13.07.2015 Views

enhancement of speaker identification using sid-usable ... - eurasip

enhancement of speaker identification using sid-usable ... - eurasip

enhancement of speaker identification using sid-usable ... - eurasip

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ENHANCEMENT OF SPEAKER IDENTIFICATION USING SID-USABLE SPEECHSaurabh S. Khanwalkar † , Brett Y. Smolenski † , Robert E. Yantorno † and S. J. Wenndt ‡† Speech Processing Lab, ECE Department, Temple University12th & Norris Streets, Philadelphia, PA 19122-6077, USA‡ Air Force Research Laboratory/IFEC,32 Brooks Rd. Rome NY 13441-4514, USAABSTRACTMost present day Speaker Identification (SID) systems focus on thespeech features used for modeling the <strong>speaker</strong>s without any concernfor the speech being input to the system. Knowing how reliable theinput speech information is can be very important and useful. Theidea <strong>of</strong> SID-<strong>usable</strong> speech is to identify and extract those portions <strong>of</strong>corrupted input speech, which are more reliable for SID systems,thereby enhancing the <strong>speaker</strong> <strong>identification</strong> process. In this paper,usability in speech, with reference to <strong>speaker</strong> <strong>identification</strong> ispresented which is called SID-<strong>usable</strong> speech. Here the SID systemitself is used to determine those speech frames that are <strong>usable</strong> foraccurate <strong>speaker</strong> <strong>identification</strong>. Two novel approaches to identifySID-<strong>usable</strong> speech frames are presented which resulted in 78% and72% correct detection <strong>of</strong> SID-<strong>usable</strong> speech. It is also shown thatSID performance can be quantified by comparing the amount <strong>of</strong>speech data required for correct <strong>identification</strong>. The amount <strong>of</strong> SID<strong>usable</strong>speech as input was approximately 30% less than entire inputdata with a con<strong>sid</strong>erable <strong>enhancement</strong> in SID performance.1. INTRODUCTIONSpeaker <strong>identification</strong> plays an important role in electronicauthentication. In an operational environment speech is degraded bymany kinds <strong>of</strong> interferences. The interference can be classifiedbroadly as stationary or non-stationary. Stationary interference isnoise which can be dealt with by <strong>using</strong> de-noising and noisereduction techniques; whereas non-stationary interference could bespeech from a different <strong>speaker</strong>. Such interference is a commonoccurrence and the corrupted speech is known as co-channel speech[1]. Traditional methods <strong>of</strong> co-channel speech processing have beento enhance the prominent <strong>speaker</strong> (target), suppress the interfering<strong>speaker</strong> speech or both. However, previous studies on co-channelspeech have shown that it is desirable to process only portions <strong>of</strong> theco-channel speech which are minimally degraded [2]. Such portions<strong>of</strong> speech con<strong>sid</strong>ered <strong>usable</strong> for <strong>speaker</strong> <strong>identification</strong> are referred toas “<strong>usable</strong> speech”. Significant amount <strong>of</strong> research has beenconducted in finding speech features that would yield maximuminformation about the identity <strong>of</strong> the <strong>speaker</strong>s, thereby increasing theaccuracy <strong>of</strong> the SID system. With only <strong>usable</strong> portions <strong>of</strong> speechbeing input to the SID system, there is an increase in the accuracy <strong>of</strong><strong>speaker</strong> <strong>identification</strong> [3]. A number <strong>of</strong> <strong>usable</strong> speech measures foruse in the <strong>usable</strong> speech extraction system, have been developed [4][5] [6]. These measures are based on the Target-to-Interferer Ratio(TIR) <strong>of</strong> a frame <strong>of</strong> speech with a 20 dB TIR threshold to classify<strong>usable</strong> speech [7]. The <strong>usable</strong> speech concept has been incorporatedfor <strong>speaker</strong> recognition improvement by silence removal [8] andmulti-pitch tracking algorithm.Usable speech is application dependent, i.e. speech segmentsthat are <strong>usable</strong> for speech recognition may not be <strong>usable</strong> for <strong>speaker</strong><strong>identification</strong> and vice versa. In this paper, the research presentedshows that the SID system itself is used to determine the usability <strong>of</strong>the speech being input to the system. This intuitive SID applicationdependent definition for usability in speech is termed as SID-<strong>usable</strong>speech. In an operational environment, it is required that there mustbe some way to identify SID-<strong>usable</strong> speech frames prior to beinginput into the SID system, i.e. have a preprocessor block <strong>of</strong> SID<strong>usable</strong>speech extractor before the SID process. Thus, this paperpresents the development <strong>of</strong> the criteria for the detection <strong>of</strong> <strong>speaker</strong><strong>identification</strong> (SID)-<strong>usable</strong> speech segments.2. BACKGROUNDA brief background to the <strong>speaker</strong> <strong>identification</strong> system along withSID-<strong>usable</strong> speech labeling is given in the following section. Thepreprocessor SID-Usable Classification systems are presented inSection 3 and the experimental evaluation <strong>of</strong> an SID system ispresented in Section 4.2.1 Vector QuantizationThe SID system used in the experiments outlined below uses avector quantization classifier to build the feature space and toperform <strong>speaker</strong> classification [10]. The LPC-Cepstral coefficientsare used as features with the Euclidean distance between testutterances and the trained <strong>speaker</strong> models as the distance measure.Although the SID system used is not the latest in advancement, itserves the purpose in illustrating the concept <strong>of</strong> SID-<strong>usable</strong> speech.One would expect the same approach to work at least as well andpossible better on a more advanced SID system. A vector quantizermaps k-dimensional vectors in the vector space R k into a finite set <strong>of</strong>vectors Y = {y i : i = 1, 2... N} Each vector y i is called a codewordand the set <strong>of</strong> all the codewords is called a codebook. In this systemthe 14 th order LPC-cepstral feature space is clustered into 128centroids during the training stage which is referred as thecodebook.2.2 Distances from Speaker ModelsDuring the testing stage, the test utterance is divided into ‘n’ framesand the euclidean distance <strong>of</strong> the features <strong>of</strong> ‘n’ frames for ‘m’trained <strong>speaker</strong> models is determined. For each <strong>speaker</strong> model, theminimum distance obtained from the codewords is con<strong>sid</strong>ered as thedistance from the model and an (m x n) classification matrix is


utterances for each <strong>speaker</strong>. Forty-eight <strong>speaker</strong>s were chosenspanning all the dialect regions with equal number <strong>of</strong> male andfemale <strong>speaker</strong>s. Of the ten utterances, four utterances were used fortraining the SID system.For the testing phase, 5 testing datasets generated (4 <strong>usable</strong>speech databases and the original entire speech TIMIT database)were used. The system was tested on remaining six utterances toobtain the SID performance metric in percentage accuracy <strong>of</strong><strong>speaker</strong> <strong>identification</strong>. The bar chart in Figure 3 below shows theresults <strong>of</strong> the SID system performance for the different SID-<strong>usable</strong>speech databases.Percent Correct1009080706098.21Speaker Identification Operation97.32Sinusoidal SID-Real SID-UsableUsable96.9195.12k-NN SID-SVM SID-UsableUsable94.87All FramesFigure 3: SID performance comparison with the different generatedSID-<strong>usable</strong> speech data.From Figure 3, by <strong>using</strong> only SID-<strong>usable</strong> speech, the SID systemhas a better performance. The amount <strong>of</strong> real SID-<strong>usable</strong> speech wasapproximately 30% less than all frames data without the SID systemperformance being compromised. Moreover, the performance <strong>of</strong> theSID system is better for the input SID-<strong>usable</strong> data obtained from theSID-<strong>usable</strong> speech classifiers than the all frames data.5. CONCLUSIONIn this paper, usability in speech, with reference to <strong>speaker</strong><strong>identification</strong>, which is called SID-<strong>usable</strong> speech, was presented.Here the SID system was used to determine those speech segmentsthat are <strong>usable</strong> for accurate <strong>speaker</strong> <strong>identification</strong>. Two novelapproaches to identify SID-<strong>usable</strong> speech frames were presentedwhich resulted in 78% and 72% correct detection <strong>of</strong> SID-<strong>usable</strong>speech. We have shown that SID performance can be quantified bycomparing the amount <strong>of</strong> speech data required for correct<strong>identification</strong>. The amount <strong>of</strong> SID-<strong>usable</strong> speech was approximately30% less than entire input data without the SID system performancebeing compromised. Therefore, it can be concluded that <strong>using</strong> onlySID-<strong>usable</strong> speech improves the <strong>speaker</strong> <strong>identification</strong> performance.6. ACKNOWLEDGEMENTSThe Air Force Research Laboratory, Air Force Material Command,and USAF sponsored this effort, under agreement number F30602-02-2-0501. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstanding anycopyright annotation thereon.7. DISCLAIMERThe views and conclusions contained herein are those <strong>of</strong> the authorsand should not be interpreted as necessarily representing the <strong>of</strong>ficialpolicies or endorsements, either expressed or implied, <strong>of</strong> Air ForceResearch Laboratory, or the U.S. Government.REFERENCES[1] R. E. Yantorno, Co-channel speech study, final report forsummer research faculty program, tech. rep., Air Force Office<strong>of</strong> Scientific Research, Speech Processing Lab, Rome Labs,New York, 1999.[2] J. Lovekin, R. E. Yantorno, S. Benincasa, S. Wenndt and M.Huggins, “Developing <strong>usable</strong> speech criteria for <strong>speaker</strong><strong>identification</strong>,” Proc. ICASSP 2001, pp. 421-424, 2001[3] S. Khanwalkar, B. Y. Smolenski, R. E. Yantorno, “SpeakerIdentification Enhancement under Co-Channel Conditions<strong>using</strong> Sinusoidal Model based Usable Speech Detection”, IEEEinternational Symposium on Intelligent Signal Processing andCommunication Systems (ISPACS 2004).[4] K. Krishnamachari & R. E. Yantorno, “Spectral autocorrelationratio as a usability measure <strong>of</strong> speech segments under cochannelconditions,” IEEE Inter. Symposium on IntelligentSignal Processing & Comm. Systems, pp. 710–713, Nov 2000.[5] J. M. Lovekin, K. R. Krishnamachari, and R. E. Yantorno,“Adjacent pitch period comparison as a usability measure <strong>of</strong>speech segments under co-channel conditions,” IEEEInternational Symposium on Intelligent Signal Processing andCommunication Systems, pp. 139–142, Nov 2001.[6] N. Chandra and R. E. Yantorno, “Usable speech detection<strong>using</strong> modified spectral autocorrelation peak to valley ration<strong>using</strong> the lpc re<strong>sid</strong>ual,” 4th IASTED Int. Conference Signal andImage Processing, pp. 146–150, 2002.[7] R. E. Yantorno, “Co-channel speech and <strong>speaker</strong> <strong>identification</strong>study,” Tech. Rep., Air Force Office <strong>of</strong> Scientific Research,Speech Processing Lab, Rome labs, New York, 1998.[8] J. K. Kim, D. S. Shin, and M. J. Bae, “A study on theimprovement <strong>of</strong> <strong>speaker</strong> recognition system by voiceddetection,” 45th Midwest Symposium on Circuits and Systems,MWSCAS, vol. III, pp. 324–327, 2002[9] A. N. Iyer, B.Y Smolenski, R. E. Yantorno, J. Cupples, S.Wenndt, “Speaker Identification Improvement Using TheUsable Speech Concept,” European Signal ProcessingConference (EUSIPCO 2004)[10] F. K. Soong, A. E. Rosenberg and B. H. Juang, “Report: Avector quantization approach to <strong>speaker</strong> recognition,” AT&TTechnical Journal, vol. 66, pp. 14–26, 1987.[11] R. J. McAulay, T. F. Quatieri, “Speech Analysis/SynthesisBased on a Sinusoidal Representation”, IEEE Trans. Acoustics,Speech and Signal Processing, Vol. ASSP–34, No. 4, pp. 744-754, August 1986.[12] R. Roy, A. Paulraj and T. Kailath (1986). “ESPRIT: ASubspace Rotation Approach to Estimation <strong>of</strong> Parameters <strong>of</strong>Sinusoids in Noise,” IEEE Trans. Acoustics, Speech, SignalProcessing, vol. ASSP-34, pp. 1340–1342, October[13] G. D. Manolakis, K. V. Ingle and M. S. Kogan, Statistical andAdaptive Signal Processing: Spectral Estimation, McGraw-HillScience/Engineering/Math (December 1999)[14] V.N.Vapnik, “The Nature <strong>of</strong> Statistical Learning Theory”,Springer-Verlag, New York, ISBN 0-387-94559-8, 1995.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!