enhancement of speaker identification using sid-usable ... - eurasip

ENHANCEMENT OF SPEAKER IDENTIFICATION USING SID-USABLE SPEECHSaurabh S. Khanwalkar † , Brett Y. Smolenski † , Robert E. Yantorno † and S. J. Wenndt ‡† Speech Processing Lab, ECE Department, Temple University12th & Norris Streets, Philadelphia, PA 19122-6077, USA‡ Air Force Research Laboratory/IFEC,32 Brooks Rd. Rome NY 13441-4514, USAABSTRACTMost present day Speaker Identification (SID) systems focus on thespeech features used for modeling the speakers without any concernfor the speech being input to the system. Knowing how reliable theinput speech information is can be very important and useful. Theidea of SID-usable speech is to identify and extract those portions ofcorrupted input speech, which are more reliable for SID systems,thereby enhancing the speaker identification process. In this paper,usability in speech, with reference to speaker identification ispresented which is called SID-usable speech. Here the SID systemitself is used to determine those speech frames that are usable foraccurate speaker identification. Two novel approaches to identifySID-usable speech frames are presented which resulted in 78% and72% correct detection of SID-usable speech. It is also shown thatSID performance can be quantified by comparing the amount ofspeech data required for correct identification. The amount of SIDusablespeech as input was approximately 30% less than entire inputdata with a considerable enhancement in SID performance.1. INTRODUCTIONSpeaker identification plays an important role in electronicauthentication. In an operational environment speech is degraded bymany kinds of interferences. The interference can be classifiedbroadly as stationary or non-stationary. Stationary interference isnoise which can be dealt with by using de-noising and noisereduction techniques; whereas non-stationary interference could bespeech from a different speaker. Such interference is a commonoccurrence and the corrupted speech is known as co-channel speech[1]. Traditional methods of co-channel speech processing have beento enhance the prominent speaker (target), suppress the interferingspeaker speech or both. However, previous studies on co-channelspeech have shown that it is desirable to process only portions of theco-channel speech which are minimally degraded [2]. Such portionsof speech considered usable for speaker identification are referred toas “usable speech”. Significant amount of research has beenconducted in finding speech features that would yield maximuminformation about the identity of the speakers, thereby increasing theaccuracy of the SID system. With only usable portions of speechbeing input to the SID system, there is an increase in the accuracy ofspeaker identification [3]. A number of usable speech measures foruse in the usable speech extraction system, have been developed [4][5] [6]. These measures are based on the Target-to-Interferer Ratio(TIR) of a frame of speech with a 20 dB TIR threshold to classifyusable speech [7]. The usable speech concept has been incorporatedfor speaker recognition improvement by silence removal [8] andmulti-pitch tracking algorithm.Usable speech is application dependent, i.e. speech segmentsthat are usable for speech recognition may not be usable for speakeridentification and vice versa. In this paper, the research presentedshows that the SID system itself is used to determine the usability ofthe speech being input to the system. This intuitive SID applicationdependent definition for usability in speech is termed as SID-usablespeech. In an operational environment, it is required that there mustbe some way to identify SID-usable speech frames prior to beinginput into the SID system, i.e. have a preprocessor block of SIDusablespeech extractor before the SID process. Thus, this paperpresents the development of the criteria for the detection of speakeridentification (SID)-usable speech segments.2. BACKGROUNDA brief background to the speaker identification system along withSID-usable speech labeling is given in the following section. Thepreprocessor SID-Usable Classification systems are presented inSection 3 and the experimental evaluation of an SID system ispresented in Section 4.2.1 Vector QuantizationThe SID system used in the experiments outlined below uses avector quantization classifier to build the feature space and toperform speaker classification [10]. The LPC-Cepstral coefficientsare used as features with the Euclidean distance between testutterances and the trained speaker models as the distance measure.Although the SID system used is not the latest in advancement, itserves the purpose in illustrating the concept of SID-usable speech.One would expect the same approach to work at least as well andpossible better on a more advanced SID system. A vector quantizermaps k-dimensional vectors in the vector space R k into a finite set ofvectors Y = {y i : i = 1, 2... N} Each vector y i is called a codewordand the set of all the codewords is called a codebook. In this systemthe 14 th order LPC-cepstral feature space is clustered into 128centroids during the training stage which is referred as thecodebook.2.2 Distances from Speaker ModelsDuring the testing stage, the test utterance is divided into ‘n’ framesand the euclidean distance of the features of ‘n’ frames for ‘m’trained speaker models is determined. For each speaker model, theminimum distance obtained from the codewords is considered as thedistance from the model and an (m x n) classification matrix is

utterances for each speaker. Forty-eight speakers were chosenspanning all the dialect regions with equal number of male andfemale speakers. Of the ten utterances, four utterances were used fortraining the SID system.For the testing phase, 5 testing datasets generated (4 usablespeech databases and the original entire speech TIMIT database)were used. The system was tested on remaining six utterances toobtain the SID performance metric in percentage accuracy ofspeaker identification. The bar chart in Figure 3 below shows theresults of the SID system performance for the different SID-usablespeech databases.Percent Correct1009080706098.21Speaker Identification Operation97.32Sinusoidal SID-Real SID-UsableUsable96.9195.12k-NN SID-SVM SID-UsableUsable94.87All FramesFigure 3: SID performance comparison with the different generatedSID-usable speech data.From Figure 3, by using only SID-usable speech, the SID systemhas a better performance. The amount of real SID-usable speech wasapproximately 30% less than all frames data without the SID systemperformance being compromised. Moreover, the performance of theSID system is better for the input SID-usable data obtained from theSID-usable speech classifiers than the all frames data.5. CONCLUSIONIn this paper, usability in speech, with reference to speakeridentification, which is called SID-usable speech, was presented.Here the SID system was used to determine those speech segmentsthat are usable for accurate speaker identification. Two novelapproaches to identify SID-usable speech frames were presentedwhich resulted in 78% and 72% correct detection of SID-usablespeech. We have shown that SID performance can be quantified bycomparing the amount of speech data required for correctidentification. The amount of SID-usable speech was approximately30% less than entire input data without the SID system performancebeing compromised. Therefore, it can be concluded that using onlySID-usable speech improves the speaker identification performance.6. ACKNOWLEDGEMENTSThe Air Force Research Laboratory, Air Force Material Command,and USAF sponsored this effort, under agreement number F30602-02-2-0501. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstanding anycopyright annotation thereon.7. DISCLAIMERThe views and conclusions contained herein are those of the authorsand should not be interpreted as necessarily representing the officialpolicies or endorsements, either expressed or implied, of Air ForceResearch Laboratory, or the U.S. Government.REFERENCES[1] R. E. Yantorno, Co-channel speech study, final report forsummer research faculty program, tech. rep., Air Force Officeof Scientific Research, Speech Processing Lab, Rome Labs,New York, 1999.[2] J. Lovekin, R. E. Yantorno, S. Benincasa, S. Wenndt and M.Huggins, “Developing usable speech criteria for speakeridentification,” Proc. ICASSP 2001, pp. 421-424, 2001[3] S. Khanwalkar, B. Y. Smolenski, R. E. Yantorno, “SpeakerIdentification Enhancement under Co-Channel Conditionsusing Sinusoidal Model based Usable Speech Detection”, IEEEinternational Symposium on Intelligent Signal Processing andCommunication Systems (ISPACS 2004).[4] K. Krishnamachari & R. E. Yantorno, “Spectral autocorrelationratio as a usability measure of speech segments under cochannelconditions,” IEEE Inter. Symposium on IntelligentSignal Processing & Comm. Systems, pp. 710–713, Nov 2000.[5] J. M. Lovekin, K. R. Krishnamachari, and R. E. Yantorno,“Adjacent pitch period comparison as a usability measure ofspeech segments under co-channel conditions,” IEEEInternational Symposium on Intelligent Signal Processing andCommunication Systems, pp. 139–142, Nov 2001.[6] N. Chandra and R. E. Yantorno, “Usable speech detectionusing modified spectral autocorrelation peak to valley rationusing the lpc residual,” 4th IASTED Int. Conference Signal andImage Processing, pp. 146–150, 2002.[7] R. E. Yantorno, “Co-channel speech and speaker identificationstudy,” Tech. Rep., Air Force Office of Scientific Research,Speech Processing Lab, Rome labs, New York, 1998.[8] J. K. Kim, D. S. Shin, and M. J. Bae, “A study on theimprovement of speaker recognition system by voiceddetection,” 45th Midwest Symposium on Circuits and Systems,MWSCAS, vol. III, pp. 324–327, 2002[9] A. N. Iyer, B.Y Smolenski, R. E. Yantorno, J. Cupples, S.Wenndt, “Speaker Identification Improvement Using TheUsable Speech Concept,” European Signal ProcessingConference (EUSIPCO 2004)[10] F. K. Soong, A. E. Rosenberg and B. H. Juang, “Report: Avector quantization approach to speaker recognition,” AT&TTechnical Journal, vol. 66, pp. 14–26, 1987.[11] R. J. McAulay, T. F. Quatieri, “Speech Analysis/SynthesisBased on a Sinusoidal Representation”, IEEE Trans. Acoustics,Speech and Signal Processing, Vol. ASSP–34, No. 4, pp. 744-754, August 1986.[12] R. Roy, A. Paulraj and T. Kailath (1986). “ESPRIT: ASubspace Rotation Approach to Estimation of Parameters ofSinusoids in Noise,” IEEE Trans. Acoustics, Speech, SignalProcessing, vol. ASSP-34, pp. 1340–1342, October[13] G. D. Manolakis, K. V. Ingle and M. S. Kogan, Statistical andAdaptive Signal Processing: Spectral Estimation, McGraw-HillScience/Engineering/Math (December 1999)[14] V.N.Vapnik, “The Nature of Statistical Learning Theory”,Springer-Verlag, New York, ISBN 0-387-94559-8, 1995.

enhancement of speaker identification using sid-usable ... - eurasip

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?