enhancement of speaker identification using sid-usable ... - eurasip

enhancement of speaker identification using sid-usable ... - eurasip enhancement of speaker identification using sid-usable ... - eurasip

signal.ee.bilkent.edu.tr
from signal.ee.bilkent.edu.tr More from this publisher
13.07.2015 Views

utterances for each speaker. Forty-eight speakers were chosenspanning all the dialect regions with equal number of male andfemale speakers. Of the ten utterances, four utterances were used fortraining the SID system.For the testing phase, 5 testing datasets generated (4 usablespeech databases and the original entire speech TIMIT database)were used. The system was tested on remaining six utterances toobtain the SID performance metric in percentage accuracy ofspeaker identification. The bar chart in Figure 3 below shows theresults of the SID system performance for the different SID-usablespeech databases.Percent Correct1009080706098.21Speaker Identification Operation97.32Sinusoidal SID-Real SID-UsableUsable96.9195.12k-NN SID-SVM SID-UsableUsable94.87All FramesFigure 3: SID performance comparison with the different generatedSID-usable speech data.From Figure 3, by using only SID-usable speech, the SID systemhas a better performance. The amount of real SID-usable speech wasapproximately 30% less than all frames data without the SID systemperformance being compromised. Moreover, the performance of theSID system is better for the input SID-usable data obtained from theSID-usable speech classifiers than the all frames data.5. CONCLUSIONIn this paper, usability in speech, with reference to speakeridentification, which is called SID-usable speech, was presented.Here the SID system was used to determine those speech segmentsthat are usable for accurate speaker identification. Two novelapproaches to identify SID-usable speech frames were presentedwhich resulted in 78% and 72% correct detection of SID-usablespeech. We have shown that SID performance can be quantified bycomparing the amount of speech data required for correctidentification. The amount of SID-usable speech was approximately30% less than entire input data without the SID system performancebeing compromised. Therefore, it can be concluded that using onlySID-usable speech improves the speaker identification performance.6. ACKNOWLEDGEMENTSThe Air Force Research Laboratory, Air Force Material Command,and USAF sponsored this effort, under agreement number F30602-02-2-0501. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstanding anycopyright annotation thereon.7. DISCLAIMERThe views and conclusions contained herein are those of the authorsand should not be interpreted as necessarily representing the officialpolicies or endorsements, either expressed or implied, of Air ForceResearch Laboratory, or the U.S. Government.REFERENCES[1] R. E. Yantorno, Co-channel speech study, final report forsummer research faculty program, tech. rep., Air Force Officeof Scientific Research, Speech Processing Lab, Rome Labs,New York, 1999.[2] J. Lovekin, R. E. Yantorno, S. Benincasa, S. Wenndt and M.Huggins, “Developing usable speech criteria for speakeridentification,” Proc. ICASSP 2001, pp. 421-424, 2001[3] S. Khanwalkar, B. Y. Smolenski, R. E. Yantorno, “SpeakerIdentification Enhancement under Co-Channel Conditionsusing Sinusoidal Model based Usable Speech Detection”, IEEEinternational Symposium on Intelligent Signal Processing andCommunication Systems (ISPACS 2004).[4] K. Krishnamachari & R. E. Yantorno, “Spectral autocorrelationratio as a usability measure of speech segments under cochannelconditions,” IEEE Inter. Symposium on IntelligentSignal Processing & Comm. Systems, pp. 710–713, Nov 2000.[5] J. M. Lovekin, K. R. Krishnamachari, and R. E. Yantorno,“Adjacent pitch period comparison as a usability measure ofspeech segments under co-channel conditions,” IEEEInternational Symposium on Intelligent Signal Processing andCommunication Systems, pp. 139–142, Nov 2001.[6] N. Chandra and R. E. Yantorno, “Usable speech detectionusing modified spectral autocorrelation peak to valley rationusing the lpc residual,” 4th IASTED Int. Conference Signal andImage Processing, pp. 146–150, 2002.[7] R. E. Yantorno, “Co-channel speech and speaker identificationstudy,” Tech. Rep., Air Force Office of Scientific Research,Speech Processing Lab, Rome labs, New York, 1998.[8] J. K. Kim, D. S. Shin, and M. J. Bae, “A study on theimprovement of speaker recognition system by voiceddetection,” 45th Midwest Symposium on Circuits and Systems,MWSCAS, vol. III, pp. 324–327, 2002[9] A. N. Iyer, B.Y Smolenski, R. E. Yantorno, J. Cupples, S.Wenndt, “Speaker Identification Improvement Using TheUsable Speech Concept,” European Signal ProcessingConference (EUSIPCO 2004)[10] F. K. Soong, A. E. Rosenberg and B. H. Juang, “Report: Avector quantization approach to speaker recognition,” AT&TTechnical Journal, vol. 66, pp. 14–26, 1987.[11] R. J. McAulay, T. F. Quatieri, “Speech Analysis/SynthesisBased on a Sinusoidal Representation”, IEEE Trans. Acoustics,Speech and Signal Processing, Vol. ASSP–34, No. 4, pp. 744-754, August 1986.[12] R. Roy, A. Paulraj and T. Kailath (1986). “ESPRIT: ASubspace Rotation Approach to Estimation of Parameters ofSinusoids in Noise,” IEEE Trans. Acoustics, Speech, SignalProcessing, vol. ASSP-34, pp. 1340–1342, October[13] G. D. Manolakis, K. V. Ingle and M. S. Kogan, Statistical andAdaptive Signal Processing: Spectral Estimation, McGraw-HillScience/Engineering/Math (December 1999)[14] V.N.Vapnik, “The Nature of Statistical Learning Theory”,Springer-Verlag, New York, ISBN 0-387-94559-8, 1995.

utterances for each <strong>speaker</strong>. Forty-eight <strong>speaker</strong>s were chosenspanning all the dialect regions with equal number <strong>of</strong> male andfemale <strong>speaker</strong>s. Of the ten utterances, four utterances were used fortraining the SID system.For the testing phase, 5 testing datasets generated (4 <strong>usable</strong>speech databases and the original entire speech TIMIT database)were used. The system was tested on remaining six utterances toobtain the SID performance metric in percentage accuracy <strong>of</strong><strong>speaker</strong> <strong>identification</strong>. The bar chart in Figure 3 below shows theresults <strong>of</strong> the SID system performance for the different SID-<strong>usable</strong>speech databases.Percent Correct1009080706098.21Speaker Identification Operation97.32Sinusoidal SID-Real SID-UsableUsable96.9195.12k-NN SID-SVM SID-UsableUsable94.87All FramesFigure 3: SID performance comparison with the different generatedSID-<strong>usable</strong> speech data.From Figure 3, by <strong>using</strong> only SID-<strong>usable</strong> speech, the SID systemhas a better performance. The amount <strong>of</strong> real SID-<strong>usable</strong> speech wasapproximately 30% less than all frames data without the SID systemperformance being compromised. Moreover, the performance <strong>of</strong> theSID system is better for the input SID-<strong>usable</strong> data obtained from theSID-<strong>usable</strong> speech classifiers than the all frames data.5. CONCLUSIONIn this paper, usability in speech, with reference to <strong>speaker</strong><strong>identification</strong>, which is called SID-<strong>usable</strong> speech, was presented.Here the SID system was used to determine those speech segmentsthat are <strong>usable</strong> for accurate <strong>speaker</strong> <strong>identification</strong>. Two novelapproaches to identify SID-<strong>usable</strong> speech frames were presentedwhich resulted in 78% and 72% correct detection <strong>of</strong> SID-<strong>usable</strong>speech. We have shown that SID performance can be quantified bycomparing the amount <strong>of</strong> speech data required for correct<strong>identification</strong>. The amount <strong>of</strong> SID-<strong>usable</strong> speech was approximately30% less than entire input data without the SID system performancebeing compromised. Therefore, it can be concluded that <strong>using</strong> onlySID-<strong>usable</strong> speech improves the <strong>speaker</strong> <strong>identification</strong> performance.6. ACKNOWLEDGEMENTSThe Air Force Research Laboratory, Air Force Material Command,and USAF sponsored this effort, under agreement number F30602-02-2-0501. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstanding anycopyright annotation thereon.7. DISCLAIMERThe views and conclusions contained herein are those <strong>of</strong> the authorsand should not be interpreted as necessarily representing the <strong>of</strong>ficialpolicies or endorsements, either expressed or implied, <strong>of</strong> Air ForceResearch Laboratory, or the U.S. Government.REFERENCES[1] R. E. Yantorno, Co-channel speech study, final report forsummer research faculty program, tech. rep., Air Force Office<strong>of</strong> Scientific Research, Speech Processing Lab, Rome Labs,New York, 1999.[2] J. Lovekin, R. E. Yantorno, S. Benincasa, S. Wenndt and M.Huggins, “Developing <strong>usable</strong> speech criteria for <strong>speaker</strong><strong>identification</strong>,” Proc. ICASSP 2001, pp. 421-424, 2001[3] S. Khanwalkar, B. Y. Smolenski, R. E. Yantorno, “SpeakerIdentification Enhancement under Co-Channel Conditions<strong>using</strong> Sinusoidal Model based Usable Speech Detection”, IEEEinternational Symposium on Intelligent Signal Processing andCommunication Systems (ISPACS 2004).[4] K. Krishnamachari & R. E. Yantorno, “Spectral autocorrelationratio as a usability measure <strong>of</strong> speech segments under cochannelconditions,” IEEE Inter. Symposium on IntelligentSignal Processing & Comm. Systems, pp. 710–713, Nov 2000.[5] J. M. Lovekin, K. R. Krishnamachari, and R. E. Yantorno,“Adjacent pitch period comparison as a usability measure <strong>of</strong>speech segments under co-channel conditions,” IEEEInternational Symposium on Intelligent Signal Processing andCommunication Systems, pp. 139–142, Nov 2001.[6] N. Chandra and R. E. Yantorno, “Usable speech detection<strong>using</strong> modified spectral autocorrelation peak to valley ration<strong>using</strong> the lpc re<strong>sid</strong>ual,” 4th IASTED Int. Conference Signal andImage Processing, pp. 146–150, 2002.[7] R. E. Yantorno, “Co-channel speech and <strong>speaker</strong> <strong>identification</strong>study,” Tech. Rep., Air Force Office <strong>of</strong> Scientific Research,Speech Processing Lab, Rome labs, New York, 1998.[8] J. K. Kim, D. S. Shin, and M. J. Bae, “A study on theimprovement <strong>of</strong> <strong>speaker</strong> recognition system by voiceddetection,” 45th Midwest Symposium on Circuits and Systems,MWSCAS, vol. III, pp. 324–327, 2002[9] A. N. Iyer, B.Y Smolenski, R. E. Yantorno, J. Cupples, S.Wenndt, “Speaker Identification Improvement Using TheUsable Speech Concept,” European Signal ProcessingConference (EUSIPCO 2004)[10] F. K. Soong, A. E. Rosenberg and B. H. Juang, “Report: Avector quantization approach to <strong>speaker</strong> recognition,” AT&TTechnical Journal, vol. 66, pp. 14–26, 1987.[11] R. J. McAulay, T. F. Quatieri, “Speech Analysis/SynthesisBased on a Sinusoidal Representation”, IEEE Trans. Acoustics,Speech and Signal Processing, Vol. ASSP–34, No. 4, pp. 744-754, August 1986.[12] R. Roy, A. Paulraj and T. Kailath (1986). “ESPRIT: ASubspace Rotation Approach to Estimation <strong>of</strong> Parameters <strong>of</strong>Sinusoids in Noise,” IEEE Trans. Acoustics, Speech, SignalProcessing, vol. ASSP-34, pp. 1340–1342, October[13] G. D. Manolakis, K. V. Ingle and M. S. Kogan, Statistical andAdaptive Signal Processing: Spectral Estimation, McGraw-HillScience/Engineering/Math (December 1999)[14] V.N.Vapnik, “The Nature <strong>of</strong> Statistical Learning Theory”,Springer-Verlag, New York, ISBN 0-387-94559-8, 1995.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!