09.07.2015 Views

Development of Indian Language Speech Databases for Large ...

Development of Indian Language Speech Databases for Large ...

Development of Indian Language Speech Databases for Large ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.2. Grapheme to phoneme Converters (G2P)The Text corpus has to be phonetized so that thedistribution <strong>of</strong> the basic recognition units, the phones,diphones, syllables, etc in the text corpus can be analyzed.Phonetizers or the G2P converters are the tools that convertthe text corpus into its phonetic equivalent. Phonetizers arealso required <strong>for</strong> generating the phonetic lexicon, animportant component <strong>of</strong> ASR systems [7]. The lexicon is arepresentation <strong>of</strong> each entry <strong>of</strong> the vocabulary word set <strong>of</strong> thesystem in its phonetic <strong>for</strong>m. The lexicon dictates to thedecoder, the phonetic composition, or the pronunciation <strong>of</strong> anentry. The process <strong>of</strong> corpora phonetization or thedevelopment <strong>of</strong> phonetic lexicons <strong>for</strong> the western languagesis traditionally done by linguists. These lexicons are subjectto constant refinement and modification. But the phoneticnature <strong>of</strong> <strong>Indian</strong> scripts reduces the ef<strong>for</strong>t to building meremapping tables and rules <strong>for</strong> the lexical representation. Theserules and the mapping tables together comprise thephonetizers or the Grapheme to Phoneme converters.2.2.1 Marathi G2PMarathi language is very similar to Hindi in both thescript and in pronunciation. The Marathi G2P is derived fromthe Hindi G2P developed at HP Labs India [5]. Two ruleswere added to address the issues specific to Marathi. A rulewas added to nasalize the consonant that follows ananuswara. Another rule was added to substitute the Hindichandrabindu with the consonant ‘n’. All other rules werecommon to both the languages.2.2.2 Tamil G2PThe Tamil G2P is a one to one mapping table. Certainexceptions arise in the case <strong>of</strong> three vowels where the vowelcomes be<strong>for</strong>e the consonant in the orthography but thepronunciation has the vowel sound following the consonant.Such combinations have been identified and have been splitaccordingly so as to ensure the one to one correspondencebetween the pronunciation and the phonetic lexicon.2.2.3 Telugu G2PTelugu G2P, in most cases, is a one to one mappingbetween the graphemes and the corresponding phones. . TheTelugu G2P was also derived from the HP Labs Hindi G2P[5]. Special ligatures have been used to denote nasalization,homo-organic nasals and dependent vowels.2.3. Optimal Text SelectionAs mentioned in the earlier sections, the optimal set <strong>of</strong>sentences is extracted from the phonetized corpus using anOptimal Text Selection (OTS) algorithm which selects thesentences which are phonetically rich. These sentences aredistributed among the intended number <strong>of</strong> speakers such thatgood speaker variability is achieved in the database. Also itis attempted to distribute the optimal text among speakerssuch that each speaker gets to speak as many phoneticcontexts as possible, The best-known OTS algorithm is thegreedy algorithm as applied to set covering problem [4]. Inthe greedy approach, a sentence is selected from the corpus ifit best satisfies a desired criterion. In general, the criterion <strong>for</strong>the optimal text selection would be the maximum number <strong>of</strong>diphones or a highest score computed by a scoring functionbased on the type <strong>of</strong> the diphones that the sentence iscomposed <strong>of</strong>. A threshold based approach has beenimplemented <strong>for</strong> OTS <strong>of</strong> the <strong>Indian</strong> language ASRs. Thisalgorithm per<strong>for</strong>ms as well as the classical greedy algorithmin a significantly lesser time. A threshold is maintained on thenumber <strong>of</strong> new diphones to be covered in a new sentence. Aset <strong>of</strong> sentences is thus selected in each iteration. The optimaltext, hence selected, has a full coverage <strong>of</strong> the diphonespresent in the corpus. These sentences are distributed amongthe intended number <strong>of</strong> speakers. The table 2 gives thediphone coverage statistics <strong>of</strong> the optimal texts <strong>of</strong> eachlanguage.Table 2: Diphone coverage in the three languages<strong>Language</strong> Phones DiphonescoveredMarathi 74 1779Tamil 55 1821Telugu 66 27813. <strong>Speech</strong> Data CollectionIn this section, the various steps involved in building thespeech corpus are detailed. Firstly, the recording media ischosen so as to capture the effects due to channel andmicrophone variations. For the databases that were built <strong>for</strong>the <strong>Indian</strong> language ASR’s, the speech data was recordedover calculated number <strong>of</strong> landline and cellular phones usinga multi-channel computer telephony interface card.3.1. Speaker Selection<strong>Speech</strong> data is collected from the native speakers <strong>of</strong> thelanguage who are com<strong>for</strong>table in speaking and reading thelanguage. The speakers are chosen such that all the diversitiesattributing to the gender, age and dialect are sufficientlycaptured. The recording is clean and has minimal backgrounddisturbance. Any mistakes made while recording have beenundone by re-recording or by making the correspondingchanges in the transcription set. The various statistics <strong>of</strong> thedata collected <strong>for</strong> the <strong>Indian</strong> <strong>Language</strong> ASR’s are given in thenext subsection.3.2. Data StatisticsSpeakers from various parts <strong>of</strong> the respective states (regions)were carefully recorded in order to cover all possible dialecticvariations <strong>of</strong> the language. Each speaker has recorded 52sentences <strong>of</strong> the optimal text. Table 3 gives the number <strong>of</strong>speakers recorded in each language and in each <strong>of</strong> therecording modes – landline and cellphones. To capturedifferent microphonic variations, four different cellphoneswere used while recording the speakers. Table 4 gives the agewise distribution <strong>of</strong> the speakers in the three languages. Table5 shows the gender wise distribution <strong>of</strong> the speakers in each<strong>of</strong> the three languages.Table 3: Number <strong>of</strong> speakers in the three languages<strong>Language</strong> Landline Cellphone TotalMarathi 92 84 176Tamil 86 114 200Telugu 108 75 183


Table 4: Age wise speaker distribution<strong>Language</strong> 18-30 30-40 40-50 50-60 > 60Marathi 77 56 26 12 5Tamil 80 28 31 16 45Telugu 52 52 32 32 15Table 5: Gender distribution among speakers<strong>Language</strong> Male FemaleMarathi 91 85Tamil 118 82Telugu 93 903.1. Transcription CorrectionDespite the care taken to record the speech with minimalbackground noise and mistakes in pronunciation, some errorshave crept in while recording. These errors had to beidentified manually by listening to the speech. If feltunsuitable, some <strong>of</strong> the utterances have been discarded.In the case <strong>of</strong> the data collected in the three <strong>Indian</strong>languages, the transcriptions were manually edited andranked based on the goodness <strong>of</strong> the speech recorded. Theutterances were classified as “Good”, “With Channeldistortion”, “With Background Noise” and “Useless”whichever is appropriate. The pronunciation mistakes werecarefully identified and if possible the corresponding changeswere made in the transcriptions so that the utterance and thetranscription correspond to each other. The idea behind theclassification was to make the utmost utilization <strong>of</strong> the dataand to serve as a corpus <strong>for</strong> further related research work.4. Building ASR systemsTypically, ASR systems comprise three majorconstituents- the acoustic models, the language models andthe phonetic lexicon. In the subsections that follow, each <strong>of</strong>these components are discussed.4.1. Acoustic ModelThe first key constituent <strong>of</strong> the ASR systems is theacoustic model. Acoustic models capture the characteristics<strong>of</strong> the basic recognition units [3]. The recognition units can beat the word level, syllable level and or at the phoneme level.Many constraints and inadequacies come into picture with theselection <strong>of</strong> each <strong>of</strong> these units. For large vocabularycontinuous speech recognition systems (LVCSR), phoneme isthe preferred unit. Neural networks (NN) and Hidden Markovmodels are the most commonly used <strong>for</strong> acoustic modeling <strong>of</strong>ASR systems. We have chosen the Semi-Continuous HiddenMarkov models (SCHMMs) [1] to represent contextdependentphones (triphones). At the time <strong>of</strong> recognition,various words are hypothesized against the speech signal. Tocompute the likelihood <strong>of</strong> a word, the lexicon is referred andthe word is broken into its constituent phones. The phonelikelihood is computed from the HMMs. The combinedlikelihood <strong>of</strong> all the phones represents the likelihood <strong>of</strong> theword in the acoustic model. The triphone acoustic modelsbuilt <strong>for</strong> the <strong>Indian</strong> language ASRs are 5 state SCHMMs withstates clustered using decision trees. The models were builtusing CMU Sphinx II trainer.Two different models were trained from Landline andCellphone data. The training data was carefully selected tohave an even distribution across all the possible variations.For each <strong>of</strong> the systems, the speech data collected wasdivided into ‘training’ and the ‘test’ sets based upon thedemographic details. 70% <strong>of</strong> the speakers were included inthe training set and the rest 30% was used to test therespective systems.4.2. <strong>Language</strong> ModelThe language model attempts to convey the behavior <strong>of</strong>the language. By the means <strong>of</strong> probabilistic in<strong>for</strong>mation, thelanguage model provides the syntax <strong>of</strong> the language. It aimsto predict the occurrence <strong>of</strong> specific word sequences possiblein the language. From the perspective <strong>of</strong> the recognitionengine, the language model helps narrow down the searchspace <strong>for</strong> a valid combination <strong>of</strong> words [3]. <strong>Language</strong> modelshelp guide and constrain the search among alternative wordhypotheses during decoding. Most ASR systems use thestochastic language models (SLM). <strong>Speech</strong> recognizers seekthe word sequence W b which is most likely to be producedfrom the acoustic evidence A.P (W b | A) = max w P(W|A) = max w P(A|W) P(W)/P(A)P(A), the probability <strong>of</strong> acoustic evidence is normallyomitted as it is common <strong>for</strong> any word sequence. Therecognized word sequence is the word combination W b whichmaximizes the above equation. LMs assign a probabilityestimate P(W) to word sequences W={w 1,w 2,……w n}. Theseprobabilities can be trained from a corpus. Since there is onlya finite coverage <strong>of</strong> word sequences, some smoothing has tobe done to discount the frequencies to those <strong>of</strong> the lowerorder. SLMs use the N-gram LM where it is assumed that theprobability <strong>of</strong> occurrence <strong>of</strong> a word is dependent only on thepast N-1 words. The trigram model [N=3], based on theprevious two words is particularly powerful, as most wordshave a strong dependence on the previous two words and itcan be estimated reasonably well with an attainable corpus.The trigram based language model with back-<strong>of</strong>f was used <strong>for</strong>recognition. The LM was created using the CMU statisticalLM toolkit [2]. For the baseline systems developed, the LMwas generated from the training and the testing transcripts.The unigrams <strong>of</strong> the language model were taken as the lexicalentries <strong>for</strong> the systems.5. Baseline System Per<strong>for</strong>manceTwo systems were built <strong>for</strong> each <strong>of</strong> the languages from thedata from the landlines and cellphones separately.The tests on each system were done using the speech data<strong>of</strong> the untrained 30% <strong>of</strong> the speakers’ data. “Good”utterances <strong>of</strong> these speakers were decoded using the SphinxII decoder. Appropriate tuning was done on the decoder toget the best per<strong>for</strong>mance. The evaluation <strong>of</strong> the experimentwas made according to the recognition accuracy andcomputed using the word error rate [WER] metric whichalign a recognized word string against the correct word stringand compute the number <strong>of</strong> substitutions (S), deletions (D)and Insertions (I) and the number <strong>of</strong> words in the correctsentence (N).W.E.R = 100 * (S+D+I) / NOn similar lines, the sentence error rate [SER] metric is themean percent error in the sentence recognition. Table 6shows the per<strong>for</strong>mance <strong>of</strong> each ASR in terms <strong>of</strong> their WER


and SER. It also shows the size <strong>of</strong> the vocabulary <strong>of</strong> thesystem.Table 6: The per<strong>for</strong>mances <strong>of</strong> different ASRs in terms <strong>of</strong> worderror rate (WER) %ASR System WER VocabularyMarathi Landline 23.2 21640Marathi Mobile 25.0 18912Tamil Landline 23.9 13883Tamil Mobile 22.2 16187Telugu Landline 18.5 25626Telugu Mobile 20.1 164196. Conclusion and Future workIn this paper, we discussed the optimal design anddevelopment <strong>of</strong> speech databases <strong>for</strong> three <strong>Indian</strong> languages.We hope the simple methodology <strong>of</strong> database creationpresented will serve as catalyst <strong>for</strong> the creation <strong>of</strong> speechdatabases in all other <strong>Indian</strong> languages. We also createdspeech recognizers and presented preliminary results onthese databases. We hope the ASRs created will serve asbaseline systems <strong>for</strong> further research on improving theaccuracies in each <strong>of</strong> the languages. Our future work isfocused in tuning these models and test them using languagemodels built using a large corpus.7. AcknowledgementsWe would like to acknowledge the help and support <strong>of</strong> Naval,Gaurav, Pavan, Natraj, Santhosh and Jitendra <strong>for</strong> speech datacollection. Our special thanks to Dr. Ravi Mosur, Dr. JamesK Baker and Dr. Raj Reddy <strong>of</strong> Carnegie Mellon University<strong>for</strong> their support and suggestions.8. References[1] L.Rabiner., “A Tutorial on Hidden Markov models andSelected Applications in <strong>Speech</strong> Recognition”, Proc. OfIEEE, Vol. 77 No. 2, 1989.[2] Rosenfeld Roni, “CMU statistical <strong>Language</strong> Modelling(SLM) Toolkit (Version 2)”.[3] X. Huang, A. Acero, H. Hon, “Spoken <strong>Language</strong>Processing: A Guide to Theory, System and Algorithm<strong>Development</strong>”, New Jersey, Prentice Hall, 2001[4] T.Cormen, C. Leiserson and R. Rivest,. “Introduction toAlgorithms.” The MIT Press, Cambridge,Massachusetts, 1990.[5] Kalika Bali, Partha Pratim Talukdar, , Tools <strong>for</strong> thedevelopment <strong>of</strong> a Hindi <strong>Speech</strong> Synthesis System, 5 thISCA <strong>Speech</strong> Synthesis Workshop, Pittsburgh, pp.109-114,2004[6] “http://tdil.mit.gov.in/corpora/ach-corpora.htm#tech”Marathi CIIL corpus.[7] Singh, S. P., et al “Building <strong>Large</strong> Vocabulary <strong>Speech</strong>Recognition Systems <strong>for</strong> <strong>Indian</strong> <strong>Language</strong>s”,International Conference on Natural <strong>Language</strong>Processing, 1:245-254, 2004.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!