Development of Indian Language Speech Databases for Large ...

2.2. Grapheme to phoneme Converters (G2P)The Text corpus has to be phonetized so that thedistribution of the basic recognition units, the phones,diphones, syllables, etc in the text corpus can be analyzed.Phonetizers or the G2P converters are the tools that convertthe text corpus into its phonetic equivalent. Phonetizers arealso required for generating the phonetic lexicon, animportant component of ASR systems [7]. The lexicon is arepresentation of each entry of the vocabulary word set of thesystem in its phonetic form. The lexicon dictates to thedecoder, the phonetic composition, or the pronunciation of anentry. The process of corpora phonetization or thedevelopment of phonetic lexicons for the western languagesis traditionally done by linguists. These lexicons are subjectto constant refinement and modification. But the phoneticnature of Indian scripts reduces the effort to building meremapping tables and rules for the lexical representation. Theserules and the mapping tables together comprise thephonetizers or the Grapheme to Phoneme converters.2.2.1 Marathi G2PMarathi language is very similar to Hindi in both thescript and in pronunciation. The Marathi G2P is derived fromthe Hindi G2P developed at HP Labs India [5]. Two ruleswere added to address the issues specific to Marathi. A rulewas added to nasalize the consonant that follows ananuswara. Another rule was added to substitute the Hindichandrabindu with the consonant ‘n’. All other rules werecommon to both the languages.2.2.2 Tamil G2PThe Tamil G2P is a one to one mapping table. Certainexceptions arise in the case of three vowels where the vowelcomes before the consonant in the orthography but thepronunciation has the vowel sound following the consonant.Such combinations have been identified and have been splitaccordingly so as to ensure the one to one correspondencebetween the pronunciation and the phonetic lexicon.2.2.3 Telugu G2PTelugu G2P, in most cases, is a one to one mappingbetween the graphemes and the corresponding phones. . TheTelugu G2P was also derived from the HP Labs Hindi G2P[5]. Special ligatures have been used to denote nasalization,homo-organic nasals and dependent vowels.2.3. Optimal Text SelectionAs mentioned in the earlier sections, the optimal set ofsentences is extracted from the phonetized corpus using anOptimal Text Selection (OTS) algorithm which selects thesentences which are phonetically rich. These sentences aredistributed among the intended number of speakers such thatgood speaker variability is achieved in the database. Also itis attempted to distribute the optimal text among speakerssuch that each speaker gets to speak as many phoneticcontexts as possible, The best-known OTS algorithm is thegreedy algorithm as applied to set covering problem [4]. Inthe greedy approach, a sentence is selected from the corpus ifit best satisfies a desired criterion. In general, the criterion forthe optimal text selection would be the maximum number ofdiphones or a highest score computed by a scoring functionbased on the type of the diphones that the sentence iscomposed of. A threshold based approach has beenimplemented for OTS of the Indian language ASRs. Thisalgorithm performs as well as the classical greedy algorithmin a significantly lesser time. A threshold is maintained on thenumber of new diphones to be covered in a new sentence. Aset of sentences is thus selected in each iteration. The optimaltext, hence selected, has a full coverage of the diphonespresent in the corpus. These sentences are distributed amongthe intended number of speakers. The table 2 gives thediphone coverage statistics of the optimal texts of eachlanguage.Table 2: Diphone coverage in the three languagesLanguage Phones DiphonescoveredMarathi 74 1779Tamil 55 1821Telugu 66 27813. Speech Data CollectionIn this section, the various steps involved in building thespeech corpus are detailed. Firstly, the recording media ischosen so as to capture the effects due to channel andmicrophone variations. For the databases that were built forthe Indian language ASR’s, the speech data was recordedover calculated number of landline and cellular phones usinga multi-channel computer telephony interface card.3.1. Speaker SelectionSpeech data is collected from the native speakers of thelanguage who are comfortable in speaking and reading thelanguage. The speakers are chosen such that all the diversitiesattributing to the gender, age and dialect are sufficientlycaptured. The recording is clean and has minimal backgrounddisturbance. Any mistakes made while recording have beenundone by re-recording or by making the correspondingchanges in the transcription set. The various statistics of thedata collected for the Indian Language ASR’s are given in thenext subsection.3.2. Data StatisticsSpeakers from various parts of the respective states (regions)were carefully recorded in order to cover all possible dialecticvariations of the language. Each speaker has recorded 52sentences of the optimal text. Table 3 gives the number ofspeakers recorded in each language and in each of therecording modes – landline and cellphones. To capturedifferent microphonic variations, four different cellphoneswere used while recording the speakers. Table 4 gives the agewise distribution of the speakers in the three languages. Table5 shows the gender wise distribution of the speakers in eachof the three languages.Table 3: Number of speakers in the three languagesLanguage Landline Cellphone TotalMarathi 92 84 176Tamil 86 114 200Telugu 108 75 183

Table 4: Age wise speaker distributionLanguage 18-30 30-40 40-50 50-60 > 60Marathi 77 56 26 12 5Tamil 80 28 31 16 45Telugu 52 52 32 32 15Table 5: Gender distribution among speakersLanguage Male FemaleMarathi 91 85Tamil 118 82Telugu 93 903.1. Transcription CorrectionDespite the care taken to record the speech with minimalbackground noise and mistakes in pronunciation, some errorshave crept in while recording. These errors had to beidentified manually by listening to the speech. If feltunsuitable, some of the utterances have been discarded.In the case of the data collected in the three Indianlanguages, the transcriptions were manually edited andranked based on the goodness of the speech recorded. Theutterances were classified as “Good”, “With Channeldistortion”, “With Background Noise” and “Useless”whichever is appropriate. The pronunciation mistakes werecarefully identified and if possible the corresponding changeswere made in the transcriptions so that the utterance and thetranscription correspond to each other. The idea behind theclassification was to make the utmost utilization of the dataand to serve as a corpus for further related research work.4. Building ASR systemsTypically, ASR systems comprise three majorconstituents- the acoustic models, the language models andthe phonetic lexicon. In the subsections that follow, each ofthese components are discussed.4.1. Acoustic ModelThe first key constituent of the ASR systems is theacoustic model. Acoustic models capture the characteristicsof the basic recognition units [3]. The recognition units can beat the word level, syllable level and or at the phoneme level.Many constraints and inadequacies come into picture with theselection of each of these units. For large vocabularycontinuous speech recognition systems (LVCSR), phoneme isthe preferred unit. Neural networks (NN) and Hidden Markovmodels are the most commonly used for acoustic modeling ofASR systems. We have chosen the Semi-Continuous HiddenMarkov models (SCHMMs) [1] to represent contextdependentphones (triphones). At the time of recognition,various words are hypothesized against the speech signal. Tocompute the likelihood of a word, the lexicon is referred andthe word is broken into its constituent phones. The phonelikelihood is computed from the HMMs. The combinedlikelihood of all the phones represents the likelihood of theword in the acoustic model. The triphone acoustic modelsbuilt for the Indian language ASRs are 5 state SCHMMs withstates clustered using decision trees. The models were builtusing CMU Sphinx II trainer.Two different models were trained from Landline andCellphone data. The training data was carefully selected tohave an even distribution across all the possible variations.For each of the systems, the speech data collected wasdivided into ‘training’ and the ‘test’ sets based upon thedemographic details. 70% of the speakers were included inthe training set and the rest 30% was used to test therespective systems.4.2. Language ModelThe language model attempts to convey the behavior ofthe language. By the means of probabilistic information, thelanguage model provides the syntax of the language. It aimsto predict the occurrence of specific word sequences possiblein the language. From the perspective of the recognitionengine, the language model helps narrow down the searchspace for a valid combination of words [3]. Language modelshelp guide and constrain the search among alternative wordhypotheses during decoding. Most ASR systems use thestochastic language models (SLM). Speech recognizers seekthe word sequence W b which is most likely to be producedfrom the acoustic evidence A.P (W b | A) = max w P(W|A) = max w P(A|W) P(W)/P(A)P(A), the probability of acoustic evidence is normallyomitted as it is common for any word sequence. Therecognized word sequence is the word combination W b whichmaximizes the above equation. LMs assign a probabilityestimate P(W) to word sequences W={w 1,w 2,……w n}. Theseprobabilities can be trained from a corpus. Since there is onlya finite coverage of word sequences, some smoothing has tobe done to discount the frequencies to those of the lowerorder. SLMs use the N-gram LM where it is assumed that theprobability of occurrence of a word is dependent only on thepast N-1 words. The trigram model [N=3], based on theprevious two words is particularly powerful, as most wordshave a strong dependence on the previous two words and itcan be estimated reasonably well with an attainable corpus.The trigram based language model with back-off was used forrecognition. The LM was created using the CMU statisticalLM toolkit [2]. For the baseline systems developed, the LMwas generated from the training and the testing transcripts.The unigrams of the language model were taken as the lexicalentries for the systems.5. Baseline System PerformanceTwo systems were built for each of the languages from thedata from the landlines and cellphones separately.The tests on each system were done using the speech dataof the untrained 30% of the speakers’ data. “Good”utterances of these speakers were decoded using the SphinxII decoder. Appropriate tuning was done on the decoder toget the best performance. The evaluation of the experimentwas made according to the recognition accuracy andcomputed using the word error rate [WER] metric whichalign a recognized word string against the correct word stringand compute the number of substitutions (S), deletions (D)and Insertions (I) and the number of words in the correctsentence (N).W.E.R = 100 * (S+D+I) / NOn similar lines, the sentence error rate [SER] metric is themean percent error in the sentence recognition. Table 6shows the performance of each ASR in terms of their WER

and SER. It also shows the size of the vocabulary of thesystem.Table 6: The performances of different ASRs in terms of worderror rate (WER) %ASR System WER VocabularyMarathi Landline 23.2 21640Marathi Mobile 25.0 18912Tamil Landline 23.9 13883Tamil Mobile 22.2 16187Telugu Landline 18.5 25626Telugu Mobile 20.1 164196. Conclusion and Future workIn this paper, we discussed the optimal design anddevelopment of speech databases for three Indian languages.We hope the simple methodology of database creationpresented will serve as catalyst for the creation of speechdatabases in all other Indian languages. We also createdspeech recognizers and presented preliminary results onthese databases. We hope the ASRs created will serve asbaseline systems for further research on improving theaccuracies in each of the languages. Our future work isfocused in tuning these models and test them using languagemodels built using a large corpus.7. AcknowledgementsWe would like to acknowledge the help and support of Naval,Gaurav, Pavan, Natraj, Santhosh and Jitendra for speech datacollection. Our special thanks to Dr. Ravi Mosur, Dr. JamesK Baker and Dr. Raj Reddy of Carnegie Mellon Universityfor their support and suggestions.8. References[1] L.Rabiner., “A Tutorial on Hidden Markov models andSelected Applications in Speech Recognition”, Proc. OfIEEE, Vol. 77 No. 2, 1989.[2] Rosenfeld Roni, “CMU statistical Language Modelling(SLM) Toolkit (Version 2)”.[3] X. Huang, A. Acero, H. Hon, “Spoken LanguageProcessing: A Guide to Theory, System and AlgorithmDevelopment”, New Jersey, Prentice Hall, 2001[4] T.Cormen, C. Leiserson and R. Rivest,. “Introduction toAlgorithms.” The MIT Press, Cambridge,Massachusetts, 1990.[5] Kalika Bali, Partha Pratim Talukdar, , Tools for thedevelopment of a Hindi Speech Synthesis System, 5 thISCA Speech Synthesis Workshop, Pittsburgh, pp.109-114,2004[6] “http://tdil.mit.gov.in/corpora/ach-corpora.htm#tech”Marathi CIIL corpus.[7] Singh, S. P., et al “Building Large Vocabulary SpeechRecognition Systems for Indian Languages”,International Conference on Natural LanguageProcessing, 1:245-254, 2004.

Development of Indian Language Speech Databases for Large ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?