Features

Geoffrey ZweigJoint work with Patrick NguyenMicrosoft ResearchCLSP Seminar, Feb. 2 20102/2/2010 1

Outline• Motivation• Detector based recognition• Segment based recognition• Segmental CRF Framework• Mathematical Model• Observations & Features• Detector Design• The MMI Multi-phone Example• Experiments• Bing Mobile• Wall Street Journal• Conclusion2/2/2010 2

Motivation• Combine millions of informative features• Possibly redundant• Add them till the mistakes go away• Acoustic side –• Base features on detection events• Phones, syllables, informative templates and others• Model segment (word) level phenomena• Avoid assumption of frame level independence• Use existing best systems as baseline feature• Language modeling side –• Support state-of-the-art• Extend by joint discriminative training with acoustic model• This paradigm not widely used in ASR now• Lots of room to explore2/2/2010 3

What kind of model could let us dothis?• Maximum Entropy Models• Conditional Random Fields• Proven log-linear combination of features• Desirable theoretical properties• Maximum entropy• Track record of success in other areas• We’ll adopt the CRF, and extend it to operate in asegmental way2/2/2010 4

Some Related Work• Detector Based ASR• Lee 2004 – Knowledge rich modeling• Direct Modeling• Kuo & Gao 2006 – Maximum Entropy Markov Models• Gunawardana et al. 2005 – Hidden CRFs• Morris & Fosler-Lussier 2006 – CRFs for Phone Recognition• Layton & Gales – 2006 - Augmented Statistical Models• Heigold et al. 2009 – Flat Direct ModelWe draw from both areas, synthesize and add a segmental aspect.2/2/2010 5

Mathematical Model2/2/2010 6

Regular CRFs 1s lees res vs nvo 1o vo n2/2/2010 7

Segmental CRF (1)s 1 s 2s 3o 1o nObservations blocked into groups correspondingto wordss 1 s 2o 1o nDon’t know how many words so must consider various segmentations2/2/2010 8

Segmental CRF (2)s lees reo(e)=o 34No added computational complexity by using feature functions of two states andan observationSee Zweig & Nguyen, ASRU 2009 for DP recursions, gradient computation, training& decoding.2/2/2010 9

The Meaning of States: ARPA LMStates are actually language model statesStates imply the last words lees reo(e)=o 342/2/2010 10

Embedding a Language Model“the dog”1“dog barked”“dog wagged”. . .S=7theS=1dogS=6nipped“dog”2“nipped”6“dog nipped”“ ”37“hazy”“the”At minimum, we can use the state sequenceto look up LM scores from the finite stategraph. These can be features.And we also know the actual arc sequence.2/2/2010 11

Recaps les ro(e)States represent whole words (not phonemes)Log-linear model relateswords to observationso 1o nObservations blocked into groups correspondingto words. Observations typically detection events.For a hypothesized word sequence s,we must sum over all possible segmentations q of observationsTraining done to maximize product of label probabilities in the training data (CML).2/2/2010 12

Observations & Features** Specific to the SCARF Speech Recognition Toolkithttp://research.microsoft.com/en-us/downloads/2/2/2010 13

The Observations• Detector streams• Phone detectors• Syllable detectors• Multiphone detectors• Template detectors• Could also use raw speech frames• At the end of the day, just need to define features thatmeasure the consistency between a word hypothesis andthe underlying acoustics2/2/2010 14

Inputs• Atomic streams• (detection time) +• Optional dictionaries• Specify the expected sequence of detections for a word2/2/2010 15

The Features• Array of features automatically constructed• Measure forms of consistency between expected andobserved detections• Differ in use of ordering information and generalization tounseen words• Existence Features• Expectation Features• Levenshtein Features• Also• LM features• “Baseline” feature• User-defined features2/2/2010 16

Existence Features• Does unit X exist within the span of word Y?• Created for all X,Y pairs in the dictionary and in thetraining data• Can automatically be created for unit n-grams• No generalization, but arbitrary detections OKHypothesized word, e.g. “kid”o 1o n2/2/2010Spanned units, e.g. “k ae d’’17

Expectation Features• Use dictionary to get generalization ability across words!• Correct Accept of u• Unit is in pronunciation of hypothesized word in dictionary, and itis detected in the span of the hypothesized word• ax k or d (dictionary pronunciation of accord)• ih k or (units seen in span)• False Reject of u• Unit is in pronunciation of hypothesized word, but it is not in thespan of the hypothesized word• ax k or d• ih k or• False Accept of u• Unit is not in pronunciation of hypothesized word, and it isdetected• ax k or d• ih k or• Automatically created for unit n-grams2/2/2010 18

Levenshtein Features• Match of u• Substitution of u• Insertion of u• Deletion of uax k or dih k or *Sub-ax = 1Match-k = 1Match-or = 1Del-d = 1• Align the detector sequence in a hypothesized word’sspan with the dictionary sequence that’s expected• Count the number of each type of edits• Operates only on the atomic units• Generalization ability across words!2/2/2010 19

Language Model Features• Basic LM:• Language model cost of transitioning between states.• Discriminative LM training:• A binary feature for each arc in the language model• Indicates if the arc is traversed in transitioning between statesTraining will result in a weight for eacharc in LM – discriminatively trained, andjointly trained with AM2/2/2010 20

The Baseline Feature• The baseline feature treats the 1-best output of a baselinesystem as a detector stream• The time associated with a word is its midpoint• The baseline feature is:• 1 if a hypothesized word spans exactly one word in thebaseline stream, and the two words match• Otherwise it is -1• To maximize,• Hypothesis must have the same number of words as baseline,• And their identities must be the same• With a high enough weight, the baseline output isguaranteed• In practice, the weight is learned along with all the others2/2/2010 21

Mutual Information BetweenMulti-phone Units and WordsOstensibly this requires O(|W||U|) timeBut it simplifies when the probability of error depends on theunit onlyO(|W| +|U|) See Zweig & Nguyen, Interspeech 2009 for algorithm.2/2/2010 24

The Errorless Casew + : words that contain unit u j ; w - : those that don’tErrorless =>Entropy of partition between words that have unit and those that don’t50/50 split is best2/2/2010 25

The Effect of Errors• Reliable detection may trump frequency• Key simplification: Probability of error depends only onthe multi-phone unit itself – not the word it occurs in2/2/2010 Can be computed in O(|W| + |U|) for all units present in lexicon! 26

What Error Model?• Interspeech 2009 paper –P(uP(ujj1| w )0 | w )• a=1, b=1, c=0.5aecbl• Later explored• Better settings of a, b• Empirically derived model based on large scalephonetic decoding• Qualitative results unchanged2/2/2010 27

Most Informative UnitsUnit MI(u; w)ax_n0.026 bitsk_ae_l_ax_f_ao_r_n_y_ax 0.023ax_r 0.021s_t 0.018ao_r 0.0172/2/2010 28

Sample Multi-phone PronunciationsWordAcademiaAcademicAcademicsAcademiesAcademyUnit Breakdownae_k_ax d_iy m_iy axae_k_ax d_eh m_ih_kae_k_ax d_eh m_ih_k sax_k_ae_d_ax_m_iy zax_k_ae_d_ax_m_iyWe essentially have empirically derived “syllables”, based onmaximum mutual information.2/2/2010 29

Experiments: Bing Mobile2/2/2010 30

What is Bing Mobile?2/2/2010 31

Experimental Setup• Bing Mobile voice search data• Multi-phone detection based on HMM decoding• 2500 hrs. acoustic training data• ½ for training detector• ½ for training SCARF parameters• Word level HMM trained on all data for baseline• HMM acoustic model has 11k states, 260k gaussians• Basic MFCC system• Baseline HMM decoding produces a set of hypotheseswhich constrain the segmentations considered• Dev set of 8,777 utterances; Test set of 12,758• Full description in Zweig & Nguyen, ASRU 2009.2/2/2010 32

Effect of Phoneme andMulti-phone DetectorsSystemSentence Error RateBaseline 37.1%+ phone 36.2+ multi-phone 35.7+ phone & multi-phone 35.4+phone & multi-phone (3 best) 35.2All + discriminative LM 35.0multi-phone betterthan phoneboth work togethersome delta from full LMExistence, Expectation and Levenshtein features all used.Phoneme biphones, Multiphone unigrams.Overall 2.1% absolute error rate reduction.2/2/2010 33

Effect of Feature SubtractionSystemSentence Errorphone & multi-phone (3 best) 35.2%less existence +0.5less expectation +0.6less levenshtein +0.3* All features are useful.* Existence and Expectation features most useful.* Perhaps because levenshtein-type ordering constraints are in thedictionary already?2/2/2010 34

Experiments: Wall Street Journalor:“Can it handle more than 3 words?”2/2/2010 35

Wall Street Journal Setup• SI284 acoustic training data – 84 WSJ0 speakers, 200 WSJ1speakers• Speaker independent models• ~72 hrs of speech• Used the distributed LM• 3.5M trigrams 3.2M bigrams• 40M training words (1.6M words)• 20k open vocabulary• Used pronlex syllables• HMM baseline if 9.7% WER• Constraint lattices have 3.7% oracle WER.2/2/2010 36

Wall Street Journal ResultsFeatures (one only) WERBaseline 9.7%Levenshtein 4.2Expectation 4.2Existence 7.1Can achieve close to the3.7% theoretically possibleOracle DetectionsFeaturesWERBaseline 9.7%Levenshtein 9.2All 9.2In realistic conditions,a better detectorproduces betterresultsBetter AM in Syllable Detection2/2/2010 37

Conclusions• Presented framework for detector based speechrecognition with Segmental CRFs• Integrates multiple detector streams• Enables and uses segment level features• Joint and discriminative language model, acoustic training• Built for continuous improvement with new, independentlydeveloped detector streams• Proposed a method for designing good detectors• Maximize mutual information with words• Illustrated method with MMI multi-phone units• Tested scheme• Real-world cell-phone data• Wall Street Journal CSR2/2/2010 38

Features

Create successful ePaper yourself

Delete template?

Save as template?