NLP for IR - Microsoft Research
NLP for IR - Microsoft Research
NLP for IR - Microsoft Research
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>NLP</strong> <strong>for</strong> <strong>IR</strong><br />
MONOJIT CHOUDHURY<br />
MICROSOFT RESEARCH LAB INDIA
Interpreting Queries<br />
cannot copy files from excel 2007 to excel 2010 on windows 8<br />
side effects of liver drugs on kidney<br />
polished tiles hard to keep clean
Beyond counting words<br />
Word structure: Spelling & Morphology<br />
Relationship between words: Syntax<br />
Meaning of Words: Synonyms (Lexical Semantics)<br />
Meaning of phrases and sentences: Compositionality (Semantics)<br />
Long range dependencies: Discourse & topic<br />
Translation
How many words can you make with: Establish<br />
Establishes, Establishing, Established,<br />
Establishment, Establishments<br />
Disestablish, Disestablishes, Disestablished, Disestablishing, Disestablishment,<br />
Disestablishments,<br />
Establishmentarian, Establishmentarians, Establishmentarianism,<br />
Establishmentarianisms, Disestablishmentarian, Disestablishmentarians,<br />
Disestablishmentarianism, Disestablishmentarianisms<br />
Antidisestablishmentarian, Antidisestablishmentarians,<br />
Antidisestablishmentarianism, Antidisestablishmentarianisms
How many words can you make with: Establish<br />
Establishes, Establishing, Established,<br />
Establishment, Establishments<br />
Disestablish, Disestablishes, Disestablished, Disestablishing, Disestablishment,<br />
Disestablishments,<br />
Establishmentarian, Establishmentarians, Establishmentarianism,<br />
Establishmentarianisms, Disestablishmentarian, Disestablishmentarians,<br />
Disestablishmentarianism, Disestablishmentarianisms<br />
Antidisestablishmentarian, Antidisestablishmentarians,<br />
Antidisestablishmentarianism, Antidisestablishmentarianisms
Morphological Analysis<br />
Anti + dis + establish + ment + ary + an + ism<br />
Prefixes<br />
Root<br />
Suffixes<br />
Inflectional vs. Derivational Morphology
Analyze the following words<br />
In<strong>for</strong>mation’s<br />
Allocation<br />
जाना<br />
कर्तव्य<br />
अपरिचिर्<br />
ಬ ೆಂಗಳೂರಿನಲ್ಲಿ<br />
ಸ ೆಂದರವಾಗಿದ
Morphological Analyzer<br />
◦Root lexicon<br />
◦Suffix lexicon<br />
◦Morphotactic rules<br />
◦Morphophonemic (sandhi) rule<br />
Finite State Automaton
Online Resources <strong>for</strong> MA<br />
http://ltrc.iiit.ac.in/showfile.php?filename=onlineServices/morph/index.htm<br />
Telugu, Hindi, Marathi, Kannada, Punjabi<br />
http://nltr.org/snltr-software/<br />
Bangla
Guess the intended word<br />
A. Part<br />
B. X;zfg<br />
C. Poter<br />
D. Slort<br />
E. Parti<br />
F. Parat<br />
1. Sort<br />
2. Party<br />
3. Apart<br />
4. Port<br />
5. Partition<br />
6. Porter<br />
7. Potter<br />
8. Sport<br />
9. Spare
Edit distance<br />
Minimum number of<br />
deletion, substitution and<br />
insertion required to convert<br />
a string to another.
Beyond simple Edit Distance<br />
Error in first letter is rare<br />
Transposition errors are common<br />
Finger shift errors are not uncommon<br />
Phonetic errors can have very high edit distance:<br />
What is ghoti??
Guess the intended word<br />
A. Largest river part of the world<br />
B. X;zfg binary array<br />
C. Howrah station poter charges<br />
D. Decathlon slort shop<br />
E. India Pakistan parti<br />
F. Birthday parat rentals<br />
1. Sort<br />
2. Party<br />
3. Apart<br />
4. Port<br />
5. Partition<br />
6. Porter<br />
7. Potter<br />
8. Sport<br />
9. Spare
Language models<br />
Given a sequence of words, w 1 w 2 … w n , what is the<br />
probability that it belongs to a language L?<br />
n-gram models: P(w n | w 1 w 2 … w n-1 )<br />
1. Unigram model<br />
2. Bigram model<br />
3. Trigram model
Measuring Quality of Language Models<br />
Entropy of P(w n | w 1 w 2 … w n-1 ):<br />
− v∈V P(v|w 1 w 2 … w n-1 ) log 2 P(v|w 1 w 2 … w n-1 )<br />
Perplexity: 2 entropy
Perplexity of language models <strong>for</strong><br />
Queries and Documents<br />
Model Documents Queries<br />
1-gram 2007 8570<br />
2-gram 182 53<br />
3-gram 15 3
Rivers of India<br />
A. Komthi<br />
B. Cunoe<br />
C. Coyne<br />
D. Cosy<br />
E. Gaubeli<br />
1. Kali<br />
2. Khoh<br />
3. Kuno<br />
4. Koshi<br />
5. Kaveri<br />
6. Kabini<br />
7. Gomati<br />
8. Godavari<br />
9. Goini
Transliteration<br />
Sound preserving trans<strong>for</strong>mation of words from one<br />
script to another.<br />
भािर्<br />
Forward<br />
Transliteration<br />
Bharat<br />
Bhaarat<br />
Bharath<br />
Bhaarath<br />
Baarath<br />
Bharat<br />
Backward<br />
Transliteration<br />
भािर्<br />
भिर्
Scope of Transliteration<br />
Names<br />
People, Location, Books, Movies, …
Song Lyrics
Reviews and Forums
Social Media
And an entire epic!
Aspects of Transliterated texts<br />
Code<br />
Mixing<br />
Transliteration<br />
Errors,<br />
Contraction
Transliteration based Tasks<br />
Forward transliteration<br />
Backward transliteration<br />
Identifying Transliteration equivalents
Rule-based transliteration<br />
क k, c, ch, q<br />
ख kh, k, q<br />
ग g, k<br />
घ gh, g<br />
च ch, c, s,<br />
छ ch, chh, s<br />
ज j, g<br />
अ a, o<br />
आ a, aa, ay<br />
इ i, e, ee, ii, ey<br />
उ o, u, oo, uu, ui
Statistical Rule-based transliteration<br />
क k (0.9), c (0.05), ch (0.01), q (0.04)<br />
ख kh (0.8), k (0.15), q (0.05)<br />
ग g, k<br />
घ gh, g<br />
च ch, c, s,<br />
छ ch, chh, s<br />
ज j, g<br />
अ a, o<br />
आ a, aa, ay<br />
इ i, e, ee, ii, ey<br />
उ o, u, oo, uu, ui
Noisy Channel Model<br />
s (= भािर्)<br />
s (= bhaarat)<br />
s (= bhaarat)<br />
Noisy channel<br />
t (= Bhaarat)<br />
t (= भािर्)<br />
t (= barath)<br />
p t s =<br />
p s t p t<br />
p s<br />
t = argmax p s t p(t)
Transliteration through Phonetic<br />
Trans<strong>for</strong>mation<br />
s (= भािर्)<br />
s (= bhaarat)<br />
s (= bhaarat)<br />
G2P<br />
π (= bhArat)<br />
P2G<br />
t (= Bhaarat)<br />
t (= भािर्)<br />
t (= barath)<br />
p t s = p t π p π s =<br />
p π t p t<br />
p π<br />
p s π p π<br />
p s
Handling spelling variation at Web-scale
Questions