27.12.2013 Views

NLP for IR - Microsoft Research

NLP for IR - Microsoft Research

NLP for IR - Microsoft Research

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>NLP</strong> <strong>for</strong> <strong>IR</strong><br />

MONOJIT CHOUDHURY<br />

MICROSOFT RESEARCH LAB INDIA


Interpreting Queries<br />

cannot copy files from excel 2007 to excel 2010 on windows 8<br />

side effects of liver drugs on kidney<br />

polished tiles hard to keep clean


Beyond counting words<br />

Word structure: Spelling & Morphology<br />

Relationship between words: Syntax<br />

Meaning of Words: Synonyms (Lexical Semantics)<br />

Meaning of phrases and sentences: Compositionality (Semantics)<br />

Long range dependencies: Discourse & topic<br />

Translation


How many words can you make with: Establish<br />

Establishes, Establishing, Established,<br />

Establishment, Establishments<br />

Disestablish, Disestablishes, Disestablished, Disestablishing, Disestablishment,<br />

Disestablishments,<br />

Establishmentarian, Establishmentarians, Establishmentarianism,<br />

Establishmentarianisms, Disestablishmentarian, Disestablishmentarians,<br />

Disestablishmentarianism, Disestablishmentarianisms<br />

Antidisestablishmentarian, Antidisestablishmentarians,<br />

Antidisestablishmentarianism, Antidisestablishmentarianisms


How many words can you make with: Establish<br />

Establishes, Establishing, Established,<br />

Establishment, Establishments<br />

Disestablish, Disestablishes, Disestablished, Disestablishing, Disestablishment,<br />

Disestablishments,<br />

Establishmentarian, Establishmentarians, Establishmentarianism,<br />

Establishmentarianisms, Disestablishmentarian, Disestablishmentarians,<br />

Disestablishmentarianism, Disestablishmentarianisms<br />

Antidisestablishmentarian, Antidisestablishmentarians,<br />

Antidisestablishmentarianism, Antidisestablishmentarianisms


Morphological Analysis<br />

Anti + dis + establish + ment + ary + an + ism<br />

Prefixes<br />

Root<br />

Suffixes<br />

Inflectional vs. Derivational Morphology


Analyze the following words<br />

In<strong>for</strong>mation’s<br />

Allocation<br />

जाना<br />

कर्तव्य<br />

अपरिचिर्<br />

ಬ ೆಂಗಳೂರಿನಲ್ಲಿ<br />

ಸ ೆಂದರವಾಗಿದ


Morphological Analyzer<br />

◦Root lexicon<br />

◦Suffix lexicon<br />

◦Morphotactic rules<br />

◦Morphophonemic (sandhi) rule<br />

Finite State Automaton


Online Resources <strong>for</strong> MA<br />

http://ltrc.iiit.ac.in/showfile.php?filename=onlineServices/morph/index.htm<br />

Telugu, Hindi, Marathi, Kannada, Punjabi<br />

http://nltr.org/snltr-software/<br />

Bangla


Guess the intended word<br />

A. Part<br />

B. X;zfg<br />

C. Poter<br />

D. Slort<br />

E. Parti<br />

F. Parat<br />

1. Sort<br />

2. Party<br />

3. Apart<br />

4. Port<br />

5. Partition<br />

6. Porter<br />

7. Potter<br />

8. Sport<br />

9. Spare


Edit distance<br />

Minimum number of<br />

deletion, substitution and<br />

insertion required to convert<br />

a string to another.


Beyond simple Edit Distance<br />

Error in first letter is rare<br />

Transposition errors are common<br />

Finger shift errors are not uncommon<br />

Phonetic errors can have very high edit distance:<br />

What is ghoti??


Guess the intended word<br />

A. Largest river part of the world<br />

B. X;zfg binary array<br />

C. Howrah station poter charges<br />

D. Decathlon slort shop<br />

E. India Pakistan parti<br />

F. Birthday parat rentals<br />

1. Sort<br />

2. Party<br />

3. Apart<br />

4. Port<br />

5. Partition<br />

6. Porter<br />

7. Potter<br />

8. Sport<br />

9. Spare


Language models<br />

Given a sequence of words, w 1 w 2 … w n , what is the<br />

probability that it belongs to a language L?<br />

n-gram models: P(w n | w 1 w 2 … w n-1 )<br />

1. Unigram model<br />

2. Bigram model<br />

3. Trigram model


Measuring Quality of Language Models<br />

Entropy of P(w n | w 1 w 2 … w n-1 ):<br />

− v∈V P(v|w 1 w 2 … w n-1 ) log 2 P(v|w 1 w 2 … w n-1 )<br />

Perplexity: 2 entropy


Perplexity of language models <strong>for</strong><br />

Queries and Documents<br />

Model Documents Queries<br />

1-gram 2007 8570<br />

2-gram 182 53<br />

3-gram 15 3


Rivers of India<br />

A. Komthi<br />

B. Cunoe<br />

C. Coyne<br />

D. Cosy<br />

E. Gaubeli<br />

1. Kali<br />

2. Khoh<br />

3. Kuno<br />

4. Koshi<br />

5. Kaveri<br />

6. Kabini<br />

7. Gomati<br />

8. Godavari<br />

9. Goini


Transliteration<br />

Sound preserving trans<strong>for</strong>mation of words from one<br />

script to another.<br />

भािर्<br />

Forward<br />

Transliteration<br />

Bharat<br />

Bhaarat<br />

Bharath<br />

Bhaarath<br />

Baarath<br />

Bharat<br />

Backward<br />

Transliteration<br />

भािर्<br />

भिर्


Scope of Transliteration<br />

Names<br />

People, Location, Books, Movies, …


Song Lyrics


Reviews and Forums


Social Media


And an entire epic!


Aspects of Transliterated texts<br />

Code<br />

Mixing<br />

Transliteration<br />

Errors,<br />

Contraction


Transliteration based Tasks<br />

Forward transliteration<br />

Backward transliteration<br />

Identifying Transliteration equivalents


Rule-based transliteration<br />

क k, c, ch, q<br />

ख kh, k, q<br />

ग g, k<br />

घ gh, g<br />

च ch, c, s,<br />

छ ch, chh, s<br />

ज j, g<br />

अ a, o<br />

आ a, aa, ay<br />

इ i, e, ee, ii, ey<br />

उ o, u, oo, uu, ui


Statistical Rule-based transliteration<br />

क k (0.9), c (0.05), ch (0.01), q (0.04)<br />

ख kh (0.8), k (0.15), q (0.05)<br />

ग g, k<br />

घ gh, g<br />

च ch, c, s,<br />

छ ch, chh, s<br />

ज j, g<br />

अ a, o<br />

आ a, aa, ay<br />

इ i, e, ee, ii, ey<br />

उ o, u, oo, uu, ui


Noisy Channel Model<br />

s (= भािर्)<br />

s (= bhaarat)<br />

s (= bhaarat)<br />

Noisy channel<br />

t (= Bhaarat)<br />

t (= भािर्)<br />

t (= barath)<br />

p t s =<br />

p s t p t<br />

p s<br />

t = argmax p s t p(t)


Transliteration through Phonetic<br />

Trans<strong>for</strong>mation<br />

s (= भािर्)<br />

s (= bhaarat)<br />

s (= bhaarat)<br />

G2P<br />

π (= bhArat)<br />

P2G<br />

t (= Bhaarat)<br />

t (= भािर्)<br />

t (= barath)<br />

p t s = p t π p π s =<br />

p π t p t<br />

p π<br />

p s π p π<br />

p s


Handling spelling variation at Web-scale


Questions

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!