Introduction to Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval
Introduction to Information Retrieval Documents as vectors • Each document is now represented as a real-valued vector of tf-idf weights ∈ R |V|. • So we have a |V|-dimensional real-valued vector space. • Terms are axes of the space. • Documents are points or vectors in this space. • Very high-dimensional: tens of millions of dimensions when you apply this to web search engines • Each vector is very sparse - most entries are zero. 34
Introduction to Information Retrieval Queries as vectors • Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space • Key idea 2: Rank documents according to their proximity to the query • proximity = similarity • proximity ≈ negative distance • Recall: We’re doing this because we want to get away from the you’re-either-in-or-out, feast-or-famine Boolean model. • Instead: rank relevant documents higher than nonrelevant documents 35
- Page 1 and 2: Introduction to Information Retriev
- Page 3 and 4: Introduction to Information Retriev
- Page 5 and 6: Introduction to Information Retriev
- Page 7 and 8: Introduction to Information Retriev
- Page 9 and 10: Introduction to Information Retriev
- Page 11 and 12: Introduction to Information Retriev
- Page 13 and 14: Introduction to Information Retriev
- Page 15 and 16: Introduction to Information Retriev
- Page 17 and 18: Introduction to Information Retriev
- Page 19 and 20: Introduction to Information Retriev
- Page 21 and 22: Introduction to Information Retriev
- Page 23 and 24: Introduction to Information Retriev
- Page 25 and 26: Introduction to Information Retriev
- Page 27 and 28: Introduction to Information Retriev
- Page 29 and 30: Introduction to Information Retriev
- Page 31 and 32: Introduction to Information Retriev
- Page 33: Introduction to Information Retriev
- Page 37 and 38: Introduction to Information Retriev
- Page 39 and 40: Introduction to Information Retriev
- Page 41 and 42: Introduction to Information Retriev
- Page 43 and 44: Introduction to Information Retriev
- Page 45 and 46: Introduction to Information Retriev
- Page 47 and 48: Introduction to Information Retriev
- Page 49 and 50: Introduction to Information Retriev
- Page 51 and 52: Introduction to Information Retriev
- Page 53: Introduction to Information Retriev
<strong>Introduction</strong> <strong>to</strong> <strong>Information</strong> <strong>Retrieval</strong><br />
Documents as vec<strong>to</strong>rs<br />
• Each document is now represented as a real-valued vec<strong>to</strong>r<br />
of tf-idf weights ∈ R |V|.<br />
• So we have a |V|-dimensional real-valued vec<strong>to</strong>r space.<br />
• Terms are axes of the space.<br />
• Documents are points or vec<strong>to</strong>rs in this space.<br />
• Very high-dimensional: tens of millions of dimensions when<br />
you apply this <strong>to</strong> web search engines<br />
• Each vec<strong>to</strong>r is very sparse - most entries are zero.<br />
34