Summary

Summary Data Analysis & Retrieval Midterm

19 views 1 purchase

Course
Data Analyse En Retrieval (INFOB3DAR)

Institution
Universiteit Utrecht (UU)

Book
Introduction to Information Retrieval

All subjects that are discussed in the first part of the Data Analysis & Retrieval (INFOB3DAR) course for the midterm, clearly summarized. Based on the lectures and the book.

[Show more]

Preview 3 out of 22 pages

View example

Summarized whole book? No
Which chapters are summarized? Unknown
Uploaded on June 1, 2022
Number of pages 22
Written in 2021/2022
Type Summary

indexing
information retrieval
mapreduce
map reduce
top k searching
top k querying
top k scoring
scoring and ranking
frequent item sets
linear algebra
gaussian elimination
pagerank
page rank
approxima

Book Title:Introduction to Information Retrieval

Author(s):Christopher D. Manning, Prabhakar Raghavan

Edition:2008
ISBN:9780521865715
Edition:Unknown

Summary
Samenvatting voor Information Retrieval Exam (X_400435)
Class notes
Lectures Information Retrieval
Summary
Samenvatting Introduction to Information Retrieval

Institution
Universiteit Utrecht (UU)
Education
Informatica
Course
Data Analyse En Retrieval (INFOB3DAR)

Suniht

Member since 3 year 82 documents sold

$9.81

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Data-analysis & retrieval
Midterm

Indexing 2

MapReduce 5

Scoring & Ranking 7
Top-k searching 9

Frequent item sets 11

Linear algebra 13
Gaussian elimination 14

PageRank 15

Approximate string matching 16

,Indexing
Text searching
- Collection: Fixed set of documents
- Goal: retrieving documents relevant to the user’s information need
- User’s need for information usually expressed by one or more search terms

Quality measures:
- Precision: fraction of retrieved documents that is relevant to user’s information need
(also called selectivity)
- Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity)

Boolean retrieval
- Basic model for IR (Information Retrieval)
- Uses logical operators (AND, OR, NOT) and brackets
- Term-document incidence matrix: matrix that shows if a term appears in a document
(with 1 or 0 for true or false)
- For each term, you get a bit array where each bit is determined by whether or not
the term is contained in the corresponding document
- You can use the above mentioned bit array in bitwise operations to run queries
- For example, get documents from a collection of 6 that meet the query ‘Brutus
AND Caesar BUT NOT Calpurnia’ could correspond to the bitwise operation
110100 && 110111 && (NOT 010000) = 100100
- Problem: collections are often rather large, too large for the use of such a matrix
- Solved by the sparse matrix approach
- Sparse matrix: documents identified by unique docID and terms are organised in a
dictionary, with each term having its own posting list
- Posting list: ordered list of documents containing the corresponding term
- Dictionary is implemented as a hash table or tree like structure
- Implementations of postings lists:
- Internal memory, static situation: arrays
- Internal memory, dynamic situation: linked list
- External memory: linked list (block structure)

Tree like structures
- B-tree: binary tree, but with a maximum of 4 branches leading out from a node (4
branches needs 3 values to determine which branch to follow
- Trie (prefix tree): leafs are the terms, built up by various nodes adding prefixes

, Indexing process
1. From the documents to be indexed, get the relevant tokens (terms)
2. Modify the tokens to be more general (no capital letters etc.)
3. Get posting lists for the terms using indexer

Boolean query processing
- Query = term1 AND term2
1. Locate postings list for p1 for term1
2. Locate postings list p2 for term2
3. Calculate the intersection of p1 and p2 by list merging
- List merging: keep only the docID’s that occur in all input lists (intersection)
1. Get docID of both lists
2. If they are equal, add to result
3. Otherwise, get next docID of list where current docID is lower than other
4. Repeat steps 2 and 3 with new docID
- Query = term1 AND NOT term2
- Go through the first list and add only the docID’s that do not occur in the other
input list to the result
- Query = term1 AND term2 AND … AND termn
- 𝑛! possibilities of merging one by one
- Use the length of postings list, merge lists with smallest length first
- Skip pointers: pointer to docID further down the list
- Use when the docID that the pointer points to
is lower than the current docID of the other list
- For example, on the right, the pointer from 11
to 31 will be used when merging the two lists
- Many skip pointers → more comparisons,
more skips, higher memory cost
- Few skip pointers → fewer comparisons, less frequent skips, longer
jumps, lower memory cost
- Rule of thumb: 𝑛 skip pointers for postings list of length 𝑛
- Instead of merging the lists one by one, you can merge them all at the same time
by making a slight adjustment to the algorithm for list merging
- This approach allows more efficient use of skip pointers

Phrase queries
- Juxtapositions of terms: difference between “fight club” and “fight” AND “club”
- Solution 1: biword index (make an index for each term)

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Suniht. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.81. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

75323 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Summary

Summary Data Analysis & Retrieval Midterm

Document information

Subjects

Connected book

More summaries for

Written for

Seller

Reviews received

Content preview