Information Retrieval and Search

CMPT 479/880 Notes
Robert D. Cameron
March 6, 2001

The IR Problem

Given a document collection and a query, retrieve relevant documents matching the query.

Recall and Precision

Two fundamental measures of the quality of information retrieval.

Recall: What percentage of relevant documents are retrieved.
Precision: What percentage of retrieved documents are relevant.

Relevance

When is a document relevant?

User's judgment.
No clear automatable answer based on the query and the document collection.
If there were an automatable answer => 100% recall, 100% precision.

Vector Space Model

A high-dimensionality vector space is constructed consisting of one dimension for every unique term found in the document collection.

Documents and queries are represented by vectors. The coordinate value along each dimension is a weight, typically based on the number of occurrences of the term in the document or query.

Documents and queries may be matched by vector comparison operations.

Distance measures.
Cosine similarity measures.

Text Analysis

Extraction of the term vector from a text.

Term candidates:: lexical elements extracted from a document that may be used as terms.
Stop words: are commonly occurring words deleted from the list of term candidates.
Stemming: may be used to reduce the dimensionality by combining words that only differ in suffix.
Frequency: the number of occurrences of a word in a document.

Term Frequency and Indexing Significance

Luhn: term frequency is useful determining significance for retrieval.

(from Chapter 2 of Information Retrieval, C.J. van Rijsbergen

TF-IDF Weighting

Frequently occurring terms may not be useful if they are equally frequent in all documents.

TF: Term Frequency in a document.
N: total number of documents.
n: total number of documents having the term.
IDF: Inverse Document Frequency
IDF = log₂ N/n
The information content associated with known a term is in a particular document.

Relevance to Web Search Engines

The vector space model with TF-IDF weights is relevant to search engine development.

Raymie Stata, Krishna Bharat and Farzin Maghoul "The Term Vector Database: fast access to indexing terms for Web pages" 9th International World Wide Web Conference ISBN/1-930792-00-X$159