Information Retrieval and Search
CMPT 479/880 Notes
Robert D. Cameron
March 6, 2001
The IR Problem
Given a document collection and a query, retrieve
relevant documents matching the query.
Recall and Precision
Two fundamental measures of the quality of information retrieval.
- Recall
- What percentage of relevant documents are retrieved.
- Precision
- What percentage of retrieved documents are relevant.
Relevance
When is a document relevant?
- User's judgment.
- No clear automatable answer based on the query and the
document collection.
- If there were an automatable answer => 100% recall, 100% precision.
Vector Space Model
A high-dimensionality vector space is constructed
consisting of one dimension for every unique term
found in the document collection.
Documents and queries are represented by vectors.
The coordinate value along each dimension is a weight,
typically
based on the number
of occurrences of the term in the document or query.
Documents and queries may be matched by vector comparison operations.
- Distance measures.
- Cosine similarity measures.
Text Analysis
Extraction of the term vector from a text.
- Term candidates:
-
lexical elements extracted from a document that may be
used as terms.
- Stop words
- are commonly occurring words deleted from the
list of term candidates.
- Stemming
- may be used to reduce the dimensionality by
combining words that only differ in suffix.
- Frequency
- the number of occurrences of a word in a document.
Term Frequency and Indexing Significance
Luhn: term frequency is useful determining significance for retrieval.

(from
Chapter 2 of Information Retrieval, C.J. van Rijsbergen
TF-IDF Weighting
Frequently occurring terms may not be useful if they are
equally frequent in all documents.
- TF
- Term Frequency in a document.
- N
- total number of documents.
- n
- total number of documents having the term.
- IDF
-
Inverse Document Frequency
IDF = log2 N/n
The information content associated with known a term
is in a particular document.
Relevance to Web Search Engines
The vector space model with TF-IDF weights is relevant
to search engine development.
Raymie Stata, Krishna Bharat and Farzin Maghoul
"The Term Vector Database: fast access to indexing terms for Web pages"
9th International World Wide Web Conference
ISBN/1-930792-00-X$159