Natural Language Processing (NLP)

intelligently processing natural languages, like English or Chinese or German, is one of the biggest areas of AI research

  • NLP comprises many fields, including linguistics, mathematics, computer science, psychology, cognitive science
  • plus it has many practical applications

a language is essentially a set of strings, plus rules for what strings are legal called a grammar

languages also have semantics, i.e. meaning

for example, the English sentence “Mario is hungry.” is grammatically well-formed, but “Mario hungry is.” is ungrammatical

the semantics of “Mario is hungry.” is that it’s an assertion that some particular person named Mario is hungry

  • the assertion may be true or false, of course
  • plus we don’t know who is doing the asserting

note that the ungrammatical sentence “Mario hungry is.” seems to have the same meaning

  • it is possible to have grammatical sentences without meaning, a famous example being “Colorless green ideas sleep furiously.” (from linguist Noam Chomsky)

nowadays, some of the most successful natural language processing techniques are statistical in nature, i.e. they uses statistical and probabilistic algorithms to learn rules for solving language problems

  • such techniques don’t seem to have a deep understanding of language in the way that humans do, yet nonetheless they have proven extremely successful in solving many NLP problems

Language Models: N-grams

for some applications, a useful way to model a language is as a sequence of tokens

  • atokens could be words, or characters, or syllables, or sentences — different applications might best be handled with different tokens

for example, suppose you have a sentence consisting of 5 word tokens: <w1, w2, w3, w4, w5>

  • e.g. “Fonzi ate ten red apples” is a five-word sentence, where w1=Fonzi, w2=ate, w3=ten, etc.

the unigrams of the sentence are just the individual words: w1, w2, w3, w4, w5

the bigrams of the sentence are the pairs of adjacent words: (w1,w2), (w2,w3), (w3,w4), (w4,w5)

the trigrams of the sentences are triple of 3 adjacent words: (w1,w2,w3), (w2,w3,w4), (w3,w4,w5)

in general, an n-gram is a sequence of n consecutive words from a text

given a large corpus of text, i.e. a collection of documents, we can count n-grams and get useful information about the probability of particular n-grams

for example, suppose we have counted all the trigrams for a large collection of news articles from the web

given two words followed by a blank, such as “Simon Fraser ___”, we can make a good guess about the most likely way to fill in the blank by looking at all the trigrams that start with “Simon Fraser” and choosing the most frequently occurring word after that

n-grams can also be done at the level of characters, which is useful for applications such as:

  • language identification, i.e. determining what language a text is written in
  • spelling correction
  • genre identification, i.e. determining if a text is a news story, a scientific publication, a novel, etc.
  • named entity recognition, i.e. determining the names of things (like people, countries, businesses) mentioned in a text

Text Classification

test classification is the problem of labeling a text as being in one of a given set of pre-defined categories

for example, spam detection labels email as either “ham” or “spam”

for example, a marker gives an English essay a grade of “A+”, “A”, “B”, “C+”, etc.

n-grams can be useful in spam detection

  • typically, spam detector is trained on both ham and spam messages, so that it can create a model for Prob(Message|Spam) and another language model for Prob(Message|Ham)

    • the notation Prob(x|y) is a conditional probability, and means “the probability of x given that y is the case”
  • then given a new message, Bayes rule is used to estimate the probability of spam/ham:

            argmax P(c|Message) = argmax P(Message|c)P(c)
    
    where c in {ham, spam}
    

in addition to word-based n-grams, spam filters may also used character-based n-grams so they can catch intentional misspellings like “f.r.e.e!”

another interesting idea for text classification mentioned in the textbook is to use compression

  • text compression algorithms, like zip, essentially find patterns in text and replace those patterns with shorter descriptions
  • you can detect spam using compression as follows:
    • append all your training spam messages into a text file and compress it
    • append all your training ham messages into another text file and compress it
    • when you get a new message M, append it to both the spam and ham messages, and compress the result — whichever compresses to a smaller file is then used to classify M, e.g. if the spam messages compress smaller, then M is classified as spam
    • the intuition is that if M will compress well with a set of other messages if M shares similar patterns
    • experiments with off-the-shelf compression tools show that they can do about as well as n-gram classification, but they’re not as fast

Information Retrieval

information retrieval (IR) is the task of finding documents relevant to a users needs

these days, IR is dominated by web searching, but there’s a long history of IR research before the web

many early IR system used boolean keyword searching, i.e. you could write logical queries like (“farm” and (“BC” or “Alberta”))

  • boolean keyword searches can be powerful, but many users find it tricky to write correct queries

most IR systems do queries on words using word count statistics

the idea is that the query is a set of words like “farming in Kansas”, and for every document in the database, a score is generated for how good a match that document is for the query

  • many IR systems do some pre-processing on queries, e.g. they might fix spelling errors, or change words like “farming” to a standard stem form such as “farm”
    • e.g. “farms”, “farming”, “farmed” might all be changed to the stem “farm”
  • documents are presented to the user from highest scoring downwards

for example, the BM25 scoring function scores a query q against a document d by taking into account three things:

  • the frequency with which query terms appear in the document; this is called TF, for term frequency; e.g. if the query is “farming in Kansas”, then a documenting where “farming” appears many times is considered to score more highly

  • the inverse document frequency, denoted IDF is a measure of how frequently a term appears in all documents, with a lower IDF meaning it occurs very frequently and so is likely not to be as informative as a term that occurs in fewer documents; e.g. the term “in” is a common word that occurs in many documents, while “farming” is less common; in BM25, IDF is defined as follows:

    \[\textrm{IDF}(q_i) = \log \frac{N - \textrm{DF}(q_i) + 0.5}{\textrm{DF}(q_i) + 0.5}\]

    here, \(N\) is the total number of documents in the database, and \(\textrm{DF}(q_i)\) is the number of documents that contain \(q_i\)

  • shorter documents that match the query are preferred over longer documents; the intuition is that, say, a million word document is likely to match lots of queries and not be terribly relevant, while a short document that matches the query is more likely to be relevant

here is the final formula:

\[\textrm{BM25}(d_j,q_{1:N})=\sum_{i=1}^{N}\textrm{IDF}(q_i)\cdot \frac{\textrm{TF}(q_i,d_j)\cdot (k+1)}{\textrm{TF}(q_i,d_j) + k \cdot (1-b+b\cdot \frac{|d_j|}{L})}\]

in this formula, \(q_i\) is word \(i\) of the query \(q\), \(\textrm{TF}(q_i,d_j)\) is the number of times word \(q_i\) occurs in document \(d_j\), \(L=\sum_{i}|d_i|/N\) is the average document length of documents in the database, and \(k\) and \(b\) are tuning parameters that are typically set to values like \(k=2.0\) and \(b=0.75\)

in practice, IR systems are carefully evaluated and tuned on many different sets of documents, and getting a good scoring function can be surprisingly tricky

  • plus, it requires some non-trivial software engineering to make the system fast and memory-efficient on large amounts of data — anything less than instantaneous results will probably not be tolerated by most users these days!

web search engines use many of the same ideas as traditional IR systems, but hyperlinks are a major difference between web pages and regular documents

  • algorithms like Google’s PageRank take into account the quality of the web pages that “point” to a page, so that it cannot be so easily fooled by spammers who add lots of occurrences of a word to their page
  • before Google, most web search engines used traditional IR techniques, and the results were typically terrible
    • you might have to click through 2 or 3 pages of results to get what you wanted
  • the arrival of Google was a major breakthrough: it ranked web search results much better than any other major search engine, and soon became the dominant searching tool it is today

Information Extraction

information extraction is the process of skimming a text to look for information of interest

  • skimming implies that the analysis of the text isn’t very deep

typically, an information extractor has a relatively simple and clear goal about what sort of information it is extracting

for example, an information extraction system might search a text to final all email addresses, or all company names, or all products and their prices, etc.

for relatively simple information extraction problems, such as finding all postal codes in a text, it might be enough to use regular expressions

but for more complex information extraction, more sophisticated linguistic and learning techniques may be needed

  • e.g. suppose you want to get movie recommendations from a forum discussion
  • how would you determine what movies are being discussed, and what people think of the movie?
    • both bad/good movies might have a lot of discussion
  • perhaps you could somehow count how many people like a movie, and how many don’t
    • your program doesn’t need to understand why they like or dis-like the movie, it just has to determine what the “sentiment” is, i.e. favorable/unfavorable