CMPT 413 - Spring 2008: Computational Linguistics

Computational Linguistics is the study of human language from a computational perspective. This course will examine algorithms used in the automatic analysis or production of language. Along with formal models of language, we will also study the engineering of natural language processing software.

More details: Course outline

Announcements

Assignments

Important note on assignment submission and development: Your homework will be submitted electronically using the department-provided submission server. Connect to the submission server by going to the URL: https://submit.cs.sfu.ca/

We will be using the Python-based nltk: Natural Language Toolkit for most of the homeworks in this course. You must use nltk version 0.9 for the homeworks.

Your code should run on a standard Python 2.5 interpreter and you can assume that the NLTK modules and corpora are available. You must test your programs on the CSIL linux machines before you submit. Python v2.5 and the nltk v0.9 libraries and data have been installed in the directory /cmpt413 which is available on all CSIL Linux machines. If you cannot access it on a machine in the lab, please email csilop@cs.sfu.ca and they will fix the problem.

You can ssh into the CSIL Linux machines (peach.csil, mango.csil, and other fruit machines in the lab) to test your programs for the most part.

The NLTK installation on the CSIL Linux machines is in /cmpt413. First check which shell you are using by running echo $SHELL. Then every time you want to use python2.5 with nltk: source /cmpt413/setup.sh (if you use bash or sh) or source /cmpt413/setup.csh (if you use tcsh or csh). Run python2.5 to start up the Python 2.5 interpreter. All nltk libraries can now be used as in the NLTK tutorials.

  1. Homework #1
  2. Homework #2 (deadline extended to Wed, Feb 13 due to snow days)
    • Additional Reading for Python newbies: Python Tutorial by Guido van Rossum. Read upto Chapter 10.
    • The data files are in the directory: ~anoop/cmpt413/hw2/
    • AT&T fst toolkit: ~anoop/cmpt413/fsm-4.0/bin; download and documentation
    • CMU Pronunciation dictionary: cmudict
  3. Homework #3
    • The data files are in the directory: ~anoop/cmpt413/hw3/
    • Read Chapter 4 of the NLTK Tutorial.
    • If you are attempting Question 8, read Kevin Knight's workbook on statistical machine translation.
  4. Homework #4
    • The data files are in the directory: ~anoop/cmpt413/hw4/
    • Read Chapter 7 and Chapter 8 of the NLTK Tutorial.
  5. Homework #5
    • The data files are in the directory: ~anoop/cmpt413/hw5/
    • Read Chapter 8 and Chapter 11 of the NLTK Tutorial.

On some CSIL Linux machines, in some rare cases, you might have to extend your CPU time limit for a process. If you are using tcsh then run the command "limit cputime 1800" to extend CPU time to 1800 secs or 30 mins. If you are using bash then use the command "ulimit -t 1800".

Textbook and References

Syllabus and Readings

The following list summarizes the topics that will be covered in this course. Also included are the required and optional readings for each topic. J&M refers to the book "Speech and Language Processing" by Jurafsky and Martin. Apart from the required readings, the optional or review readings are provided for those who are having difficulty understanding the material.

  1. Introduction to Linguistics and Formal language theory
  2. Finite-state methods: automata and transducers (applications to orthography, morphology, phonology)
  3. Finite-state methods: edit distance (shortest path in a transducer, spelling correction, evaluation metrics)
  4. Probability models and language: n-grams
    • Chp 6 sections 6.1-6.2 and 6.7 (J&M). 
      • Warning: it is essential to read the errata pages for this chapter and do not read sections 6.3-6.6 from J&M (1st ed. only; read the 2nd ed. for this chapter if you have it)
    • Links: TextCat and languid: language identification based on n-gram matching
  5. Hidden markov models (sequence learning)
  6. Some applications of sequence learning (automatic speech recognition, part of speech tagging, name-finding, word segmentation)
  7. Context-free grammars and parsing algorithms (natural language syntax)
  8. Feature structures and unification
  9. Lexical semantics and word-sense disambiguation
    • Notes #18 (3/31/2008)
    • Chp 16 and Chp 17, sections 17.1 and 17.2 (J&M)
  10. Discourse and dialog models
    • Notes #19 (4/7/2008)
    • Chp 18, section 18.1 and Chp 19, sections 19.1-19.3 (J&M)
  11. Natural Language Semantics (translation into logic, language understanding, language generation)
  12. Natural Language and complexity theory (mathematical linguistics)

Course Expectations and Policies


anoop at cs.sfu.ca