CMPT 413 - Spring 2008: Computational Linguistics

Instructor: Dr. Anoop Sarkar
Location and Time:
AQ 3003, 2:30-3:20p Mon, Fri
AQ 3159, 2:30-3:20p Wed

Mailing List: cmpt-413 _at_ sfu.ca (always prefix "cmpt-413: " to all messages sent to this list)
Mailing list archives

Office: TASC 9427
Office hours: Wed, 12:30-1:30

Teaching Assistant: Anton Venema (email: avenema _at_ gmail.com)
TA Office Hours: Wed, 3:30-4:30 in TASC1-9406.

Computational Linguistics is the study of human language from a computational perspective. This course will examine algorithms used in the automatic analysis or production of language. Along with formal models of language, we will also study the engineering of natural language processing software.

More details: Course outline

Announcements

Grading for the course:

Final Exam: 34%
3 quizzes: 12% each (total of 36%)
Homeworks: 30%

Important Dates:

Mon, Jan 7: First day of classes
Mon, Apr 7: Last day of classes
Quiz #1: Feb 4, 2008
Quiz #2: Mar 10, 2008
Quiz #3: Mar 31, 2008
Fri, Apr 18: Final Exam. Time: 3:30-6:30pm. Location: AQ 5018

Location of homework files is in the unix directory: ~anoop/cmpt413/ on the CSIL linux machines

Assignments

Important note on assignment submission and development: Your homework will be submitted electronically using the department-provided submission server. Connect to the submission server by going to the URL: https://submit.cs.sfu.ca/

We will be using the Python-based nltk: Natural Language Toolkit for most of the homeworks in this course. You must use nltk version 0.9 for the homeworks.

Your code should run on a standard Python 2.5 interpreter and you can assume that the NLTK modules and corpora are available. You must test your programs on the CSIL linux machines before you submit. Python v2.5 and the nltk v0.9 libraries and data have been installed in the directory /cmpt413 which is available on all CSIL Linux machines. If you cannot access it on a machine in the lab, please email csilop@cs.sfu.ca and they will fix the problem.

You can ssh into the CSIL Linux machines (peach.csil, mango.csil, and other fruit machines in the lab) to test your programs for the most part.

The NLTK installation on the CSIL Linux machines is in /cmpt413. First check which shell you are using by running echo $SHELL. Then every time you want to use python2.5 with nltk: source /cmpt413/setup.sh (if you use bash or sh) or source /cmpt413/setup.csh (if you use tcsh or csh). Run python2.5 to start up the Python 2.5 interpreter. All nltk libraries can now be used as in the NLTK tutorials.

Homework #1

Additional Reading: Short Introduction to NLTK by S. Bird, E. Klein and E. Loper.
The data files are in the directory: ~anoop/cmpt413/hw1/
The Porter stemmer is available from Martin Porter's page

Homework #2 (deadline extended to Wed, Feb 13 due to snow days)

Additional Reading for Python newbies: Python Tutorial by Guido van Rossum. Read upto Chapter 10.
The data files are in the directory: ~anoop/cmpt413/hw2/
AT&T fst toolkit: ~anoop/cmpt413/fsm-4.0/bin; download and documentation
CMU Pronunciation dictionary: cmudict

Homework #3

The data files are in the directory: ~anoop/cmpt413/hw3/
Read Chapter 4 of the NLTK Tutorial.
If you are attempting Question 8, read Kevin Knight's workbook on statistical machine translation.

Homework #4

The data files are in the directory: ~anoop/cmpt413/hw4/
Read Chapter 7 and Chapter 8 of the NLTK Tutorial.

Homework #5

The data files are in the directory: ~anoop/cmpt413/hw5/
Read Chapter 8 and Chapter 11 of the NLTK Tutorial.

On some CSIL Linux machines, in some rare cases, you might have to extend your CPU time limit for a process. If you are using tcsh then run the command "limit cputime 1800" to extend CPU time to 1800 secs or 30 mins. If you are using bash then use the command "ulimit -t 1800".

Textbook and References

Textbook:

no official textbook; use the lecture notes and reading from Syllabus below

Recommended Textbooks:

We will follow the material in this textbook closely but not in all aspects. The exercises and discussion in this book will be helpful to supplement what is discussed in class.

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin. 934 pages 1 edition (January 26, 2000), Prentice Hall, ISBN: 0130950696

The book also has a webpage. In particular visit it for the Errata and the online Resources sections.

Introduction to Natural Language Processing by Steven Bird, Ewan Klein and Edward Loper. This book describes the Python-based Natural Language toolkit that we will use in the homework assignments and will provide additional reading material for that purpose. Note that the tutorials needed for the homeworks are available online on nltk.sf.net.

Reference Textbooks: Material that I use from time to time, but not required reading for those taking the course.

Foundations of Statistical Natural Language Processing by Christopher D. Manning, Hinrich Schutze. 680 pages 1 edition (1999), M.I.T. Press/Triliteral, ISBN: 0262133601

This book will be useful in cases where you want a different presentation of the same material that is required reading from J&M. In many cases the statistical approaches are covered in a bit more detail in this book. However, it does not contain all the topics that we will cover in this course.

Syllabus and Readings

The following list summarizes the topics that will be covered in this course. Also included are the required and optional readings for each topic. J&M refers to the book "Speech and Language Processing" by Jurafsky and Martin. Apart from the required readings, the optional or review readings are provided for those who are having difficulty understanding the material.

Introduction to Linguistics and Formal language theory

Notes #1 (1/14/2008)
Chp 1 and 2 (J&M)
Links: Festival: Open-source text to speech, Speech Animation at AT&T, Summarization at SFU
Links: CMU Communicator: Dialog System, RUTH: Rutgers University Talking Head, WordsEye: NLP & Graphics

Finite-state methods: automata and transducers (applications to orthography, morphology, phonology)

Notes #2 (1/14/2008)
Notes #3 (1/14/2008)
Notes #4 (1/23/2008)
Notes #5 (2/1/2008)
Chp 1 and 2 (J&M)
Chp 3 (J&M)

Finite-state methods: edit distance (shortest path in a transducer, spelling correction, evaluation metrics)

Notes #6 (2/1/2008)
Chp 5 sections 5.1-5.6 (J&M)
Links: Levenshtein Demo

Probability models and language: n-grams

Notes #7 (2/13/2008)
Notes #8 (2/20/2008)
Notes #9 (2/18/2008)
Reading: Sections 1-14 from Kevin Knight's statistical MT workbook
Reading: Sections 1-2.7 (p.15) & Section 5.1 from Empirical Study of Smoothing by Chen and Goodman

Chp 6 sections 6.1-6.2 and 6.7 (J&M).

Warning: it is essential to read the errata pages for this chapter and do not read sections 6.3-6.6 from J&M (1st ed. only; read the 2nd ed. for this chapter if you have it)

Links: TextCat and languid: language identification based on n-gram matching

Hidden markov models (sequence learning)

Notes #10 (2/20/2008)
Notes #10a (3/5/2008)
Notes #10b (3/5/2008)
Spreadsheet demos: Viterbi algorithm and HMM Learning
Chp 5 section 5.9; sections 7.1-7.3 and Appendix D (J&M)

Some applications of sequence learning (automatic speech recognition, part of speech tagging, name-finding, word segmentation)

Notes #11 (2/22/2008)
Notes #12 (2/20/2008)
Chp 8 sections 8.1-8.5 (J&M)

Context-free grammars and parsing algorithms (natural language syntax)

Notes #13 (2/28/2008)
Notes #14 (2/28/2008)
Notes #15 (2/28/2008)
Notes #16 (2/28/2008)
Chp 9 and Chp 10 (J&M)
Earley Algorithm (ppt) from Jason Eisner's NLP course
Review: Chp 2 and 4.1 (Sipser)

Feature structures and unification

Notes #17 (3/17/2008)
Supplementary transparencies (3/28/07)
Chp 11 (J&M)

Lexical semantics and word-sense disambiguation

Notes #18 (3/31/2008)
Chp 16 and Chp 17, sections 17.1 and 17.2 (J&M)

Discourse and dialog models

Notes #19 (4/7/2008)
Chp 18, section 18.1 and Chp 19, sections 19.1-19.3 (J&M)

~~Natural Language Semantics (translation into logic, language understanding, language generation)~~

Notes #20
Chp 14 and 15 (J&M)

~~Natural Language and complexity theory (mathematical linguistics)~~

Notes #21
Chp 13 (J&M)

Course Expectations and Policies

All course information on this web page is tentative and could be in error. It can also change at any time. Confirm crucial dates or information with me in person during class. Double check with SFU calendar or schedule information for official class times and final exams time and location.
Students are expected to attend all classes: announcements about assigned readings, homeworks and exams will be made available at the start of each class. Such announcements may not be made on this web page, so don't rely on information here instead of attending class.
Lecture notes or other materials put up on this web page are only additional material and not an alternative to the readings assigned. Only reading the lecture notes will not be enough to prepare for the assignments or the exams.
Late assignments will be graded as follows: you will have 5 grace days which you can use throughout the semester; you can use all 5 days for one assignment or submit upto 5 assignments late by one day without penalty (or any other combination that adds up to 5 days). Once your grace days are used up you will be graded on 40% of the total grade for the assignment.
If you must miss an exam because of illness, you are required to contact me prior to the exam either by email or a message in my mailbox. A valid note from a medical doctor is required specifying date of absence and reason. If you miss an exam due to valid medical reasons you will be graded on your performance on the rest of the course. Make up exams will not be given under any circumstances.
Email policy: Use the prefix "cmpt-413: " on all your messages. If you do not include the prefix, then the mail might go unanswered.
For personal advising come during my office hours (posted above).
Copying on assignments or exams will be taken very seriously. If you are caught cheating on an assignment or an exam you will be hauled off for disciplinary action. There will be no assignments to be solved as a group. Despite this, students often meet to discuss the assignments. You have to be very careful that you do not take any notes or copy during these meetings.
For more on academic dishonesty read the University code of academic honesty.

anoop at cs.sfu.ca

CMPT 413 - Spring 2008: Computational Linguistics

Announcements

Assignments

References

Weekly Readings

Course Policies (first time here? read this)

Announcements

Assignments

Textbook and References

Syllabus and Readings

Course Expectations and Policies