Announcements
- Grading for the course:
- Final Exam: 34%
- 3 quizzes: 12% each (total of 36%)
- Homeworks: 30%
- Important Dates:
- Mon, Jan 7: First day of classes
- Mon, Apr 7: Last day of classes
- Quiz #1: Feb 4, 2008
- Quiz #2: Mar 10, 2008
- Quiz #3: Mar 31, 2008
- Fri, Apr 18: Final Exam. Time: 3:30-6:30pm. Location: AQ 5018
- Location of homework files is in the unix directory:
~anoop/cmpt413/
on the CSIL linux machines
Assignments
Important note on assignment submission and development: Your homework will be submitted electronically using the department-provided submission server. Connect to the submission server by going to the URL: https://submit.cs.sfu.ca/
We will be using the Python-based nltk: Natural Language Toolkit for most of the homeworks in this course. You must use nltk version 0.9 for the homeworks.
Your code should run on a standard Python 2.5 interpreter and
you can assume that the NLTK modules and corpora are available. You
must test your programs on the CSIL linux machines before you submit.
Python v2.5 and the nltk v0.9 libraries and data have been installed
in the directory /cmpt413
which is available on all
CSIL Linux machines. If you cannot access it on a machine in the lab,
please email csilop@cs.sfu.ca
and they will fix the
problem.
You can ssh into the CSIL Linux machines (peach.csil, mango.csil, and other fruit machines in the lab) to test your programs for the most part.
The NLTK installation on the CSIL Linux machines is in /cmpt413. First check which shell you are using by running echo $SHELL. Then every time you want to use python2.5 with nltk: source /cmpt413/setup.sh (if you use bash or sh) or source /cmpt413/setup.csh (if you use tcsh or csh). Run python2.5 to start up the Python 2.5 interpreter. All nltk libraries can now be used as in the NLTK tutorials.
- Homework #1
- Additional Reading: Short Introduction to NLTK by S. Bird, E. Klein and E. Loper.
- The data files are in the directory:
~anoop/cmpt413/hw1/
- The Porter stemmer is available from Martin Porter's page
- Homework #2 (deadline extended to Wed, Feb 13 due to snow days)
- Additional Reading for Python newbies: Python Tutorial by Guido van Rossum. Read upto Chapter 10.
- The data files are in the directory:
~anoop/cmpt413/hw2/
- AT&T fst toolkit:
~anoop/cmpt413/fsm-4.0/bin
; download and documentation - CMU Pronunciation dictionary: cmudict
- Homework #3
- The data files are in the directory:
~anoop/cmpt413/hw3/
- Read Chapter 4 of the NLTK Tutorial.
- If you are attempting Question 8, read Kevin Knight's workbook on statistical machine translation.
- Homework #4
- The data files are in the directory:
~anoop/cmpt413/hw4/
- Read Chapter 7 and Chapter 8 of the NLTK Tutorial.
- Homework #5
- The data files are in the directory:
~anoop/cmpt413/hw5/
- Read Chapter 8 and Chapter 11 of the NLTK Tutorial.
On some CSIL Linux machines, in some rare cases, you might have to extend your CPU time limit for a process. If you are using tcsh then run the command "limit cputime 1800" to extend CPU time to 1800 secs or 30 mins. If you are using bash then use the command "ulimit -t 1800".
Textbook and References
- Textbook:
- no official textbook; use the lecture notes and reading from Syllabus below
- Recommended Textbooks:
-
We will follow the material in this textbook closely but not in all aspects. The exercises and
discussion in this book will be helpful to supplement what is discussed in class.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin. 934 pages 1 edition (January 26, 2000), Prentice Hall, ISBN: 0130950696
The book also has a webpage. In particular visit it for the Errata and the online Resources sections.
-
We will follow the material in this textbook closely but not in all aspects. The exercises and
discussion in this book will be helpful to supplement what is discussed in class.
- Introduction to Natural Language Processing by Steven Bird, Ewan Klein and Edward Loper. This book describes the Python-based Natural Language toolkit that we will use in the homework assignments and will provide additional reading material for that purpose. Note that the tutorials needed for the homeworks are available online on nltk.sf.net.
![]() |
Foundations of Statistical Natural Language
Processing by
Christopher D. Manning, Hinrich Schutze. 680 pages 1 edition (1999),
M.I.T. Press/Triliteral, ISBN: 0262133601 This book will be useful in cases where you want a different presentation of the same material that is required reading from J&M. In many cases the statistical approaches are covered in a bit more detail in this book. However, it does not contain all the topics that we will cover in this course. |
Syllabus and Readings
The following list summarizes the topics that will be covered in this course. Also included are the required and optional readings for each topic. J&M refers to the book "Speech and Language Processing" by Jurafsky and Martin. Apart from the required readings, the optional or review readings are provided for those who are having difficulty understanding the material.
- Introduction to Linguistics and Formal language theory
- Notes #1 (1/14/2008)
- Chp 1 and 2 (J&M)
- Links: Festival: Open-source text to speech, Speech Animation at AT&T, Summarization at SFU
- Links: CMU Communicator: Dialog System, RUTH: Rutgers University Talking Head, WordsEye: NLP & Graphics
- Finite-state methods: automata and transducers (applications to orthography, morphology, phonology)
- Notes #2 (1/14/2008)
- Notes #3 (1/14/2008)
- Notes #4 (1/23/2008)
- Notes #5 (2/1/2008)
- Chp 1 and 2 (J&M)
- Chp 3 (J&M)
- Finite-state methods: edit distance (shortest path in a transducer, spelling correction, evaluation metrics)
- Notes #6 (2/1/2008)
- Chp 5 sections 5.1-5.6 (J&M)
- Links: Levenshtein Demo
- Probability models and language: n-grams
- Notes #7 (2/13/2008)
- Notes #8 (2/20/2008)
- Notes #9 (2/18/2008)
- Reading: Sections 1-14 from Kevin Knight's statistical MT workbook
- Reading: Sections 1-2.7 (p.15) & Section 5.1 from Empirical Study of Smoothing by Chen and Goodman
- Chp 6 sections 6.1-6.2 and 6.7 (J&M).
- Warning: it is essential to read the errata pages for this chapter and do not read sections 6.3-6.6 from J&M (1st ed. only; read the 2nd ed. for this chapter if you have it)
- Links: TextCat and languid: language identification based on n-gram matching
- Hidden markov models (sequence learning)
- Notes #10 (2/20/2008)
- Notes #10a (3/5/2008)
- Notes #10b (3/5/2008)
- Spreadsheet demos: Viterbi algorithm and HMM Learning
- Chp 5 section 5.9; sections 7.1-7.3 and Appendix D (J&M)
- Some applications of sequence learning (automatic speech
recognition, part of speech tagging, name-finding,
word segmentation)
- Context-free grammars and parsing algorithms (natural language syntax)
- Notes #13 (2/28/2008)
- Notes #14 (2/28/2008)
- Notes #15 (2/28/2008)
- Notes #16 (2/28/2008)
- Chp 9 and Chp 10 (J&M)
- Earley Algorithm (ppt) from Jason Eisner's NLP course
- Review: Chp 2 and 4.1 (Sipser)
- Feature structures and unification
- Notes #17 (3/17/2008)
- Supplementary transparencies (3/28/07)
- Chp 11 (J&M)
- Lexical semantics and word-sense disambiguation
- Notes #18 (3/31/2008)
- Chp 16 and Chp 17, sections 17.1 and 17.2 (J&M)
- Discourse and dialog models
- Notes #19 (4/7/2008)
- Chp 18, section 18.1 and Chp 19, sections 19.1-19.3 (J&M)
Natural Language Semantics (translation into logic, language understanding, language generation)- Notes #20
- Chp 14 and 15 (J&M)
Natural Language and complexity theory (mathematical linguistics)- Notes #21
- Chp 13 (J&M)
Course Expectations and Policies
- All course information on this web page is tentative and could be in error. It can also change at any time. Confirm crucial dates or information with me in person during class. Double check with SFU calendar or schedule information for official class times and final exams time and location.
- Students are expected to attend all classes: announcements about assigned readings, homeworks and exams will be made available at the start of each class. Such announcements may not be made on this web page, so don't rely on information here instead of attending class.
- Lecture notes or other materials put up on this web page are only additional material and not an alternative to the readings assigned. Only reading the lecture notes will not be enough to prepare for the assignments or the exams.
- Late assignments will be graded as follows: you will have 5 grace days which you can use throughout the semester; you can use all 5 days for one assignment or submit upto 5 assignments late by one day without penalty (or any other combination that adds up to 5 days). Once your grace days are used up you will be graded on 40% of the total grade for the assignment.
- If you must miss an exam because of illness, you are required to contact me prior to the exam either by email or a message in my mailbox. A valid note from a medical doctor is required specifying date of absence and reason. If you miss an exam due to valid medical reasons you will be graded on your performance on the rest of the course. Make up exams will not be given under any circumstances.
- Email policy: Use the prefix "cmpt-413: " on all your messages. If you do not include the prefix, then the mail might go unanswered.
- For personal advising come during my office hours (posted above).
- Copying on assignments or exams will be taken very seriously. If you are caught cheating on an assignment or an exam you will be hauled off for disciplinary action. There will be no assignments to be solved as a group. Despite this, students often meet to discuss the assignments. You have to be very careful that you do not take any notes or copy during these meetings.
- For more on academic dishonesty read the University code of academic honesty.