Anoop Sarkar: CMPT 882-3 - Statistical Learning of Natural Language

CMPT 882-3 - Statistical Learning of Natural Language

Anoop Sarkar

Office: ASB 10859

Phone: 291-4933

Email: anoop at cs.sfu.ca

CMPT 882-3: Fall 2002

Monday 4:00p - 5:50p in SCB 8662

Wednesday 3:30p - 4:20p in SCB 8662

Office Hrs by appointment (send me email)

In this course we will study basic algorithms that produce state of the art results on tasks involving natural language text. For each of these tasks, we will compare knowledge-rich approaches which use a lot of human supervision to knowledge-poor techniques which use parameter re-estimation or bootstrapping algorithms. We will also compare generative models (models which maximize likelihood of the training data) with discriminative models (models which minimize classification error rate).
For more details: Course Description for CMPT 882-3

Important Dates

Sep 4: First class
Oct 14: No Class: Thanksgiving
Nov 11: No Class: Remembrance Day
Dec 4: Final Project Presentations

Final Projects

Julia Birke, Metaphoinder: A study in metaphor clustering (slides)
Charu Jain, Transformations (slides)
Chris Demwell, Tagging with Hidden Markov Models (slides)
Roozbeh Farahbod, Modified Voted Perceptron: A Re-ranking method for NP Chunking (slides)
Mona Vajihollahi, Re-ranking for NP Chunking using the Maximum Entropy Framework (slides)
Daniel Zimmerman, RNA Secondary Structure Prediction with Error-Driven Transformation Based Learning (slides)

Reading List

Introduction to Statistical NLP and Supervised Decision List Learning
Notes: Lecture #01 pdf
Bootstrapping techniques in learning word meanings: word-sense disambiguation
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods (1995). David Yarowsky. Proceedings of ACL-95. pp. 189-196
Notes: Lecture #02 pdf
Additional Readings:
- Learning Decision Lists (1987). Ronald L. Rivest. Machine Learning. 2(3), pp. 229-246.
- Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French (1994). David Yarowsky. Proceedings of ACL-94. pp. 88-85.
- Unsupervised Models for Named Entity Classification (1999). Michael Collins, Yoram Singer. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
Data:
- "Plant" dataset extracted from New York Times (taken from North American Business News, LDC dataset)
- Penn Treebank Tagging Guidelines: this will explain the tags that appear after the underscore after each word, e.g. plant_NN
- Training and test data for WSD in multiple languages: senseval 2 data.
Homework:
- Write a concordance tool that shows a keyword (say, "plant") in context. Use the above data set and produce an output that looks like this:
```
de_VBP training_VBG new_JJ Ukrainian_JJ     plant    _NN operators_NNS to_TO replace_VB Russi
_NNPS who_WP are_VBP leaving_VBG the_DT     plant    s_NNS in_IN Ukraine_NNP and_CC improving
N and_CC safety_NN procedures_NNS at_IN     plant    s_NNS in_IN both_DT countries_NNS ,_, sa
iet-designed_JJ reactors_NNS at_IN a_DT     plant    _NN in_IN the_DT Czech_NNP Republic_NNP 
er_JJ to_TO pay_VB to_TO make_VB the_DT     plant    _NN safer_JJR ._. 
er_JJ to_TO pay_VB to_TO make_VB the_DT     plant    _NN safer_JJR ._. 
 ,_, ''_'' she_PRP said_VBD ,_, are_VBP     plant    _NN moratoriums_NNS ._. 
_NNS at_IN the_DT Orange_NNP County_NNP     plant    _NN ._. 
  
```
- perl script for concordance (readme)
Comparing Decision Lists to Naive Bayes
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval (1998). David Lewis. Proceedings of ECML-98 (10th meeting). pp. 4-15.
Notes: Lecture #03 pdf
Additional Readings:
- A Comparison of Event Models for Naive Bayes Text Classification (1998). Andrew Mccallum and Kamal Nigam. In AAAI-98 Workshop on Learning for Text Categorization.
- ifile: An Application of Machine Learning to E-mail Filtering (2000). Jason Rennie. Proceedings of the KDD-2000 Workshop on Text Mining.
Homework:
- Download, install and use rainbow, an implementation of Naive Bayes for document classification by Andrew McCallum and his collaborators. You can use the 20Newsgroups and the WebKB datasets available from the CMU Text Learning page.
- Optionally, you can download, install and play with Christian Borgelt's Naive Bayes implementation. You will need to download table.tar.gz and bayes.tar.gz. Use the data files that come with the package.
- Other links: AutoClass, Mistral
Hidden Markov Models and their Application to Sequence Analysis
Chapters 3 and 4 of Statistical Language Learning. Eugene Charniak. MIT Press. 1993.
Notes: Lecture #04 pdf
Other Sources:
- Chapter 2 from Statistical Methods in Speech Recognition. Frederick Jelinek. MIT Press. 1998.
- A tutorial on Hidden Markov Models and selected applications in speech recognition (1989). L. R. Rabiner. Proceedings of the IEEE, vol 77, number 2.
- Chapter 6 from Fundamentals of Speech Recognition. Lawrence Rabiner and Biing-Hwang Juang. Prentice Hall. 1993.
Additional Readings:
- A stochastic parts program and noun phrase parser for unrestricted text (1988). Kenneth Church. Proceedings of ANLP-88.
- Best-first Model Merging for Hidden Markov Model Induction (1994). A. Stolcke and S. M. Omohundro. TR-94-003, ICSI, Berkeley, CA.
Homework:
- Create a markov chain model on a corpus of your choice using the script create_model from this directory (you can use the "Plant" dataset as your corpus, for example). Then produce a sample output using trigen.pl (also in the same directory). Read the file readme.txt for further instructions.
Using the Forward-Backward Algorithm
Does Baum-Welch Re-estimation help taggers? (1994). David Elworthy. Proceedings of 4th ACL Conf on ANLP, Stuttgart. pp. 53-58.
Notes: Lecture #05 pdf
Additional Readings:
- A Practical Part-of-Speech Tagger (1992). Doug Cutting, Julian Kupiec, Jan Pedersen and Penelope Sibun. In Proceedings of ANLP-92.
- Tagging text with a probabilistic model (1994). Bernard Merialdo. Computational Linguistics 20(2):155-172.
- Text Classification from Labeled and Unlabeled Documents using EM (2000). Kamal Nigam, Andrew Mccallum, Sebastian Thrun and Tom Mitchell. Machine Learning volume 39, number 2-3.
Homework:
- Use the HMM alpha/beta spreadsheet and the HMM Viterbi spreadsheet. Note that you will need either Microsoft Excel or Openoffice to open these spreadsheets. The idea of using a spreadsheet for teaching the forward-backward algorithm comes from Jason Eisner's TNLP paper.
Hidden Markov Models for Name Finding
Nymble: a High-Performance Learning Name-finder (1997). Daniel M. Bikel, Scott Miller, Richard Schwartz, Ralph Weischedel. Proceedings of ANLP-97.
Notes: Lecture #06 pdf
Additional Readings:
- Information Extraction with HMMs and Shrinkage (1999). Dayne Freitag, Andrew McCallum. Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction.
- Information Extraction with HMM Structures Learned by Stochastic Optimization (2000). Dayne Freitag and Andrew McCallum. Proceedings of AAAI-2000.
Other Applications of HMMs:
- Hiding a Semantic Class Hierarchy in a Markov Model (1998). Steven Abney and Marc Light. manuscript.
The EM algorithm for hybrid models
EM for hybrid models
We will look at the use of the EM algorithm (a generalization of the forward-backward algorithm for HMMs) and apply it to the problem of finding the appropriate interpolation weights between a word-based and a part-of-speech based language model.
mixture.pl is a simple Perl script that implements this idea.
Additional Readings:
- A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models (1997). Jeff A. Bilmes, Technical Report, University of Berkeley, ICSI-TR-97-021.
- The EM algorithm (1997). Michael Collins. manuscript.
Discriminative Methods, Transformation-Based Learning
Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging (1995). Eric Brill. Computational Linguistics, volume 21, number 4, pp. 543-565.
Notes: Lecture #07 pdf
Additional Readings:
- Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging (1995). Eric Brill. Proceedings of the Third Workshop on Very Large Corpora WVLC-95. pp. 1-13
- Text Chunking Using Transformation-Based Learning (1995). Lance Ramshaw and Mitch Marcus. Proceedings of the Third Workshop on Very Large Corpora WVLC-95. pp. 82-94.
Homework:
- Download the fntbl toolkit: a recent implementation of the TBL learning algorithm. Use it for POS tagging and NP chunking tasks (included in the download). Examine the rules that have been learned for these tasks.
- Download and use the mu-TBL system which uses Prolog to implement TBL.
Discriminative Methods, Maximum Entropy Models
A simple introduction to maximum entropy models for natural language processing (1997). Adwait Ratnaparkhi. Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania.
A Maximum Entropy Model for Part-of-Speech Tagging (1996). Adwait Ratnaparkhi. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 133-142.
Notes: Lecture #08 pdf
Additional Readings:
- Maximum Entropy Markov Models for Information Extraction and Segmentation (2000). Andrew Mccallum, Dayne Freitag and Fernando Pereira. In Proc. 17th International Conf. on Machine Learning. pp. 591-598.
- Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data (2001). John Lafferty, Andrew McCallum and Fernando Pereira. In Proc. 18th International Conf. on Machine Learning
- A Gaussian prior for smoothing maximum entropy models (1999). S. Chen and R. Rosenfeld, Technical Report CMUCS -99-108, Carnegie Mellon University.
Homework:
- Download the Java code for MXPOST from Adwait Ratnaparkhi's web page. Using the training data and raw text supplied to you train an initial model on the training data and then augment the training data using the raw text iteratively in a self-training loop until the raw text is exhausted. Measure performance at each iteration on the test data provided.
Hypothesis Testing: Unsupervised Learning of Lexical Knowledge
Automatic Extraction of Subcategorization from Corpora (1997). Ted Briscoe and John Carroll. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97).
Notes: Lecture #09 pdf
Additional Readings:
- Surface Cues and Robust Inference as a Basis for the early Acquisition of Subcategorization Frames (1993). Michael Brent. In Gleitman, L. and Landau, B., editors, The Acquisition of the Lexicon, pages 433-470. MIT Press.
- Automatic Extraction of Subcategorization Frames for Czech (2000). Anoop Sarkar and Daniel Zeman. In Proceedings of COLING-2000.
- Automatic Verb Classification Using Distributions of Grammatical Features (1999). Suzanne Stevenson and Paola Merlo. In Proc. of the 9th Conference of the European Chapter of the ACL, pages 45-52.
Homework:
- By now you have decided on a project. From now on, there will be no more homeworks. Instead you should concentrate on getting the datasets, and doing the relevant reading for your project work.
Learning Morphology
Minimally supervised morphological analysis by multimodal alignment (2000). Yarowsky, D. and R. Wicentowski. In Proceedings of ACL-2000, pages 207-216.
Unsupervised Learning of the Morphology of a Natural Language (2001). John Goldsmith. Computational Linguistics, Volume 27, Number 2.
Notes: Lecture #10 pdf
Additional Readings:
- Knowledge-Free Induction of Morphology Using Latent Semantic Analysis (2000). Patrick Schone and Daniel Jurafsky. In Proceedings of the Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop.
- Language independent named entity recognition combining morphological and contextual evidence (1999). S. Cucerzan and D. Yarowsky, In Proc. 1999 Joint SIGDAT Conference on EMNLP and VLC.
HMM Redux: Almost Parsing
Supertagging: An Approach to Almost Parsing (1999). Srinivas Bangalore and Aravind K. Joshi. Computational Linguistics, volume 25, number 2, pages 237-265.
Notes: Lecture #11 pdf
Additional Readings:
- Transplanting Supertags from English to Spanish (1998). Srinivas Bangalore. In Proceedings of TAG+4 Workshop on Tree Adjoining Grammars.
- New Models for Improving Supertag Disambiguation (1999). John Chen, Srinivas Bangalore and K. Vijay-Shanker. In Proc. of the 9th EACL.
- Reranking an N-Gram Supertagger (2002). John Chen, Srinivas Bangalore, Michael Collins and Owen Rambow. In Proc. of 6th International Workshop on Tree Adjoining Grammars and Related Frameworks.
Prepositional Phrase Attachment
Coping with syntactic ambiguity or how to put the block in the box on the table (1982). Kenneth Church and Ramesh Patil. Computational Linguistics 8:139-49.
Prepositional Phrase Attachment through a Backed-Off Model (1995). Michael Collins and James Brooks. Proceedings of the Third Workshop on Very Large Corpora WVLC-95.
Notes: Lecture #12 pdf
Additional Readings:
- A rule based approach to prepositional phrase attachment disambiguation (1994). Eric Brill and Philip Resnik. In Proceedings of COLING-94, Kyoto, Japan.
- PP Attachment Ambiguity Resolution through Supervised Learning (1998). J. Stetina and M. Nagao. Journal of Natural Language Processing (Japan) Vol. 5 No. 1. pp. 37-57.
- Attaching Multiple Prepositional Phrases: Generalized Backed-off Estimation (1997). P. Merlo and M. Crocker and C. Berthouzoz. In Proceedings of Second Conference on Empirical Methods in Natural Language Processing, pages 149-155.
Unsupervised Prepositional Phrase Attachment
Structural Ambiguity and Lexical Relations (1993). Donald Hindle and Mats Rooth. Computational Linguistics. Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora: I.
Statistical Models for Unsupervised Prepositional Phrase Attachment (1998). Adwait Ratnaparkhi. In Proceedings of COLING-ACL 1998.
Notes: Lecture #13 pdf
Additional Readings:
- An Unsupervised Approach to Prepositional Phrase Attachment using Contextually Similar Words (2000). P. Pantel and D. Lin. In Proceedings of Association for Computational Linguistics 2000. pp. 101-108. Hong Kong.
Statistical Parsing using a Treebank: Context-Free Grammars and Lexicalized Models
Head-Driven Statistical Models for Natural Language Parsing. Michael Collins. PhD Dissertation, University of Pennsylvania, 1999. Read chapters 2 and 3, pages 31-102
Statistical parsing with an automatically-extracted tree adjoining grammar (2000). David Chiang. In Proceedings of ACL 2000, Hong Kong, October 2000, pages 456-463.
Notes: Lecture #14 pdf and additional slides (from Michael Collins' thesis presentation)
Additional Readings:
- Statistical Parsing with a Context-Free Grammar and Word Statistics. Eugene Charniak. Proc. of 14th National Conf. on AI, AAAI Press. 1997.
- Corpus Variation and Parser Performance. Daniel Gildea. Proc. of 2001 Conf. on Empirical Methods in Natural Language Processing (EMNLP). 2001.
Code:
- Parser for CFGs
- Parser for TIGs
Parsing Algorithms and the Inside-Outside Algorithm for PCFGs
Inside-Outside Reestimation from partially bracketed corpora. Fernando Pereira and Yves Schabes. In 30th Annual Meeting of the Association for Computational Linguistics, pages 128-135, Newark, Delaware, 1992.
Applications of stochastic context-free grammars using the Inside-Outside algorithm. K. Lari and S. J. Young. Computer Speech and Language, 4:35-56, 1990.
Additional Readings:
- Parsing the Wall Street Journal with the Inside-Outside Algorithm. Yves Schabes, Michael Roth and Randy Osborne. In Proc. of Sixth Conference of the European Chapter of the Association for Computational Linguistics, 1993.
Optional:
- Basic Methods of Probabilistic Context-Free Grammars. F. Jelinek, J. D. Lafferty and R. L. Mercer. Technical Report RC 16374, IBM, Yorktown Heights. 1990.
Co-training
Combining Labeled and Unlabeled Data with Co-training. Avrim Blum and Tom Mitchell. In Proc. of the Workshop on Computational Learning Theory (COLT98). 1998.
Analyzing the Effectiveness and Applicability of Co-training. Kamal Nigam and Rayid Ghani. In Ninth International Conference on Information and Knowledge Management (CIKM-2000), pp. 86-93. 2000.
Additional Readings:
- Enhancing Supervised Learning with Unlabeled Data. Sally Goldman and Yan Zhou. Proc. 17th International Conf. on Machine Learning (ICML-2000). pages 327--334, 2000.
- Bootstrapping. Steven Abney. In Proc. of ACL-02. 2002.
- PAC generalization bounds for co-training. Sanjoy Dasgupta, Michael Littman and David McAllester. Neural Information Processing Systems (NIPS), 2001.
Active Learning, Sample Selection
Committee-Based Sample Selection for Probabilistic Classifiers. Shlomo Argamon-Engelson and Ido Dagan. in Journal of Artificial Intelligence Research, 1999.
On Minimizing Training Corpus for Parser Acquisition. Rebecca Hwa. In Proc. of Workshop on Computational Natural Language Learning. 2001.
Additional Readings:
- Toward Optimal Active Learning through Sampling Estimation of Error Reduction . Nicholas Roy and Andrew McCallum. In Proc. 18th International Conf. on Machine Learning. pages 441-448, 2001.
- Employing EM in pool-based active learning for text classification . Andrew K. McCallum and Kamal Nigam. In Proceedings of ICML-98, 15th International Conference on Machine Learning. pages 350--358, 1998.
Discriminative Models: Boosting
A short introduction to boosting . Y. Freund and R. Schapire. Journal of the Japanese Society for Artificial Intelligence. 14(5), pages 771-780, 1999.
Boosting Applied to Tagging and PP Attachment . Steven Abney, Robert E. Schapire, and Yoram Singer. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 38-45. 1999.
Additional Readings:
- Discriminative Reranking for Natural Language Parsing. Michael Collins. In Proc. 17th International Conf. on Machine Learning, pages 175-182, 2000.
- Bagging and Boosting a Treebank Parser. John Henderson and Eric Brill. In Proc. of 6th Applied Natural Language Processing Conference, 2000.

Misc Presentations

Crash course in Perl. Perl shell for trying out the code fragments interactively (run with perl perlsh.pl and copy/paste code from the tutorial into the session).
Project ideas
Project requirements
LaTeX and Microsoft Word style files for final project report

Links to Useful Software and Corpora

NP Chunking datasets and results by Eric Tjong Kim Sang.
Stanford Stat NLP Page: links to various implementations of algorithms we have discussed.
FSNLP Parsers Page: links to parsers available.

Citations

ACL: Annual Meeting of the Association of Computational Linguistics (ACL Web)
ECML: European Conference on Machine Learning
AAAI: American Association for Artificial Intelligence
KDD: ACM International Conference on Knowledge Discovery and Data Mining
ANLP: Conference on Applied Natural Language Processing

Last modified: $Date: 2003-09-03 00:53:00 $
[Home]

Monday	4:00p - 5:50p in SCB 8662
Wednesday	3:30p - 4:20p in SCB 8662
Office Hrs	by appointment (send me email)