Natural Language Laboratory at Simon Fraser University
Writing a letter, reading a newspaper, watching the six o'clock news, having a conversation -- the every-day written and spoken language of such activities is called natural language to distinguish it from artificial, made-up languages like programming languages. For over 30 years, researchers have studied how computers can be programmed to understand and generate written text and spoken utterances. The study area has been called natural language processing (NLP) or computational linguistics, though these terms tend to be associated with text processing rather than speech processing.
These days, NLP research is conducted at many universities and in the research laboratories of large companies, and there is a growing number of commercial NLP products (Obermeier, 1989) such as machine translation systems (see Hovy, 1993) and natural language interfaces (Sijtsma & Zweekhorst, 1993).
The study of natural language is frequently decomposed into a number of smaller, partially overlapping study areas: phonology, morphology, syntax, semantics and pragmatics. The scope of each area is described below, together with problems that each area presents for NLP. The descriptions of areas are adapted from Crystal (1992).
Language is a medium: its auditory form is spoken language, its visual form is written language. This view of language is briefly described below. Also, some Internet sites for natural language research are given.
Problems include the ratio of noise to data, the varying speech rates within and across individuals, and coarticulation. Coarticulation takes place when the articulation for two or more sounds takes place in the vocal tract, e.g., the `sh-' in `shoe' is normally pronounced with lip-rounding in anticipation of the `-oo' sound.
A major morphological problem is ambiguity: the suffix `s', for example, can indicate the plural of a noun or the present tense of a verb. Another problem is exceptions, for example, the plural of the noun `foot' is `feet' (not `foots').
Structural ambiguity occurs when a sentence construction can be assigned several possible structures or combinations of elements, e.g. in ``Jane saw the man in the park with the telescope'' the prepositional phrase ``with the telescope'' could be attached to either ``Jane saw'' or ``the man in the park.''
Unbounded or long distance dependency is a relationship between two syntactic components of a sentence in which the related constituents are not required to be within some bounded distance of each other. The dependency, which may extend over one or more clause boundaries, usually involves an empty noun phrase constituent called a "trace" which is coindexed with another noun phrase appearing earlier, as in ``Show me the subscript[i] that Nick wanted Dan to subscript[i]'' where, although `report' is the object of the verb `write', there is no explicit object following the verb.
Lexical ambiguity refers to a semantic property of words that they can have multiple senses or meanings, e.g., the word `crook' has different senses: it can mean a thief, a bend, or a shepherd's stick. Resolution of lexical ambiguity is required for understanding sentences that contain ambiguous words like `crook', e.g., in ``The crook stole a diamond ring,'' the thief sense is meant.
Similarity or paraphrase refers to a property of sentences that different ones can have the same (or very similar) meanings, e.g.,
``Give me the Western region financial performance for July,''
``Give me the July financial performance for the Western region,''
``Give me the financial performance for July for the Western region'' and
``Give me the July Western region financial performance'' (cf. McFetridge, 1991).
The problem is recognizing when two sentence are paraphrases.
Reference is a relationship of identity between linguistic units, e.g., between a pronoun and a noun or noun phrase. Pronouns are of various kinds, including definite pronouns like `it' and `them', personal pronouns such as `I' and `you', reflexive pronouns like `myself' and `yourself', and relative pronouns such as `who', `whom' and `that'. The problem is reso>lving reference, i.e., connecting a pronoun with the noun or noun phrase to which it refers.
Reference can occur across sentence boundaries, and can be backwards or forwards. Anaphora (or back-reference) is reference to an earlier part of a discourse. Cataphora (or forward reference) is reference to a later part of the discourse. The difference can be seen in a two different two-sentence discourses where the first sentence each time is ``John is at home.'' There is an anaphoric reference to John when the second sentence is ``If he is not drunk, Peter will be surprised'' versus a cataphoric reference to Peter when the second sentence is ``If he is not drunk, Peter will take me there'' (Strzalkowski & Cercone, 1986, p. 159). Traditional syntactic solutions have been able to treat only simple classes of anaphora and only occasional inter-sentential references.
Presupposition is the information assumed by a person when using language and which is as the centre of a person's communicative interest, e.g., ``There is unrest in Yugoslavia'' assumes the existence of (a country called) Yugoslavia.
Conversational repair refers to ``the attempt made by participants in a conversation to make good a real or imagined deficiency in the interaction (for example, a mishearing or misunderstanding)'' (Crystal, 1992, p. 298). A major problem here is working out which participant is wrong or mistaken and hence should have their conversation (and understanding) repaired.
Indirect meaning refers to the communicative purpose of a piece of language which does not directly reflect its surface form. The true communicative purpose is understood from examining the context in which the piece of language was used, for example, ``It's hot in here'' looks like an assertion, but in the right context -- spoken to someone standing by a window -- might be a request to open the window. Likewise, ``Can you pass the salt?'' looks like a question, but can also be a request to pass the salt if said when sitting at a table and spoken to someone closer to the salt than you are!
There is a collection of HPSG resources at Ohio State University.
For useful general collections of NLP software, visit the Natural Language Software Registry (Saarbrucken, Germany) and the public sections of the Consortium for Lexical Research (Las Cruces, NM).
For language corpora, access the sites run by the Consortium for Lexical Research or the Oxford Text Archive (Oxford, England).
For NLP-related papers and bibliographies, check the Computation and Language E-Print Archive (Harvard, MA), the natural language processing section of the Artificial Intelligence Repository at Carnegie-Mellon University, and the collection of bibliographies on artificial intelligence at the University of Manitoba, Canada. For information about associations concerned with the study of natural language, see the World-Wide Web site operated by the Association of Computational Linguistics (ACL) based in New York, and the NLP section of SIGART, which is the ACM Special Interest Group on Artificial Intelligence. Various language and subject dictionaries and other reference works can be found at Carnegie Mellon University. Also available online are listings of academic departments and institutes and companies and corporate research labs