Natural Language Laboratory at Simon Fraser University



Next: 2. Theoretical Laboratory Work (Post-1988) Up: About the Natural Language Laboratory Previous: Table of Contents

1. About Natural Language

This section introduces some of the major problems addressed in NLP. It can be readily skipped by those familiar with the field.

Writing a letter, reading a newspaper, watching the six o'clock news, having a conversation -- the every-day written and spoken language of such activities is called natural language to distinguish it from artificial, made-up languages like programming languages. For over 30 years, researchers have studied how computers can be programmed to understand and generate written text and spoken utterances. The study area has been called natural language processing (NLP) or computational linguistics, though these terms tend to be associated with text processing rather than speech processing.

These days, NLP research is conducted at many universities and in the research laboratories of large companies, and there is a growing number of commercial NLP products (Obermeier, 1989) such as machine translation systems (see Hovy, 1993) and natural language interfaces (Sijtsma & Zweekhorst, 1993).

The study of natural language is frequently decomposed into a number of smaller, partially overlapping study areas: phonology, morphology, syntax, semantics and pragmatics. The scope of each area is described below, together with problems that each area presents for NLP. The descriptions of areas are adapted from Crystal (1992).

Language is a medium: its auditory form is spoken language, its visual form is written language. This view of language is briefly described below. Also, some Internet sites for natural language research are given.

1.1. Phonology

Phonology is the study of the sound structure of language. Sounds are organized into a system of contrasts, and analyzed in terms of phonemes, distinctive features, or other such phonological units according to the theory used. A phoneme is the minimal unit of the sound system of a language. Some languages have as few as 15; others have as many as 80. No two languages have the same system of phonemes. Distinctive features are used either to define phonemes or as an alternative to the notion of phoneme. Example pairs include +nasal and -nasal, and +voice (voiced) and -voice (voiceless). Nasal sounds are produced when there is complete closure in the mouth and all the air thus escapes through the nose, as in the `n-' sound of `nasal'. Voiced sounds are produced while the vocal cords are vibrating, e.g., the `b-' sound in `bin'; voiceless or unvoiced sounds are produced when there is no such vibration, as in the `p-' sound of `pin'.

Problems include the ratio of noise to data, the varying speech rates within and across individuals, and coarticulation. Coarticulation takes place when the articulation for two or more sounds takes place in the vocal tract, e.g., the `sh-' in `shoe' is normally pronounced with lip-rounding in anticipation of the `-oo' sound.

1.2. Morphology

Morphology is the study of the structure of words, especially through use of morphemes. Morphemes are commonly divided into free forms (morphemes which can occur as separate words) and bound forms (morphemes which cannot occur in this way, e.g., `unselfish' consists of three morphemes, `self' which is a free form, and `un-' and `-ish' which are bound forms.

A major morphological problem is ambiguity: the suffix `s', for example, can indicate the plural of a noun or the present tense of a verb. Another problem is exceptions, for example, the plural of the noun `foot' is `feet' (not `foots').

1.3. Syntax

Syntax is the study of how words are combined to form sentences in a language. Syntactic structures (or constructions) are analyzed into sequences of syntactic categories (or classes). The sequences are established on the basis of syntactic relationships that linguistic items have with each other in a construction, e.g., ``tall people'' is generally analyzed into a noun phrase consisting of an adjective `tall' and a noun `people'. Linguists have designed grammars for many languages. A grammar is a system of syntax and inflections for a language. Inflection is the change words undergo when used, for example, in the plural (`mouse' and `mice') or in the past tense (`fly' and `flew'). Parsing refers to the assignment of syntactic categories and structures in single sentences. Parsers often but not always use grammars. The following are some major problems for syntactic processing.

Structural ambiguity occurs when a sentence construction can be assigned several possible structures or combinations of elements, e.g. in ``Jane saw the man in the park with the telescope'' the prepositional phrase ``with the telescope'' could be attached to either ``Jane saw'' or ``the man in the park.''

Unbounded or long distance dependency is a relationship between two syntactic components of a sentence in which the related constituents are not required to be within some bounded distance of each other. The dependency, which may extend over one or more clause boundaries, usually involves an empty noun phrase constituent called a "trace" which is coindexed with another noun phrase appearing earlier, as in ``Show me the subscript[i] that Nick wanted Dan to subscript[i]'' where, although `report' is the object of the verb `write', there is no explicit object following the verb.

1.4. Semantics

Semantics is the study of meaning in language. It contains a number of branches including philosophical semantics and linguistic semantics, which have both been studied in NLP. Philosophical semantics studies relations between linguistic expressions (like sentences) and the entities in the world to which they refer, and the conditions under which such expressions can be said to be true or false. Analysis is performed with logical systems. Linguistic semantics studies the semantic properties of natural languages using a variety of linguistic constructs. Among the phenomena studied within semantics are the following.

Lexical ambiguity refers to a semantic property of words that they can have multiple senses or meanings, e.g., the word `crook' has different senses: it can mean a thief, a bend, or a shepherd's stick. Resolution of lexical ambiguity is required for understanding sentences that contain ambiguous words like `crook', e.g., in ``The crook stole a diamond ring,'' the thief sense is meant.

Similarity or paraphrase refers to a property of sentences that different ones can have the same (or very similar) meanings, e.g.,

``Give me the Western region financial performance for July,''
``Give me the July financial performance for the Western region,''
``Give me the financial performance for July for the Western region'' and
``Give me the July Western region financial performance'' (cf. McFetridge, 1991).

The problem is recognizing when two sentence are paraphrases.

Reference is a relationship of identity between linguistic units, e.g., between a pronoun and a noun or noun phrase. Pronouns are of various kinds, including definite pronouns like `it' and `them', personal pronouns such as `I' and `you', reflexive pronouns like `myself' and `yourself', and relative pronouns such as `who', `whom' and `that'. The problem is reso>lving reference, i.e., connecting a pronoun with the noun or noun phrase to which it refers.

Reference can occur across sentence boundaries, and can be backwards or forwards. Anaphora (or back-reference) is reference to an earlier part of a discourse. Cataphora (or forward reference) is reference to a later part of the discourse. The difference can be seen in a two different two-sentence discourses where the first sentence each time is ``John is at home.'' There is an anaphoric reference to John when the second sentence is ``If he is not drunk, Peter will be surprised'' versus a cataphoric reference to Peter when the second sentence is ``If he is not drunk, Peter will take me there'' (Strzalkowski & Cercone, 1986, p. 159). Traditional syntactic solutions have been able to treat only simple classes of anaphora and only occasional inter-sentential references.

1.5. Pragmatics

Pragmatics is the study of the communicative use of language, particularly the structure of conversations and dialogue: how participants take turns in conversations, how speakers use knowledge of communication (e.g., about the context in which language is used), and the effects their use of language has on other participants. Pragmatic problems include the following.

Presupposition is the information assumed by a person when using language and which is as the centre of a person's communicative interest, e.g., ``There is unrest in Yugoslavia'' assumes the existence of (a country called) Yugoslavia.

Conversational repair refers to ``the attempt made by participants in a conversation to make good a real or imagined deficiency in the interaction (for example, a mishearing or misunderstanding)'' (Crystal, 1992, p. 298). A major problem here is working out which participant is wrong or mistaken and hence should have their conversation (and understanding) repaired.

Indirect meaning refers to the communicative purpose of a piece of language which does not directly reflect its surface form. The true communicative purpose is understood from examining the context in which the piece of language was used, for example, ``It's hot in here'' looks like an assertion, but in the right context -- spoken to someone standing by a window -- might be a request to open the window. Likewise, ``Can you pass the salt?'' looks like a question, but can also be a request to pass the salt if said when sitting at a table and spoken to someone closer to the salt than you are!

1.6. Language as a Medium

The NLP community has responded to the growing interest in multimedia systems by investigating how to integrate natural language (in typed, handwritten and spoken forms) with other kinds of multimedia input such as the use of graphics, input devices like menus and data gloves. Similarly, there have been studies of generating coordinated multimedia output in which natural language is mixed with diagrams and so forth.

1.7. Natural Language Sites on the Internet

The University of Stuttgart has an extensive listing of computational linguistics resources and institutions IRIT in Italy maintains a good list of (computational) linguistics and NLP resources on the Internet.

There is a collection of HPSG resources at Ohio State University.

For useful general collections of NLP software, visit the Natural Language Software Registry (Saarbrucken, Germany) and the public sections of the Consortium for Lexical Research (Las Cruces, NM).

For language corpora, access the sites run by the Consortium for Lexical Research or the Oxford Text Archive (Oxford, England).

For NLP-related papers and bibliographies, check the Computation and Language E-Print Archive (Harvard, MA), the natural language processing section of the Artificial Intelligence Repository at Carnegie-Mellon University, and the collection of bibliographies on artificial intelligence at the University of Manitoba, Canada. For information about associations concerned with the study of natural language, see the World-Wide Web site operated by the Association of Computational Linguistics (ACL) based in New York, and the NLP section of SIGART, which is the ACM Special Interest Group on Artificial Intelligence. Various language and subject dictionaries and other reference works can be found at Carnegie Mellon University. Also available online are listings of academic departments and institutes and companies and corporate research labs



Next: 2. Theoretical Laboratory Work (Post-1988) Up: About the Natural Language Laboratory Previous: Table of Contents
  • Last modified 24 July 1998 (Dan Fass <fass@cs.sfu.ca>)