Natural Language Laboratory: Applied Work

Natural Language Laboratory at Simon Fraser University

Next: 4. Laboratory Publications and Software Up: About the Natural Language Laboratory Previous: 2. Theoretical Laboratory Work (Post-1988)

3. Applied Laboratory Work (Post-1988)

3.1. Natural Language Interfaces to Databases
3.2. Grammar Development Tools
3.3. Machine Translation
3.4. Computer-Aided Language Learning

The Natural Language Laboratory has projects in two of the most common NLP applications: natural language interfaces and machine translation.

Natural language interfaces allow people to communicate with machines in a natural language such as English. A particular application of natural language interfaces is to computer databases so that non-technical people can directly access the information in the databases.

Machine translation systems automatically translate from one natural language (say French) into another (say Spanish or English).

The Natural Language Laboratory also works on a third application: grammar development systems. These are systems used by (computational) linguists to design and test grammars and parsers.

3.1. Natural Language Interfaces to Databases

3.1.1. SystemX (Pre-1990)
3.1.2. SystemX (Post-1990)

One of the most intensively studied areas of NLP is question-answering by natural language interfaces serving as "front ends" to databases. Such front ends help users in two main ways (McTear, 1987). First, they are relieved of the need to know the structure of the database. Second, the system is made more convenient and flexible for them. A particular application of natural language interfaces is executive information systems which allow management executives to access directly the information in the databases of their companies. SystemX turns ordinary English questions into database queries expressed in SQL (short for Structured Query Language), the standard computer language for manipulating relational databases. Work on SystemX first began in 1986. In 1990, SystemX was heavily revised: the old set of grammar rules (plus semantic interpretation rules) replaced by Head-Driven Phrase Structure Grammar (HPSG) and new semantics. Both are described below.

3.1.1. SystemX (Pre-1990)

The pre-1990 SystemX (McFetridge et al., 1988a, 1988b; Cercone et al., 1989, 1990) is the front end to a database containing Simon Fraser University academic advisor information. It has quite extensive coverage of English, and can handle passives, imperatives, possessives, relative clauses, prepositional phrases and quantification.

SystemX is of a modular design, containing a natural language understanding module and a database query module. The natural language understanding module has three components: a lexical analyzer (which analyzes words), parser (which does syntax processing), and semantic interpreter (which does semantic interpretation and produces canonical semantic representations).

The lexical analyzer contains two subsystems, TEMPLATE and MORPHOS, and accesses a syntactic lexicon. TEMPLATE uses the form of certain words to recognize that they belong to certain categories such as proper names, part numbers and report identity numbers. MORPHOS is a morphological analyzer which employs a set of rules to strip the endings off words and identify their roots. Words are looked up in the syntactic lexicon and the syntactic information retrieved is used by the parser.

The parser contains a set of grammar rules. The rules are applied to the syntactic information and a parse tree for the sentence is built. The parse tree is passed to the semantic interpreter.

The semantic interpreter contains a set of semantic rules and a semantic lexicon. Each entry in the lexicon is a frame containing a description of the database entity referred to by a particular word or expression. Information from the frames is attached to the parse tree and then the semantic rules are applied to build canonical semantic representations. When there is some ambiguity within an expression which the interpreter cannot resolve on its own, the names of the database entities contained within the canonical representation of the expression are passed to a component called Pathfinder (Hall, 1986). Pathfinder uses the semantic information inherent in the database design to assist in the disambiguation of the expression. A database is a (restricted) conceptual model which represents the entities and relationships of the domain. English expressions correspond to substructures of the conceptual model. If an expression is ambiguous in some way, there will be multiple candidate substructures in the model. Pathfinder selects the correct one by analyzing the constraints inherent in the various relationship types used to structure the model, measuring for each candidate substructure the degree of semantic relatedness among the entities to which the expression refers, and selecting the candidate which exhibits the highest degree of semantic relatedness.

Often, however, Pathfinder cannot select a single best candidate structure and a user must be the final arbiter. In such situations, the candidates must be presented to the user in a manner s/he is capable of understanding. In SystemX, this is done by generating English sentences that correspond to each candidate structure. The sentences currently generated are somewhat rough and halting. Research is planned which will improve their quality.

The database query module contains a component that translates the canonical semantic representations into a logical form. A second component translates the logical form into SQL. Different versions of this component would translate the logical form into other database languages.

3.1.2. SystemX (Post-1990)

The post-1990 SystemX (Popowich et al., 1992; Cercone et al., 1993, 1994; Fass et al., 1995; Fass et al., 1996) like SystemX, is a natural language interface to a relational database which turns ordinary English questions into requests in SQL. The new SystemX has been used as a front end to four different databases:

Rogers CableSystems customer service operations information.
National Science and Engineering Research Council grants information.
Environment Canada information.
Rogers CableSystems cable and pay-per view product information.

The main application has been as an advanced prototype executive information system for Rogers Cablesystems, a major national Canadian cable company, through a grant from the Canadian Cable Labs Fund. The initial target is a statistical database describing the customer service operations of the company, which describes sets of entities such as service outages, telephone calls and customer service representatives, work orders, and payment methods.

Statistical databases are common in executive information systems, since executives are typically interested in summary information. Most conceptual modelling languages do not provide the facility to represent statistical concepts. Natural language queries to statistical databases often lack direct correspondence to database objects. Such queries often refer to the domain entities which are being summarized and these are only indirectly represented in the database via the statistics. This lack of correspondence means that the mediation of a conceptual model is even more necessary than in the general case. The SystemX group has focused on the proper representation of statistical concepts in such a model, and the use of such concepts in disambiguating natural language queries.

This extension into the domain of statistics is a major difference between the pre- and post-1990 versions of SystemX. The other major difference is that the natural language understanding module has been replaced in the newer SystemX by one based on a more modern grammar formalism and a new semantics. The grammar formalism is head-driven phrase structure grammar. Two parsers have been developed, one written in Lisp by McFetridge (McFetridge & Cercone, 1990), and the other written in Prolog by Popowich and Vogel (1990, 1991a). The two parsers are used to test competing ideas, which are sometimes easier to implement and test in one language than another. The semantics, developed by McFetridge (1991), is modelled on the structure of the Rogers database. The parser produces logical forms which are then passed to an adapted version of the database query module from SystemX.

SystemX has also been interfaced to DBLEARN, a tool for knowledge discovery in databases. DBLEARN is the subject of an IRIS 2 project, HMI-5, of which Nick Cercone is a principal investigator.

A major phase of customizing a natural language interface to a database is generating a conceptual model. Gary Hall has been looking into means to automate this phase as much as possible. A related problem is how to update changes which occur in the domain of the database to a natural language interface. Successful automatic generation techniques should ease this problem because the techniques should be able to deduce changes in the domain which are prompting changes to the logical model and to the patterns of data in the database itself.

Hall and Gupta (1991; Gupta & Hall, 1991) have been studying how to integrate the representation of the static and dynamic aspects of a domain into a single conceptual model. Hall is looking to extend Pathfinder (used in both versions of SystemX) to measure semantic relatedness in models which include relationship types that Pathfinder cannot presently manage.

3.2. Grammar Development Tools

3.2.1. HPSG-PL Grammar Development System
3.2.2. Pleuk Grammar Development Shell
3.2.3. Emacs User Interface to ALE

The grammars used in NLP systems are becoming very sophisticated. Developing such grammars requires designing, testing and modifying grammar rules, principles and lexicons. Tools for developing such grammars are needed by system developers (and sophisticated users) of applications like natural language interfaces to databases, and by students and researchers for understanding existing grammars, for extending them, and for developing new grammars. Tools will make it easier to study how linguistic problems are handled by different grammars, to import ideas from one grammar into another, and to describe fresh linguistic problems within an existing grammar.

3.2.1. HPSG-PL Grammar Development System

HPSG-PL is a Prolog implementation of the HPSG formalism, developed by Fred Popowich, Sandi Kodric and Carl Vogel (Popowich & Vogel, 1991c; Kodric, Popowich & Vogel, 1992). It was developed to be a working tool for designing and testing grammars written within the HPSG framework.

The system consists of a lexical compiler, constraint processor, chart parser and a module for linking the parser to a graphic interface. Using this system, a user can examine the properties of the HPSG formalism itself, and can investigate characteristics of specific grammars that utilize the formalism. The system can also be used in conjunction with the TreeTool graphical interface, developed by Baker et al (1990). TreeTool is a C program that utilizes the Suntools windowing environment to display graphic trees corresponding to Prolog-style term representations of trees. TreeTool has also been adapted to XVIEW. The code for HPSG-PL is available free for noncommercial purposes. The gzipped UNIX tar file also includes TreeTool, a sample grammar covering a fragment of English, and supporting documentation. A more up to date version is also available which contains some modified files as as a gzipped UNIX tar file. The documentation is also available separately (in PostScript form).

3.2.2. Pleuk Grammar Development Shell

Pleuk (Calder, 1993) is a grammar development shell written by Jo Calder, Kevin Humphreys and Mike Reape. Jo Calder worked in the Natural Language Laboratory during 1993-1994. Many different grammatical formalisms can be embedded within Pleuk. Those currently supported are:

HPSG-PL
Cfg, a simple context-free grammar system, intended for demonstration purposes;
Mike, a simple graph-based unification system;
SLE, a graph-based formalism enhanced with arbitrary relations;
Term, a term-based unification grammar system;
Ale, the Attribute Logic Engine by Carpenter and Penn;
vNTag, van Noord's TAG system; and
definite clause grammars.

Pleuk features a unique ``derivation checker'', a graphical system which allows the user to `grow derivations' by actions including the selection of lexical or other material and the insertion of that material into larger structures (as defined by the formalism in use). Facilities are also provided to allow Pleuk to process collections of test sentences.

Pleuk requires SICStus prolog version 2.1#9 or later, plus a variety of ancillary programs available free of charge from many ftp sites. The code for Pleuk is available free for research purposes.

3.2.3. Emacs User Interface to ALE

The Attribute Logic Engine (ALE) is an integrated phrase structure parsing and definite clause logic programming system written in Prolog by Bob Carpenter and Gerald Penn. It is suitable for writing grammars in formalisms that use typed feature structures such as HPSG. ALE is available free for research purposes.

An Emacs interface to ALE has been written that is described in a technical report (Laurens, 1995) available online (20 pages, PostScript). The code for the interface is available free for research purposes.

3.3. Machine Translation

A unification-based machine translation engine for real-time machine translation is being developed in partnership with TCC Communications Corporation of Victoria, BC. The engine uses the shake-and-bake method (see Popowich, 1994b, 1995a, 1995b, 1996). For a description of the system, see Popowich et al. (1997). For other recent work on the system, see Turcato et al. (1997, 1998), Turcato (1998), and Toole (to appear).

The engine uses shallow semantic analysis together with a rich description of the interrelationship between words in each language. This approach permits the use of existing lexicons without the development and implementation of a rich semantic representation, thereby speeding up development time.

The lab is currently developing robust English and Spanish grammars based on the Head-Driven Phrase Structure Grammar (HPSG) approach and employing a highly modular design.

The resulting grammars and engine are to be embedded in the next generation of TCC's TeleTranslator products.

3.4. Computer-Assisted Language Learning

Computer-assisted language instruction programs have been developed for German (Heift & McFetridge, 1993; Heift, 1998) and are being developed for Spanish (McFetridge & Heift, 1994).

In 1998, the Natural Language Laboratory, together with the Language Learning Centre and the Language Training Institute of Simon Fraser University, entered into an agreement with the Ministry of Education of Greece to produce an introductory Self-Instructional Language Program for Greek. A critical component of this program will be Computer Assisted Language Instructional software. The Natural Language Laboratory is creating an intelligent language tutor for teaching Greek capable of analysing students' responses to exercises and individualizing instruction according to each student's progress and abilities. Much of this work is based on the PhD dissertation of Trude Heift (1998), "Designed Intelligence: A Language Teacher Model".

3.5. Information Extraction

The Natural Language Laboratory has developed a prototype information extractor (PIE) system. PIE is a document navigation tool for use with web search engines and other information retrieval systems. PIE analyzes the returned documents and summarizes the way that key words from the user's search are used in those documents. The user can then judge the content of the document before manually accessing it.

Next: 4. Laboratory Publications and Software Up: About the Natural Language Laboratory Previous: 2. Theoretical Laboratory Work (Post-1988)

Last modified 9 October 1998 (Dan Fass <fass@cs.sfu.ca>)