previous up contents next
Left: Encoding DAGs Up: The DATR Web Pages Right: References

Implementations and modes of use

  As already noted, the inferential core of DATR is extremely simple to implement. We know of the existence of more than a dozen different implementations of the language but there may well be others that we do not know of. The best known, and most widely available are our own (Brighton/Sussex), which is written in Prolog and runs on most Unix platforms; Gibbon's (Bielefeld) DDATR (Scheme) and NODE (Sicstus Prolog); Schillo's (Bielefeld) ZDATR (C); and Kilbury's (Duesseldorf) QDATR Prolog implementation which runs (in compiled form) on PCs and on Sicstus Prolog under Unix. All of these are freely available on request, as is an extensive archive of over one hundred example fragments some of which illustrate formal techniques and others of which are applications of DATR to the lexical phonology, morphology, syntax or semantics of a wide variety of different languages. Other interesting implementations that we are familiar with include the experimental reverse query implementation by Langer (Osnabrueck), Duda & Gebhardi's (Berlin) implementation that is dynamically linked to PATR, and Barg's (Duesseldorf) implementation of a system that induces DATR descriptions from extensional data sets.

Lexicons can either be developed by hand or, in principle at least, they can be induced from relevant data. Once created, lexicons get used for language understanding, language generation, or both. Lexicons that are in use also have to be maintained. At present, implementations of lexical representation systems are typically specialised to one or two of these tasks. A language for lexical knowledge representation is merely one component of a lexical representation system, of course, but its design may well have implications for the tasks noted above. A language that coded everything into bit strings might be fully adequate for the induction and generation tasks, say. But it would probably not facilitate manual lexicon maintenance.

From a more formal point of view, Barg (1994) provides the useful tabular conceptualisation of the inferential tasks that may be associated with a lexical representation language like DATR shown in Table 2, below.

Table: Possible inference tasks (adapted from Barg 1994)
  Theory Query Value
Conventional inference given given unknown
Reverse query given unknown given
Theory induction unknown given given

Consider the English verbal morphology facts that provided our running example in Section 2, above. The conventional inference task presupposes that we have a description (such as that given in that section) and a query (such as Love: <mor past participle>): the task is to infer the appropriate value for this query, namely love ed. This task is crucial to lexicon development and maintenance since it provides the means by which the developer can check the empirical adequacy of their analysis. It is also a task that is likely to figure in the on-line use of the lexicon in a language processing system, once the relevant lexical entry (i.e., the relevant DATR node) has been determined, to recover information associated with the entry. And it is the task that does the compilation in systems that use a partially or fully compiled-out off-line lexicon (as in Andry et al. 1992).

The reverse query task again presupposes that we have a description available to us. But instead of starting with a known query, we start instead with a known value (love ed, say, and the task is to infer what queries would lead to that value (Love: <mor past participle>, Love: <mor past tense sing one>, etc.). An alternative formulation is to start with a known value and path, and the task is to infer the appropriate nodes. The ability to perform this kind of inference may also be useful in lexicon development and maintenance. However, its most obvious application is to ``bootstrap'' lexical access in language processing systems that make direct use of an on-line lexicon: given a surface form (in analysis) or a semantic form (in generation), we need to identify a lexical entry associated with that form by reverse query, and then access other lexical information associated with the entry by conventional inference. Langer (1994) gives an inference algorithm, based on the familiar chart data structure, for reverse querying DATR lexicons; and Gibbon (1993) describes EDQL (Extended DATR Query Language) which permits quantification into components of multisentence DATR queries.

The final task is that of theory induction. Here one starts with a set of known query-value pairs (Love: <mor past participle> = love ed., Love: <mor pres tense sing three> = love s., etc.) and the task is to induce a description that has those pairs as theorems under the application of conventional inference. In a world in which all the relevant data was already clearly set out in descriptive linguistic work, an algorithm that efficiently achieved this kind of induction would be the philosopher's stone to the construction of computational lexicons. In the real world, such an algorithm would still be useful for domains like morphology (where the quality and clarity of extant descriptive linguistic work is very high), for bootstrapping lexical descriptions for subsequent manual development by humans, for updating lexicons in the light of newly encountered lexical information, and for converting one kind of lexicon into a completely different kind of lexicon by inducing the latter from the output of the former. The automatic induction of (symbolic) lexicons from data is a very new research area in computational linguistics: Kilbury (1993), Kilbury et al. (1994), Light (1994) and Light et al. (1993) have proposed a variety of incremental algorithms that take a partial lexical hierarchy and elaborate it as necessary in the light of successively presented data sets, whilst Barg (1994) has presented a nonincremental algorithm that induces full DATR hierarchies from suitable data sets.

Since DATR is no more than a language, it does not itself dictate how a DATR lexicon is to be used. As it turns out, different researchers have used it very differently. Andry et al. (1992), in the context of a speech recognition task involving the parsing of ``extremely large lattices of lexical hypotheses'' (p248), opted for off-line compilation of their 2000 word DATR lexicons into pairs of on-line lexicons, one of which was encoded with bit-vectors for speed and compactness. At the other extreme, Duda & Gebhardi (1994) present an interface between a PATR-based parser and a DATR lexicon where the former is dynamically linked to the latter and able to query it freely, in both conventional and reverse modes, without restriction. And Gibbon (1993) presents an implementation of a very flexible query language, EDQL, which allows quantification over any constituents of (possibly complex) DATR queries.


previous up contents next
Left: Encoding DAGs Up: The DATR Web Pages Right: References
Copyright © Roger Evans, Gerald Gazdar & Bill Keller, Tuesday 10 November 1998