previous up contents next
Left: Contents Up: The DATR Web Pages Right: DATR by example

Introduction

  Irregular lexemes are standardly regular in some respect. Most are just like regular lexemes except that they deviate in one or two characteristics. What is needed is a natural way of saying ``this lexeme is regular except for this property''. One obvious approach is to use nonmonotonicity and inheritance machinery to capture such lexical irregularity (and subregularity), and much recent research into the design of representation languages for natural language lexicons has thus made use of nonmonotonic inheritance networks (or ``semantic nets'') as originally developed for more general representation purposes in Artificial Intelligence (Daelemans et al. 1992). DATR is a rather spartan nonmonotonic language for defining inheritance networks with path/value equations. In keeping with its intendedly minimalist character, it lacks many of the constructs embodied either in general purpose knowledge representation languages or in contemporary grammar formalisms. But the present document seeks to show that the language is nonetheless sufficiently expressive to represent concisely the structure of lexical information at a variety of levels of language description.

The development of DATR has been guided by a number of concerns which we summarise here. Our objective has been a language which (i) has an explicit declarative semantics (see Section 4), (ii) has an explicit theory of inference (see Section 5), (iii) can be readily and efficiently implemented (see Section 7), (iv) has the necessary expressive power to encode the lexical entries presupposed by work in the unification grammar tradition (see Section 6.8), and (v) can express all the evident generalisations and subgeneralisations about such entries Section 6).

With respect to (i) and (ii), the present document presents Keller's 1995 and 1996 treatment of the formal foundations of DATR, a treatment which replaces the rather different, and not entirely adequate treatment presented by Evans & Gazdar [E&G] in their original 1989a and 1989b papers on the language. With respect to (iii), the core inference engine for DATR can be coded in a page of Prolog (see, e.g., Gibbon 1993, p50). At the time of writing, we know of more than a dozen different implementations of the language, some of which have been used with large DATR lexicons in the context of big NLP systems (e.g., Andry et al. 1992; Cahill 1993, 1994; Cahill & Evans 1990). We will comment further on implementation matters in Section 7, below. A major purpose of the present document is to exhibit the use of DATR for lexical description (iv) and the way it makes it relatively easy to capture lexical generalisations and subregularities at a variety of analytic levels (v). We will pursue (iv) and (v) in the context of an informal example-based introduction to the language and to techniques for its use, and we will make frequent reference to the DATR-based lexical work that has been done since 1989.

DATR is a language for lexical knowledge representation. It is a kind of programming language, not a theoretical framework for the lexicon (in the way that HPSG is a theoretical framework for syntax, say). As will become evident below, the language is well suited to lexical frameworks that embrace, or are consistent with, nonmonotonicity and inheritance of properties through networks of nodes. But those two dispositions hardly constitute a restrictive notion of suitability in the context of contemporary NLP work. Nor are they absolute requirements: it is, for example, entirely possible to write useful DATR fragments that never override inherited values (and so are monotonic) or which define isolated nodes with no inheritance.

It is true, of course, that our examples, here and elsewhere, reflect a particular set of assumptions about how NLP lexicons can be best organised. But, apart from the utility of inheritance and nonmonotonicity, we have been careful not to build those assumptions into the DATR language itself. There is, for example, no built-in assumption that lexicons should lexeme-based rather than, say, word- or morpheme-based.

Unlike some other NLP inheritance languages, DATR is not intended to provide the facilities of a particular syntactic formalism. Rather, it is intended to be a lexical formalism that can be used with any syntactic representation which can be encoded in terms of attributes and values. Thus, at the time of writing, we know of nontrivial DATR lexicons written for GPSG, LTAG , PATR, Unification Categorial Grammar, and Word Grammar. Equally, the use of DATR does not commit one, in advance, to adopting any particular set of theoretical assumptions with respect to phonology, morphology or semantics. In phonology, for example, the language allows one to write transducers that map strings of atomic phonemes to strings of atomic phones. But it also allows one to encode full-blown feature and syllable-tree based prosodic analyses.

Unlike the formalisms typically proposed by linguists, DATR does not attempt to embody in its design any substantive and restrictive universal claims about the lexicons of natural language. That does not distinguish it from most NLP formalisms, of course. However, we have also sought to ensure that its design does not embody features that would restrict its use to a single language (English, say) or to a particular class of closely related languages (the Romance class, say). The available evidence suggests that we have succeeded in the latter aim since, at the time of writing, nontrivial DATR fragments of the lexicons of Arabic, Arapesh, Czech, English, French, German, Gikuyu, Italian, Latin, Polish, Portuguese, Russian and Spanish have been developed. There are also smaller indicative fragments for Baoule, Dakota, Dan, Dutch, Japanese, Nyanja, Sanskrit, Serbo-Croat, Swahili and Tem.

Unlike most other languages proposed for lexical knowledge representation, DATR is not intended to be restricted in the levels of linguistic description to which it can sensibly be applied. It is designed to be equally applicable at phonological, orthographic, morphological, syntactic and semantic levels of description. But it is not intended to replace existing approaches to those levels. Rather, we envisage descriptions of different levels according to different theoretical frameworks being implementable in DATR: thus an NLP group might decide, for example, to build a lexicon with DRT-style semantic representations, HPSG -style syntactic representations, ``item & arrangement'' style morphological representations and a KIMMO-style orthographic component, implementing all of these, including the HPSG lexical rules, in DATR. DATR itself does not mandate any of the choices in this example, but equally nor does it allow such choices to be avoided. However, DATR 's framework-agnosticism may make it a plausible candidate for the construction of polytheoretic lexicons. For example, one that would allow either categorial or HPSG -style subcategorisation specifications to be derived, depending on the setting of a parameter. DATR cannot be (sensibly) used without a prior decision as to the theoretical frameworks in which the description is to be conducted; there is no ``default'' framework for describing morphological facts in DATR, say. Thus, for example, Gibbon (1992) and Langer & Gibbon (1992) use DATR to implement their ILEX theory of lexical organisation, Corbett & Fraser (1993) and Fraser & Corbett (1997) use DATR to implement their Network Morphology framework, and Gazdar (1992) shows how Paradigm Function Morphology analyses (Stump 1992) can be mapped into DATR. Indeed, it would not be entirely misleading to think of DATR as a kind of assembly language for constructing (or reconstructing) higher level theories of lexical representation.

This document is organized as follows. Section 2 uses an analysis of English verbal morphology to provide an informal introduction to DATR. Section 3 describes the language more precisely: its syntax, inferential and default mechanisms, and the use of abbreviatory variables. Section 4 provides a formal denotational semantics for the language, and Section 5 defines a formal theory of inference. Section 6 describes a wide variety of DATR techniques, including case constructs and parameters, boolean logic, finite state transduction, lists and DAGs, lexical rules, multiple inheritance, and ways to encode ambiguity and alternation. Section 7 discusses existing implementations and the modes of use of the language.

---------------------------------------------------------

previous up contents next
Left: Contents Up: The DATR Web Pages Right: DATR by example
Copyright © Roger Evans, Gerald Gazdar & Bill Keller, Tuesday 10 November 1998