Lexicons can either be developed by hand or, in principle at least, they can be induced from relevant data. Once created, lexicons get used for language understanding, language generation, or both. Lexicons that are in use also have to be maintained. At present, implementations of lexical representation systems are typically specialised to one or two of these tasks. A language for lexical knowledge representation is merely one component of a lexical representation system, of course, but its design may well have implications for the tasks noted above. A language that coded everything into bit strings might be fully adequate for the induction and generation tasks, say. But it would probably not facilitate manual lexicon maintenance.
From a more formal point of view, Barg (1994) provides
the useful tabular conceptualisation of the inferential tasks that may
be associated with a lexical representation language like DATR shown
in Table 2, below.
<mor past participle
>): the task is to infer the appropriate value for this query, namely love ed. This task is crucial to lexicon development and maintenance since it provides the means by which the developer can check the empirical adequacy of their analysis. It is also a task that is likely to figure in the on-line use of the lexicon in a language processing system, once the relevant lexical entry (i.e., the relevant DATR node) has been determined, to recover information associated with the entry. And it is the task that does the compilation in systems that use a partially or fully compiled-out off-line lexicon (as in Andry et al. 1992).
The reverse query task again presupposes that we have a description
available to us. But instead of starting with a known query, we start
instead with a known value (love ed, say, and the task is to
infer what queries would lead to that value (Love:
<mor past participle
past tense sing one
>, etc.). An alternative formulation is
to start with a known value and path, and the task is to infer the
appropriate nodes. The ability to perform this kind of inference may
also be useful in lexicon development and maintenance. However, its
most obvious application is to ``bootstrap'' lexical access in
language processing systems that make direct use of an on-line
lexicon: given a surface form (in analysis) or a semantic form (in
generation), we need to identify a lexical entry associated with that
form by reverse query, and then access other lexical information
associated with the entry by conventional inference. Langer
(1994) gives an inference algorithm, based on the
familiar chart data structure, for reverse querying DATR lexicons;
and Gibbon (1993) describes EDQL (Extended DATR Query Language) which permits quantification into components of
multisentence DATR queries.
The final task is that of theory induction. Here one starts with a
set of known query-value pairs (Love:
> = love ed., Love:
<mor pres tense sing three
> = love s.,
etc.) and the task is to induce a description that has those pairs as
theorems under the application of conventional inference. In a world
in which all the relevant data was already clearly set out in
descriptive linguistic work, an algorithm that efficiently achieved
this kind of induction would be the philosopher's stone to the
construction of computational lexicons. In the real world, such an
algorithm would still be useful for domains like morphology (where the
quality and clarity of extant descriptive linguistic work is very
high), for bootstrapping lexical descriptions for subsequent manual
development by humans, for updating lexicons in the light of newly
encountered lexical information, and for converting one kind of
lexicon into a completely different kind of lexicon by inducing the
latter from the output of the former. The automatic induction of
(symbolic) lexicons from data is a very new research area in
computational linguistics: Kilbury (1993), Kilbury et
al. (1994), Light (1994) and Light et al.
(1993) have proposed a variety of incremental algorithms
that take a partial lexical hierarchy and elaborate it as necessary in
the light of successively presented data sets, whilst Barg
(1994) has presented a nonincremental algorithm that
induces full DATR hierarchies from suitable data sets.
Since DATR is no more than a language, it does not itself dictate how a DATR lexicon is to be used. As it turns out, different researchers have used it very differently. Andry et al. (1992), in the context of a speech recognition task involving the parsing of ``extremely large lattices of lexical hypotheses'' (p248), opted for off-line compilation of their 2000 word DATR lexicons into pairs of on-line lexicons, one of which was encoded with bit-vectors for speed and compactness. At the other extreme, Duda & Gebhardi (1994) present an interface between a PATR-based parser and a DATR lexicon where the former is dynamically linked to the latter and able to query it freely, in both conventional and reverse modes, without restriction. And Gibbon (1993) presents an implementation of a very flexible query language, EDQL, which allows quantification over any constituents of (possibly complex) DATR queries.