Grammar as knowledge representation

A description of a recognizer or parser for a language is in some sense a description of that language, but it need not by any means be a perspicuous one. Moreover, it may well be an implementation-specific definition, but implementations - even implementations of a programming language that is thought to be well understood - can differ significantly and unexpectedly. It is for this reason that computer scientists have turned in recent years away from procedural definitions of the semantics for programming languages and towards the declarative descriptions of denotational semantics. Rather similar considerations hold when we consider writing programs that process natural language input: both syntactically and semantically, we need to have a secure definition of the natural language (or an approximation to a natural language) that we are processing if we are to have any idea how the system should behave under a wide range of conditions.

A language can be regarded as a set whose membership is precisely specifiable by rules.

The set of compound linguistic expressions in a natural language is not finite, so we cannot list them all (cf. 'a slow machine', 'a very slow machine', 'a very very slow machine', ...). As far as is known, no natural language is a finite language. The range of constructions that make a language infinite is typically rather large. In English, a word like 'and' permits an unbounded number of phrases to be combined together and

relative clauses can contain verb phrases which can contain noun phrases which can contain relative clauses which ... .

What we need are formal (taht is, mathematical) systems that define the membership of the infinite sets of linguistic expressions and assign a structure to each member of these sets. Such formal systems are grammars. From this perspective, the various kinds of transition networks that we have looked at in the preceding chapters are grammars. However, the word 'grammar' is often used in a rather more restricted sense in current natural language work to refer to formal systems that meet not only the criterion just noted, but are in addition subject to two further criteria:
Firstly, grammars, in the sense discussed in this chapter, are expressed in a declarative grammar formalism which only contains information about which objects combine together and what the properties of the resultant object are, a formalism which contains no extraneous procedural information about how to put the objects together, such as the control information implicitly encoded in transition networks.

Secondly, grammars, as we shall present them, transparently provide each legal string with an implicit structural description, without the necessity of explicit structure-building annotations, as required in an ATN, for example.

Like FSTNs and RTNs, but unlike many ATNs, the grammars we shall present directly characterize the order of elements in the string, as opposed to attempting to reconstruct some hypothesized

underlying order. And, like RTNs, our grammars will use simple recursion to manage the structural complexity. Grammars as we shall present them are declarative and, for the most part, based on a decomposition of syntactic categories (roughly 'parts of speech') into components know as features. One of the merits of such formalisms will only become apparent when we look at semantics in Chapter 8: grammars of this kind support a compositional approach to meaning, one in which every well-formed expression has a meaning of its own, a meaning that has been composed from the meaning of the subexpressions that make it up. In such a context, the syntactic structure imposed by the grammar on an expression is a key element in the determination of its meaning.

From the perspective of NLP, the study of grammar is a branch of knowledge representation: a grammar is simply a way of representing certain aspects of what we know about a language that is explicit and formal enough to be understood by a machine. Consider by analogy, a much simpler area of knowledge representation, the representation of mass: this involves mathematical questions (which is the appropriate number system to use - integers, reals, rationals, complex numbers, ...?), notational questions (do we use arabic numerals, roman numerals, points in a graph, ...?) and detailed descriptive questions (is mass(ingot93) = 128Kg true, accurate, accurate enough, ...?). In this chapter, we will examine these three kinds of question as they apply to the grammars of natural languages.

One important mathematical question that has interested linguists, off and on, for the last 30 years is the question of just how powerful a formal system is needed to describe any natural language.

We have already seen one example of this issue in Chapter 2 where we noted the inability of FSTNs to recognize exactly the strings in the language a\un\db\un\d - that is, the language consisting of strings containing some number of as followed by the same number of bs. And the RTNs considered in Chapter 3 can recognize exactly the strings of a\un\db\un\d but not those of a\un\db\un\dc\un\d. There are a number of good reasons why a computational linguist should care about this question. If, for example, we are using an RTN for our language description and we encounter an a\un\db\un\dc\un\d construction in the language in question, then we will know that time spent trying to get our RTN to recognize the construction will be time wasted. Or suppose that we had access to hardware that would handle FSTNs ultrafast and observed that in actual occurrences of a\un\db\un\d constructions, the value of n never exceeded 3; then we might decide to compile our RTN descriptions down into FSTNs subject to an n = 3 upper bound (such a compilation is possible for any given finite upper bound on the value of n). Knowledge of the underlying mathematics can help to avoid wasting time and, sometimes, even suggest shortcuts. We will consider some of the general mathematical questions to do with power in a bit more detail in the final sections of this chapter.

The academic linguist's primary criterion in evaluating a type of grammar has always been its ability to capture significant generalizations within the grammar of a language and across the grammars of different languages. However, expressing significant generalizations is largely a matter of notation, and classes of grammars, taken as sets of mathematical objects, have properties that are theirs independently of the notations that might be used to define them. Thus, a class of grammars determines the set of languages that can be described by its members, the set of structural analyzes that can be assigned to sentences in these languages, and so on.

Like the academic linguist, the computational linguist needs to know what natural languages are really like, mathematically speaking. But, in addition, it is useful for the computational linguist to know what languages are roughly like, or mostly like, and whether particular languages (English, say, or Japanese) exhibit certain mathematical properties. These latter questions are of little interest to the academic linguist. Suppose, for example, that Swahili has certain features that show it to be a type Y language, mathematically. But suppose further that these features are statistically rather uncommon in every day Swahili usage, so that 99.5% of such usage can be parsed under the assumption that Swahili is of a mathematically more restrictive type and has a computationally more straightforward type X grammar. Suppose, finally, that we have a working parser based on a type X grammar for Swahili which handles 85% of the input it is offered and which, for numerous practical reasons (ill-formed input, dialectal variations, non-standard spellings, idiosyncratic stylistic forms, ...), cannot be further improved (85% is actually quite a good figure in a practical context). It is clear that under such conditions, a switch to a parser based on a type Y grammar would simply not be worth the effort since, ceteris paribus, it could effect at best an improvement from 85% to 85.5%.

Turning now to notational issues, the design and choice of grammar formalisms for NLP is a topic that has attracted considerable attention in the 1980s and much progress has been made. At least the following general criteria are relevant:
(i) linguistic naturalness,
(ii) mathematical power, and
(iii) computational effectiveness.
Firstly, those who use such formalisms need a notation that allows and encourages them to encode their linguistic descriptions in a manner that is easy to understand and modify, accords with the way they think about language, and expresses the relevant generalizations about the domain. For example, if all tensed verbs in a language appear in VP initial position, then the formalism should allow us to say exactly that, and not force us into making 40 statements, one for each subclass of verbs, or into some cumbersome circumlocution making reference to all words showing some disjunction of particular endings, a circumlocution that just happens to pick out all the tensed verbs. Secondly, if we decide that natural languages are type Y, mathematically speaking, and we wish our grammars to take this fact into account, then the grammar description language will need to be able to express type Y grammars.

As will be seen in the final sections of this chapter, some notational restrictions on grammar formalisms can have the effect of seriously limiting the class of grammars that can be expressed. Conversely, apparently minor notational changes can radically increase the potential mathematical power of the systems characterized. We may not want this to happen, either because we wish to inhibit users of the formalism from devising unnatural analyzes, or for reasons of computational effectiveness, to which we now turn.

A grammar formalism used by an academic linguist is usually only read by other academic linguists. But a computational grammar formalism, like a programming language, has to be read and understood by both humans and machines. Furthermore, given the typical goals of NLP work, we want machines to be able to understand and employ the formalism in realistic amounts of time.

Indeed, the issues that arise in the design of grammar formalisms are exactly those that arise in the design of any specialist declarative computer language for knowledge representation in some particular domain.

Some existing grammar formalisms were proposed to express linguistic theories. With such formalisms, the self-imposed limitations are often intended as empirical claims about the nature of natural languages. Others were designed for use as tools by the academic or computational linguist, and these are normally free of intended expressive limitations.

The formalisms we shall use in this chapter (and subsequently) to represent what are essentially context-free phrase structure grammars are Definite Clause Grammar (DCG) and PATR, and both these formalisms belong to the latter class.

PATR has, in recent years, become a potential lingua franca for NLP work, and many other grammatical formalisms can be compiled into it.

All the kinds of grammar that have been used in computational linguistics have employed, in one form or another, the following:
(i) A representation for syntactic categories or 'parts of speech'.
(ii) A data type for words (and hence a lexicon, dictionary or word list).
(iii) A data type for syntactic rules.
(iv) a data type for syntactic structures.
The grammar as whole can be viewed as a particular kind of data type composed of th first three items. A parser, then, is an algorithm that takes such a data type, together with a string, and attempts to return one or more instances of the syntactic structure data type.

Thus, a complete grammar formalism provides, at least, a language for specifying syntactic categories, a language for representing lexical entries, a language in which to write rules (possibly rules of more than one type) and a language for exhibiting syntactic structures. These languages may be distinct, or two or more of them may collapse.

Send us a comment.



[Contents] [Previous] [Next]
This document was translated by troff2html v0.21 on October 22, 1996.