Words, rules and structures

A lexicon is minimally a list of words that associates each word with its syntactic properties. The most important of these properties is its (gross) syntactic category - for example, whether the word is a verb or a noun. In addition, depending on the sophistication of the overall grammar, the lexicon will contain information as to the subcategory of the word (such as, whether a particular verb is transitive or intransitive), other syntactic properties (such as, the gender of a noun in a language that makes gender distinctions), and perhaps also morphological and semantic information. In the very simplest grammars, then, a lexical entry might look like this (using PATR notation):




Word paid: <cat> = V.

which simply tells us that 'paid' is a word whose category (cat) is verb (V), whereas a slightly more sophisticated grammar might require a lexical entry like this:




Word paid: <cat> = V <tense> = past <arg1> = NP.

which tells, in addition, that the verb is in the past tense and that it is transitive - how <arg1> = NP comes to mean transitive will be explained later.

The character of syntactic rules varies considerably depending on the precise character of the theory of grammar involved, and often such theories permit the use of more than one kind of rule. One very frequently employed type of syntactic rule is the phrase structure rule, and most current computational linguistic frameworks presuppose the existence of such rules. A phrase structure rule simply tells what a particular syntactic category (the 'left-hand side' - LHS) can be composed of (the 'right-hand side' - RHS). Thus, S -> NP VP tells us that a sentence can consist of a noun phrase followed by a verb phrase, while VP -> V NP tells us that a verb phrase can consist of a verb followed by a noun phrase. If you treat the arrow '->' as a convenient abbreviation for the expression 'can consist of' when looking at phrase structure rules, then you will not go far wrong. Phrase structure rules can also be used to introduce lexical items; thus, a rule like V -> paid is just an alternative way of representing the information we encoded above as:




Word paid: <cat> = V.

In principle, then, we could at this stage, if we wished, collapse our rules and our lexicon into a single component, since both components use phrase structure rules in this example. In practice, however, we will sometimes treat lexical entries as grammatical rules and sometimes consider them as a separate component, depending on which seems clearest at a given moment.

So far, we have discussed grammar in the abstract without ever exhibiting an example of one.

In this section and the next, we will look in some detail at what grammars look like and how they work. To start with, let us consider a very straightforward grammar (Grammar1) employing four syntactic categories; namely, noun phrase (NP), sentence (S), verb phrase (VP) and verb (V). We will supply the grammar with only three rules: one telling us that a sentence can consist of a noun phrase followed by a verb phrase, another which lets a verb phrase consist of a verb and a noun phrase, and a third which lets a verb phrase consist of a verb standing on its own. We also need a lexicon to supply us with some ready-made verbs and noun phrases. We will imagine that we are beginning to build a natural langauge front end to a hospital database, so all our lexical items will be drawn from this domain.


BOX: \Grammar1




Rule {simple sentence formation} S -> NP VP. Rule {transitive verb} VP -> V NP. Rule {intransitive verb} VP -> V.

Word Dr. Chan: <cat> = NP. Word nurses: <cat> = NP. Word MediCenter: <cat> = NP. Word patients: <cat> = NP. Word died: <cat> = V. Word employed: <cat> = V.



The lexicon tells us that 'Dr. Chan', 'nurses', 'MediCenter', and 'patients' are all noun phrases while 'died' and 'employed' are both verbs.

Grammars such as Grammar1 are called context-free phrase structure grammars (CF-PSGs). The key features of a CF-PSG are the employment of a finite set of grammatical categories and a finite set of rules for specifying how LHS categories can be realized as sequences of RHS elements. An RHS element may then be either a category or a particular symbol of the language. When a CF-PSG rule specifies that an LHS category can be realized as a particular RHS, this realization is deemed to be possible regardless of the context in which the LHS category appears. The rule does not make any restriction on the context in which this can happen - hence, the term 'context free'. There are a variety of possible notations for CF-PSGs, the Backus-Naur Form used for programming language grammars being one, but we use (a subset of) the PATR notation for Grammar1.

We have now completely defined Grammar1. But what is it good for? What can it do? Well, a grammar has two functions.

The first is to define the sets of grammatical strings of words in the language under consideration (or, more realistically, part of that language). Such strings are said to be 'generated' by the grammar. This use of the word generate does not imply the existence of any actual procedure that produces sentences; it indicates merely that certain strings are allowed as grammatical by the grammar.

So, in the case of a natural language like English, a grammar should define the set of grammatical sentences, the set of grammatical verb phrases, and so on. In defining the sets of grammatical strings it will, by implication, also define the ungrammatical strings, since an ungrammatical string is simply a string that is not grammatical - that is, a string that is not 'generated' by the grammar. The second function the grammar has is to associate one or more structures with each grammatical string. The syntactic structure of a phrase can be thought of as a kind of justification that the phrase is grammatical. It displays the various parts of the phrase and the ways in which the combinations of parts are licenced by the rules of the grammar. As we will see, this breaking up of a phrase into parts is a key factor in the computation of the semantics of the phrase. Typically, if a given string has two different structures, then it will also have two different meanings, and hence be ambiguous. Part of the job of the grammar, then, is to make correct predictions about this kind of structural ambiguity.

What form do these structures take? The answer, in mathematical parlance, is directed acyclic graphs: all contemporary grammatical frameworks, without exception, employ some kind of directed acyclic graph to represent structure. We will not delve into graph theory here, however, and directed acyclic graphs will anyway receive more attention in Chapter 7. Conveniently, most, but by no means all, contemporary linguists have elected to use one particular kind of directed acyclic graph - namely, the tree - for their structural descriptions. And a tree is what it sounds like: it has a root, it has branches and it terminates in leaves. But computational linguists, like genealogists, conventionally exhibit their trees upside down, with the root poking into the sky and the leaves on the ground. Figure 4.1 shows such a tree.


How does Grammar1 define sets of strings and associate such strings with structures like the one shown in Figure 4.1? Consider the top of the tree: here we have an S which has as daughters an NP and a VP, and nothing else, appearing in that order. The first rule of Grammar1 says that an S can consist of an NP followed by a VP, and that is exactly what we have here. So, the grammar legitimates this part of the structure. If we turn our attention now to the VP, then again we find that the grammar contains a rule that corresponds to this part of the structure. Finally, we look in the lexicon and find that 'MediCenter' and 'nurses' are both NPs, as the tree claims, and that 'employed' is a V. So, every substructure in the tree corresponds to some rule in the grammar, and the tree is thus admitted by the grammar. Since the tree is a legitimate structure, it follows that the string of words made up of the leaves of the tree - namely 'MediCenter employed nurses' - is, according to Grammar1, a grammatical expression of category S - that is, a sentence. This is not the only string of words that Grammar1 will admit as a sentence - there are 39 more such strings, including 'Nurses died, Dr. Chan employed patients, and so on.

We have represented our illustrative tree graphically above, but such diagrams are not, given the limitations of present computers, a convenient data structure to use for NLP. The simplest way to handle trees is to manipulate them as lists where the first element is the root category and subsequent elements are the daughter subtrees in order of occurrence. Thus the tree above would be:




[s,[np,'MediCenter'],[vp,[v,employed],[np,nurses]]]

This kind of list would appear in a "pretty-printed" format as follows:




[s [np 'MediCenter'] [vp [v employed] [np nurses]]]

The general principle used here is that a tree is represented by a list whose first element is the top label and whose subsequent elements are the immediate subtrees in order. These elements are then themselves lists, representing the subtrees according to the same convention. In this scheme, a leaf node of a tree is treated as a tree with a label but no immediate subtrees. Thus it will be represented by a list with one element - the label.

An alternative and equivalent representation of a tree is as a term:




s(np('MediCenter'),vp(v(employed),np(nurses)))

We will conclude this section with some comments about the appropriate way to evaluate a particular grammar for a (fragment of a) language, given a grammar formalism and an underlying mathematical interpretation for that formalism. There are basically three empirical criteria:
(1) Does it undergenerate?
(2) Does it overgenerate?
(3) Does it assign appropriate structures to the strings that it generates?
Once again, we are using the word generate in a neutral sense here, simply discussing the adequacy of the grammar as a description of a set of strings and ignoring the procedural complications that would be involved if we wished to produce actual example sentences from the description.

A grammar for English undergenerates if there are some syntactically well-formed expressions of English to which the grammar assigns no structure. But the sets of rules that grammarians propose are rarely intended to generate the whole of a natural language and so questions of undergeneration are normally considered relative to the goals of the grammar. If the grammar is intended as a set of rules to handle all English relative clauses and yet it fails to assign a structure to 'that saw you' in 'the man that saw you', then the grammar is clearly empirically inadequate for reasons of undergeneration. In a computational context, undergeneration by the grammar is not necessarily a problem, if the goal is to produce appropriate language; on the other hand, it could prove fatal if the goal is to recognize or understand.

A grammar overgenerates if it legitimates strings that cannot be construed as grammatical expressions of the language in question, given the category assigned to them by the grammar. This appears simpler than the undergeneration criterion, since the grammarian's goals do not seem to be bound up with the criterion. However, two factors serve to make it less straightforward than it at first appears. Firstly, our intuitions about what is or is not grammatical are often hard to disentangle from our judgements about what is meaningful; is 'stones bet stones stones stones' ungrammatical, or merely unlikely to have a useful meaning? And, secondly, the grammar may decide grammaticality on the basis of more than one component, so the fact that one component fails to forbid some string does not mean that the grammar as a whole will fail in this way. Of course, if all the components of the grammar are fully specified, then there should be no problem of evaluation. But often, in academic linguistic work, one or more syntactic components exist in name only (logical form, interpretative rules, and so on). This makes it difficult or impossible to evaluate the empirical adequacy of the rule components that are fully specified. In a computational context, overgeneration by the grammar is not necessarily a problem if the goal is to recognize or understand well-formed language. On the other hand, it could prove fatal if the goal is to produce well-formed language.

The question of whether a grammar assigns the correct structures is likewise a subtle one. Natural language expressions do not come to us with their structures ready marked: we have to infer the relevant structures indirectly by considering, for instance, what other phrases we could substitute for a given subphrase while maintaining grammaticality.

In addition to these empirical criteria, standard scientific considerations such as simplicity and generality apply to grammars in much the same way as they do to any other theories about natural phenomena. Other things being equal, a grammar with seven rules is to be preferred to one with 93 rules. And, other things being equal, a grammar for a fragment of English that can be converted into a grammar for the corresponding fragment of French with a few minor changes is to be preferred to a grammar for the English fragment that bears no relation whatsoever to its French counterpart. Likewise, a grammar for Swedish relative clauses that can be extended to handle a class of Swedish

questions with the addition of a couple of rules and a feature is to be preferred to one that requires a complete new set of parallel rules and features to encompass the questions.

Exercise 4.1

Code:aprecg.pl

Send us a comment.



[Contents] [Previous] [Next]
This document was translated by troff2html v0.21 on October 22, 1996.