Practical NL parsing: disambiguation
An important problem in parsing natural (human) language by computer is identifying the syntactic analysis which forms the basis of the correct interpretation from the often large number of analyses assigned to an input by a grammar. In tackling this problem Ted Briscoe and I devised and implemented a novel probabilistic approach to syntactic disambiguation (Carroll & Briscoe, 1992; Briscoe & Carroll, 1993).
More recently, I have been working on robust analysis of unrestricted English text (Briscoe & Carroll, 1995; Carroll & Briscoe, 1996; Carroll & Briscoe, 2002). This research overcomes a major source of 'brittleness' (or failure) in conventional parsing with linguistically-motivated grammars - omissions in the lexicon - by requiring less detailed lexical information (taken completely automatically from the output of a lexical tagger), by producing slightly 'shallower' analyses, and by relying on a variant of the same probabilistic disambiguation technique.
On this same issue of disambiguation, I have also investigated techniques for associating probabilistic information with lexicalized grammar formalisms, such as Lexicalized Tree Adjoining Grammar (Carroll & Weir, 1997).
Practical NL parsing: efficiency
Another major problem for practical parsing of text with formal grammars is ensuring adequate parser throughput. In work on the Alvey NL Tools I developed novel processing techniques which led to improved throughput in unification-based parsers (Carroll, 1993). On the basis of this work, I presented empirical evidence (Carroll, 1994) which suggests that, in practice, wide-coverage NL grammars do not necessarily evince worst-case complexity in parsing algorithms.
Following on from this research, I have worked on efficient parsing with wide-coverage grammars in the HPSG framework (Kiefer et al, 1999; Oepen & Carroll, 2000; Malouf, Carroll & Copestake, 2000); and also fast automaton-based parsing of lexicalized tree grammars (Shaumyan, Carroll & Weir, 2002).
Measuring parser accuracy is a difficult problem: a number of approaches have been proposed in the literature, but all have their drawbacks. Carroll, Briscoe & Sanfilippo (1998) describe and justify a new (dependency-based) technique which overcomes some of the shortcomings of previous proposals. Carroll, Minnen & Briscoe (1999) take this a step further, describing a new test suite of naturally-occurring English sentences annotated to this standard for use in evaluating a parser. The test suite is available for public download.
One application of robust natural language analysis technology is the extraction of linguistic knowledge about words from large amounts of raw, unannotated text. This information could be used to improve parsing accuracy or coverage, or feed into other types of language analysis.
I have experimented with automatically acquiring subcategorisation information for verbs (Briscoe & Carroll, 1997); I have used this to determine empirically whether the subcategorization frequency data gathered for individual verbs can improve parser accuracy (Carroll, Minnen & Briscoe, 1998), and whether the information can improve the coverage of a 'deep' grammar of English (Carroll & Fang, 2004).
Diana McCarthy and I have applied automatically acquired selectional preferences to the tagging of words with their most likely sense in context (Carroll & McCarthy, 2000; McCarthy & Carroll, 2003). More recent work has used the similarity of contexts in which words appear to identify their predominant senses, without needing any annotated data (McCarthy et al, 2004a; McCarthy et al, 2004b ).
Large-scale grammar and lexicon development
As part of a collaboration with the LinGO Lab at Stanford University, I have contributed to the advanced graphical interface presented to the grammar writer in the LKB system. This system has been used to teach courses in grammar writing (Copestake et al, 2001).
I was involved in designing and implementing one of the first grammar development environments of this type (Boguraev et al, 1988), which supported the development of the Alvey NL Tools grammar. Using this environment, we developed a wide-coverage syntactic and semantic rule-based grammar of English (Grover, Carroll & Briscoe, 1993), covering a large class of syntactic phenomena that occur in actual texts. We also devised a methodology and implemented a system used to build an associated large computational lexicon from the machine-readable version of the Longman Dictionary of Contemporary English (Carroll & Grover, 1988).
The difficulty of creating and maintaining large, unstructured lexicons led (in a separate, but related research effort) to the development of a novel approach to lexicon representation (Russell et al, 1992) using multiple default inheritance, but with certain specific restrictions to ensure tractability.
Linguistic approaches to tactical generation
Using formal linguistic grammars to generate natural language text from a specified semantic representation has parallels to parsing with these grammars. I have worked on two approaches to tactical generation: semantic head-driven (Russell, Warwick & Carroll, 1990) for unification grammars (augmented with relational constraints) of fragments of English, French and German; and efficient chart-based generation for large-scale lexicalised grammars (Carroll et al, 1999; Cahill et al, 2001; Carroll & Oepen, 2005).
In a similar vein to this work in reversible grammars, I have been involved in developing an accurate and efficient morphological analyser/generator for English using finite-state techniques (Minnen, Carroll & Pearce, 2001). The morphological analyser, generator, and also an orthographic post-processor are freely available.
Back to John A. Carroll home page