Carroll, J. (1993) Practical unification-based parsing of natural language, Technical Report No. 314, Computer Laboratory, University of Cambridge. PhD Thesis.
The thesis describes novel techniques and algorithms for the practical parsing of realistic Natural Language (NL)
texts with a wide-coverage unification-based grammar of English. The thesis tackles two of the major problems in
this area: firstly, the fact that parsing realistic inputs with such grammars can be computationally very expensive,
and secondly, the observation that many analyses are often assigned to an input, only one of which usually forms
the basis of the correct interpretation.
The thesis starts by presenting a new unification algorithm, justifies why it is well-suited to practical NL parsing,
and describes a bottom-up active chart parser which employs this unification algorithm together with several
other novel processing and optimisation techniques. Empirical results demonstrate that an implementation of this
parser has significantly better practical performance than a comparable, state-of-the-art unification-based parser.
Next, techniques for computing an LR table for a large unification grammar are described, a context free
non-deterministic LR parsing algorithm is presented which has better time complexity than any previously
reported using the same approach, and a unification-based version is derived. In experiments, the performance
of an implementation of the latter is shown to exceed both the chart parser and also that of another efficient
LR-like algorithm recently proposed.
Building on these methods, a system for parsing text taken from a given corpus is described which uses
probabilistic techniques to identify the most plausible syntactic analyses for an input from the often large number
licensed by the grammar. New techniques implemented include an incremental approach to semi-supervised
training, a context-sensitive method of scoring sub-analyses, the accurate manipulation of probabilities during
parsing, and the identification of the highest ranked analyses without exhaustive search. The system attains a
similar success rate to approaches based on context-free grammar, but produces analyses which are more
suitable for semantic processing.
The thesis includes detailed analyses of the worst-case space and time complexities of all the main algorithms
described, and discusses the practical impact of the theoretical complexity results.
Download postscript version.