Summary of Final Report for EPSRC Project 'Integrated Language Database' (IED4/1/5808, GR/H90148)

The main overall goal of the Integrated Language Database (ILD) project was to develop an integrated language database system which would facilitate the extraction of information from large bodies of online text (corpora), supporting rapid and efficient development of multilingual computational lexicons and published dictionaries.

The remit for the Cambridge group was to develop multilingual corpus analysis tools with suitable interfaces to the core database system (this system to be implemented by Sharp). The languages chosen were English and French, for their widespread applicability, and also as a test of the generality of the text analysis techniques used and their suitability for multilingual processing. The corpus tools were to be robust and efficient, based on finite-state techniques, context free / unification-based grammars, and statistical processing. Over the course of the project we developed new tools and/or enhanced existing ones for: part-of-speech tagging, lemmatisation, phrasal parsing, and inference of lexical knowledge from large corpora. The tools delivered consisted of:

1. Robust statistical phrasal parsers (English and French)

2. Adaptation of Acquilex tagger to unknown words --- 2 releases (English and French)

3. Addition of text grammar to parsers (English and French)

4. Prototype English subcategorisation frame extractor

5. Evaluation of parsers

6. French tagged corpus

7. English lemmatiser

For most of these tools, we were able to show significant improvement over their precursors and/or state-of-the-art alternatives developed elsewhere.

The main results were that improvements to unknown word tagging decreased the overall error rate of the Acquilex tagger by 3%; a wide-coverage lemmatiser capable of processing 55000 words / second was developed; addition of a text grammar to the English phrasal parser increased coverage by 8% and decreased syntactic ambiguity by 38%; final coverage of the English phrasal grammar was 79% of the test corpus (SUSANNE) and 62% of test sentences contained no crossing brackets with the highest ranked analysis returned by the parser; and finally, use of verb subcategorisation frames obtained automatically from corpora using the extractor system improved the mean crossing brackets rate of the parser by 7%.