International Summer School

CONTEMPORARY TOPICS

IN COMPUTATIONAL LINGUISTICS

Tzigov Chark, Bulgaria
7-9 Sept 1997

School Announcement + Registration Form
Course Descriptions
Programme
FAQ

COURSE DESCRIPTIONS


Information Extraction as a Core Language Technology

Yorick Wilks (University of Sheffield)

What is IE?

Information Extraction (IE) technology is now coming on to the market and is of great significance to information end- user industries of all kinds, especially finance companies, banks, publishers and governments. For instance, Lloyds of London need to know of daily ship sinkings throughout the world and pay large numbers of people to locate them in newspapers in a wide range of languages, and doing that automatically would be a paradigm case of IE.

Computational linguistic techniques and theories are playing a strong role in this emerging technology (IE), not to be confused with the more mature technology of Information Retrieval (IR), which selects a relevant subset of documents from a larger set. IE extracts information from the actual text of documents, by computer and at high speed, and normally from publicly available electronic sources such as news wires. Any application of IE technology is usually preceded by an IR phase, which selects a set of documents relevant to some query--normally a string of features or terms that appear in the documents. So, IE is interested in the structure of the texts, whereas one could say that, from an IR point of view, texts are just bags of unordered words.

Recent IE technology

In this course we will look at the history of IE and at the empirical linguistic modules that make up a classic IE system: part of speech tagging, syntactic analysis or verb pattern analysis, lexicon development, word sense tagging and so on. we shall also look at the evaluation schemas developed for IE within the US Government ARPA program which did so much to push the development of the technology over the last ten years, as well as the wider systems also developed there such as general architectures for NLP within which to situate IE.


Multi-Engine Machine Translation Environments

Sergei Nirenburg (New Mexico State University)

Multi-engine machine translation environments combine results from a variety of machine translation systems working on the same input text, in order to improve the overall quality of translation. The individual engines can be built on any of the known methods - rule-based or corpus-orientated. They can also be either commercial packages or research prototypes. The complex task of combining the outputs and, whenever possible, intermediate structures from the various engines will be discussed.

Semantic Syntax

Pieter Seuren (Univ of Nijmegen)

Semantic Syntax (SESYN) is an integrated theory of natural language syntax and semantics. It is a direct continuation of Generative Semantics. SESYN is a rule system that establishes a (bidirectional) mapping between the meaning representation of sentences (called semantic analysis structures) and their surface realisation. The semantic analysis structures are higher-order predicate calculus trees and contain the lexical items for open class words. The surface structures are based on an orthodox version of Transformational grammar.

The rule system works as follows: first a semantic analysis structure is constructed using a set of formation rules. The formation rules are a grammar for the semantic analysis structures. Then, the semantic analysis structure is transformed into a surface structure using cyclic and post-cyclic transformational rules. The cyclic rules apply in a bottom-up way and are mostly lexicon-driven: predicates are lexically marked for the cyclic rules they induce. The post-cyclic rules are largely structure-driven and apply in linear order as defined by the grammar. There is a limited number of highly constrained transformation rules. In SESYN there is no surface syntax grammar. Instead a grammar for the semantic structure is used coupled with a transformational component - SESYN is a procedural framework. SESYN is formally precise and achieves a high degree of empirical success. Exact rule systems are available for a number of languages (English, French, Dutch, German) and more are being developed.

The course will cover:

Recomended reading:
Seuren, Pieter A.M. 1996. SEMANTIC SYNTAX. Oxford: Blackwell.

An Architecture for Linguistically Intensive Content Characterisation

Branimir Boguraev (Apple Computer, Cupertino)

We will describe a novel approach to content characterisation of text documents. It is domain- and genre-independent, by virtue of not requiring an in-depth analysis of the full meaning. At the same time, it remains closer to the core meaning by choosing a different granularity of its representations (phrasal expressions rather than sentences or paragraphs), by exploiting a notion of discourse contiguity and coherence for the purposes of uniform coverage and context maintenance, and by utilising a strong linguistic notion of salience, as a more appropriate and representative measure of a document's ``aboutness''. We will focus on the requirements of an architecture for performing such a task, irrespective of a document's style, genre, or domain.

Natural Language Generation

Michael Zock (LIMSI, CNRS)

Natural language generation is a dynamic field that lies at the cross-roads of a number of disciplines: linguistics, psychology, rhetorics, computer science (to name just four).

The goal of this course will be two-fold: (a) to make the non-specialist aware of the problems, potential and achievements of natural language generation, (b) to help bridge the gap that still exists between the experts in the different disciplines (e.g., linguists, psychologists).

Issues which will be addressed here are:

It will be emphasised that the problem of natural language generation can only be solved in the realm of cognitive science, within a framework where linguists, psychologists and computer scientists agree to meet and work in concert.

Computational Morphology

Harald Trost (Austrian Institute for AI)

Duration: 3 hours
Audience: Students with a general background in computational linguistics
Goal: To provide a short introduction into state-of-the-art methods in computational morphology with an emphasis on two-level morphology.

Content:

Recommended reading:
Richard Sproat. 1992. Morphology and Computation. Cambridge, MA: MIT Press.

Corpus Linguistics

Tony McEnery (Univ of Lancaster)

Corpora have had an increasing impact upon computational linguistics. Since the Brown corpus was exploited to yield quantitative data that enabled the production of the first robust part-of-speech tagging systems, the growth of corpus based NLP systems has been steady. In this course the speaker will examine the growth of corpora and link this to developments in computational linguistics.

Recent developments in anaphora resolution

Ruslan Mitkov (Univ of Wolverhampton)

Anaphora resolution is a key issue in NLP being vital in natural language interfaces, machine translation, automatic abstracting and in a number of other NLP applications. After considerable initial research in anaphora resolution and after years of relative silence in the early eighties, anaphora resolution has attracted the attention of many researchers in the last 10 years and much promising work on the topic has been reported. The course will consist of the following parts:

Automatic Abstracting

Benjamin Tsou (City University of Hong Kong)


Last update: 1 Sep 1997


Nicolas Nicolov
Cognitive and Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, UK
Phone: +44-1273 678408
Fax: +44-1273 671320
nicolas@cogs.susx.ac.uk