International Summer School
CONTEMPORARY TOPICS
IN COMPUTATIONAL LINGUISTICS
Tzigov Chark, Bulgaria
7-9 Sept 1997
-
-
School Announcement + Registration Form
-
Course Descriptions
-
Programme
-
FAQ
Information Extraction as a Core Language Technology
-
Yorick Wilks (University of Sheffield)
What is IE?
Information Extraction (IE)
technology is now coming on
to the market and is of great
significance to information end-
user industries of all kinds,
especially finance companies,
banks, publishers and
governments. For instance,
Lloyds of London need to
know of daily ship sinkings throughout
the world and pay large
numbers of people to locate
them in newspapers in a
wide range of languages,
and doing that automatically
would be a paradigm case of IE.
Computational linguistic
techniques and theories are
playing a strong role in this
emerging technology (IE),
not to be confused with the
more mature technology of
Information Retrieval (IR),
which selects a relevant subset
of documents from a larger set.
IE extracts information from
the actual text of documents,
by computer and at high speed,
and normally from publicly available
electronic sources such as news
wires. Any application of IE
technology is usually
preceded by an IR phase,
which selects a set of
documents relevant to some
query--normally a string of
features or terms that appear in
the documents. So, IE is
interested in the structure of
the texts, whereas one could
say that, from an IR point of
view, texts are just bags of unordered
words.
Recent IE technology
In this course we will look at the history of IE
and at the empirical linguistic modules that make up
a classic IE system: part of speech tagging, syntactic analysis
or verb pattern analysis, lexicon development, word sense tagging
and so on. we shall also look at the evaluation schemas
developed for IE within the US Government ARPA program which did so
much to push the development of the technology over the last ten
years, as well as the wider systems also developed there such as
general architectures for NLP within which to situate IE.
Multi-Engine Machine Translation Environments
-
Sergei Nirenburg (New Mexico State University)
Multi-engine machine translation environments combine results from a variety
of machine translation systems working on the same input text, in order to
improve the overall quality of translation. The individual engines can be
built on any of the known methods - rule-based or corpus-orientated. They
can also be either commercial packages or research prototypes. The complex
task of combining the outputs and, whenever possible, intermediate
structures from the various engines will be discussed.
Semantic Syntax
-
Pieter Seuren (Univ of Nijmegen)
Semantic Syntax
(SESYN)
is an integrated theory of natural
language syntax and semantics. It is a direct continuation of
Generative Semantics.
SESYN
is a rule system
that establishes a
(bidirectional) mapping between the meaning representation of
sentences (called semantic analysis structures) and their surface
realisation. The semantic analysis structures are higher-order
predicate calculus trees and contain the lexical items for open class
words. The surface structures are based on an orthodox version of
Transformational grammar.
The rule system works as follows: first a semantic analysis structure
is constructed using a set of formation rules. The formation rules are
a grammar for the semantic analysis structures. Then, the semantic
analysis structure is transformed into a surface structure using
cyclic and post-cyclic transformational rules. The cyclic rules apply
in a bottom-up way and are mostly lexicon-driven: predicates are
lexically marked for the cyclic rules they induce. The post-cyclic
rules are largely structure-driven and apply in linear order as
defined by the grammar. There is a limited number of highly
constrained transformation rules. In
SESYN
there is no surface
syntax grammar. Instead a grammar for the semantic structure is used
coupled with a transformational component -
SESYN is a
procedural framework.
SESYN
is formally precise and achieves a
high degree of empirical success. Exact rule systems are available for a
number of languages (English, French, Dutch, German) and more are
being developed.
The course will cover:
- Basic principles of Sermantic Syntax. Fully formalized generation of
sentences from semantic base, maximally unified for all languages.
- Demonstration for English and perhaps other languages.
- Further demonstration (computer implementation). Discussion of
problems and perspectives.
Recomended reading:
-
Seuren, Pieter A.M. 1996. SEMANTIC SYNTAX. Oxford: Blackwell.
An Architecture for Linguistically Intensive Content Characterisation
-
Branimir Boguraev (Apple Computer, Cupertino)
We will describe a novel approach to content characterisation of text
documents. It is domain- and genre-independent, by virtue of not requiring
an in-depth analysis of the full meaning. At the same time, it remains
closer to the core meaning by choosing a different granularity of its
representations (phrasal expressions rather than sentences or paragraphs),
by exploiting a notion of discourse contiguity and coherence for the
purposes of uniform coverage and context maintenance, and by utilising a
strong linguistic notion of salience, as a more appropriate and
representative measure of a document's ``aboutness''. We will focus on the
requirements of an architecture for performing such a task, irrespective of
a document's style, genre, or domain.
Natural Language Generation
-
Michael Zock (LIMSI, CNRS)
Natural language generation is a dynamic field that lies at the cross-roads
of a number of disciplines: linguistics, psychology, rhetorics, computer
science (to name just four).
The goal of this course will be two-fold: (a) to make the non-specialist
aware of the problems, potential and achievements of natural language
generation, (b) to help bridge the gap that still exists between the experts
in the different disciplines (e.g., linguists, psychologists).
Issues which will be addressed here are:
- Why is generation an important research area?
- Why is generation a difficult task?
- What has been achieved?
- What methods have been invented?
- Which problems have been neglected?
- On which areas is interest currently focussed?
It will be emphasised that the problem of natural language generation can
only be solved in the realm of cognitive science, within a framework where
linguists, psychologists and computer scientists agree to meet and work in
concert.
Computational Morphology
-
Harald Trost (Austrian Institute for AI)
Duration:
3 hours
Audience:
Students with a general background in computational
linguistics
Goal:
To provide a short introduction into state-of-the-art methods
in computational morphology with an emphasis on two-level
morphology.
Content:
-
A short introduction into morphology:
what is a word; functions (inflection, derivation and compounding) and
phenomena of morphology; with examples from different languages;
-
Applications of computational morphology
e.g., lemmatization, finding word boundaries, morphological analysis;
-
Two-level morphology: the basic approach:
This part is devoted to explaining the basis mechanism of two-level
morphology; encoding morphological and phonological phenomena in two-level
rules; compilation of rules and lexica into finite state transducers;
will be exemplified by a small English lexicon and rule set.
-
Extensions and innovative applications of two-level morphology:
This includes the application of two-level formalisms to non-concatenative
morphology (umlaut in German, templatic morphology in Arabic) and the
application to speech synthesis using a phonological lexicon.
Recommended reading:
-
Richard Sproat. 1992.
Morphology and Computation.
Cambridge, MA: MIT Press.
Corpus Linguistics
-
Tony McEnery (Univ of Lancaster)
Corpora have had an increasing impact upon computational linguistics. Since
the Brown corpus was exploited to yield quantitative data that enabled the
production of the first robust part-of-speech tagging systems, the growth of
corpus based NLP systems has been steady. In this course the speaker will
examine the growth of corpora and link this to developments in computational
linguistics.
Recent developments in anaphora resolution
-
Ruslan Mitkov (Univ of Wolverhampton)
Anaphora resolution is a key issue in NLP being vital in natural language
interfaces, machine translation, automatic abstracting and in a number of
other NLP applications. After considerable initial research in anaphora
resolution and after years of relative silence in the early eighties,
anaphora resolution has attracted the attention of many researchers in the
last 10 years and much promising work on the topic has been reported.
The course will consist of the following parts:
-
Brief introduction (theoretical background: basic notions and terminology;
earlier work; the role of center (focus) and center (focus) tracking)
-
Anaphora resolution in the last 10 years (knowledge-based (integrated)
approaches; alternative approaches; latest trends towards knowledge-poor
approaches)
-
Applications (anaphora resolution in Machine Translation (theoretical
issues, recent research); anaphora resolution in message understanding)
-
The future (unsolved problems; future directions)
Automatic Abstracting
-
Benjamin Tsou (City University of Hong Kong)
Last update: 1 Sep 1997
Nicolas Nicolov
Cognitive and Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, UK
Phone: +44-1273 678408
Fax: +44-1273 671320
nicolas@cogs.susx.ac.uk