LINGUIST List 4.429

Fri 04 Jun 1993

FYI: New list: South-East Asian languages; Parser

Editor for this issue: <>


  1. Brian Migliazza, South East Asian Languages & Linguistics Interest Group (fwd)
  2. Atro Voutilainen, English Constraint Grammar Parser

Message 1: South East Asian Languages & Linguistics Interest Group (fwd)

Date: Fri, 28 May 1993 16:49:41 South East Asian Languages & Linguistics Interest Group (fwd)
From: Brian Migliazza <>
Subject: South East Asian Languages & Linguistics Interest Group (fwd)

No. 1 (93-05-28) SEALLIG 28 May 1993

SEALLIG: South East Asian Languages & Linguistics Interest Group

Moderator: Brian Migliazza <>
 Linguistics Department
 Thammasat University
 Bangkok, Thailand

Asst. Editors: to be announced


Welcome to a new email forum for South East Asian Languages and Linguistics
Interest Group. There has been a lot of response from various people
around the world for a language interest group focused on the languages of
SEA -- so I will send out this notice and see what y'all think.

If you like this idea please let me know. Let me know also your ideas on
how to organize this list.

Thailand has been on the email nets now for a while and there are quite a
few universities here that are being added to Internet. What this means
for us, is that many of the Thai linguists and other academics are now
accesible via email. Other countries in SEA are also coming online with
Internet, so that the potential for fruitful academic interaction is now
possible being those of you working outside of SEA and the local scholars
within SEA.

This SEALLIG list is designed to facilitate quick interaction between all
of us around the world who are interested in the languages and linguistics
of this region. My idea would be to "LOOSELY" define the SEA region --
both in terms of the geography and in terms of the languages. Thus, I would
consider SEA to run from Southern China to Indonesia, and from Eastern
India to Philippines -- and including all the languages in between.

I am willing to serve as moderator -- meaning that I would compile and
collate all messages sent to me and then send them to you all. As in the
LINGUIST file, it would be preferable for people to respond directly to
the person making the query. Then that person should compile the responses
and send them to me for distribution to the entire group.

As for topic areas, probably it would be good to organize the comments/
messages by major language families -- probably like TB (Tibeto-Burman),
AN (Austronesian), AA (Austroasiatic), MY (Mien-Yao), and TK (Tai-Kadai).
If messages overlap these areas, we can put them in a general area.

Also we can have a "Notices" section for information on programs in these
languages, books published, journals available, upcoming conferences. If
you are interested, we could also maintain a directory of scholars and the
languages they study.

Send in any feedback that you may have. I am open to suggestions.
Hopefully soon we will have assistant editors for this list, from the
universities here in Thailand (Chulalongkorn, Thammasat, and Mahidol).

Brian Migliazza <>
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: English Constraint Grammar Parser

Date: Fri, 28 May 93 12:00:33 +0English Constraint Grammar Parser
From: Atro Voutilainen <>
Subject: English Constraint Grammar Parser


As of June 1, 1993, the English Constraint Grammar Parser ENGCG,
developed at the Department of General Linguistics and the Research
Unit for Computational Linguistics, University of Helsinki, is
released for non-commercial academic use. ENGCG is released in
collaboration with Helsinki University Licensing Ltd.

The various parts of the system were written by the following persons:

* ENGTWOL lexicon (c) Atro Voutilainen, Juha Heikkila
* Grammar for
morphological disambiguation (c) Atro Voutilainen
* Grammar for syntactic functions (c) Arto Anttila
* Two-level program (c) Kimmo Koskenniemi and Lingsoft, Inc.
Constraint Grammar parser,
 academic version Bart Jongejan ((c) CRI A/S, Denmark)
Constraint Grammar parser,
 production version (c) Pasi Tapanainen

The system is shipped as a fully compiled run-time version for Sun
SparcStations (2 or 10). (Depending on customer requirements, it may
become available for other machines as well.)

ENGCG is based on the Constraint Grammar framework originally proposed
by Fred Karlsson. The theoretical background as well as the English
description is documented as a book to appear under the title:

Karlsson, F., Voutilainen, A., Heikkila, J. & Anttila, A.
(forthcoming). ``Constraint Grammar: a Language--Independent System
for Parsing Unrestricted Text''. To be published by Mouton de Gruyter.

A short description of the main modules of ENGCG:

 * sentence boundary determination
 * normalisation of typographical conventions
 * detection of fixed expressions,
 e.g. multiword prepositions and compounds

 Morphological description:
 -- ENGTWOL, a TWOL-style morphological description
 * 56,000 entries
 * accounts for all inflected and central derived forms
 Morphological heuristics
 * a heuristic module that assigns ENGTWOL-style descriptions
 to those words not recognised by ENGTWOL.
 English Constraint Grammar
 (i) grammar for morphological (e.g. part-of-speech) disambiguation
 * 1,100 `grammar-based` constraints
 * 99.7--100% of all words retain the appropriate
 morphological reading
 * 3--6% of all words remain (partly) ambiguous
 * 200 `heuristic' constraints
 * resolves some 50% of remaining ambiguities
 * after heuristic disambiguation, 99.5% or more
 retain the appropriate morphological reading
 (ii) grammar for determining syntactic functions
 * 250 syntactic constraints for syntactic ambiguity resolution
 * some 75--85% of all words become syntactically
 * some 95.5--98% of all words retain the appropriate
 syntactic-function tag
Speed of analysis on a Sun SparcStation 2:
 -- There are two C implementations of the Constraint Grammar Parser.
 With the `academic' version, written by Bart Jongejan, CRI A/S,
 analysis speed is:
 * preprocessing, morphological analysis,
 morphological disambiguation: 35--55 words per second
 * preprocessing, morphological analysis,
 morphological disambiguation, syntactic analysis:
 15--25 words per second

This offer concerns non-commercial academic research purposes. ENGCG
is distributed on a sublicence basis to academic departments. If your
department wants to obtain the right to use ENGCG, please request a
copy of the requisite Licence Agreement by sending the name and
address of your department and the responsible person to

Atro Voutilainen
Dept. of General Linguistics
P.O. Box 4, University of Helsinki
FIN-00014 University of Helsinki

e-mail: avoutilaling.Helsinki.FI
fax: +358 0 191 3598

A Licence Agreement form will be sent to you promptly. When the form
has been properly completed and returned, and the fee of 1,500 US
dollars paid, the software will be shipped immediately.

The package contains the following items:
 - ENGCG on a 3.5 inch HD diskette
 - book manuscript
 - a short User's Manual

Contact Atro Voutilainen ( or Fred Karlsson
( for further details.

There is also a production version of the parser, written by Pasi
Tapanainen. It is some 20--25 times faster than the academic version.
(Those interested in non-academic use of ENGCG should contact Mr.
Krister Linden (

For the time being, texts of up to 300 words can be analysed with
ENGCG, free of charge, for testing purposes, by sending the text as an
e-mail message to The analysis is sent as
return mail. -- More specific instrictions about testing ENGCG can be
obtained by sending a mail message to
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue