LINGUIST List 2.45

Tuesday, 19 Feb 1991

FYI: Ling files for ftp, Software, Ling database

Editor for this issue: <>


  1. John Lawler, ftp availability of linguistic files
  2. Joseph Pentheroudakis, Product Announcement: Morfogen
  3. , Database for/with (syntactic analysis) trees

Message 1: ftp availability of linguistic files

Date: Fri, 15 Feb 91 22:10:12 EST
From: John Lawler <>
Subject: ftp availability of linguistic files
I wish to announce the availability of several files of interest to
linguists. These are all permitted to anonymous ftp, and will be
kept up to date as later material arrives.

They are:

 (1) the (semi-)official LSA list of members' e-mail addresses,
 compiled and maintained by John Moyne. Current size: 71,586 bytes
 Location: EMAILIST.LSA on LING or

 (2) an archive of all back issues of the LINGUIST list.
 Current size: 267,562 bytes.
 Location: LINGUIST.LST on LING or

 (3) the current archive of electronically-submitted responses to
 the LSA computer survey that appeared in a previous LINGUIST
 issue. Current size: 196,227 bytes.
 Location: SURVEY.LSA on LING or

 As you can see, there are two host machines at the University of
 Michigan - UM and UB. The files are on both of them. To reach
 them from the Internet (note - we are in the process of developing
 a BITnet fileserv program, but currently we cannot service BITnet
 requests - sorry):

 (a) ftp to either machine
 (b) log in as "anonymous"
 (c) send your real id as a password
 (d) "cd ling"
 (e) if you want to see the available files, use the "ls" command
 (f) to get a particular file, "get <filename>"
 (g) "quit" to get off ftp
 [all case-insensitive commands]

...and that's it. 

This may be the beginning of a linguistics server. Hold onto your

-John Lawler
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Product Announcement: Morfogen

Date: Sun, 17 Feb 1991 14:48 MST
From: Joseph Pentheroudakis <PENTHERJCC.UTAH.EDU>
Subject: Product Announcement: Morfogen
[The following announcement advertises a commercial product.
Its posting to LINGUIST is done as an informational service to 
the members, and in NO WAY WHATSOEVER indicates endorsement of 
the product by the list, or the moderators of the list, to whom the
virtues--or failings--of the product are unknown.]



ECS announces the release of version 2.0 of Morfogen, a tool
used to develop morphological analysis grammars and to
interface with any electronic dictionary.

Morfogen consists of:

a) A rule compiler: inflectional and derivational paradigms
 are listed in a simple, textbook-style format, and are
 then compiled into a finite state machine; and

b) C source code for the morphological analysis routines:
 these routines access the compiled grammar and return the
 stem of an inflected form and the morpheme(s) matched;
 they can also be used to reject spurious derivations.

Morfogen has been used to build morphological analyzers for
French, English and Spanish, and agglutinative languages such as
Japanese. Development time for these analyzers has ranged from
one to three weeks. The compiled analyzers for these languages
are also available, along with the morphological analysis routines.
Grammars for Italian, German, Turkish and Korean are
forthcoming. Versions exist for DOS, OS/2, SunOS, and

The compiled grammars range in size from 13K for the English
grammar to 50K for the French grammar; the object file versions
of the C routines are less than 20K. The compact size of the
grammars and the analysis routines makes them ideal for use in
a memory-resident mode.

Morfogen is an ideal product for applications requiring
morphological analysis and dictionary access. A substantial
discount is available for academic institutions. For information
or to request a demo disk, contact:

Joseph E. Pentheroudakis
Executive Communication Systems, Inc.
455 North University Avenue, Suite 202
Provo, Utah 84601

fax: (801) 374-6292
voice: (801) 377-1167
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 3: Database for/with (syntactic analysis) trees

Date: Fri, 15 Feb 91 14:28 MET
Subject: Database for/with (syntactic analysis) trees
[Extracted from Humanist Discussion Group, Vol. 4, No. 1053. 
Monday, 18 Feb 1991.]

- New also is a freely copyable demo version for MSDOS.

See below for details and for a general introduction to the LDB.

The Linguistic DataBase (LDB)

The LDB is a database system developed by the TOSCA group at Nijmegen
University which allows linguists who are not experts in computing
to access syntactically analyzed corpora. The data in the database
comprises `syntactic analysis trees' of the contiguous utterances
in a natural-language text. Since these trees are built from a
continuous text, they give a good representation of actual
language use and can thus provide a testing ground for linguistic
hypotheses. The range of extractable information in such a
database is mainly dependent on the degree to which the text has
been prepared. Formerly studies of corpora were restricted to the
level of words or word-classes, but with the Linguistic DataBase
it becomes possible to extend these studies to the level of
syntax, so that larger constituents can be analyzed.

Unlike currently available database packages, the LDB has
been created specifically to handle the type of data linguists
need to analyze - a labelled tree structure with a variable
number of branches at each node and the possibility of recursion.
The LDB can be used to examine the trees on the terminal
screen, search for utterances with given properties, and handle
database-wide queries about constructs in the utterances.

The LDB does not presume special graphics hardware. For
this reason it has been implemented for common machines (VAX and
IBM PC/AT) and common terminals (VT100, ADM3, etc.).
Where possible, special terminal features are used,
such as highlighting and graphics characters, but even on the so-
called `dumb' ADM3A the trees are represented by an
acceptable imitation of graphics. Terminal types not already
provided for can be easily installed by the user.

The LDB also does not presume a computationally expert
user. Thus control of the program is designed to be simple and
clear. The overall control is handled by a menu system, which
displays short descriptions of the choices, each of which can be
activated by a single keystroke. In the Tree Viewer, which is
used to examine an ana ysis tree on the terminal screen, there is
not enough space left on the screen to produce these
descriptions, so that commands (mostly of one keystroke) are
listed in abbreviated form. A description of all commands can be
accessed by a `help' command, however.

For queries going beyond a single tree, the Exploration Scheme
formalism has been developed. An Exploration Scheme consists of a
search pattern, itself a tree much like the analysis trees, and a
specification of the operations to be performed on the
information the pattern discovers. The possibilities of
Exploration Schemes are various. They range from a simple search
for a tree, in order to examine it with the Tree Viewer, to the
creation of frequency tables. The formalism is designed in such a
way that the novice can start exploring immediately. From there,
he can gradually expand his knowledge to the more complex
features. In order to facilitate formulating Exploration Schemes
the LDB has a special scheme editor.

The LDB package comes with the Nijmegen Corpus, a 130,000
word collection of modern British English with a full syntactic
analysis of each utterance. To each node in the tree (i.e. each
constituent in the utterance) has been attached a function and a
category label. In the future more corpora will become available.
Furthermore, since the database system is independent of both
formalism and language, it is possible to use it for any other
kind of analyzed corpus.

The LDB package requires (1) VAX with VMS; (2) IBM PC (AT preferred),
640K RAM, hard disk, at least one 1.2 Mb high-capacity diskette drive, MS-DOS,
no special graphics hardware; or (3) any UNIX machine, competent C-compiler,
enough knowledge about terminal and file I/O to be able to
configure the program to the system. Not copy protected. Source
code (ca. 25,000 lines of CDL2) not available.

It costs Hfl. 100 (academic institutions), Hfl. 5000 (other).
[as of Jan. 1991 Hfl. 1 is about $ 0.60]
A user manual is not included in the academic distribution;
the book Linguistic Exploitation of Syntactic Databases (see
publications) contains all necessary information and is priced at Hfl. 70.

A (fully functional) demonstration version for any MSDOS machine with harddisk
is available
 - on a 5.25" 360K diskette from the address below
 - by ftp at in the directory pub/LDB
 - by listserv from LISTSERVHEARN as files

For more information contact
 Hans van Halteren
 TOSCA Group
 Department of English
 University of Nijmegen
 P.O. Box 9103
 6500 HD Nijmegen
 The Netherlands
 tel: (+31)-080-512836


van Halteren, Hans and Nelleke Oostdijk. ``Using an Analyzed
Corpus as a Linguistic Database'', in Computers in Literary
and Linguistic Computing, Proceedings of the
XIIIth ALLC Conference (Norwich 1986),
John Roper (vol. ed.), J. Hamesse and A. Zampolli (series eds.)

van Halteren, Hans and Theo van den Heuvel. Linguistic
Exploitation of Syntactic Databases. (Rodopi, Amsterdam 1990).

de Haan, Pieter. ``Exploring the Linguistic Database: Noun Phrase
Complexity and Language Variation'', in Corpus Linguistics
and Beyond, Willem Meijs, ed. (Rodopi, Amsterdam 1987).
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue