LINGUIST List 5.465

Thu 21 Apr 1994

Sum: Morphological analyzers

Editor for this issue: <>


Directory

  1. Paul Deane, Sum: Morphological analyzers

Message 1: Sum: Morphological analyzers

Date: Tue, 19 Apr 1994 07:14:27 Sum: Morphological analyzers
From: Paul Deane <an995freenet.carleton.ca>
Subject: Sum: Morphological analyzers


I received a number of responses to my query about morphological
analyzers. There are two I can publicly give on the net. In addition,
a major U.S. company will soon be making a major product announcement
about their morphological analyzer. And there seem to be quite a few
"in-house" analyzers developed independently by various private com-
panies. Most of these I cannot pass on publicly to the net
without permission from the companies involved.

The information I received on publicly available products follows.


PC-KIMMO:

>In vol 4-401 of the linguist list, Tom Donaldson posted a reply to a
>similar query, so you may want to get it from the archives. I've
>enclosed some excerpts from his reply pertaining to PC-KIMMO, since
>I've used it myself. It's a free tool written in C for writing
>morphological analyzers. It comes w/ a simple analyzer and small
>lexicon for English. It can be acquired from the Consortium for
>Lexical Research at clr.nmsu.edu [128.123.1.11]; cf. the
>sub-directories of /pub/tools/ling-analysis. Send e-mail inquiries to
>lexicalnmsu.edu.

>PC-Kimmo is a microcomputer version of the KIMMO morphological analyzer
>available via ftp. To contact the developers:
>
> Academic Computing Department
> PC-KIMMO project
> 7500 W. Camp Wisdom Road
> Dallas, TX 75236
> U.S.A.
>
> phone: 214/709-3346, -2418
> fax: 214/709-24333
> email: Evan.Antworthsil.org
>

>LINGSOFT,INC.:

>Dear Mr. Deane,

>I noticed your message in The Linguist 1/4/94 regarding to a query on
>Morphological Analyzers for commercial use.

>Our company has been working in this area since 1986 and have
>morphological analyzers for a number of different languages, including
>and especially for English.

>Our systems have proved to be very fast and also provides a very wide
>coverage of text analysis. Our German system have, last month, won a
>competition in Germany for the best overall German morphological
>analyzer amongst seven other German systems from seven different
>German universities.

>I have included below some information on our ENGCG (English
>Constraint Grammar system) and brief information on Lingsoft and our
>other products.

>Our products are focused on commercial use and we are flexible in
>negotiating appropriate software licenses to meet with your
>requirements.

>I look forward to your reply.

>Best regards,
>Eugene Young.

...............................................................
>Eugene Young :eyoungling.Helsinki.fi (email)
>Lingsoft, Inc. : +358 0 499 556 (ph)
>Museokatu 18 A 3 : +358 0 440 602 (fax)
>FIN-00100 Helsinki :
>FINLAND :
...............................................................

>******************** Further information follows ***********************

> ENGCG - A Constraint Grammar Parser of English.

>ENGCG is based on the Constraint Grammar framework originally proposed by
 >Prof.
>Fred Karlsson.

>ENGCG consists of the following main modules:

>Preprocessor
> * sentence boundary determination
> * normalisation of typographical conventions
> * detection of fixed expressions, eg. multiword prepositions and
> compounds.

>ENGTWOL, a TWOL-style morphological description
> * 56,000 entries
> * accounts for all inflected and central derived forms,
> * no two-level rules,
> * 147 sublexicons,
> * approximately 5400 compounds,
> * approximately 580 idioms,
> * 159 features consisting of:
> - 110 morphosyntactic features
> - 18 derivational features
> - 18 stylistic features
> - 13 punctuator features.

>Morphological Heuristics
> * a heuristic module that assigns ENGTWOL-style descriptions to
> those words not recognised by ENGTWOL.

>ENGCG - English Constraint Grammar
> i)grammar for morphological (e.g. part-of-speech) disambiguation,
> * 1,100 'grammar-based' constraints,
> * 99.7-100% of all words retain the appropriate morphological reading,
> * 3-6% of all words remain (partly) ambiguous,
> * 200 'heuristic' constraints,
> * resolves some 50% of remaining ambiguities,
> * after heuristic disambiguation, 99.5% or more retain the appropriate
> morphological reading.
> ii) grammar for determining syntactic functions
> * some 200 mapping statements,
> * 250 syntactic constraints that discard contextually illegitimate
> syntactic-functiion tags,
> * some 75-85% of all words become syntactically unambiguous,
> * some 95.5-98% of all words retain the appropriate syntactic-function tag.

>For the time being, texts of up to 300 words can be analysed with
>ENGCG, free of charge, for testing purposes, by sending the text as an
>email message to engcgling.helsinki.fi. The analysis is sent via
>return mail. -- More specific instructions about testing ENGCG can be
>obtained by sending a mail message to engcg-infoling.helsinki.fi.

>*****************************************************************************

>Lingsoft, Inc.
>Museokatu 18 A 3, 00100 Helsinki, Finland. Ph: 358 0 499 556 Fax: 358 0 >440
 602
>_______________________________________________________________________________

>1 Introduction

>Lingsoft, Inc. is a linguistic software company based in Helsinki,
>Finland. Lingsoft specializes in providing high quality linguistic
>software for text retrieval and information management systems with an
>emphasis on the processing of English, German, Danish, Swedish, and
>Finnish. The methods and technologies Lingsoft uses are language
>independent, hence Lingsoft also supports Estonian, Russian, Swahili,
>and will be supporting the following languages in the near future:
>French, Italian, and Norwegian.

>Lingsofts' business strategy is to apply state-of-the-art linguistic
>technologies to enhance text retrieval and information management
>systems for a variety of languages. In doing so we strive to provide
>fast and accurate linguistic software that will enhance the
>productivity of existing systems. We currently provide modules for:
>i) search stem formation and morphological analysis for base-form
>reduction of inflected forms for information retrieval; ii)
>high-performing noun phrase extraction, for English, for text indexing
>and information retrieval as well as robust surface syntactic analysis
>for unrestricted text; iii) automatic hyphenation and spelling
>verification and correction for word processing, typesetting, and desk
>top publishing systems.

>2 Background
>---------------

>Lingsoft was founded in August 1986 by Professor Kimmo Koskenniemi,
>Professor Fred Karlsson, and Mr Keijo Kaivanto. Professor Kimmo
>Koskenniemi is the developer of the language independent morphological
>analysis method called the "Koskenniemi Two-Level Model for
>morphology" which has gained general recognition as the only method
>truly applicable to any language, and reasonably efficient (with
>speeds up to 1000 words per second with large dictionaries on
>mainframes and UNIX hosts, and 100 words or more per second on
>personal computers).

>Professor Fred Karlsson is Professor of Linguistics at the University
>of Helsinki and Head of Research Unit for Computational Linguistics
>(RUCL). He is also the author of hyphenation logic for Finnish and
>Swedish, speller for Swedish, and developer of the Constraint Grammar
>Parser and grammar formalism.

>Lingsoft has a team of linguists and programmers developing new
>products, and a number of consulting advisers guaranteeing the best
>scientific quality of the products. Mr Krister Linden, M.SC. MA.,
>is the managing director.

>The theoretical foundations of the methods used in our software
>products have been developed at the Department of General Linguistics,
>and RUCL, both at the University of Helsinki. The methods developed
>in Helsinki have gained worldwide recognition and are currently used
>at dozens of universities around the world.

>Lingsoft is also involved with several European Commission - Eureka
>and LRE projects, such as GRAAL, DELIS, and TRANSTERM. In the GRAAL
>project, Lingsoft is cooperating with Nokia, where the
>surface-syntactic parser will be used in text-indexing and domain
>specific knowledge-extraction applications. This project has several
>industrial and academic partners from France, Italy, Germany,
>Portugal, Greece and Switzerland, ranging from car manufacturers to
>helicopter builders and telecommunications providers. Within this
>project Lingsoft aims at developing a French, German and an Italian
>surface-syntactic parser. In the smaller DELIS project, Lingsoft is
>using the tool for corpus-processing but the aim is to develop methods
>for lexical semantic descriptions. This project has members from
>Germany, Italy, France, Holland and England, where the commercial
>partners are dictionary publishers.

>3 Summary
>------------

>Lingsoft's software modules have been in successful commercial use
>since 1986. Our list of clients includes, amongst others, the largest
>Finnish newspaper and magazine publishers, government departments,
>Finnish subsidiaries of multinational corporations, specialists in the
>field of text indexing and information retrieval systems, and several
>international software developers and manufacturers.

>Lingsoft aims to provide high performance state-of-the-art linguistic
>software in a variety of Nordic and European languages. Lingsofts'
>continuous internal product development and close association with
>RUCL at the University of Helsinki and other advanced computational
>linguistics research facilities internationally ensures that the
>methods and algorithms used are well researched and scientifically
>proven.

>Lingsoft is in a position to offer a variety of advanced linguistic
>tools in the following areas across a number of different operating
>platforms (from mainframes to PCs):

> * morphological analysis and generation;
> * stemming for information retrieval;
> * part-of-speech tagging;
> * noun phrase extraction for running text;
> * surface syntactic analysis;
> * grammar checkers, currently for Finnish only.
> * hyphenation and spell-checking;

>Our strategy is to integrate state-of-the-art linguistic technologies
>to provide a fast and accurate method to further enhance the
>functionality of new and existing text retrieval and information
>management systems.

>If you have questions, need more specific information, or need to
>discuss your application, please contact Eugene Young
>(eyoungling.helsinki.fi).

>Products available:

>* Base form reduction and search form production (Morphological Analyzers).
> * English * Finnish * Russian
> * German * Danish * Estonian
> * Swedish

>* Terminology identification and syntactic analysis for English.

>* Hyphenation and spell-checking (languages currently supported)
> * Finnish * Swedish * Russian

>* Other modules:
> - Finnish grammar checker
> - Module for the retrieval of Russian names written according to
>Finnish, Swedish, English, German, or French spelling
>conventions. Converts Russian names written according to the
>conventions of other languages into the Finnish convention, thus
>facilitating correct matches despite the variation.

>*******************
>PRODUCT INFORMATION
>********************

>Base form reduction and search form production (Morphological Analyzers).
>-------------------------------------------------------------------------

>* English: contains 75,000 base forms, recognizing over 300,000 word forms.

>* German: contains 70,000 base forms, recognizing over 500,000 word
>forms and an infinite number of new compounds and currently being
>extended with material from German newspaper text,

>* Swedish: contains almost 60,000 base forms based on the Svenska
>Akademins Ordlista, which serves as the norm for the Swedish
>language. correct words.

>* Danish: contains 35,000 roots and was based on Bylendals
>Retskrivnings Ordbogen.

>* Finnish: contains 40,000 roots at the moment, but one verb root in
>Finnish may have 18,000 inflected forms and one noun some 2,000 forms.
>The analyzer is also able to recognise new compounds, which for all
>practical purposes makes the number of recognized word forms infinite.

>* Estonian: contains 35,000 base forms at the moment and a compounding
>and word mechanism similar to Finnish, which for all practical
>purposes makes the number of recognized word forms infinite. The
>analyzer is based on Ulle Viks' Morphological Dictionary for
>Estonian.

>* Russian: contains approximately 80,000 base forms. It is based on
>the morphological word-book of Zalisnyak, but the words have been
>selected based on corpus material and extensive additions and
>corrections have been made to the compounding mechanism.

>Terminology identification and syntactic analysis
>-------------------------------------------------

>* ENGDIS: A part-of-speech disambiguator for English with 99.7-100%
>correctness on restricted text with 3-6% ambiguity in the output.

>* ENGIND: A noun phrase extraction tool for indexing of unrestricted
>English text with a recall of 98.5-100% and a precision of 95-98%.

>* ENGNPG: A noun phrase grammar with a simplified function tag set
>indicating only nominal heads, nominal modifiers, verbs, adverbials,
>and conjunctions for unrestricted text (correctness 99-100%, ambiguity
>left 5-8%).

>* ENGCG: A general surface syntactic constraint grammar with a full
>functional tag set for English (correctness 96-97%, ambiguity left 10->18%).

>Hyphenation and spell-checking
>------------------------------

>* FINHYP9 - a high quality hyphenation algorithm for Finnish, finds
>99% of the points with 99.9% correctness. It is open and rule-based
>and thus able to cover any type of words, including foreign names and
>technical terms.

>* Finnish Spell Finder - a high speed spelling-checker with a large
>compacted Finnish dictionary and a Spell Finder interface from
>Microlytics, Inc.

>* SWEHYP - a hyphenation algorithm for Swedish which is a rule-based
>algorithm (like FINHYP), and finds hyphenation points with a
>correctness of 98% or more.

>* Swedish Spell Finder - a high speed spelling checker for Swedish.
>Based on the two-level model and accepts compound words in an open but
>controlled way with an interface from Microlytics, Inc.

>* RUSHYP - a hyphenation algorithm for Russian which is a rule-based
>algorithm.

>* Russian Spell Finder - a high speed spelling-checker with a large
>compacted dictionary and a Spell Finder interface from Microlytics, Inc.

>Other language modules
>----------------------

>* FINCORR - a routine for checking the correct usage of Finnish.
>Detects and suggests correction of certain common errors such as the
>use of commas, government (eg. 'alkaa satamaan'), and spelling of
>learned words.

>* RUSNOM - a module for the retrieval of Russian names written
>according to Finnish, Swedish, English, German, or French spelling
>conventions. Converts names written according to the conventions of
>other languages into the Finnish convention, thus facilitating correct
>matches despite the variation.

>Information on program size and performance:
>--------------------------------------------

>* Morphological Analyzers producing the baseforms and word-class tags
>have data files of approximately 1.0-1.5MB (soon to be reduced by 50%)
>and a 55kB driver.

>*Disk space requirements for spell-checkers are 240-290kB of data and
>a 50kB driver.

>* Hyphenation algorithms require approximately 90kB of memory.

>* The programs are currently available for Unix workstations, OS/2,
>Windows and PCs with DOS/Extender.

>* Language analysis performance is dependent on the language being
>analysed and the tools used for the analysis. On a Sun SPARCstation
>2, the analysis performance is 100-1000 words per second.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue