Editor for this issue: <>
I received a number of responses to my query about morphological analyzers. There are two I can publicly give on the net. In addition, a major U.S. company will soon be making a major product announcement about their morphological analyzer. And there seem to be quite a few "in-house" analyzers developed independently by various private com- panies. Most of these I cannot pass on publicly to the net without permission from the companies involved. The information I received on publicly available products follows. PC-KIMMO: >In vol 4-401 of the linguist list, Tom Donaldson posted a reply to a >similar query, so you may want to get it from the archives. I've >enclosed some excerpts from his reply pertaining to PC-KIMMO, since >I've used it myself. It's a free tool written in C for writing >morphological analyzers. It comes w/ a simple analyzer and small >lexicon for English. It can be acquired from the Consortium for >Lexical Research at clr.nmsu.edu [128.123.1.11]; cf. the >sub-directories of /pub/tools/ling-analysis. Send e-mail inquiries to >lexicalMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuenmsu.edu. >PC-Kimmo is a microcomputer version of the KIMMO morphological analyzer >available via ftp. To contact the developers: > > Academic Computing Department > PC-KIMMO project > 7500 W. Camp Wisdom Road > Dallas, TX 75236 > U.S.A. > > phone: 214/709-3346, -2418 > fax: 214/709-24333 > email: Evan.Antworth
sil.org > >LINGSOFT,INC.: >Dear Mr. Deane, >I noticed your message in The Linguist 1/4/94 regarding to a query on >Morphological Analyzers for commercial use. >Our company has been working in this area since 1986 and have >morphological analyzers for a number of different languages, including >and especially for English. >Our systems have proved to be very fast and also provides a very wide >coverage of text analysis. Our German system have, last month, won a >competition in Germany for the best overall German morphological >analyzer amongst seven other German systems from seven different >German universities. >I have included below some information on our ENGCG (English >Constraint Grammar system) and brief information on Lingsoft and our >other products. >Our products are focused on commercial use and we are flexible in >negotiating appropriate software licenses to meet with your >requirements. >I look forward to your reply. >Best regards, >Eugene Young. ............................................................... >Eugene Young :eyoung
ling.Helsinki.fi (email) >Lingsoft, Inc. : +358 0 499 556 (ph) >Museokatu 18 A 3 : +358 0 440 602 (fax) >FIN-00100 Helsinki : >FINLAND : ............................................................... >******************** Further information follows *********************** > ENGCG - A Constraint Grammar Parser of English. >ENGCG is based on the Constraint Grammar framework originally proposed by >Prof. >Fred Karlsson. >ENGCG consists of the following main modules: >Preprocessor > * sentence boundary determination > * normalisation of typographical conventions > * detection of fixed expressions, eg. multiword prepositions and > compounds. >ENGTWOL, a TWOL-style morphological description > * 56,000 entries > * accounts for all inflected and central derived forms, > * no two-level rules, > * 147 sublexicons, > * approximately 5400 compounds, > * approximately 580 idioms, > * 159 features consisting of: > - 110 morphosyntactic features > - 18 derivational features > - 18 stylistic features > - 13 punctuator features. >Morphological Heuristics > * a heuristic module that assigns ENGTWOL-style descriptions to > those words not recognised by ENGTWOL. >ENGCG - English Constraint Grammar > i)grammar for morphological (e.g. part-of-speech) disambiguation, > * 1,100 'grammar-based' constraints, > * 99.7-100% of all words retain the appropriate morphological reading, > * 3-6% of all words remain (partly) ambiguous, > * 200 'heuristic' constraints, > * resolves some 50% of remaining ambiguities, > * after heuristic disambiguation, 99.5% or more retain the appropriate > morphological reading. > ii) grammar for determining syntactic functions > * some 200 mapping statements, > * 250 syntactic constraints that discard contextually illegitimate > syntactic-functiion tags, > * some 75-85% of all words become syntactically unambiguous, > * some 95.5-98% of all words retain the appropriate syntactic-function tag. >For the time being, texts of up to 300 words can be analysed with >ENGCG, free of charge, for testing purposes, by sending the text as an >email message to engcg
ling.helsinki.fi. The analysis is sent via >return mail. -- More specific instructions about testing ENGCG can be >obtained by sending a mail message to engcg-info
ling.helsinki.fi. >***************************************************************************** >Lingsoft, Inc. >Museokatu 18 A 3, 00100 Helsinki, Finland. Ph: 358 0 499 556 Fax: 358 0 >440 602 >_______________________________________________________________________________ >1 Introduction >Lingsoft, Inc. is a linguistic software company based in Helsinki, >Finland. Lingsoft specializes in providing high quality linguistic >software for text retrieval and information management systems with an >emphasis on the processing of English, German, Danish, Swedish, and >Finnish. The methods and technologies Lingsoft uses are language >independent, hence Lingsoft also supports Estonian, Russian, Swahili, >and will be supporting the following languages in the near future: >French, Italian, and Norwegian. >Lingsofts' business strategy is to apply state-of-the-art linguistic >technologies to enhance text retrieval and information management >systems for a variety of languages. In doing so we strive to provide >fast and accurate linguistic software that will enhance the >productivity of existing systems. We currently provide modules for: >i) search stem formation and morphological analysis for base-form >reduction of inflected forms for information retrieval; ii) >high-performing noun phrase extraction, for English, for text indexing >and information retrieval as well as robust surface syntactic analysis >for unrestricted text; iii) automatic hyphenation and spelling >verification and correction for word processing, typesetting, and desk >top publishing systems. >2 Background >--------------- >Lingsoft was founded in August 1986 by Professor Kimmo Koskenniemi, >Professor Fred Karlsson, and Mr Keijo Kaivanto. Professor Kimmo >Koskenniemi is the developer of the language independent morphological >analysis method called the "Koskenniemi Two-Level Model for >morphology" which has gained general recognition as the only method >truly applicable to any language, and reasonably efficient (with >speeds up to 1000 words per second with large dictionaries on >mainframes and UNIX hosts, and 100 words or more per second on >personal computers). >Professor Fred Karlsson is Professor of Linguistics at the University >of Helsinki and Head of Research Unit for Computational Linguistics >(RUCL). He is also the author of hyphenation logic for Finnish and >Swedish, speller for Swedish, and developer of the Constraint Grammar >Parser and grammar formalism. >Lingsoft has a team of linguists and programmers developing new >products, and a number of consulting advisers guaranteeing the best >scientific quality of the products. Mr Krister Linden, M.SC. MA., >is the managing director. >The theoretical foundations of the methods used in our software >products have been developed at the Department of General Linguistics, >and RUCL, both at the University of Helsinki. The methods developed >in Helsinki have gained worldwide recognition and are currently used >at dozens of universities around the world. >Lingsoft is also involved with several European Commission - Eureka >and LRE projects, such as GRAAL, DELIS, and TRANSTERM. In the GRAAL >project, Lingsoft is cooperating with Nokia, where the >surface-syntactic parser will be used in text-indexing and domain >specific knowledge-extraction applications. This project has several >industrial and academic partners from France, Italy, Germany, >Portugal, Greece and Switzerland, ranging from car manufacturers to >helicopter builders and telecommunications providers. Within this >project Lingsoft aims at developing a French, German and an Italian >surface-syntactic parser. In the smaller DELIS project, Lingsoft is >using the tool for corpus-processing but the aim is to develop methods >for lexical semantic descriptions. This project has members from >Germany, Italy, France, Holland and England, where the commercial >partners are dictionary publishers. >3 Summary >------------ >Lingsoft's software modules have been in successful commercial use >since 1986. Our list of clients includes, amongst others, the largest >Finnish newspaper and magazine publishers, government departments, >Finnish subsidiaries of multinational corporations, specialists in the >field of text indexing and information retrieval systems, and several >international software developers and manufacturers. >Lingsoft aims to provide high performance state-of-the-art linguistic >software in a variety of Nordic and European languages. Lingsofts' >continuous internal product development and close association with >RUCL at the University of Helsinki and other advanced computational >linguistics research facilities internationally ensures that the >methods and algorithms used are well researched and scientifically >proven. >Lingsoft is in a position to offer a variety of advanced linguistic >tools in the following areas across a number of different operating >platforms (from mainframes to PCs): > * morphological analysis and generation; > * stemming for information retrieval; > * part-of-speech tagging; > * noun phrase extraction for running text; > * surface syntactic analysis; > * grammar checkers, currently for Finnish only. > * hyphenation and spell-checking; >Our strategy is to integrate state-of-the-art linguistic technologies >to provide a fast and accurate method to further enhance the >functionality of new and existing text retrieval and information >management systems. >If you have questions, need more specific information, or need to >discuss your application, please contact Eugene Young >(eyoung
ling.helsinki.fi). >Products available: >* Base form reduction and search form production (Morphological Analyzers). > * English * Finnish * Russian > * German * Danish * Estonian > * Swedish >* Terminology identification and syntactic analysis for English. >* Hyphenation and spell-checking (languages currently supported) > * Finnish * Swedish * Russian >* Other modules: > - Finnish grammar checker > - Module for the retrieval of Russian names written according to >Finnish, Swedish, English, German, or French spelling >conventions. Converts Russian names written according to the >conventions of other languages into the Finnish convention, thus >facilitating correct matches despite the variation. >******************* >PRODUCT INFORMATION >******************** >Base form reduction and search form production (Morphological Analyzers). >------------------------------------------------------------------------- >* English: contains 75,000 base forms, recognizing over 300,000 word forms. >* German: contains 70,000 base forms, recognizing over 500,000 word >forms and an infinite number of new compounds and currently being >extended with material from German newspaper text, >* Swedish: contains almost 60,000 base forms based on the Svenska >Akademins Ordlista, which serves as the norm for the Swedish >language. correct words. >* Danish: contains 35,000 roots and was based on Bylendals >Retskrivnings Ordbogen. >* Finnish: contains 40,000 roots at the moment, but one verb root in >Finnish may have 18,000 inflected forms and one noun some 2,000 forms. >The analyzer is also able to recognise new compounds, which for all >practical purposes makes the number of recognized word forms infinite. >* Estonian: contains 35,000 base forms at the moment and a compounding >and word mechanism similar to Finnish, which for all practical >purposes makes the number of recognized word forms infinite. The >analyzer is based on Ulle Viks' Morphological Dictionary for >Estonian. >* Russian: contains approximately 80,000 base forms. It is based on >the morphological word-book of Zalisnyak, but the words have been >selected based on corpus material and extensive additions and >corrections have been made to the compounding mechanism. >Terminology identification and syntactic analysis >------------------------------------------------- >* ENGDIS: A part-of-speech disambiguator for English with 99.7-100% >correctness on restricted text with 3-6% ambiguity in the output. >* ENGIND: A noun phrase extraction tool for indexing of unrestricted >English text with a recall of 98.5-100% and a precision of 95-98%. >* ENGNPG: A noun phrase grammar with a simplified function tag set >indicating only nominal heads, nominal modifiers, verbs, adverbials, >and conjunctions for unrestricted text (correctness 99-100%, ambiguity >left 5-8%). >* ENGCG: A general surface syntactic constraint grammar with a full >functional tag set for English (correctness 96-97%, ambiguity left 10->18%). >Hyphenation and spell-checking >------------------------------ >* FINHYP9 - a high quality hyphenation algorithm for Finnish, finds >99% of the points with 99.9% correctness. It is open and rule-based >and thus able to cover any type of words, including foreign names and >technical terms. >* Finnish Spell Finder - a high speed spelling-checker with a large >compacted Finnish dictionary and a Spell Finder interface from >Microlytics, Inc. >* SWEHYP - a hyphenation algorithm for Swedish which is a rule-based >algorithm (like FINHYP), and finds hyphenation points with a >correctness of 98% or more. >* Swedish Spell Finder - a high speed spelling checker for Swedish. >Based on the two-level model and accepts compound words in an open but >controlled way with an interface from Microlytics, Inc. >* RUSHYP - a hyphenation algorithm for Russian which is a rule-based >algorithm. >* Russian Spell Finder - a high speed spelling-checker with a large >compacted dictionary and a Spell Finder interface from Microlytics, Inc. >Other language modules >---------------------- >* FINCORR - a routine for checking the correct usage of Finnish. >Detects and suggests correction of certain common errors such as the >use of commas, government (eg. 'alkaa satamaan'), and spelling of >learned words. >* RUSNOM - a module for the retrieval of Russian names written >according to Finnish, Swedish, English, German, or French spelling >conventions. Converts names written according to the conventions of >other languages into the Finnish convention, thus facilitating correct >matches despite the variation. >Information on program size and performance: >-------------------------------------------- >* Morphological Analyzers producing the baseforms and word-class tags >have data files of approximately 1.0-1.5MB (soon to be reduced by 50%) >and a 55kB driver. >*Disk space requirements for spell-checkers are 240-290kB of data and >a 50kB driver. >* Hyphenation algorithms require approximately 90kB of memory. >* The programs are currently available for Unix workstations, OS/2, >Windows and PCs with DOS/Extender. >* Language analysis performance is dependent on the language being >analysed and the tools used for the analysis. On a Sun SPARCstation >2, the analysis performance is 100-1000 words per second.