Editor for this issue: <>
Back in April I submitted requests for information on language stemmers (sub-area of "morphological analysis" involving generation of roots and inflected forms) to two mailing lists: LINGUIST Thanks to Patty Schmidt, a linguist at Logos USA, for directing me to the LINGUIST mailing list, and to Lifen Chen and Arlene Puryear (more Georgetown linguists) for directing me to Patty. I believe that Patty's employer, Logos USA, develops machine translation software. INSOFT-L Mailing list primarily for those concerned with internationalization of software. Thanks to all who responded! Below is a summary of responses and other results. ======================================================================== -------------------------------------------- >From Thomas Everth of Circle Noetic Services -------------------------------------------- Internet address: EVERTHMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueAppleLink.Apple.COM I am directing the marketing of CNS (Circle Noetic Services). We are a linguistic software company in New Hampshire (USA). CNS has developed a product called "WordFan" that is exactly aimed at the market that you describe. WordFan will produce all conjugated forms of an input word, or derive the base form from any of its conjugated forms. The first version of WordFan will be released in mid of '93 for English. Other languages to follow. German is high up on the list indeed. Japanese is currently not under development, but: Russian, Arabic and most Western European languages. CNS was founded in 1987 and has since then provided hyphenation algorithms (now in 29 languages) and spelling checking (now in 13 languages, incl. Arabic) for the computing industry. Our products have been licensed by major national and international vendors for typesetting, word processing and DTP applications. CNS also offers or has under development: IOW (In Other Words), a large lexical database with over a million concepts and relations based on over 100,000 English words. Linguistic tools for OCR and hand writing recognition. Wordlists in many languages: conjugated, with rule markings and conjugation rules, morphological breakdown, morphological cross references,.... and many more. The WordFan product will in the future also include many other relations besides "conjugation" like: synonym, antonym, homolog ... etc. Hammer -> (is a) tool -> [other tools] for example. WordFan will also come (optionally) with algorithms to split Germanic compound words like: "Bundestagsverwaltungshauptapparat" or other such tongue twisting monsters. (I am German by the way.) Splitting Germanic compound words should be a must for text retrieval software in these languages. Our technology is being developed by former linguists and programmers from MIT. Please call us at (603) 672-6151 or fax: (603) 672-8025 for further information. Via internet use: D1634
AppleLink.Apple.COM or my personal id: EVERTH
AppleLink.Apple.COM ------------------------------------ >From Krister Linden of Lingsoft Inc. ------------------------------------ Internet address: klinden
ling.Helsinki.FI Lingsoft is a small software company in Finland. We specialize in morphology and morphosyntactic analysis. Our methods are based on the Kimmo Koskenniemi two-level model. We also sell products based on the FiniteState-syntactic model presented last week at the EACL. That paper received the Don Walker Award for best paper. We have: 0. spell-checking and hyphenation 1. morphological analysis and generation 2. stemming for information retrieval 3. part-of-speech tagging ( >99% correct, <5% ambiguity) 4. NP extraction for text indexing and retrieval ( >98% recall, >95% precision) 5. surface syntactic analysis 6. grammar checker English 1,2,3,4,5 German 1 (end of May), 0,2 (end of summer) Swedish 0,1,2,3 Russian 0,1,2 Finnish 0,1,2,3,6 Danish 1, 0,2,3 (end of year) Swahili 1,2 All the lexicons have between 40.000 and 80.000 roots. The programs are programmed in C and have been ported to various platforms. The speed of all the tools are btw 600-1000 w/s on a Sparcstation 2. In a near future we will have tools for French, Estonian, Italian and Norwegian as well. Krister Linden Lingsoft Inc. --- tomd: From sales literature send via hardcopy mail, I learned that Prof Kimmo Koskenniemi is one of the "founders and principal owners of Lingsoft." He developed a "two-level model" of morphological analysis that seems to be popular as the basis of software for morphological analysis. ------------------------------------------------------------- >From Richard Sproat of AT&T's Linguistics Research Department ------------------------------------------------------------- Internet address: rws
research.att.com Probably the best and most general available commercial software for doing this kind of thing is PC-KIMMO, which you can actually get for free by anonymous FTP. I enclose some info (dated January 92 -- I assume it still holds) on that below. There is also a book to go with that by Evan Antworth, which you can get from the Summer Institute of Linguistics (address below). For more general discussion of various methods for doing computational morphology, you can also consult two recent MIT Press Books: 1. Computational Morphology: Practical mechanisms for the English lexicon. By Graeme D. Ritchie, Graham J. Russell, Alan W. Black and Stephen G. Pulman. ACL-MIT Press Series in Natural Language Processing. Cambridge, Massachusetts: MIT Press, 1992 2. And my own 1992 book in the same series, Morphology and Computation. Mine covers a wider variety of stuff than does the Ritchie et al. book. Richard Sproat Linguistics Research Department AT&T Bell Laboratories | tel (908) 582-5296 600 Mountain Avenue, Room 2d-451 | fax (908) 582-7308 Murray Hill, NJ 07974, USA | rws
research.att.com --- TomD: Richard also enclosed a lengthy "news" item on PC-KIMMO from Evan Antworth. It seemed a bit too long to include here, but see the next item *from* Evan Antworth. --------------------------------------------------------------------- >From Evan Antworth of Academic Computing Department, (institution???) --------------------------------------------------------------------- Internet address: evan.antworth
sil.org Here is some information on PC-KIMMO, a program for morphological parsing. It has been reviewed in _Computational Linguistics_ 17:2, June 1991 and also in _Computers and the Humanities_ 26:2, April 1992. We provide the C source code with the intention that it be used in programs developed by the user. Of course, I cannot say whether or not it could successfully be used in your application. Let me know if I can help you further. Evan Antworth evan.antworth
sil.org ------------------------------------------ PC-KIMMO: A Two-level Processor for Morphological Analysis WHAT IS PC-KIMMO? PC-KIMMO is a new implementation for microcomputers of a program dubbed KIMMO after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. Work on PC-KIMMO began in 1985, following the specifications of the LISP implementation of Koskenniemi's model described in Karttunen 1983. The coding has been done in Microsoft C by David Smith and Stephen McConnel under the direction of Gary Simons and under the auspices of the Summer Institute of Linguistics. The aim was to develop a version of the two-level processor that would run on an IBM PC compatible computer and that would include an environment for testing and debugging a linguistic description. The PC-KIMMO program is actually a shell program that serves as an interactive user interface to the primitive PC-KIMMO functions. These functions are available as a C-language source code library that can be included in a program written by the user. [tomd: much text deleted] HOW TO CONTACT US PC-KIMMO is a research project in progress, not a finished commercial product. In this spirit, we invite your response to the software and the book. Please direct your comments to: Academic Computing Department PC-KIMMO project 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. phone: 214/709-3346, -2418 email: evan.antworth
sil.org (Evan Antworth) REFERENCES Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. ISBN 0-88312-639-7, 273 pages, paperbound. Karttunen, Lauri. 1983. KIMMO: a general morphological processor. Texas Linguistic Forum 22:163-186. Koskenniemi, Kimmo. 1983. Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. University of Helsinki: Department of General Linguistics. ---------------------- >From Ian Hersey of IBM ---------------------- Internet Address: hersey
vnet.IBM.COM We do have a system that both lemmatizes ("stems") and generates all inflected forms, and it is available for about 19 European languages. We also do lemmatization for Japanese. The code is language-independent: you just plug in the dictionary you need and go from there. This same service also performs hyphenation (not for Japanese -- it isn't ever hyphenated) and spell-checking. This system is available for Windows, OS/2, AIX, VM and MVS. I should mention that our morphological processing only handles inflectional morphology: "compute" can generate "computes", "computed" and "computing" (all forms of the verb "to compute"), but it will not generate "computer". The "-er" and other affixes that change the part of speech are known as derivational morphology, and our service doesn't handle that area (yet). I'm not the one to give pricing information. Please contact Brian Gessel at 301-803-2943 for that; he's our business person. He can also provide you with an OEM fact sheet that lists all of the languages and sizes. Regards, Ian -------------------------------------------------------- >From Daniel Stieger of Institut fuer Informationssysteme -------------------------------------------------------- Internet Address: stieger
inf.ethz.ch [tomd: Dani Stieger is responding to a query regarding a German language stemmer based on the "Porter algorithm."] As I mentioned to your colleague there is no serious report about our experiments. I am in possession of a "Semester Work" (a short report performed by a student) about this subject. It is NOT available in machine readable form [tomd: text deleted] AND ... it is written in GERMAN. The Report contains also a listing of the german Porter algorithm (written in MODULA-2 !!). Furthermore, you need the decomposition of german words so that you are really stemming the right (ending) part of the word (as you know, german words may be composed of several words). For the decomposition I used an automatically generated dictionnary (215'000 german words). [tomd: text deleted] >You mention "Porter (1983)." Can you send me the full citation? Is >there some way we can get the source of your experiments with the >algorithm? > M.F. Porter: An Algorithm for Suffix Stripping. Program, Vol. 14, No. 3, 1980, pp. 130-137. [tomd: text deleted] Dani ************************************************************************ Daniel Stieger stieger
inf.ethz.ch Institut fuer Informationssysteme ETH Zentrum, IFW E43.2 Tel: +41-1-254-7226 CH - 8092 Zuerich Fax: +41-1-262-3973 ************************************************************************ ====================================================================== Thanks again for all your help, Tom # Tom Donaldson 2400 Research Blvd., Suite 350 # # Senior Software Developer Rockville, MD 20850 # # Personal Library Software (301) 990-1155, FAX: (301) 963-9738 # # e-mail: tomd
pls.com #