Editor for this issue: <>
The following is a long overdue summary of responses to a query concerning Japanese tokenizers: > I am looking for software to tokenize a stream of Japanese text > into "words" in a fashion that would be acceptable to the > majority of native speakers of Japanese. I understand that it > is often difficult because of a lack of word delimiters in much > Japanese text. > > Does anyone know of any commercial libraries, > soon-to-be-available software, shareware, freeware, or > any-kinda-ware for tokenization of Japanese? Are there any > accepted algorithms for tokenizing Japanese? > > To date I have located one outfit in Florida, USA, that is > working on such a tokenizer, but there must be others. I posted queries to two mailing lists apart from LINGUIST: 1) INSOFT-L is dedicated to the internationalization of software. To subscribe, send a message to listservMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecis.vutbr.cz with a message body of the following form: SUB INSOFT-L Yourfirstname Yourlastname 2) LANTRA is language interpretation and translation list. To subscribe, send a message to LISTSERV
SEARN.SUNET.SE. I somehow failed to save my subscription message to this list, but the message body should be either: SUB LANTRA-L YourFirstName YourLastName or: SIGNON LANTRA-L YourFirstName YourLastName Some of the responses came by rather circuitous routes. Thanks to all of you who responded or passed my query along to others. I have shortened some of the messages. The {stuff removed} was not relevant to this summary. ================================== The outfit in Florida that I had already located is Linguistic Software Solutions (LSS). It is the sole marketer of products from SoftArt. The only email address I have is for Lane Carder at SoftArt: soft.art
AppleLink.Apple.COM The contact information I have for LSS: Patricial Carder Linguistic Software Solutions, Inc. 606 Bald Eagle Dr., Suite 203 Marco Island, FL 33937 USA There is an article on SoftArt in "Language Industry Monitor", #16, July-August 1993: L A N G U A G E I N D U S T R Y M O N I T O R "The World of Natural Language Computing" ISSN 0925-3327 Eerste Helmersstraat 183 1054 DT Amsterdam, The Netherlands Tel: + 31 20 685-0462 Fax: +31 20 685-4300 Internet: colinb
paramount.nikhefk.nikhef.nl CompuServe: 70023,1164 tomd ================================== ================================== Another promising outfit is MRJ. I received a phone call from Ken Sheers, who saw a message on CompuServe that some one from LANTRA posted there (thanks to whoever did it!) ... MRJ will have a segmenter real soon now. Seems like a very bright, energetic crew. Ken Sheers MRJ, Inc. 10455 White Granite Dr. Oakton, VA 22124 Voice: (703) 385-0830 e-mail: ksheers
mrj.com tomd ================================== ================================== I got several responses regarding the Juman tokenizer mentioned in the following message. I have not yet been able to connect to the ftp site, but you may have better luck. tomd ================================== Hello, Here is a freeware (almost see below) for segmenting and parts-of- speech tagging Japanese text: Name of the program : JUMAN developed by University of Kyoto ftp : pine.kuee.kyoto-u.ac.jp pub/juman/newjuman.tar.Z after uncompressing and untarring, please look at doc/main.tex (it may be just Japanese but), it tells how to install the program. Our lab, the Computing Research Lab, has been using this program in the past two years and the performance is reasonable considering this is a public domain program. As you may know, all the big Japanese electronics/computer companies have better segmentation programs but they are not redistributable (we cannot get it in any way) One thing you should do before getting this software from the ftp is to get a permission from the developer of the software, Professor Matsumoto (email : matsu
is.aist-nara.ac.jp). I understand that this is just for record-keeping. Cheers, Takahiro Wakao Computing Research Lab New Mexico State Unviersity twakao
crl.nmsu.edu ================================== These folks at IBM appear to be offering support for quite a few languages. tomd ================================== >From: "Brian Gessel (919-469-7741 TL883)" <briang
vnet.IBM.COM> X-Addr: PRGS Natural Language Processing P.O. Box 60000, Mailstop TH8/5W/661 8000 Regency Parkway Cary, NC 27511 Internet: briang
vnet.ibm.com Tom, We intend to offer word segmentation for Japanese, as well as morphology (stemming, inflection) for the European languages you mentioned. Of course, we can also offer spell aid, hyphenation support, and synonyms for a wide variety of European languages. We currently run under DOS, Windows, OS/2 (16-and 32-bit), and Aix (a form of Unix.) We also intend to port the linguistic service to Macintosh System 7 and additional Unix platforms. Our initial targets were Sun (probably Solaris) and HP. I am not sure we will cover SunOS, if it is very different than Solaris, but we want to make the service as portable as possible. We have no plans for porting to VMS, but are open to significant business opportunities. One caveat--we do not currently support Japanese under DOS or Windows, but, again, will consider requirements as they are presented. Although we do not yet have all the machinery in place for full-scale OEM marketing, we are open for business and are handling requests on an ad hoc basis. If you would like further discussion of our plans, we will probably need to put a confidential disclosure agreement (CDA) in place. {stuff removed} Regards, Brian ================================== Of course all you insoft-l folks already know about Ken Lunde's book. It is not a source of Japanese tokenization software, but does provide a lot of other useful information. I have severly truncated the message, so if you need ordering information, contact Ken. tomd ================================== >From: lunde
mv.us.adobe.com (Ken Lunde) >Date: Thu, 16 Sep 93 13:41:38 PDT >Subject: "Understanding Japanese Information Processing" released! It's here! My book entitled "Understanding Japanese Information Processing" was released yesterday (9/15/93), and I received the author copies this morning. I am quite pleased with the final result. Many of you inquired about being told when the book is available. Well, the time is now. Expect to start seeing it in book stores in about 2 weeks or so. I am appending international ordering information below. Note that most accept orders and inquiries by e-mail. Please contact me in the case of questions... -- Ken Lunde {stuff removed} ================================== My company is not a member of the following consortium, and I have not yet received information regarding the consortium. It may be appropriate for some of you, however. tomd ================================== >From: davidc
titan.wordperfect.com (David Cook (Unix Dev)) There is lots of interesting stuff at the CLR at New Mexico State University. See the information below. I hope it is helpful. Feel free to post this to the mailing list. Also I would appreciate it if you would mail me a summary of the replies you get. Thanks, David Cook WordPerfect Japanese Development ===================================================================== CONSORTIUM FOR LEXICAL RESEARCH Computing Research Lab New Mexico State University Las Cruces, NM 88003 phone: (505) 646-5466 fax: (505) 646-6218 email: lexical
nmsu.edu bitnet: lexical at nmsu ********* Consortium for Lexical Research Newsletter 6 >From the Computing Research Laboratory New Mexico State University Edited by - Margarita Gonzalez and Jim Cowie. Contributions and inquiries to - lexical
nmsu.edu OR lexical
nmsu.bitnet FTP address for accessing materials - clr.nmsu.edu [128.123.1.12] This newsletter is distributed in plain ASCII, but it is also available in postscript from clr.nmsu.edu. The directory is newsletter and the file is news6.ps. Information on using the CLR archives and on becoming a member of CLR can be obtained by emailing lexical
nmsu.edu. Recently Added Materials Two new items are available this month: JUMAN is a segmenter and part-of-speech tagger for Japanese text, and TRIG is a parser for English which uses a Link Grammar. JUMAN Ftp Directory: members-only/lexica/JUMAN-MCC/ Juman is a program which segments Japanese into words and tags these words with parts of speech. It was produced at Kyoto University and then heavily modified by researchers at MCC. The tables used for tagging are generated by a Prolog program, but the program which actually does the tokenizing and tagging is written in C, so that users do not need to have a working Prolog implementation if they just want to use Juman. CLR Membership The members-only area of the CLR archives is rapidly increasing its volume with valuable materials and software available to lexical researchers, members of the consortium. If your interests lie in lexicology, lexicography and lexical research, we encourage your organization to become a member, promoting the use of these valuable resources for lexical research and ensuring that they can be maintained. ================================== This one shows promise for the future ... ================================== >From: cwh
world.std.com (Carl W Hoffman) My company has developed a program which indexes Japanese text through the use of a Japanese dictionary. One step in the indexing process is tokenization. Our code does not do a perfect job of tokenizing Japanese (this is an area of active research) however it does correctly parse conjugated verbs and adjectives in nearly all of the cases we have encountered. We are continuing development of this program to improve its linguistic abilities. We have also incorporated our tokenizer into a Japanese text browser. Our browser enables a native English speaker with a modest knowledge of Japanese to read difficult Japanese electronic texts. Using a mouse, you can click on various words in the text to see English definitions. The browser currently runs either under English DOS (requires VGA) or under Japanese DOS/V. A Microsoft Windows version of the browser is under development. Sincerely, Carl Hoffman President ------------------------------------------------------------------------------ Carl Hoffman 304 Newbury Street Tanaka Building 3F Basis Technology Corp. Boston, MA 02115 1-9-6 Iwamotocho, Chiyoda-ku U.S.A. JAPAN IN: cwh
std.com Tel: 617-262-2062 Tel: 03-3863-2997 CIS: 76416,3365 Fax: 617-262-4284 Fax: 03-3863-2998 Thanks again for your assistance. Tom # Tom Donaldson 2400 Research Blvd., Suite 350 # # Senior Software Developer Rockville, MD 20850 # # Personal Library Software (301) 208-1222, FAX: (301) 963-9738 # # e-mail: tomd
pls.com #