LINGUIST List 4.918

Thu 04 Nov 1993

Sum: Japanese Tokenization

Editor for this issue: <>


Directory

  1. Tom Donaldson, Japanese Tokenization: Summary Of Responses

Message 1: Japanese Tokenization: Summary Of Responses

Date: Thu, 4 Nov 93 09:18:25 -0500
From: Tom Donaldson <tomddfhake.pls.com>
Subject: Japanese Tokenization: Summary Of Responses
The following is a long overdue summary of responses to a query
concerning Japanese tokenizers:
> I am looking for software to tokenize a stream of Japanese text
> into "words" in a fashion that would be acceptable to the
> majority of native speakers of Japanese. I understand that it
> is often difficult because of a lack of word delimiters in much
> Japanese text.
>
> Does anyone know of any commercial libraries,
> soon-to-be-available software, shareware, freeware, or
> any-kinda-ware for tokenization of Japanese? Are there any
> accepted algorithms for tokenizing Japanese?
>
> To date I have located one outfit in Florida, USA, that is
> working on such a tokenizer, but there must be others.
I posted queries to two mailing lists apart from LINGUIST:
1) INSOFT-L is dedicated to the internationalization of software. To
subscribe, send a message to listservcis.vutbr.cz with a message body
of the following form:
 SUB INSOFT-L Yourfirstname Yourlastname
2) LANTRA is language interpretation and translation list.
To subscribe, send a message to LISTSERVSEARN.SUNET.SE. I somehow
failed to save my subscription message to this list, but the message
body should be either:
 SUB LANTRA-L YourFirstName YourLastName
or:
 SIGNON LANTRA-L YourFirstName YourLastName
Some of the responses came by rather circuitous routes. Thanks to all
of you who responded or passed my query along to others.
I have shortened some of the messages. The {stuff removed} was not
relevant to this summary.
==================================
The outfit in Florida that I had already located is Linguistic
Software Solutions (LSS). It is the sole marketer of products from
SoftArt. The only email address I have is for Lane Carder at SoftArt:
 soft.artAppleLink.Apple.COM
The contact information I have for LSS:
 Patricial Carder
 Linguistic Software Solutions, Inc.
 606 Bald Eagle Dr., Suite 203
 Marco Island, FL 33937
 USA
There is an article on SoftArt in "Language Industry Monitor", #16,
July-August 1993:
 L A N G U A G E I N D U S T R Y M O N I T O R
 "The World of Natural Language Computing"
 ISSN 0925-3327
 Eerste Helmersstraat 183
 1054 DT Amsterdam, The Netherlands
 Tel: + 31 20 685-0462 Fax: +31 20 685-4300
 Internet: colinbparamount.nikhefk.nikhef.nl
 CompuServe: 70023,1164
tomd
==================================
==================================
Another promising outfit is MRJ. I received a phone call from Ken
Sheers, who saw a message on CompuServe that some one from LANTRA
posted there (thanks to whoever did it!) ...
MRJ will have a segmenter real soon now. Seems like a very bright,
energetic crew.
 Ken Sheers
 MRJ, Inc.
 10455 White Granite Dr.
 Oakton, VA 22124
 Voice: (703) 385-0830
 e-mail: ksheersmrj.com
tomd
==================================
==================================
I got several responses regarding the Juman tokenizer mentioned in the
following message. I have not yet been able to connect to the ftp
site, but you may have better luck. tomd
==================================
Hello,
Here is a freeware (almost see below) for segmenting and parts-of-
speech tagging Japanese text:
Name of the program : JUMAN
developed by University of Kyoto
ftp : pine.kuee.kyoto-u.ac.jp
 pub/juman/newjuman.tar.Z
after uncompressing and untarring, please look at doc/main.tex
(it may be just Japanese but), it tells how to install the program.
Our lab, the Computing Research Lab, has been using this program
in the past two years and the performance is reasonable considering
this is a public domain program. As you may know, all the big Japanese
electronics/computer companies have better segmentation programs but
they are not redistributable (we cannot get it in any way)
One thing you should do before getting this software from the ftp is to
get a permission from the developer of the software, Professor Matsumoto
(email : matsuis.aist-nara.ac.jp). I understand that this is just for
record-keeping.
Cheers,
Takahiro Wakao
Computing Research Lab
New Mexico State Unviersity
twakaocrl.nmsu.edu
==================================
These folks at IBM appear to be offering support for quite a few
languages. tomd
==================================
>From: "Brian Gessel (919-469-7741 TL883)" <briangvnet.IBM.COM>
X-Addr: PRGS Natural Language Processing
 P.O. Box 60000, Mailstop TH8/5W/661
 8000 Regency Parkway
 Cary, NC 27511 Internet: briangvnet.ibm.com
Tom,
We intend to offer word segmentation for Japanese, as well as morphology
(stemming, inflection) for the European languages you mentioned. Of
course, we can also offer spell aid, hyphenation support, and synonyms
for a wide variety of European languages.
We currently run under DOS, Windows, OS/2 (16-and 32-bit), and Aix (a
form of Unix.) We also intend to port the linguistic service to
Macintosh System 7 and additional Unix platforms. Our initial targets
were Sun (probably Solaris) and HP. I am not sure we will cover SunOS,
if it is very different than Solaris, but we want to make the service as
portable as possible. We have no plans for porting to VMS, but are open
to significant business opportunities. One caveat--we do not currently
support Japanese under DOS or Windows, but, again, will consider
requirements as they are presented.
Although we do not yet have all the machinery in place for full-scale
OEM marketing, we are open for business and are handling requests on an
ad hoc basis. If you would like further discussion of our plans, we
will probably need to put a confidential disclosure agreement (CDA)
in place.
{stuff removed}
Regards,
Brian
==================================
Of course all you insoft-l folks already know about Ken Lunde's book.
It is not a source of Japanese tokenization software, but does provide
a lot of other useful information. I have severly truncated the
message, so if you need ordering information, contact Ken. tomd
==================================
>From: lundemv.us.adobe.com (Ken Lunde)
>Date: Thu, 16 Sep 93 13:41:38 PDT
>Subject: "Understanding Japanese Information Processing" released!
 It's here! My book entitled "Understanding Japanese Information
Processing" was released yesterday (9/15/93), and I received the author
copies this morning. I am quite pleased with the final result.
 Many of you inquired about being told when the book is available.
Well, the time is now. Expect to start seeing it in book stores in about
2 weeks or so.
 I am appending international ordering information below. Note
that most accept orders and inquiries by e-mail.
 Please contact me in the case of questions...
-- Ken Lunde
{stuff removed}
==================================
My company is not a member of the following consortium, and I have not
yet received information regarding the consortium. It may be
appropriate for some of you, however. tomd
==================================
>From: davidctitan.wordperfect.com (David Cook (Unix Dev))
 There is lots of interesting stuff at the CLR at New Mexico State
University. See the information below. I hope it is helpful. Feel
free to post this to the mailing list. Also I would appreciate it if
you would mail me a summary of the replies you get.
 Thanks,
 David Cook
 WordPerfect Japanese Development
=====================================================================
 CONSORTIUM FOR LEXICAL RESEARCH
 Computing Research Lab
 New Mexico State University
 Las Cruces, NM 88003
 phone: (505) 646-5466
 fax: (505) 646-6218
 email: lexicalnmsu.edu
 bitnet: lexical at nmsu
 *********
Consortium for Lexical Research Newsletter 6
>From the Computing Research Laboratory New Mexico State University
Edited by - Margarita Gonzalez and Jim Cowie.
Contributions and inquiries to - lexicalnmsu.edu OR lexicalnmsu.bitnet
FTP address for accessing materials - clr.nmsu.edu [128.123.1.12]
This newsletter is distributed in plain ASCII, but it is also
available in postscript from clr.nmsu.edu. The directory is newsletter
and the file is news6.ps.
Information on using the CLR archives and on becoming a member of CLR
can be obtained by emailing lexicalnmsu.edu.
Recently Added Materials
Two new items are available this month: JUMAN is a segmenter and
part-of-speech tagger for Japanese text, and TRIG is a parser for
English which uses a Link Grammar.
JUMAN
Ftp Directory: members-only/lexica/JUMAN-MCC/
Juman is a program which segments Japanese into words and tags these
words with parts of speech. It was produced at Kyoto University and
then heavily modified by researchers at MCC. The tables used for
tagging are generated by a Prolog program, but the program which
actually does the tokenizing and tagging is written in C, so that
users do not need to have a working Prolog implementation if they just
want to use Juman.
CLR Membership
The members-only area of the CLR archives is rapidly increasing its
volume with valuable materials and software available to lexical
researchers, members of the consortium. If your interests lie in
lexicology, lexicography and lexical research, we encourage your
organization to become a member, promoting the use of these valuable
resources for lexical research and ensuring that they can be maintained.
==================================
This one shows promise for the future ...
==================================
>From: cwhworld.std.com (Carl W Hoffman)
My company has developed a program which indexes Japanese text through
the use of a Japanese dictionary. One step in the indexing process is
tokenization. Our code does not do a perfect job of tokenizing Japanese
(this is an area of active research) however it does correctly parse
conjugated verbs and adjectives in nearly all of the cases we have
encountered. We are continuing development of this program to improve
its linguistic abilities.
We have also incorporated our tokenizer into a Japanese text browser.
Our browser enables a native English speaker with a modest knowledge of
Japanese to read difficult Japanese electronic texts. Using a mouse, you
can click on various words in the text to see English definitions. The
browser currently runs either under English DOS (requires VGA) or under
Japanese DOS/V. A Microsoft Windows version of the browser is under
development.
Sincerely,
Carl Hoffman
President
 ------------------------------------------------------------------------------
Carl Hoffman 304 Newbury Street Tanaka Building 3F
Basis Technology Corp. Boston, MA 02115 1-9-6 Iwamotocho, Chiyoda-ku
 U.S.A. JAPAN
IN: cwhstd.com Tel: 617-262-2062 Tel: 03-3863-2997
CIS: 76416,3365 Fax: 617-262-4284 Fax: 03-3863-2998
Thanks again for your assistance.
Tom
 # Tom Donaldson 2400 Research Blvd., Suite 350 #
 # Senior Software Developer Rockville, MD 20850 #
 # Personal Library Software (301) 208-1222, FAX: (301) 963-9738 #
 # e-mail: tomdpls.com #
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue