LINGUIST List 2.511

Fri 13 Sep 1991

FYI: Language identification, Languages of the World

Editor for this issue: <>


Directory

  1. stan kulikowski ii, language identification algorithm
  2. , LANGUAGES OF THE WORLD, MUNICH

Message 1: language identification algorithm

Date: Fri, 13 Sep 91 09:47:33 CDT
From: stan kulikowski ii <STANKULIUWF.BITNET>
Subject: language identification algorithm
 USING SHORT WORDS:
 a language identification algorithm
 stan kulikowski ii
 educational research and development center
 the university of west florida
 pensacola, fl, usa 32514
abstract
 a simple algorithm is developed to identify the language of a line of text.
the hypothesis is that languages contain frequent small words (3 chars or less)
which can be used to distinguish many lines of their text. a heuristic program
has been written to acquire distinctive language signatures and test it against
text files collected from network sources. this algorithm is moderately
successful, usually identifying better than 50% of lines in worst case
analyses. when tuned, it performs better on actual network text (70-85%
success). combining this method with character level analyses from
cryptographic studies may be successful in developing a quick method of
language identification which is sensitive to quite small samples of text.
the heuristic program is available from the author upon request.
introduction
 last month in linguist list (vol 2 no 368) i put out a query for an algorithm
to identify distinctly english text mixed in texts of other languages. i had a
number of requests to report back what i found.
 i have been collecting files off the internet as samples of languages to use
in research on the uniform measurement of textual complexity for educational
purposes. at present i am identifying active network sources of natural
language diversity and attempting to quantify these for educational uses. so
far i have found less than a dozen languages, but i have been able to gather
several megabytes of such data. in order to calibrate software which measures
textual complexity in different languages, i feel that i need large samples of
relative purity in linguistic content.
 the general problem is that english is more or less the matrix language in
the structure of global networks. as such, english is a common contaminant in
network text written in other languages. i am not concerned at this point with
words and small phrases which are borrowed into another language: a natural
process in linguistic diachronics. i am concerned about file processes which
transfer chunks of text of one language into another. this is very common in
networking. a writer in spanish may want to quote a network source on a
programming virus and use the mailer to bring in parts of another text file.
automatic requoting of other messages in networks is easier than paraphrasing
so this form of reference is growing. the transfers often bring english into
discussions in other languages.
 to calibrate my software, i need to eliminate cross-linguistic file transfers
so later we can accurately measure these properties in active network sources.
our first pass at this is to have a student look at each message as it arrives
off the network. if big hunks of the thing are not the language expected from
that source, it can be discarded. but hand-processing hundreds of files a day
can be prone to error, especially when cross-quoted text may just be a line or
two. so i came in search of an algorithm i could use to verify large
calibration corpora and eventually use to monitor active network text sources.
 well, noone came right back with an oh-yeah so-and-so's work does that. i
did get a number of replies relayed from usenet's sci.crypt that cryptographers
use a method based on the frequency of bigram and trigram character sequences.
this may work for file-sized data, but i doubt that it would be sensitive to a
datum in the range of 40-80 bytes which is what you get in line-by-line text
transfers.
 my original request hypothesized that the frequent occurrence of short words
in a language may be line-by-line distinct. since nobody said nay, i wrote a
small program to test that notion.
[Moderators' note: The rest of this posting is available on the server.
 To get the file, send a message to:
 listservuniwa.uwa.oz.au
The message should consist of the single line:
 get signature
You will then receive the complete posting.]
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: LANGUAGES OF THE WORLD, MUNICH

Date: Fri, 13 Sep 91 12:36+0000
From: <HASPELMATHPHILOLOGIE.FU-BERLIN.DBP.DE>
Subject: LANGUAGES OF THE WORLD, MUNICH
>From: Ulrich Lueders, Munich:
TYPOLOGY JOURNAL BEGINS PUBLICATION:
Just out is the first issue of
 LANGUAGES OF THE WORLD,
a new international journal focusing on problems of language typology,
genetic relationship, geographical linguistics, and related topics.
LANGUAGES OF THE WORLD includes the LINGUISTIC NEWS LINES, a medium of
information and communication for linguists of various disciplines. The
LINGUISTIC NEWS LINES are devoted to news, announcements, commentaries,
interviews, conference reports, and similar informal material.
LW appears 4 times a year. Individual issues are DM 10.00 (Europe),
US$ 8.00 (Western hemisphere and Africa), US$ 9.00 (Asia, Australia),
with a reduced rate for students.
Contact: Ulrich J. Lueders, Editor & publisher
 LINCOM EUROPA
 Sportplatzstrasse 6
 D-8044 Unterschleissheim/Muenchen, GERMANY
 (no e-mail address yet)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue