Editor for this issue: <>
USING SHORT WORDS: a language identification algorithm stan kulikowski ii educational research and development center the university of west florida pensacola, fl, usa 32514 abstract a simple algorithm is developed to identify the language of a line of text. the hypothesis is that languages contain frequent small words (3 chars or less) which can be used to distinguish many lines of their text. a heuristic program has been written to acquire distinctive language signatures and test it against text files collected from network sources. this algorithm is moderately successful, usually identifying better than 50% of lines in worst case analyses. when tuned, it performs better on actual network text (70-85% success). combining this method with character level analyses from cryptographic studies may be successful in developing a quick method of language identification which is sensitive to quite small samples of text. the heuristic program is available from the author upon request. introduction last month in linguist list (vol 2 no 368) i put out a query for an algorithm to identify distinctly english text mixed in texts of other languages. i had a number of requests to report back what i found. i have been collecting files off the internet as samples of languages to use in research on the uniform measurement of textual complexity for educational purposes. at present i am identifying active network sources of natural language diversity and attempting to quantify these for educational uses. so far i have found less than a dozen languages, but i have been able to gather several megabytes of such data. in order to calibrate software which measures textual complexity in different languages, i feel that i need large samples of relative purity in linguistic content. the general problem is that english is more or less the matrix language in the structure of global networks. as such, english is a common contaminant in network text written in other languages. i am not concerned at this point with words and small phrases which are borrowed into another language: a natural process in linguistic diachronics. i am concerned about file processes which transfer chunks of text of one language into another. this is very common in networking. a writer in spanish may want to quote a network source on a programming virus and use the mailer to bring in parts of another text file. automatic requoting of other messages in networks is easier than paraphrasing so this form of reference is growing. the transfers often bring english into discussions in other languages. to calibrate my software, i need to eliminate cross-linguistic file transfers so later we can accurately measure these properties in active network sources. our first pass at this is to have a student look at each message as it arrives off the network. if big hunks of the thing are not the language expected from that source, it can be discarded. but hand-processing hundreds of files a day can be prone to error, especially when cross-quoted text may just be a line or two. so i came in search of an algorithm i could use to verify large calibration corpora and eventually use to monitor active network text sources. well, noone came right back with an oh-yeah so-and-so's work does that. i did get a number of replies relayed from usenet's sci.crypt that cryptographers use a method based on the frequency of bigram and trigram character sequences. this may work for file-sized data, but i doubt that it would be sensitive to a datum in the range of 40-80 bytes which is what you get in line-by-line text transfers. my original request hypothesized that the frequent occurrence of short words in a language may be line-by-line distinct. since nobody said nay, i wrote a small program to test that notion. [Moderators' note: The rest of this posting is available on the server. To get the file, send a message to: listservMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueuniwa.uwa.oz.au The message should consist of the single line: get signature You will then receive the complete posting.]
>From: Ulrich Lueders, Munich: TYPOLOGY JOURNAL BEGINS PUBLICATION: Just out is the first issue of LANGUAGES OF THE WORLD, a new international journal focusing on problems of language typology, genetic relationship, geographical linguistics, and related topics. LANGUAGES OF THE WORLD includes the LINGUISTIC NEWS LINES, a medium of information and communication for linguists of various disciplines. The LINGUISTIC NEWS LINES are devoted to news, announcements, commentaries, interviews, conference reports, and similar informal material. LW appears 4 times a year. Individual issues are DM 10.00 (Europe), US$ 8.00 (Western hemisphere and Africa), US$ 9.00 (Asia, Australia), with a reduced rate for students. Contact: Ulrich J. Lueders, Editor & publisher LINCOM EUROPA Sportplatzstrasse 6 D-8044 Unterschleissheim/Muenchen, GERMANY (no e-mail address yet)Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue