LINGUIST List 25.4689

Fri Nov 21 2014

Software: Computational Linguistics; Morphology; Text/Corpus Linguistics: types2: Type and Hapax Accumulation Curves

Editor for this issue: Andrew Lamont <alamontlinguistlist.org>


Date: 20-Nov-2014
From: Tanja Säily <tanja.sailyhelsinki.fi>
Subject: Computational Linguistics; Morphology; Text/Corpus Linguistics: types2: Type and Hapax Accumulation Curves
E-mail this message to a friend

types2 is a free tool for visualizing and assessing the statistical significance of differences in word frequencies across corpora and other data sets. It is especially useful for analysing variation in the frequencies of types and hapax legomena, which are common measures of morphological productivity and lexical diversity. The previous version, types1, was introduced in 2009; the new version facilitates comparisons through interactive visualization and adjusts the significance for multiple hypothesis testing.

The software can analyse data sets from the perspective of the following statistics:
- number of words: the total number of running words in the text corpus
- number of tokens: the words of interest in our study
- number of types: how many distinct tokens we have seen
- number of hapaxes: how many tokens have occurred only once

The tool can be employed for visualization, statistical hypothesis testing, and exploratory data analysis. To enhance the reliability of the results, it uses robust, nonparametric statistics (more specifically, Monte Carlo permutation tests). The only modelling assumption is that, under the null hypothesis, individual ''samples'' are exchangeable.

The software is written by Jukka Suomela, and the system is designed and developed in collaboration with Tanja Säily. It has been tested on Windows, Macintosh and Unix platforms. The output is provided in three formats: web pages, PDF images and raw statistics in a database. The software is freely available at http://users.ics.aalto.fi/suomela/types2/ and http://dx.doi.org/10.5281/zenodo.9868

Linguistic Field(s): Computational Linguistics
                            Morphology
                            Text/Corpus Linguistics

Page Updated: 21-Nov-2014