Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

Raciolinguistics

Edited by H. Samy Alim, John R. Rickford, and Arnetha F. Ball

Raciolinguistics "Brings together a critical mass of scholars to form a new field dedicated to theorizing and analyzing language and race together."


New from Cambridge University Press!

ad

Sociolinguistics from the Periphery

By Sari Pietikäinen, FinlandAlexandra Jaffe, Long BeachHelen Kelly-Holmes, and Nikolas Coupland

Sociolinguistics from the Periphery "presents a fascinating book about change: shifting political, economic and cultural conditions; ephemeral, sometimes even seasonal, multilingualism; and altered imaginaries for minority and indigenous languages and their users."


Academic Paper


Title: A fast and flexible architecture for very large word n-gram datasets
Author: Michael Flor
Institution: NLP and Speech Group
Linguistic Field: Computational Linguistics
Abstract: This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.

CUP AT LINGUIST

This article appears IN Natural Language Engineering Vol. 19, Issue 1.

Return to TOC.

Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page