Books: Index Structures for the Exploration of Natural Language Corpora: Goller
Editor for this issue: Rebekah McClure
Date: 24-Jan-2013 From: Ulrich Lueders <lincom.europat-online.de> Subject: Index Structures for the Exploration of Natural Language Corpora: Goller E-mail this message to a friend
Title: Index Structures for the Exploration of Natural Language Corpora
Series Title: Linguistic Resources for Natural Language Processing 06
Publisher: Lincom GmbH
Author: Johannes Goller
Paperback: ISBN: 9783862884087 Pages: 140 Price: Europe EURO 64.80
This study describes the development of a large-scale corpus query system – a specialized search engine used to perform advanced types of pattern search, especially for patterns used by linguists interested in discovering syntactic phenomena in large corpora.
Beginning with a review of traditional search engine algorithms, the main focus then shifts to suffix arrays, a data structure that has been available since 1987, but is not commonly used in large-scale search engines for various technical reasons.
Recently developed algorithms are considered in this study as the starting point for a new attempt to re-introduce the suffix array as a data structure of practical value to corpus-linguistic research. One of the key findings is a technique that combines several suffix arrays using indexed bit vectors and enables the searching of layers of meta information, such as part-of-speech information and semantic labels, in parallel to searching the text. A set of algorithms operating on that data structure is presented, enabling sophisticated pattern matching, such as gap-matching and gap-filling, as well as improved methods of concordance generation. The final chapters present practical examples of how the new system is used to make linguistically relevant discoveries in real corpora.