This study describes the development of a large-scale corpus query system – a specialized search engine used to perform advanced types of pattern search, especially for patterns used by linguists interested in discovering syntactic phenomena in large corpora.
Beginning with a review of traditional search engine algorithms, the main focus then shifts to suffix arrays, a data structure that has been available since 1987, but is not commonly used in large-scale search engines for various technical reasons.
Recently developed algorithms are considered in this study as the starting point for a new attempt to re-introduce the suffix array as a data structure of practical value to corpus-linguistic research. One of the key findings is a technique that combines several suffix arrays using indexed bit vectors and enables the searching of layers of meta information, such as part-of-speech information and semantic labels, in parallel to searching the text. A set of algorithms operating on that data structure is presented, enabling sophisticated pattern matching, such as gap-matching and gap-filling, as well as improved methods of concordance generation. The final chapters present practical examples of how the new system is used to make linguistically relevant discoveries in real corpora.