"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more
To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.
This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.
Index Structures for the Exploration of Natural Language Corpora
Linguistic Resources for Natural Language Processing 06
This study describes the development of a large-scale corpus query system – a specialized search engine used to perform advanced types of pattern search, especially for patterns used by linguists interested in discovering syntactic phenomena in large corpora.
Beginning with a review of traditional search engine algorithms, the main focus then shifts to suffix arrays, a data structure that has been available since 1987, but is not commonly used in large-scale search engines for various technical reasons.
Recently developed algorithms are considered in this study as the starting point for a new attempt to re-introduce the suffix array as a data structure of practical value to corpus-linguistic research. One of the key findings is a technique that combines several suffix arrays using indexed bit vectors and enables the searching of layers of meta information, such as part-of-speech information and semantic labels, in parallel to searching the text. A set of algorithms operating on that data structure is presented, enabling sophisticated pattern matching, such as gap-matching and gap-filling, as well as improved methods of concordance generation. The final chapters present practical examples of how the new system is used to make linguistically relevant discoveries in real corpora.