LINGUIST List 27.939

Mon Feb 22 2016

Qs: Lexicography and variation: big data via Google?

Editor for this issue: Anna White <>

Date: 19-Feb-2016
From: Stefan Dollinger <>
Subject: Lexicography and variation: big data via Google?
E-mail this message to a friend

Dear colleagues,

While the use of internet data in lexicography is nothing new, the question has been raised how to best normalize the ''big and messy'' data on the internet using site-restricted searches (SRSs). SRSs have been employed to obtain information on the regional variation of a given term (and, ideally, a given meaning), yet some issues remain unresolved. The question of how to phrase searches to target specific meanings is perhaps the most challenging aspect, yet by far not the only one.

An interesting discussion is developing in this forum:

I wonder if anyone has used black box commercial search engines such as Google, which, despite all its shortcomings and annoyances, offers a temptingly large, in fact the largest, index in the world. Other search engines, e.g., are more precise, yet their index is smaller.

My question: Does anyone have experience, or can anyone add to the methodology presented in the above discussion forum?

As the issues raised relate to a number of linguistic approaches, I would ask primarily for input for open class lexical items, which show, in contrast to most grammatical items, very low frequency counts. It is important that participants consider this aspect which means, as shown in the discussion paper, that existing web-scaled resources (of 12 billion words etc.) are still much too small to assist in regional labelling of lexical items.

Thanks for your input. You are welcome to post directly in the forum on, on the entire approach or an any aspect of the paper (click on relevant text passage to open a dialog box for your comment).

I will post a summary on linguist

Thanks for considering to offer your expertise.

Linguistic Field(s): Anthropological Linguistics
                            Applied Linguistics
                            Computational Linguistics
                            Historical Linguistics
                            Ling & Literature
                            Text/Corpus Linguistics

Page Updated: 22-Feb-2016