LINGUIST List 27.2136

Tue May 10 2016

Sum: Lexicography and variation: big data via Google?

Editor for this issue: Anna White <awhitelinguistlist.org>


Date: 06-May-2016
From: Stefan Dollinger <stefan.dollingersprak.gu.se>
Subject: Lexicography and variation: big data via Google?
E-mail this message to a friend

Discussion period: 19 Feb. to c. mid March 2016.

The discussion was centred on this draft paper:
https://www.academia.edu/s/1a487c74ab

And was announced in https://linguistlist.org/issues/27/27-939.html

Summary:

Fifty-two discussants partook in the 21-day Session on academia.edu. I will attempt to summarize the most salient issues, in my view, below. For other topics, please refer directly to the Session link (read from the bottom up). Thanks to all those who gave their time. Apologies to those whose posting(s) are not reported below, which is not for a lack of appreciation.

One conversation stream, started by Robert Lew, cut right to the validity of the entire approach. Robert, in measured yet incisive messages, raised a number of concerns about using Google, one of them rather serious, insisting that using Google was ''bad science'', in analogy to an Adam Kilgarriff paper from 2007, which I used as a spring board to reconsider some of Adam's rejections ''Googleology as Smart Lexicography''. I realized early on that the discussion centred a bit too much on Google, though this was an inadvertent reflection of the title. The discussion was intended more as a ''how reliable are web-searches'' with open access search engines in general, but the Google focus made it very concrete and tangible and offered insights that go far beyond Google's ''Black Box'' structure. Robert

Most crucially, Robert pointed out that Google page counts are unreliable and the figures displayed are, when one clicks through, not matched. This is a serious problem for any method, such as this one, that relies on Google's numbers to create its normalized cross-domain indices. It seemed as if the error is probably proportionally inflated, as the results that were found in DCHP-2 match what we know about regional patterns of Canadianisms on an international, as well as on a Canada-internal regional scale (see for a very concise account https://www.academia.edu/18967380/How_to_write_a_historical_dictionary_a_sketch_of_The_Dictionary_of_Canadianisms_on_Historical_Principles_Second_Edition). So, while the absolute numbers are off, the ratio between the numbers in different domains seems to be correct. Robert pursued the issue further and found a number of infelicities even in the ratio. There clearly is more work to be done, but the use of more precise search engines is not the panacea it seems. The question remains, and will be verifiable by everyone once DCHP-2 is in open access in late 2016, why the results we get are in line with the few terms whose regional variation we knew and, for the many others, usually make perfect sense when matched with the extra-linguistic histories of the terms.

My entire point of the paper was that the clean web-scaled resources that Kilgarriff advocates are still not big enough (e.g. 12 billion words) to produce the regional data information. So if we would like to have regional labels in dictionaries, one of the areas lexicographers, as I argue, do worst, we will have to make the best of suboptimal search engines. This point I make in the paper with an example. Robert pointed out that the Yandex and Exalead search engines might be preferable, yet it remains to be checked whether their indices are large enough to compete with the data from the messy Google interface.

The point raised in the paper, that Google MUST be tracked and results can only be confirmed post-hoc with the help of extensive tracking data, is the key of the method, which, I believe, has been refined. This would apply to other indices, whether Yandex or Exlead or others as well. So, one take-away message might be: If you want to argue from frequency, NEVER just search the web, always track the domain sizes, document them and then search the web.

Is Google messy? No doubt. Do we have an alternative for the kind of tasks geographically-minded lexicographers need to handle? Not yet.

So while web-scaled corpora (e.g. TenTen and resources in SketchEngine) and resources like GloWbE (Mark Davies) are extremely useful, they are way too small (in the latter case very much so) to contribute to address regional distributions of lexical searches.

There is room for more exploration. Lexicographers are no computational linguists, generally, so any method that would help the former would need to be simple and effective – or come in the form of an app. That was the idea behind my paper: practical, but computationally unsophisticated yet sound (or much sounder compared to current practice).

Thanks to all, besides Robert Lew, who posted comments, especially Robert Fuchs, Dorota Lockyer and Victoria Ventura for their various suggestions. I will incorporate them, with full acknowledgement, to the maximally possible degree in the final version of the paper.

Thanks for the contributions! Great to be part of the collaborative spirit.

Stefan Dollinger
https://gu-se.academia.edu/StefanDollinger
Gothenburg, Sweden, 6 May 2016

Linguistic Field(s): Anthropological Linguistics
                            Applied Linguistics
                            Computational Linguistics
                            General Linguistics
                            Historical Linguistics
                            Lexicography
                            Ling & Literature
                            Semantics
                            Sociolinguistics

Page Updated: 10-May-2016