Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more

Donate Now | Visit the Fund Drive Homepage

Amount Raised:


Still Needed:


Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington

Publishing Partner: Cambridge University Press CUP Extra Publisher Login

Discussion Details

Title: Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script
Submitter: Damon Allen Davison
Description: Re: Linguist 15.2577, Linguist 15.2594
Dear List,
John Atkinson writes:
  ... Google shows the first as 16 times less common than the second.
  Of course, it's no use entering 'take the liberty', because three quarters
  of the returns are things like 'Take the Liberty Bridge Exit'. Also, a type
  of automobile called a Liberty seems to turn up in a high proportion of the
  hits on both sides.
This is a very important point and bears elaboration. You might even
say that this is the fundamental problem with simply googling
linguistic queries. Google does not allow truly literal searches
because it strips quite a bit of metainformation from queries. The
information Google does account for belongs to two basic classes:
lexical and syntactic data. The problem on the lexical side is that
only literal lexemes are accounted for. Google has some support for
synonymic searches using the ' ~' operator, but its weakness for our
purposes is that its morphological features are currently limited to
accounting for plural variation. Verb morphology is not accounted for
xat all. Searches for 'took a liberty' and 'takes a liberty', along
with their the-equivalents, return similar results to 'take a
liberty', but seemingly with less punctuation noise.
(In engineering terms, the internet already has a high signal-to-noise
ratio. When Google strips the signal, the query, of metainformation,
it increases this ratio even further. Google is simply not designed
to handle linguistic queries.)
Services like the University of Liverpool's WebCorp
(http://www.webcorp.org.uk , but refusing connections at the moment)
try to eliminate the static by using Google to do an initial search,
and then filtering those results using verb morphology,
capitalization, and punctuation.
  Perhaps the preponderance of 'take the liberty to' in web-pages is because
  it's common in officialese, while 'take a liberty' is a rather more literary
  term. Nothing to do with their relative well-formedness.
Yes, I think it's safe to say that Google searches are nice frequency
of use indicators, nothing more. It is very difficult to define a
threshold above which things are grammatical.
Warm Regards,
Damon Allen Davison
Date Posted: 20-Sep-2004
LL Issue: 15.2606
Posted: 20-Sep-2004

Search Again

Back to Discussions Index