LINGUIST List 15.2606

Mon Sep 20 2004

Disc: Re: 15.2577, FYI:Using Google Script

Editor for this issue: Naomi Fox <foxlinguistlist.org>


Directory

  1. Damon Allen Davison, Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script
  2. Costas Gabrielatos, Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script

Message 1: Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script

Date: Sun, 19 Sep 2004 11:37:40 +0200
From: Damon Allen Davison <allolexgmail.com>
Subject: Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script

Re: Linguist 15.2577, Linguist 15.2594

Dear List,

John Atkinson writes: 
> ... Google shows the first as 16 times less common than the second.
> Of course, it's no use entering "take the liberty", because three quarters
> of the returns are things like "Take the Liberty Bridge Exit". Also, a type
> of automobile called a Liberty seems to turn up in a high proportion of the
> hits on both sides.

This is a very important point and bears elaboration. You might even
say that this is the fundamental problem with simply googling
linguistic queries. Google does not allow truly literal searches
because it strips quite a bit of metainformation from queries. The
information Google does account for belongs to two basic classes:
lexical and syntactic data. The problem on the lexical side is that
only literal lexemes are accounted for. Google has some support for
synonymic searches using the ' ~ ' operator, but its weakness for our
purposes is that its morphological features are currently limited to
accounting for plural variation. Verb morphology is not accounted for
xat all. Searches for "took a liberty" and "takes a liberty", along
with their the-equivalents, return similar results to "take a
liberty", but seemingly with less punctuation noise.

(In engineering terms, the internet already has a high signal-to-noise
ratio. When Google strips the signal, the query, of metainformation,
it increases this ratio even further. Google is simply not designed
to handle linguistic queries.)

Services like the University of Liverpool's WebCorp
(http://www.webcorp.org.uk , but refusing connections at the moment)
try to eliminate the static by using Google to do an initial search,
and then filtering those results using verb morphology,
capitalization, and punctuation.

> Perhaps the preponderance of "take the liberty to" in web-pages is because
> it's common in officialese, while "take a liberty" is a rather more literary
> term. Nothing to do with their relative well-formedness.

Yes, I think it's safe to say that Google searches are nice frequency
of use indicators, nothing more. It is very difficult to define a
threshold above which things are grammatical.

Warm Regards,

Damon

- 

Damon Allen Davison
http://www.allolex.net
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script

Date: Sun, 19 Sep 2004 10:31:29 -0400 (EDT)
From: Costas Gabrielatos <c.gabrielatoslancaster.ac.uk>
Subject: Re: 15.2594, Disc: Re: 15.2577, FYI:Using Google Script

In Linguist 15.2577, Danko Sipka <danko.sipkaasu.edu> wrote:

> Dear Linguists,
>
> I frequently use Google to determine lexical and morphosyntactic 
> well-formedness of two options in various languages. I advise my 
> students to do the same. In order to save time required to go to 
> Google two times for one inquiry, I have created a simple script at:
>
> http://cli.la.asu.edu/togoogleornot.htm
>
> which lets you enter two options, choose the target language and then 
> get hits for both options in one window. For example, if a student of 
> English enters take the liberty as the first option and take a liberty 
> as the second, it will be possible to determine that the first option 
> is well-formed while the other is not.

I queried the two constructions and the Google results show that 'take
the liberty' is much more frequent in web pages than 'take a
liberty'. However, I don't think that low frequency entails that a
construction is ill-formed or unacceptable.

Also, looking at only one form of the lemma TAKE (the form 'take') may
hide some interesting variations in the relative frequency of the
combinations with 'the liberty' and 'a liberty'. Google queries of the
different forms of TAKE returned the following results:

take 	the/a liberty:	39,300 / 1,520
takes 	the/a liberty:	 3,720 / 313
taking	the/a liberty:	 9,770 / 810
took 	the/a liberty:	50,900 / 687
taken	the/a liberty:	56,300 / 680

Although the 'the' construction is more frequent for all forms of
TAKE, the frequency difference is much less marked in the cases of
'takes' and 'taking' (both about 12 times more frequent) than in
'take' (x26), 'took' (x74), and 'taken' (x83).

I also queried 'TAKE the liberty' and 'TAKE a liberty' in BNCweb:

take 	the/a liberty:	 6 / --
takes 	the/a liberty:	-- / --
taking	the/a liberty:	-- / 1
took 	the/a liberty:	13 / 1
taken	the/a liberty:	10 / --

There don't seem to be enough instances of the two constructions to
draw any conclusions. However, there are three interesting points to
consider. First, the 'a' construction is found in a representative
corpus, which indicates that it is acceptable. Second, although in the
Google query the 'the' construction is about 40 times more frequent
(considering all the forms of TAKE), in the BNC it is only 15 times
more frequent, which points towards exercising caution when using the
web as a corpus - as John Atkinson has already mentioned (Linguist
15.2594). Finally, in the BNC, one of the two instances of the 'a'
construction is with the form 'taking', although there are no
instances of 'taking' in the 'the' construction.

Perhaps Google queries can be more useful in helping learners become
aware of the different contexts that the different forms of two
constructions are used in.

Costas Gabrielatos 
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue