|Title:||Re: A Challenge to the Minimalist Community|
|Description:||I personally tend to feel that these sorts of meta-discussions aren't
very useful and so have tried to stay clear, but let me just inject some
facts into the discussion, since subsequent mentions of the Klein and
Manning work have been incorrect in major respects (one case looks
like a genuine misunderstanding, others like the authors couldn't have
read all of either paper...).
This isn't to say that there wasn't a core of value in the original
posting: I do think it is high time that linguists look more at the success
that is being achieved by empirical and machine learning methods and
use it to question some of the assumptions of the dominant theories,
assumptions that were adopted in the 1950s before there were
successful empirical and machine learning methods for any domain.
On 6 May 2005, Sean Fulop
They go on to reference work by Klein and Manning which induces
grammars in an "unsupervised" fashion from text. Well first of all, it
is still debatable whether anything can ever actually do this (see the
algorithmic learning theory literature, summarized in Jain et al.
1999), and Sproat and Lappin also note that Klein and Manning's
scheme uses part-of-speech tagged text, which is a far cry from
text. This is a huge annotation, and could be taken as a
component of Universal Grammar. It is a component that is argued
for in P&P, as well.
and later referring to our work,
On 11 May 2005, Carson Schutze
No one in P&P ever claimed that inducing the ability to parse a
representative subset of a corpus of everyday speech to a certain
approximation (given POS tags) required innate linguistic machinery.
It just isn't the case that the Klein and Manning results require part-of-
speech tagged text. Both of the papers cited in the original post (and
below) -- Klein and Manning 2002 and 2004 -- show results working
from simply a sequence of words and doing automatic distributional
induction of word classes. And see also Dan's thesis (Klein 2005) for
the most recent and complete exposition of the work. A lot of the
results we present are from pre-tagged text and furthermore the word
class induction method that we use is rather simple - it's not as good
as methods already proposed in Schuetze 1995, let alone other
recent promising work, of which the best is perhaps Clark (2003). But
that's just because it wasn't our focus, precisely because there was
previous quite successful work on learning word classes. I would
conjecture that our fully unsupervised results would improve
considerably if one simply welded Clark's word class induction to the
rest of our system. (And if one doesn't want to assume a list of words
as input, there is other unsupervised work that has looked at word
segmentation, and phoneme recognition.... Start welding it together.)
[Clark's work is also relevant to the algorithmic learning theory
comment: it's not clear to me how relevant such work on learnability of
general classes like regular or context-free languages is to human
language learnability, since the latter may very well depend on data-
dependent features of the rather restricted class of languages that are
human languages (something Chomsky would maybe even agree
with), but to the extent that one examines such work, again work such
as Clark and Thollard (2004) shows that probabilistic formal
languages have better learnability possibilities: they show that a rather
broad class of PFAs (Probabilistic Deterministic Finite State Automata)
are PAC-learnable from positive data alone.]
On 11 May 2005, Carson Schutze
What is particularly notable about the Klein-Manning grammar
induction procedures is that they do what Chomsky and others
have argued is impossible: They induce a grammar using general
statistical methods which have few, if any, built-in assumptions
that are specific to language.
To even debate this, we would have to establish a definition
for "grammar"; earlier in the paragraph this system is described as
inferring a "parser", which, as has been discussed, is crucially not
the same thing under usual interpretations of these terms.
The important point is the suggestion that some 'alternative(s)' to
P&P can supposedly do "what Chomsky and others have argued is
impossible ... induce a grammar". Here we have a comparison
based on a false premise, it seems to me. What is the evidence that
the Klein/Manning algorithms induce a grammar that has the
properties Chomsky argued required innate structure to learn? All
we've been told about it is that it parses some corpora at some rate
less than 80% but is "quickly converging" on that level of accuracy.
To be precise, what Klein and Manning do is show that given a
reasonable amount of text (but in no way huge! - less than 100,000
words), we can learn the constituent units and dependencies/
headedness of that text (with a reasonable degree of success). The
model that is built from the data could reasonably be called a grammar
(though certainly not one that knows about things like binding theory
or long distance dependencies), but we don't actually build a parser at
all - though that would be an obvious extension, since a treebank
parser could be built on the results by supervised learning methods in
the usual way. While a human language grammar is clearly much
more than knowledge of constituency, constituency is such an
important and basic part of knowledge of language that I do feel that it
is a very reasonable first target, and a reasonable thing to feel that
you should be able to do better with a P&P/Minimalist language
learner, precisely because a large part of the principles and
parameters that have concretely been proposed do deal with issues of
Later, Carson writes:
What are we to make of "with this in mind" as a connective between
the upper (and preceding) paragraphs and the lower? The former
talks about learning a grammar of a natural language. The latter
talks about correctly parsing 90% of examples sampled from some
corpus the system was trained on. Accomplishing the very narrow
parsing task in S&L's challenge hardly tells us anything about
whether some system is or is not able to learn a natural language
grammar, so if our goal is really studying how humans acquire
grammars, the challenge is virtually irrelevant to that goal.
I would agree with this. The more appropriate goal seems to be to
show a language learner with a version of P&P/minimalist assumed
innate knowledge outperforming a language learner without that
knowledge on a grammar induction task. However, it doesn't seem
unreasonable to me to focus on constituency learning as the first such
task - it's one of the more basic and better understood areas of
On 11 May 2005, Charles Yang <email@example.com in
Linguist 16.1505 wrote:
The recent, and remarkable, work of Klein and Manning (2002)
takes this a step further. So far as I can tell, in the induction of a
grammatical constituent, Klein & Manning's model not only keeps
track of the constituent itself, but also its aunts and sibling(s) in the
tree structure. These additional structures is what they refer to
as ''context''; those with a more traditional linguistics training may
recall ''specifier'', ''complement'', ''c-command'', and ''government''.
I take this as the genuine misunderstanding, but this isn't right: while
we define "context" as a general notion, the "context" that we
concretely use is nothing more or less than the word class immediately
to the left and right of a putative constituent. This model was adopted
precisely because such a model of using word classes to the left and
right had proven so successful in distributional word class induction.
Dan Klein and Christopher D. Manning. 2002. A Generative
Constituent-Context Model for Improved Grammar Induction.
Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pp. 128-135.
Dan Klein and Christopher D. Manning. 2004. Corpus-Based Induction
of Syntactic Structure: Models of Dependency and Constituency.
Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics (ACL 2004).
Dan Klein, "The Unsupervised Learning of Natural Language
Structure," Ph.D. Thesis, Stanford University, 2005.
Hinrich Schuetze. Distributional part-of-speech tagging. In EACL 7
(1995), pp. 141-148. http://arxiv.org/abs/cmp-lg/9503009
Alexander Clark (2003) Combining Distributional and Morphological
Information for Part of Speech Induction, Proceedings of EACL 2003.
Alexander Clark and Franck Thollard (2004) PAC-learnability of
Probabilistic Deterministic Finite State Automata Journal of Machine
Learning Research, 5 (May):473-497, 2004.
Discipline of Linguistics