Editor for this issue: Martin Jacobsen <marty
linguistlist.org>
Phil Bralich wrote: > Let me illustrate with some quotes from the MUC-6 web page which > outlines the tasks to be accomplished. To see this site yourself go > to http://cs.nyu.edu/cs/faculty/grishman/muc6.html (I have no idea > why no one in these discussions is providing the relevant URLs > besides me). > You will note in the following that there is no concern whatsoever > for the ability to do a constituent analysis of a tree as that is > precisely what is being avoided. Certainly any amount of > constituent analysis would be of value and would not be excluded but > there seems to be an awareness that it is not available so different > criteria are chosen. Note also that information being extracted is > not described in terms of phrases or clauses. > [excerpt deleted] [ ... and later ... ] > However, having demonstrated with a 75,000 word dictionary and a > parser that does a wide variety of new functions, we can no longer > dismiss theoretical syntax. Certainly, IE and IR will benefit > greatly when we extend our current tools to those areas because we > provide so much more information about the environments of the > "named entities" and so on. [ ... and later ... ] > Here you are talking about doing something useful with huge numbers > of documents of unrestricted text whereas I am speaking (primarily) > about doing question/answer, statement/response repartee, grammar > checking, the improvemengt of machine translation devices and of > course about a significant, over night improvement in naviagation > and control devices. Nothing in the MUC standards speaks to any of > this. The MUC standards are actually quite narrow compared to the > very wide realm of what is possible with NLP. > This problem is not unknown in the field. Take a look at what is > said in _The State of the Art in Human Language Technology_, Ron > Cole Editor-in-Chief. A 1996 report commissioned by the National > Science Foundation. In this report, Ted Briscoe (Section 3.7, p. 1) > states, "Despite over three decades of research effort, no practical > domain independent parser of unrestricted text has been developed." > In addition, in that same report, Hans Uszkoreit and Anne Zaenen > state (Section 3.1, p. 1), "Currently, no methods exist for > efficient distributed grammar engineering [parsing]. This > constitutes a serious bottleneck in the development of language > technology products." > Thus, while there is some IE and IR happening without parsers, there > are hundreds of other possible technologies that cannot be developed > with the standards used by MUC. To create these other technologies > it is necessary to meet standards just like those I have proposed or > the bottleneck will not be broken. In addition all IR and IE > technologies will be significantly improved once these NLP tools are > brought to bear on that domain. [ ... and later ... ] > While there is a lack of precision in some of them, I don't think it > is at all a problem to expect a system to label tense, change active > to passive or passive to active or to be able to answer a simple > question. I also do not find that spectacularly undefined. [ ... and later ... ] > All I suggest is that the reader go to the MUC page himself (URL > given above) and decide for himself. The tasks that IR and IE set > out for themselves may be of some value in that very limited domain, > but they have absolutlely no applicability to the development of > other NL tools such as q&a machine translation and so on. In order > to approach these other areas you absolutely have to have a parser. [ ... and finally, in response to someone else ... ] > However, I do not see how anyone could come even close to meeting > the standards I propose without first having a fully worked out > theory of syntax. Even if programmers were the ones to develop a > program that met those standards we would have to admit that > somewhere in those lines of code was a true theory of syntax. Even > if you just created a huge series of gerry rigs, either they would > not work or they would merge into a theory. The phenomena to be > described are complex, subtle, and intricate. Only a completely > worked out theory of syntax will result in such programs. First, I welcome Dr. Bralich's more subdued tone. But I still disagree with him, and I think that the final paragraph I've cited here illustrates the crux of our disagreement. I agree, of course, that there is a theory of language, no matter how simple and crude, embedded in ANY language-processing system. There used to be people in NLP who denied that, and claimed that their approaches were purely statistical, but I'm pretty sure most of those folks have finally admitted that such a goal is impossible (and counterproductive). And I certainly agree that the MUC criteria for evaluation bear on a tiny subset of the potential applications of language processing systems. Dr. Bralich lists a number of other potential applications, which I've repeated above. However, these observations do NOT imply, together, that Dr. Bralich's standards are very useful, and the problem is that Dr. Bralich cannot conceive of a situation in which his desiderata do not entail his conclusions. For instance, Dr. Bralich lists question-answering as one of the potential applications which MUC does not address. This is true. However, the DARPA community HAS addressed question-answering systems in a related evaluation: the ATIS air travel reservation domain for spoken language understanding systems. In this evaluation, an attempt was made to define a syntactic/shallow semantic level of evaluation; the effort was called SEMEVAL, and it failed miserably. The problem was that there are too many different theories of language, and settling on an intermediate level of evaluation which embraced a particular one (and it was of course impossible to settle on such a level without embracing a particular one) was counterproductive and irrelevant, for a number of reasons: (1) the point was answering the question, not parsing the sentence. The intermediate representation was unimportant as an evaluation criterion. (2) there were many people who were using a simple frame-based approach which did not use an articulated intermediate representation, and they (rightly) observed that they would be unfairly penalized for using an alternative approach to reaching the same goal. (3) the range of syntactic constructions encountered in a corpus of 14,000 spontaneous utterances on air travel is pretty large, and linguistic theories do not yet address the majority of them. In other words, not only was the proposed evaluation level unimportant and biased, it was also impossible to determine the "right" answer, because we don't know it yet. All these objections apply to Dr. Bralich's proposed criteria. I sympathize tremendously with Dr. Bralich's goals. However, the problem of standards for language processing systems is far more complex than he's anticipating. There are three purposes one might conjure up for defining such standards: (a) it helps determine whether systems are behaving in linguistically well-justified ways (b) it allows one to compare systems (c) it contributes to the determination of whether these systems can contribute to tasks which require linguistic processing The problem is that (a) is ill-defined, and there's no evidence that it bears any relationship to (c), and only bears on (b) if you're trying to evaluate a system without a task in mind (which turns out to be nearly impossible). Dr. Bralich seems not to be able to imagine a scenario in which (a) and (c) do not entail each other, and my point is that the entire history of evaluation of language systems has failed to demonstrate that they are more than peripherally connected. [Just to clarify, I say (a) is ill-defined because theories currently vastly underanalyze the available data, and conflict on the data that they do analyze; resolving these conflicts MIGHT be useful for evaluation if (a) were to imply (c) and there was no other way to get to (c), but this has never been shown.] Now, like Dr. Bralich, I don't believe for a second that we can get 100% of language analysis for ANY application without a detailed theory of syntax; but (a) I don't really care which one it is (well, I do, but not for the purposes of this discussion :-) ) and (b) we're nowhere near 100% analysis for any task (and demonstrating that Dr. Bralich's system matches the Penn Treebank at 100% accuracy indicates nothing relative to this goal), and committing to Dr. Bralich's criteria counterproductively biases the path that we take toward this goal. By way of conclusion, let me elaborate on this final point. Tremendous progress has been made in computational linguistics over the last ten years by ABANDONING the commonly-held convictions in theoretical linguistics on how to make progress in this area. Theoretically well-motivated systems perform no better, and in many situations perform more poorly (both in speed and accuracy), than pragmatically-constructed systems which fundamentally change the assumptions about how language research is to be conducted. These systems have a theory of language behind them; it's just a sort of theory which theoretical linguists aren't very interested in. Dr. Bralich's standards impose a bias from theoretical linguistics on what systems MUST do in order to be successful; so far, this bias has been demonstrated by the computational linguistics community to be false and counterproductive. I personally think this result has vast implications for linguistic theory, and I would hate to see a set of standards adopted which effectively eliminated this alternative branch of investigation. Samuel Bayer The MITRE CorporationMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
In vol-9-383, Philip A. Bralich <bralichMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuehawaii.edu> wrote: >...at some point we have to step back and ask ourselves which >theories, whatever their motivation are >capable of accounting for the >data. That is which theories after three decades of trying have done >the best job and how can we demonstrate that? ...Nor am I saying >they will not at some point arrive at a proper description of the data >that does indeed meet their goals (whether that be a description of >processing, acquistion, or whatever) but I am saying that as long as >they do not have a demonstrably satisfying account of the basics, they >can make no strong claim to being a mature or effective theory. They >can call themselves psychologically motivated, or learning theory >motivated or whatever, but they cannot make many claims to being able >to account for the data. [Bralich quotes Piu ten Hacken:] >>1. a linguistic theory which does not give a description of all the phenomenaBralich's parser covers >>need not be a bad theory; [Bralich again:] >Correct, but it is also not a mature theory. "The data" which a theory might claim to be able to "account for" encompasses a wide range of phenomena. Quite apart from the fact that Bralich has only claimed that his parser + grammar works for English, Chomsky outlined a long time ago (1965, in "Aspects of the Theory of Syntax"), three levels of adequacy for theories, namely observational, descriptive and explanatory. So far as I have seen in this discussion, Bralich is saying his parser + grammar does a good job at observational adequacy--this is what he appears to be calling "the basics"--but I have seen nothing about descriptive or explanatory adequacy. Over the years, there have been many parser + grammar combinations that have achieved a reasonably high degree of observational adequacy. (The fact that they have not been put "in the reviewers hands", as Bralich says they should be, does not negate this claim. Many of these programs are proprietary to the companies that developed them, and they are not about to release them as stand-alone parsers.) But observational adequacy is not the only goal for most modern theories of linguistics, and a theory which achieved only observational adequacy--if there were such--cannot claim to be a mature theory, either. So why not try to compare parsing programs on the level of descriptive adequacy? Because that presupposes we know what the correct analysis of every sentence in some large (and "complete") test corpus is, and we don't. That isn't to say people haven't tried to do such comparisons over the years, just that it has turned out to be more difficult than expected. While many parsing programs achieve reasonably similar levels of observational adequacy, i.e. they recognize more or less the same sets of sentences, they do not assign comparable constituent structures, and it is not apparent which constituent structure is correct. (Or worse for purposes of comparison, an LFG parser, say, will assign two different sorts of parallel structures where a GPSG parser will assign only one.) I could give a long listing here of syntactic constructions of English for which the appropriate structure is not apparent, but I'll content myself with one example: In the sentence "I wonder who came", is there a gap after "who"? (For that matter, is there a gap, in the sense of a phrasal node which does not dominate any terminals, in _any_ wh-construction?) It goes without saying that parsing programs do not attempt to achieve explanatory adequacy. There are computer programs which have made forays into this area, but that's a different topic. Apart from those attempts, explanatory adequacy has been the exclusive realm of theoretical linguists. In summary, after "three decades of trying", there is no "mature theory" of syntax, not even whatever theory underlies Bralich's grammar. Look at it this way: it's employment security. Mike Maxwell Mike_Maxwell
sil.org