Bertolo, Stefan, ed. (2001) Language Acquisition and Learnability. Cambridge University Press, viii+247pp, hardback ISBN 0-521-64149-7, $64.95; paperback ISBN 0-521-64620-0, $22.95.
Lee Fullerton, University of Minnesota.
This book is not an anthology in the normal sense, but rather a five-chapter introduction to the topics of its title by six authors. The editor has made an effort to uniformize terminology and formal notation and sets out his norms in the fourteen-page Chapter 1. He claims the book is accessible, but the reader definitely needs to be versed in the mathematics of sets, probabilities, other statistics, as well as Chomsky's Principles and Parameters (1981; hereafter PPH, where H stands for 'hypothesis') and his Minimalist Program (1995). The book contains almost no discussion of results in first-language acquisition research. Rather, it is a theoretical, mostly abstract treatment of the nature of parameters, universal grammar (UG) and the learning device. Each chapter has exercises interrupting the text. (Those for Chapter 5 are all at the end.)
Chapter 1, A brief overview of learnability, by Stefano Bertolo Chapter 2, Learnability and the acquisition of syntax, by Martin Atkinson Chapter 3, Language change and learnability, by Ian Roberts Chapter 4, Information theory, complexity and linguistic descriptions, by Robin Clark Chapter 5, The Structural Triggers Learner, by William G. Sakas and Janet D. Fodor
In the first chapter Bartolo describes the book as a collaboration of linguistics, psychology and learning theory in which linguists elaborate all the theoretical possibilities, learning researchers eliminate some of these based on theory supported by empirical studies, and psychologists eliminate others based on their studies of learning schedules and types of data learned. B then elaborates briefly on the first four of the following five questions central to learnability researchers: (i) What is being learned, exactly? (ii) What kind of hypotheses is the learner capable of entertaining? (iii) How are the data of the target language presented to the learner? (iv) What are the restrictions that govern how the learner updates her conjectures in response to the data? (v) Under what conditions, exactly, do we say that a learner has been successful in the language learning task? B's elaboration of the last question begins by noting that the current consensus rejects the notion that humans learn language by Identification by Enumeration (of conjectures about the finite values of a finite number of parameters), but he proceeds anyway to outline three types of such learning, all of which are theoretically compatible with the PPH: Identification in the Limit, the Wexler and Culicover Criterion, and the Probably Approximately Correct Criterion. B describes each formally and proves PPH grammars learnable mathematically under each. More recent work, B says, has turned to investigating another possible organization of the space which parameters occupy: the Subset Principle. In the last three pages of the chapter B gives formal definitions of parameter spaces and a notation for exploring interesting regions of them.
Atkinson's Chapter 2 has three parts, which incidentally retrace the successive foci of recent learning theory: the child's linguistic environment, the subset problem, and algorithms accounting for the effects of interacting parameters. Regarding the child's environment, A concludes first that negative feedback (i.e. correction or other expression of disapproval of ill-formed sentences) is nearly nonexistent and would not make a difference even if it were frequent. Second, the environment of Motherese does indeed present the child with simple sentences: embeddings are very rare. Parameters are therefore settable without reference to embedding, and such phenomena as WH-movement over one or more clause boundaries are acquired in the form of movement over one or more nodes within the simple sentence. Thirdly, A considers whether Motherese is "graded" , i.e. whether the data presented are ordered in a way that matches the brain's predetermined sequence of structure-types acquired. The reasonable answer is no, but empirical evidence is lacking. Brain maturation might cause the child to focus on different aspects of the consistently simple data in a predetermined sequence.
A introduces the subset problem with a set-theoretical analysis and illustrates it later with the phenomenon of across-clause binding of anaphors and pronouns. Any language that binds an anaphor across x number of clauses will also allow that binding across x-1 clauses, which results in language X properly including language X-1. This discussion is quite lengthy, including at the end treatment of the null subject parameter as a possible subset problem, one in which the values of the parameter are only two: pronoun subject expressed or not. In fact, if one ignores the across- clause(s) binding problem, it turns out that maybe all parameters have binary values, neither of which has the qualifier 'optional'. A suggests that current research on the proper articulation of the binding theory will lead to the conclusion that it too is binary. I see the binding of anaphors as part of the same problem as the distance of WH- movement mentioned above. Both are acquired as binding across nodes within the simple sentence. A closes this section with the conclusion that subset relations are not a part of human grammar.
A's last section deals with problems that arise with the interaction of two or more parameters. Given XP=spec,X' and X'=Comp(lement), X, where Spec is S(ubject), Comp is O(ject) and unbarred X is V(erb), four arrangements of S, V, O are possible depending on the (binary) ordering values assigned to the two parameters. The first yields either S-first or S- last while the second yields either VO or OV. Given initial settings and a wealth of data sentences, usually including more than S, V, O (e.g. a second complement of the verb, free adverbials, auxiliary verbs), what triggering word orders are needed to move the learner from his/her initial settings to the target settings of his/her language? Are there setting pairs (initial, target) for which no sequence of triggers will move the learner to the target settings? Once A introduces a third parameter, plus or minus verb second (V2), the second question is answered with yes. These "local maxima" occur when the initial setting for V2 is minus and the target setting is plus. A gets around this dilemma by assuming that learners set parameter values according to a particular ordering of parameters. If the two X-bar parameters are set first, then the later setting of V2 can avoid the problem. The ordering of parameters set is of course consistent with the above-mentioned ordering of focal points on the input data according to brain maturation.
Ian Roberts begins Chapter 3 by noting that language change requires us to believe that a generation of learners can sometimes set a parameter's value differently from the members of the generation providing input. Latin pretty clearly has underlying OV order, while its daughters, the modern Romance languages, all have VO. Such things happen, R says, because between the parents' setting and the input there is the learning algorithm (device), which in cases of change finds the parent setting unlearnable. Citing Cinque (1997), who finds in a large sample of unrelated languages 32 ordered parameters expressed in IP alone, R calculates 2-to- the-64th-power grammars. Setting one parameter per second, the learner's acquisition would in the worst case take more than 34 years. What allows acquisition within three-five years is the learning algorithm.
Drawing heavily on Kauffman's (1995) study of the clumping of matter in galaxies using Boolean networks and their states, R credits the learning algorithm with the notions of markedness between the two values of a parameter and an implicational relation of one parameter to another. The latter is only two deep, that is, parameter X can have either value 0 or value 1 and, if 1, then parameter Y is in play with its two values, but there are no further parameters involved. Not only are the values binary but also the size of the network. Following a long discussion of feature checking within Chomsky's (1995) Minimalist Program, R boils it down to the same kind of binary network: any feature F is either expressed at the level of Phonetic Form or it is not: if expressed, then by either Merge (lexical-phonological insertion) or by Move.
Markedness in acquisition has the (always conservative) learner setting parameters only in response to trigger input which expresses the marked value. Otherwise the unmarked value is the default setting. Move is marked with regard to no phonological expression of the feature, and Merge is marked with respect to Move. This notion of markedness is part of the learning algorithm and is distinct from Cinque's parameters within IP, which are part of UG. These latter R calls Jakobsonian parameters and illustrates them with four mood parameters, e.g. Mood-sub-Speech-Act is unmarkedly declarative and markedly non-declarative.
Both types of markedness are subject to the observation that the default (unmarked) setting unmarkedly lacks overt expression while the marked setting is expressed. Thus declarative sentences in English have English's underlying order SVO while yes/no-interrogative and imperative sentences show subject-verb inversion (VSO), a result of Move.
By all these notions R claims to put syntactic variation and syntactic change into the realm of cognitive theory, which in turn allows variation and change to be evidence for describing UG and the learning algorithm. R's first case is the loss of two sites (C and AgrS) for the main verb in Modern English (ModE). In the absence of an auxiliary, ModE requires DO-Support with negation and interrogation. It also places the verb after certain adverbials and after a floated quantifier. None of this was true in Middle English (ME): no DO-Support; verbs preceded a floated quantifier and an adverbial. By R's analysis the ME order shows movement of the main verb into AgrS while ModE lacks such movement and is therefore simpler and more "elegant" to the learning algorithm. The ModE situation shows no overt expression of the unmarked setting zero of the V-to-AgrS parameter, as opposed to the ME situation, where Move correlates with the marked setting 1. A further change is the loss of distinctive verb inflection for person and number (four different endings in 1400, only one (3sg -s) today). It too illustrates elegant as unmarked: ME has the marked setting of phonological expression, i.e. Merge, while ModE has the unmarked setting of no overt expression.
R treats two more cases of syntactic change in similar fashion: (1) the shift of 'habere' from an independent verb of possession in Latin to a suffix for future and conditional in the modern Romance languages and (2) the shift from SOV in Latin and Proto-Germanic to SVO in Romance and English, respectively.
At its beginning Robin Clark identifies Chapter Four as an invitation to linguists to explore the mathematics of information theory and statistics. At the end of his second section (page 136) C summarizes his intuitive account of the relationship between "texts," i.e. adult input, and the setting of parameters: "...[A] system has the learnability property just in case there is some learner that learns the languages [including variation within a single language] determined by that system from any arbitrarily selected 'fair' text, one where each parameter value is expressed above the learnability threshold [frequency]. The complexity bound U...should serve to limit the complexity of the input text; in particular, given U we can establish an upper bound of both the sample size and the time required by the learner. ...[A]s the complexity bound U grows, the sentences which express structures near the bound become less likely. It will take increasingly large samples to learn more complex parameter values. Assuming, as seems reasonable, that the time to converge is a function of the size of the text [the learner] learns on, then the time-complexity of learning is also a function of U. But U is a bound on parameter expression: no parameter can contain more information than can be expressed by a phrase marker of complexity at most U. In other words, the information content of a parameter value is directly related to probabilities. Finally, since cross-linguistic variation is determined by the different parameter values, U also limits the amount of variation that is possible across languages. We now turn to the formalization of these intuitions."
As implied above, C assumes that parameters have a finite number of values greater than two. He seems also to assume a critical learning period extending beyond age three. He assumes further that the learner comes equipped with some innate mathematics.
C's sketch of probability and information theory begins with three axioms about sets of events and four axioms about the probabilities of events within the set. After introducing the notion of entropy, a measure of the uncertainty in a system, C defines conditional entropy and relative entropy. Examples: Heads reduce entropy by selecting for semantic, syntactic and morphological properties of the constituents they govern (conditional entropy). Which sense of a two- meaning word like 'grade' is intended (school, slope) can be approached by calculating the probability of each in a given context (relative entropy).
C's discussion of parameters begins with the idea of describing an object, any object, including linguistic objects. The complexity of the description depends on the degree to which the object has structure: Objects with a great deal of internal predictability, like languages, have short descriptions, which may be formulated as instructions; objects with no structure (random objects, like a sequence of coin tosses) cannot have a compressed description. To tie these notions to symbolic descriptions, C discusses Turing machines, including two-tape ones (for finding e.g. palindromes) and universal ones, which can simulate any particular Turing machine. This leads to a discussion of data compression and codes, in particular, instantaneous codes, in which no codeword is identical with any initial sequence ("prefix") of any other codeword. Optimal instantaneous codes give the shortest codewords to those entities with the highest probability of occurrence. To end this section C provides a binary, optimal, instantaneous code description of a universal Turing machine.
The Kolmogorov complexity of an object is the length of the briefest program (formal description) of the object by a Turing machine. C demonstrates a relationship between K- complexity and entropy, namely, as sample size grows the two approach each other. For linguists this means that as input to the learner grows, its uncertainty about parameter settings shrinks. There must be an upper limit on the complexity of any given parameter setting, for the input data that allow the learner to make the setting become ever less frequent the more complex the setting. At some level of complexity the frequency of corresponding data becomes so low that it fails to meet the threshold required for learnability. Corpus-based studies can reveal the upper bound of complexity and thereby inform both the typologist and the learnability researcher as to how much information can be packed into a parameter. This in turn should have consequences for theoretical syntax and developmental psycholinguistics.
In Chapter 5 Sakas and Fodor (S&F) introduce the Structural Triggers Learner (STL), in three versions: the strong STL, which uses parallel processing of many candidate grammars, the weak STL, which uses serial processing and throws out all ambiguous input sentences, and the dynamic weak STL, which gleans what it can from ambiguous sentences. Before examining any of these, however, S&F discuss four general problems (summarized below) and show the unsupportability of Gibson and Wexler's (1994) Triggering Learning Algorithm (ignored here).
S&F posit three phases for the learner's work in establishing a parameter setting: I. recognizing the trigger structure when present in the input; II. adopting the corresponding value for the parameter; III. finding any other parameter settings that are now in error and resetting them by I and II. That underlying phrase structures get changed by derivational processes is a problem for the learner. For example, how is a learner of German, underlyingly verb last, going to arrive at that setting when nearly all of the input consists of independent clauses, which have the verb first or second? Of course, the input reflects movement of the German verb, but that brings up another problem. In an SVO string the German verb has been moved into the C(omplementizer) position, the English verb remains in its underlying position in VP, and the French verb has been moved up to an inflectional head. However, in none of these languages is there anything in the input's surface string to indicate what node dominates the verb. How is a learner to set a parameter for the verb's landing site? S&F assume a parser, which sets parameters, including "deep" parameters, i.e. those for which there is never any trigging information in the surface string of the input. The parser can work only with the learner's current grammar, but triggers from which the learner can learn are contained in only that input which the parser cannot yet parse. Sidestepping this "parsing paradox", what the parser delivers is a phrase structure tree for the surface string together with all the information telling how this tree is derived from the underlying tree. The string SVO is necessarily ambiguous to the learner because its verb position could be underlying, as in English, or derived from SOV, as in German. The parser's output ought to reveal a trigger for setting the verb-landing-site parameter, but that can't happen here because the parser yields two mutually incompatible outputs. What does the learning device do with ambiguous input data?
Although S&F consider both a strong and a weak version of the STL, they decide in the end on a third version. Basic to all three is the Parametric Principle, by which the value of every parameter is set for all time and independently of all other parameters. This yields rapid acquisition, because each successive setting act eliminates a (progressively smaller) host of candidates for target grammar. Independence of parameters also eliminates the need for III above.
The S&F process of setting parameters has the parser working with the learner's current grammar until it encounters a (sub)string that it cannot parse; it then turns to a "supergrammar" which contains "treelets" supplied by UG. These treelets are minimal structures--S&F's examples show only two branches--in which the terminals and the node(s) are labeled. Each represents one value of one parameter; a binary parameter thus has two treelets, an n-ary parameter has n treelets. Assuming that treelet selection (parameter setting) leads the learner to build the underlying syntactic structure of the target language, what happens when the (sub)string the parser is looking at is derived, i.e. distorted by deletion, movement, etc.? If I understand correctly, S&F offer two answers: the (sub)string contains traces reflecting underlying structure; the supergrammar contains, in addition to underlying treelets, also treelets for all possible derived structures, i.e. derivational steps also have parameters.
This model eliminates I above, since the triggers are discovered by the parser's getting hung up. The parsing paradox is eliminated by the parser's ability to turn to the supergrammar. The derivation problem is solved by the presence of derived treelets in the supergrammar. Problems like that of the verb's landing site are solved by labels on the UG treelets. The problem of ambiguous input remains, but S&F suggest that in practice it may be a small problem, since adult input during early learning may be very simple, expressing no more that six parameters per sentence
Overall, this book focuses on the mathematics of formalizing and testing theories. Actual adult sentences addressed to children are rarely cited, and children's own speech is never discussed. Of course, it's probably true that acquisition is nearly complete when children begin uttering three- and four- word strings, so speculation about what's going on before that point should not be unwelcome. Yet the formal, mathematical approach forces the practitioner to make simplifying assumptions, the more of which take him/her the farther from common sense and the real world. All of these authors know this. Sakas and Fodor even illustrate it in their endnote 12 with a joke that circulated once among mathematicians: "A Mafia boss kidnaps a mathematician, locks him into a dank cellar, says 'I'll be back in six months and you must then give me a formula to predict whether my horse will win at the races. If you don't, I'll shoot you.' He leaves. He returns in six months, asks the mathematician for the formula, the mathematician doesn't have it, the Mafia man pulls out his gun. But the mathematician says 'No, don't shoot me. I don't have the formula yet but I have made significant progress. I have it worked out for the case of the perfectly spherical horse."
Among models discussed, that of Sakas and Fordor is the most detailed, but it too falls short. Traces are as discernible in the input as node labels, i.e. not at all. The supergrammar is so loaded with treelets that it threatens to lose the internal predictability of structured objects in the sense of Clark, Chapter 4. Assuming binarity, a parameter ought to have only plus and minus values, e.g. plus for the VP parameter could mean VO, and minus would then mean OV. The same for what we used to call rules, e.g. plus or minus V2. If research supports them, the values M(arked) and U(nmarked) would be even better. The claim that every parameter is set independently of every other loses the insight of implicational universals, e.g. underlying verb- last structure implies postpositions. If early input were to contain no adpositions but clearly trigger OV, that should in turn trigger an initial and unmarked setting for postpositions (contingent unmarkedness as per Roberts, Chapter 3). If later input revealed only prepositions, the learning device ought to be able to change the adposition parameter setting to prep. Finally, Sakas and Fodor worry too much about ambiguous input. If they were to take intonation and prosody into account, ambiguity would drop sharply. For example, they label the following sentence ambiguous: He fed her dog biscuits (biscuits to her dog or dog biscuits to her). Speak the sentence aloud once for each reading; you will find no ambiguity. Note also that, from the page, the second reading is hard to fetch--and a literate child would never fetch it--because it it expresses anomalous behavior. This fact is not unimportant.
REFERENCES Chomsky, N. (1981) Principles and Parameters in Syntactic Theory, in N. Hornstein and D. Lightfoot, eds., Explanation in Linguistics: the Logical Problem of Language Acquisition, Longman.
Chomsky, N. (1995) The Minimalist Program, MIT Press.
Cinque, G. (1997) Adverbs and the Universal Hierarchy of Functional Projections, unpublished manuscript, University of Venice.
Gibson, E. and K. Wexler (1994) Triggers, Linguistic Inquiry 25: 407-54.
Kauffman, S. (1995) At Home in the Universe, Viking Press.
Wexler, K and P. Culicover (1980) Formal Principles of Language Acquisition, MIT Press.
ABOUT THE REVIEWER Lee Fullerton is an Associate Professor on leave from the University of Minnesota. His main scholarly interests are historical Germanic morphology and phonology and the syntax of Modern German.
|