Editor for this issue: Terence Langendoen <terry
linguistlist.org>
Van Eynde, F., Schuurman, I., and Schelkens, N., ed. (2000) Computational Linguistics in the Netherlands 1998, Editions Rodopi B.V., 233 pp., Studies in Practical Linguistics Laura Alonso, CLiC (Centre for Language and Computation), University of Barcelona This book consists on fourteen selected papers from the Computational Linguistics in the Netherlands conference (CLIN) of the year 1998. The contents are classifiable in three groups: statistical methods, syntax/semantics and applications. The section on statistical methods is the largest one with six papers. Two of them concentrate on phonotactic properties (Kleiweg & Nerbonne and Stoianov & Nerbonne). The other four are of a general methodological nature; they concern style adaptation of statistical language models (Van Uystel, Wambacq & Van Compernolle), memory-based word sense disambiguation (Veenstra, van den Bosch, Buchholz, Daelemans & Zavrel), instance families in memory-based language learning (van den Bosch), and the arbitrariness of lexical categories (Durieux, Daelemans & Gillis). The section on syntax and semantics consists of five papers. Three of them employ the framework of Head-driven Phrase Structure Grammar (HPSG) for formal analyses of Polish clitics (Kupsc), null-headed nominals in German and English (Nerbonne & Mullen, and words without content or figure heads (Van Eynde). The other two report on work connected with the Amazon grammar, the last part of the middle field in Dutch (Van Dreumel) and parenthetical reporting clauses (Schelfhout). The section on applications has three papers. One shows how NLP tools can be used for the development of a course on computational linguistics (Bouma). Another one discusses the use of NLP techniques in document processing (van der Eijk & Janssen). The last one concerns the evaluation of the NLP components of the OVIS2 spoken dialogue system (Veldhuijzen van Zanten, Bouma, Sima'an, van Noord & Bonnema) Here is a summary and discussion of the papers in the volume. Kleiweg, P., Nerbonne, J., 'An FGREP Investigation on Phonotactics'. This paper discusses experiments with neural networks trained to represent letters in a manner meaningful to the processing task: the network is presented with monosyllabic words, one letter at a time, and the network has to learn to predict the next letter. The algorithm used is Miikkulainen's FGREP (Forming Global Representations with Extended back Propagation). However, since the network did very badly in distinguishing valid from invalid words, FGREP was augmented with a 'dispersion' algorithm to improve distinctness among the letter representations, which improved performance and made the network sensitive to dependencies of letters separated by one or more letters, showing a stricter notion of 'valid word'. These experiments are psycholinguistically interesting in that they were aimed at capturing some aspects of human language processing rather than at developing a better algorithm for the machine-learning of language. The data sets used for these experiments and the results are available at: http://www.let.rug.nl/~kleiweg/papers/afiip. Stoianov, I., Nerbonne, J. 'Exploring Phonotactics with Simple Recurrent Networks'. This paper presents an extension of an initial experiment learning graphotactics to learning phonotactics. In addition, a further analysis of neural network is conducted, with regard to variables such as word frequency, length, neighborhood and error location. This informal comparison of SRNs and human performance suggests that neural networks may be used for learning natural language, thus challenging connectionism to tackle symbolic problems. Van Uystel, D.H., Wambacq, P., Van Compernolle, D., 'Style Adaptation of Statistical Language Models'. Language Models associate each sentence hypothesis generated by a speech recognizer with its probability to occur in a given domain. Statistical language models are based in n-grams: an n-gram model makes the assumption that the occurrence of a word in a sentence depends on the preceding words. This paper discusses the adaptation of language models from a general training test corpus (broadcast news or financial newspaper) to the style of a given domain (news talkshow). The authors propose three different adaptation schemes, transformation and two variants of relevance weighting, which make use of a weighted counting approach. The results show that weighted counting is more effective on the financial newspaper corpus than on the broadcast news, although the gains from the tested methods remain rather modest. POS n-grams may not be sufficient to characterize style, so more fine-grained style distinctions should be made. Veenstra, J., Van den Bosch, A., Buchholz, S., Daelemans, W., Zavrel, J. 'Memory-based Word Sense Disambiguation'. The authors present a method for the word sense disambiguation task in the SENSEVAL project: the association of a word in context with its contextually appropriate sense tag. For each word to be disambiguated, a distinct classifier is constructed that is then trained in POS-tagged corpus examples and selected information from dictionary entries <http://www.itri.brighton.ac.uk/events/senseval>. The classifier extracts: (1) context features: a window of two words (and their POS tag) to the left and the right of the word of interest; and (2) keyword features: a number of relatively frequent words that occur frequently with the sense of interest. The method achieves a relatively high accuracy and it is computationally very economical, in contrast with other machine- learning methods that can not deal with the number of features that interact in word sense disambiguation. Interesting future directions of this investigation would be to determine if the method can feed on dictionary information only, when there is not an abundant labeled training corpus. Van den Bosch, A., 'Instance Families in Memory-Based Language Learning'. Pure memory-based language learning treats a set of pre-classified training linguistic instances as points in a multi- dimensional feature-space. They are then stored in memory to classify new instances by matching them to all instances in the instance base. In this paper it is shown how careful abstraction improves the performance these systems by reducing their memory requirements. Six automated classification tasks are carried over to compare these two approaches. The FAMBL algorithm (FAMily-Based Learning) carefully merges groups of nearest-neighbor instances labeled with the same class in a single, more general instance. It works in two stages: a 'probing' stage when all possible families are extracted randomly and a 'family extraction' stage when no family is extracted that has more members or more distance between members than the median. This reduces the number of items in memory between a 31% and a 75%, depending on the task. So, when a new instance is submitted, a match is made between a the new instance and the stored family expressions, not with the whole of the instance base. The careful abstraction method was applied to six language tasks: grapheme-phoneme conversion, word pronunciation, morphological segmentation, base-NP chunking, PP attachment, and part-of-speech tagging. It performed close to a non-abstraction algorithm (IB1-IG), though equaling it on only three of the six tasks. The FAMBL algorithm does not handle adequately properties like very high disjunctivity or feature interaction. The incorporation of feature interaction is a relevant point for future research in the field of language learning, involving both linguistic, cognitive and computational knowledge. Durieux, G. Daelemans, W., Gillis, S., 'On the Arbitrariness of Lexical Categories'. This paper shows how automated classification can predict with a similar reliability the category of domains which are linguistically considered to have a very different degree of arbitrariness. The starting hypotheses were: (1) that the degree to which a lexical category is predictable can be shown quantitatively by machine learning techniques, and (2) that memory-based learning succeeds in successfully learning lexical categories, irrespective of their degree of arbitrariness. These hypotheses were tested through automated classification tasks concerning three lexical categories in Dutch which have a varying predictability: completely predictable (diminutive), rather predictable (stress) and essentially arbitrary (gender). The results of the two experiments which were carried out show how predictability is indeed reflected by machine learning techniques and how any lexical category can be learned with these methods. However, a correlation was observed between the predictability and the success in learning: the more arbitrary a category is, the harder it is to be learnt correctly by the algorithm. Kupsc, A., 'Position of Polish Clitics: an HPSG Approach'. Even though Polish clitics have a rather free distribution, there are certainly some constraints that are worth being analyzed. Since most of the positions of clitics follow from general principles of Polish linear order, the author considers them as syntactic items and uses order domains, that account for Polish linear order facts, to also account for their distribution. An LP constraint is proposed which uniformly accounts for the distribution of both preverbal and postverbal clitics. Two alternatives to this approach are also analyzed: one based in lexical weight and the other in topological fields. However, neither of them yield satisfactory results: the first is still too general, the second is too restrictive for the freedom of order of Polish. Nerbonne, J., Mullen, T., 'Null-Headed Nominals in German and English'. The authors give an explanation for certain nominal phrase constructions in German and English which are best considered as having empty lexical heads: they propose the Left Periphery feature of a nominal tree structure, which can be empty, full or none. Simple, language-specific rules are then applied to give account of the combination of signs according to their Left Periphery values: for example, determiners such as 'none' or 'mine' are restricted to combining with nominal constituents whose left periphery is empty, while 'no' and 'my' require a nominal constituent with a full left periphery. This rules also give satisfactory account of some kinds of anaphoric constructions in German which seem to lack a clear nominal heads, as well as the 'one' anaphor in English. The cross-linguistic descriptive power of this account justifies the use of the null constituent. To prove this, a grammar has been implemented which, not using the null constituent or the feature Left Periphery, introduces ambiguity and requires unnecessary detail. The Left Periphery feature seems to provide a general explanation for a number of phenomena where determiners or adjectives, and not nominal elements, appear to be the central part of the phrase. Van Eynde, F., 'Figure Heads in HPSG'. Figure heads are words without semantic content, such as the copula or the infinitive 'to'. For these, the head-driven semantics of HPSG-94 stipulates that the semantic value of a semantically vacuous word is identified with the one of its complement, in contrast to the general tendency of this framework to assign to a head-complement combination the content value of its head daughter. But this leaves unresolved the semantic contribution of the verbal tense. Moreover, no criteria are given for the identification of vacuous words. To overcome part of these limitations, the author integrates a treatment of the tenses in HPSG and defines some criteria for identifying vacuous words. In doing so, he provides empirical evidence against the figure head treatment of HPSG, and replaces it with an alternative, nonsubstantive analysis in which vacuous verbs have no 'content' feature. Van Dreumel, S., 'The Amazon Grammar and the Last Part of the Middle Field'. The author successfully gives a formalization of the end of the structuralist Middle Field, the part of the sentence which is considered to be found between two poles of the sentence. Since these two poles can be empty in Dutch, the boundary between the two fields is invisible, thus causing a transparency problem to a structuralist parser like AMAZON. A parsing technique is proposed that predicts the Middle Field closing point by recognizing and determining the internal order of elements with closing properties, namely particles belonging to pronominal adverbs and predicative elements. Once formalized and implemented, the properties of these closing mi-elements enable the development of more efficient and more robust parsers, able to handle transparency situations in which an empty verb cluster is followed by an infinitival complement without complementizer. Schelfhout, C., 'Corpus-Based Analysis of Parenthetical Reporting Clauses'. In investigating the syntactic properties of parenthetical reporting clauses in Dutch, it is shown that it is inadequate to give too simple an analysis, in which the quote would be considered as the direct object of the reporting verb. The authors propose an analysis in which the quote and the reporting clause are taken to be adjoined. However, some problems arose from this analysis, because sometimes obligatory objects were not present, and there were also frequent inversions in the reporting clause. To solve this, an abstract particle 'so' was postulated in first position in reporting clauses. This may be explicit or not, and it stands in anaphoric relation to the quote. This analysis will be implemented in AMAZON (a parser for Dutch), and it is hoped that, in testing the implementation on the corpus, answers may come to light for some questions that still remain, such as: What is the relation between the 'so' and the quote or a possible direct object in the quote? Are there contexts in which 'so' could never become explicit? At which positions in the quote can a reporting clause occur? Bouma, G., 'A Modern Computational Linguistics Course using Dutch'. This paper presents a course in computational linguistics concentrating on realistic language technology applications for Dutch. Main targets of this course are that the students learn to use high-level tools, that they become familiar with quantitative evaluation methods, by working with real data. Some illustrating exercises in this course deal with finite state methods using regular expression calculus, grammar development and natural language interface development, such as report generation and the development of a question-answering system. van der Eijk, P., Janssen, D. 'XML Mixed Content Grammars'. The authors argue that some NLP tools, such as style checkers or Controlled Language tools, need to evolve into content processing. They discuss the extension of a Controlled Language tool (CLarity) by mapping XML DTDs to a constraint-based grammar formalism. They have embedded the XML document structure within the syntactic analysis. The only drawback of this extension is the loss of the modularity of grammar and DTD, which requires grammar developers to be competent both in computational linguistics and in document technology. However, this extension can be generalized to support many arbitrary DTDs. They also discuss the effect of XML content manipulation on XML document integrity, and outline conditions on grammars under which an NLP system preserves XML well-formedness and validity, interpreted as guidelines for grammar writers and implemented as a verification procedure. Veldhuijzen van Zanten, G., Bouma, G. Sima'an, K., van Noord, G., Bonnema, R., 'Evaluation of the NLP Components of the OVIS2 Spoken Dialogue System'. In the framework of a five-year research program for the development of spoken language information systems, two natural language processing modules are developed: a grammar one, rule-based, and a data-oriented one, memory-based and stochastic. In order to compare them, a formal evaluation has been carried out showing that the grammar-based component performs much better than the data-oriented one, and it requires much less computational resources. For example, the best data-oriented method obtains an error rate for concept accuracy of 24.5%, whereas the best grammar-based method obtains a rate of 17%, and differences increase with increasing sentence length. The most important problem for the application consists of disambiguation of the word graph used for representing all sequences of words that the speech recognizer hypothesizes for a spoken utterance. Here, a combination of speech scores and trigram scores performs much better in string accuracy than the data- oriented methods. The grammar-based methods incorporate N-gram statistics. Laura Alonso i Alemany is a postgraduate student at the University of Barcelona. She is currently working on a shallow rhetorical parser for Spanish unrestricted text. Her thesis project consists on developing a rhetorical-structure-based system for automated text summarization for Spanish. Her areas of interest are discourse and rhetoric, and natural language processing.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue