Date: Mon, 17 May 2004 10:35:40 +0300 From: Verginica Mititelu Subject: Treebanks: Building and Using Parsed Corpora
EDITOR: Abeillé, Anne TITLE: Treebanks SUBTITLE: Building and Using Parsed Corpora SERIES: Text, Speech and Language Technology, volume 20 PUBLISHER: Kluwer Academic Publishers YEAR: 2003
Verginica Barbu Mititelu, Institute for Artificial Intelligence, Romanian Academy
The book is a collection of 21 papers on building and using parsed corpora, most of them formerly presented at workshops and conferences (ATALA, LINC, LREC, EACL).
The objective of the book, as stated in the Introduction, is to present an overview of the work being done in the field of treebanks, the results achieved so far and the open questions. The addressees are linguists, including computational linguists, psycholinguists, and sociolinguists.
The book is organized in two parts: Building treebanks (15 chapters, pp. 1-277) and Using treebanks (6 chapters, pp. 279-389), each of them having subparts. It also contains a preface (pp. xi), an introduction (pp. xiii-xxvi), a list of contributing authors and their affiliation (pp. 391- 397), and an index of topics (pp. 399-405).
The organization of the Introduction (signed by Anne Abeillé) is similar to the structure of the whole book, namely it has two parts, entitled Building treebanks and Using treebanks, respectively. After making the terminological distinction between tagged corpora and parsed corpora (or treebanks), the author emphasizes the reasons for the need of the existence of treebanks and makes a general presentation of the topics to be covered by the papers in the volume, stressing the fact that the problems encountered for each language are, at great extent, the same, thus a certain redundancy in the papers collected in this volume.
PART I. BUILDING TREEBANKS
The chapters of the first part are grouped according to the language or language families for which the approaches to building treebanks are presented: the first four chapters are dedicated to English treebanks, the next two to German ones; there are two papers on Slavic treebanks, four on Romance parsed corpora and the last three chapters of the first part address to treebanks for other languages (Sinica, Japanese, Turkish).
ENGLISH TREEBANKS Ch. 1. The Penn Treebank: an Overview. Ann Taylor, Mitchell Marcus, and Beatrice Santorini. The authors present the annotation schemes and the methodology used during the 8-year Treebank project. The part-of-speech (POS) tagset is based on that of the Brown Corpus, but adjusted to serve the stochastic orientation of Penn Treebank and its concern with sparse data, and reduced to eliminate lexical and syntactic redundancies. More than one tag can be associated to a word, thus avoiding arbitrary decisions. POS tags also contain syntactic functions, thus serving as a basis for syntactic bracketing, which was modified during the project from a skeletal context free bracketing with limited empty categories and no indication of non-contiguous structures and dependencies to a style of annotation which aimed at clearly distinguishing between arguments and adjuncts of a predicate, recovering the structure of discontiguous constituents, and making use of null elements and coindexing to deal with wh-movement, passive, subjects of infinitival constructions. The first objective was not always easy to achieve via structural differences, that is why a set of easily identifiable roles were defined, although sometimes these ones proved difficult to apply, too. The Penn Treebank (PTB) project also produced dysfluency annotation of transcribed conversations, labeling complete and incomplete utterances, non-sentence elements (filters, explicit ending terms, discourse markers, coordinating conjunctions) and restarts (with or without repair).
For all these 3 annotation schemes a 2-step methodology was adopted: an automatic step (represented by PARTS and Brill taggers for POS tagging, the Fidditch deterministic parser for syntactic bracketing, and a mere Perl script identifying common non-sentential elements) followed by hand correction.
Chapter 2. Thoughts on Two Decades of Drawing Trees. Geoffrey Sampson. The author exploits the idea that the annotation of both written and (transcribed) oral corpora makes obvious the deficiencies of theoretical linguistics and may even contradict some widely accepted conventional linguistic wisdom. For instance, sentences of the form subject- intransitive verb are rather infrequent in English corpus, contrary to what can be found in some linguistics textbooks.
Chapter 3. Bank of English and beyond. Timo Järvinen. The aim of this paper is twofold. On the one hand, the author describes the four modules (pre-processing -- i.e. segmentation and tokenization --, POS assignment, POS tagging, functional analysis) of the English Constraint Grammar (ENGCG) system used for annotating corpora for compiling the second edition of the Collins COBUILD Dictionary of English, and also the methodology adopted taking into consideration the huge amount of data that was to be dealt with; thus, manual inspection was possible only for some random fragments of the data and automatic methods were created for monitoring them.
As clearly stated, the CG system was chosen for its morphological accuracy. However, syntactic ambiguity was too high. That is why, Järvinen pleas for a Functional Dependency Grammar (FDG) parser, which better deals with long-distance dependencies, ellipses and other complex phenomena. He points out the need for a deep parsing, instead of the shallow one, his reason being, besides the lower ambiguity, the practical orientation of the former.
Chapter 4. Completing Parsed Corpora. Sean Wallis. A more challenging title for this paper could have been: ''Do we need linguists for constructing treebanks?'' For answering this question, S. Wallis starts by giving us a brief overview of the phases of the annotation employed on International Corpus of English ? British Component (ICE- GB) and by pointing out the fact that the use of two parsers (i.e., TOSCA and Survey parser) increased the number of inconsistencies in the corpus, thus the necessity of a post-correction. He provides two arguments against Sinclair (1992), who found human annotators a source of errors in the treebank.
In order to ensure the cleanness of the parsed corpus, one has two problems to solve: the decision (i.e. the correctness of the analysis) and the consistency (of the analysis throughout the corpus) ones. S. Wallis draws a distinction between longitudinal (that is, working through a corpus sentence-by-sentence, until it is completed) and transverse (i.e. working through a corpus construction-by- construction) correction, bringing arguments in favor of the latter: less time-consuming, control of the accuracy of the analysis and of its consistency. The price paid is difficulty in implementation and in managing the process. But once the tool for grammatical queries search facility (Fuzzy tree Fragment) is created, it can also be used not only for correction, but also for searching and browsing the corpus for linguistic queries, so a post-project use of the tool.
As clearly stated in the Critique section of Wallis's paper, the question formulated above receives an affirmative answer if the final aim of the corpus is not a study of the parser performance, but of language variation.
GERMAN TREEBANKS Chapter 5. Syntactic Annotation of a German Newspaper Corpus. Thorsten Brants, Wojciech Skut, Hans Uszkoreit. This paper is a presentation of the syntactic annotation of the NEGRA newspaper corpus. Language-specific reasons (free word order, among others), corpus structure (frequently elliptical constructions) and the characteristics of the formalism contributed to the choosing of Dependency Grammar for the annotation. However, it was modified so that to take advantage of phrase-structure grammar, too: flat structures, no empty categories, treatment of the head as a grammatical function expressed by labeling, not by the syntactic structure, allowance of crossing branches (which give rise to a large number of errors), a more explicit annotation of grammatical functions, encoding of predicate- argument information.
A characteristic of this project is the interactive annotation process which makes use of the TnT statistical tagger and second order Markov models for POS tagging. Syntactic structure is built incrementally, using cascaded Markov models. A graphical user interface allows for manual tree manipulation and runs taggers and parsers in the background. Human annotators need to concentrate only on the problematic cases, which are assigned different probabilities by statistical tagger and parser. Accuracy is ensured by annotating the same set of sentences by two different annotators. Differences are discussed and after agreeing on them, modifications are applied to the annotation.
The design of the corpus and the annotation scheme make it usable for different linguistic investigations and also for training taggers and chunkers.
Chapter 6. Annotation of Error Types for German Newsgroup Corpus. Markus Becker, Andrew Bredenkamp, Berthold Crysmann, Judith Klein. This paper contributes to the presentation of the applications used for the development of controlled language and grammar checking applications for German. The corpus in the FLAG project consisted of email messages (as they present the characteristics needed: high error density, accessibility, electronic availability). Their annotation was 3-phased: developing of a typology of grammatical errors in the target language (German), manual annotation on paper, and annotation by means of computer tools.
The first phase relied on traditional grammar books and its outcome was a type hierarchy of possible errors, also containing error domains (i.e. it tries to define the relations between the affected words) useful in guiding the detection of errors. Although the hierarchy was a fine- grained one, in the annotation process only a pool of 16 error types were to be detected and classified. After being manually annotated, the same set of sentences was annotated in turn with two tools: Annotate and DiET. The annotation with the former one has a tree-format: the nodes are the error types, and the edges are descriptive information on these types; thus, a rich representation of the structure of errors in terms of relations. However, this representation is built bottom-up, the error-type being added last. DiET offers a better method for configuring an annotation schema, that is why the annotation was performed with this latter tool. The overwhelming type of errors were the orthographical ones (83%), followed, at huge distance, by grammatical ones (16%).
TREEBANKS FOR SLAVIC LANGUAGES Chapter 7. The Prague Dependency Treebank. Alena Böhmová, Jan Hajic, Eva Hajicová, Barbora Hladká For the annotation of the Czech newspaper corpus, a 3-level structure was used. At the morphological level, the automatic analyzer produces ideally for each token in the input data the lemma and the associated MTag. Whenever more than one lemma and/or an MTag are produced, manual disambiguation is needed. For the analytical (syntactic) level of annotation the dependency structure was used. It is based on a dependency/determination relation. Solutions were found for problematic structures, as coordination, ellipses, ambiguity, and apposition. Two modes of annotation were employed: first, manual annotation, then the Collins parser was trained on such annotated data and used further to generate the structure, while syntactic functions went on being manually assigned. The separately produced morphological and analytical syntactic annotations are then merged together, all possible discrepancies being manually solved. The third level of annotation, the tectogramatical one, describes the meaning of the sentences by means of tectogrammatical functions and the information structure of sentences. Analytic trees are transduced to tectogrammatical ones in two phases: an automatic one (which makes the necessary changes to syntactic trees, as merging the auxiliary nodes with verbs) and a manual one.
Chapter 8. An HPSG-Annotated Test Suite for Polish. Malgorzata Marciniak, Agnieszka Mykowiecka, Adam Przepiórkowski, Anna Kupsc. The aim of the paper is to present the construction of a test-suite for Polish, consisting of written sentences, both correct and incorrect ones, the latter being manually annotated with correctness markers. Each of these two types are further classified into three subgroups, according to their complexity. Moreover, each sentence is hand annotated with the list of linguistic phenomena they display, choosing from nine groups of hierarchies of such phenomena. Sentences are annotated with attribute-value matrices (AVMs), whose content is restricted by an HPSG signature. The result is a database of sentences, the correct ones augmented with their HPSG structures, and a database of wordforms. The aim of the former database is to evaluate computational grammars for Polish.
TREEBANKS FOR ROMANCE LANGUAGES Chapter 9. Developing a Syntactic Annotation Scheme and Tools for a Spanish Treebank. Antonio Moreno, Susana López, Fernando Sánchez, Ralph Grishman. The paper reports on building an annotated Spanish corpora, based on newspaper articles. Problems specific to Spanish are presented: dealing with multiword constituents and with amalgams or portmanteau words, with null subjects and ellipses, ''se''-constructions, etc. There are three levels of annotations: syntactic categories, syntactic functions, morpho-syntactic features and some semantic features. The annotation and debugging tools are also presented in the paper, alongside with some error statistics, current state of the Spanish treebank and future development.
Chapter 10. Building a Treebank for French. Anne Abeillé, Lionel Clément, François Toussenel. A newspaper corpus, representative of contemporary written French, was subject to automatic tagging (segmentation with special attention to compounds, tagging relying on trigram method, and retagging making use of contextual information) and parsing (surface and shallow annotation, theory- neutral, with the aim of identifying sentence boundaries and limited embedding). Each annotation with morphosyntax, lemmas (based on lexical rules), compounds and sentence boundaries was followed by manual validation. The resulting treebank was used for evaluating lemmatizers and for training taggers.
Chapter 11. Building the Italian Syntactic-Semantic Treebank. Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte. The paper presents the syntactic-semantic annotation of a balanced corpus and of a specialized one. Four levels of annotations were adopted: morpho-syntactic annotation (POS, lemma, morpho-syntactic features), syntactic annotation made up of constituency annotation (identification of phrase boundaries and labeling of constituents) and functional annotation (with functional relations), lexico- semantic annotation (distinguishing among single lexical items, semantically complex units and title sense units; specification of senses for each word - relying on ItalWordNet -, along with other lexico-semantic information, such as figurative usage, idiomatic expressions, etc.). The first two types of annotations were performed semi-automatically, while the other two were performed manually. There are two innovations brought about by this treebank: sense tagging (which resembles a semantic annotation of the corpus) and two distinct layers of syntactic annotation, the constituency and the functional ones, grounded by language specific phenomena (such as free constituent order and pro-drop property) and by further usages of the obtained treebank which is compatible with different approaches to syntax.
In the second part of the article the annotation tool, GesTALt, is presented: its consisting applications and the architecture of the tool. In the end the usages of the obtained data are presented: improvement of a translation system, enrichment of dictionaries, improvement at the level of analysis.
Chapter 12. Automated Creation of a Medieval Portuguese Partial Treebank. Vitor Rocio, Mário Amado Alves, J. Gabriel Lopes, Maria Francisca Xavier, Gracia Vicente. The novelty of the approach presented in this paper arises from the use of tools and resources developed for Contemporary Portuguese to the annotation of a corpus of Medieval Portuguese. The differences between these two phases of the language are presented.
The neural-network based POS tagger was trained on a set of words manually tagged for each of the texts in the Medieval Portuguese corpus. It was then used to extract a dictionary and to tag the rest of the texts. Manual correction followed. For the lexical analysis, a morphocentric lexical knowledge-base (LKB) was used. The lexical analyzer uses as input the output from the POS tagger and applies to it the knowledge in the LKB. Its output serves as input for the syntactic analyzer.
The authors present the resources used and the adaptations required to deal with the corpus. A similar method for dealing with corpora of other Romance languages is envisaged.
TREEBAKNS FOR OTHER LANGUAGES Chapter 13. Sinica Treebank. Keh-Jiann Chen, Chi-Ching Lou, Ming-Chung Chang, Feng-Yi Chen, Chao-Jan Chen, Chu-Ren Huang, Zhao-Ming Gao. The paper reports on the construction of a treebank for Mandarin Chinese, relying on Sinica Corpus, already annotated at the moment of starting the treebank, so its resources could be used for the latter. The authors provide reasons for their choosing of the grammar formalism used for the representation of lexico-grammatical information, namely Information-based Case Grammar. They also present the concepts they work with: the principles of inheritance, the phrasal categories, etc.
Sinica treebank is not a mere syntactically annotated corpora, but also a semantically annotated one, containing thematic information. The automatic annotation process was followed by a manual checking, as in most cases. The language-specific phenomena (for instance, constructions with nominal predicates) are given a short presentation, along with the solution adopted in the annotation process. The treebank aims at being used as a reliable resource by (theoretical) linguists, but not only by them, so tools for extracting information from it were developed.
Chapter 14. Building a Japanese Parsed Corpus. Sadao Kurohashi, Makoto Nagao. The morphological and syntactic annotation of a Japanese newspaper corpus is presented in this paper. It developed in parallel with the improvement of the morphological analyzer JUMAN and of the dependency structure analyzer KNP (chosen in accordance with the characteristics of Japanese). The dependency relation is defined on bunsetsu, the traditional Japanese linguistic unit. The free word order of Japanese raised a problem which remained unsolved: predicate-argument relation in embedded sentences.
Chapter 15. Building a Turkish Treebank. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür. The aims of realizing the Turkish treebank is to be representative and to contain all the relevant information for its potential users.
There are two levels of annotation: morphological and syntactical ones. Both take into consideration the characteristics of Turkish, especially its rich inflectional and derivational morphology. Thus, each word is annotated for each of its morphemes, as this information may be necessary for syntax. The free word order and the discontinuities favor the usage of the dependency framework. Its typical problems (pro-drop phenomenon, verb ellipsis, etc.) are given the solution adopted in the annotation process.
PART II. USING TREEBANKS
Chapter 16. Encoding Syntactic Annotation. Nancy Ide, Laurent Romary. The emerge of treebanks, alongside with the proliferation of annotation schemes, triggered the need for a general framework to accommodate these annotation schemes and the different theoretical and practical approaches. The general framework (built within XCES) presented in this paper is an abstract model, theory and tagset independent, that can be instantiated in different ways, according to the annotator's approach and goal. This abstract model uses two knowledge sources: Data Category Registry (an inventory of data categories for syntactic annotation) and a meta-model (a domain-dependent abstract structural framework for syntactic annotation). Two other sources are used for the project-specific formats of the annotation scheme: Data Category Specification (DCS) (the description of the set of data categories used within a certain annotation scheme) and Dialect Specification (defining the project-specific format for syntactic annotation). Combining the meta-model with the DCS, a virtual annotation markup language (AML) can be defined for comparing annotations, for merging them or for designing tools for visualization, editing, extraction, etc. A concrete AML results from the combination of a virtual AML and Dialect Specification. The abstract model ensures the coherence and consistency of the annotation schemes.
Chapter 17. Parser Evaluation. John Carroll, Guido Minnen, Ted Briscoe. The emergence of syntactic parsers triggered the need for methods evaluating them. In fact, this has become a real branch in the field of NLP research. In this paper we are presented a corpus annotation scheme that can be used for the evaluation of syntactic parsers. The scheme makes use of a grammatical relation hierarchy, containing types of syntactic dependencies between heads and dependents. Based on EAGLES lexicon/syntax standards (Barnett et al. 1996), this hierarchy aims at being language- and application- independent.
The authors present a 10,000 words corpus semi- automatically marked up. For its evaluation three measures are calculated: precision (the number of bracketing matches with respect to the total number of bracketings returned by the parser), recall (the number of bracketing matches with respect to the number of bracketings in the corpus) and F- score (this is a measure combining the previous two measures: (2 x precision x recall)/(precision + recall)). This last measure can be used to illustrate the parser accuracy. The evaluation of grammatical relations provides information about levels of precision and recall for groups or single relations. Thus, they are useful for indicating the areas where more effort should be concentrated for bettering.
Chapter 18. Dependency-based Evaluation of MINIPAR. Dekang Lin. The author presents a dependency-based method for evaluating parsers performance. To represent a dependency tree he makes use of a set of tuples for each node in the tree, specifying the word, its grammatical category, its head (if the case, and also its position with respect to this head) and its relationship with the head (again, if the case). To perform the evaluation, for the parser generated trees (called here answers) and the manually constructed trees (called keys) dependency trees are generated and compared on a word-by- word basis. Very important, a selective evaluation is also possible: one can measure the parser performance with respect to a certain type of dependency relation or even to a certain word. Two scores are calculated: recall and precision.
The author goes on with the presentation of MINIPAR, a principle-based broad coverage English parser (Berwick et al. 1991). The dependency-based method presented above is used for evaluating this parser. One interesting outcome of this evaluation is that the parser performs better on longer sentences than on shorter ones. This may be the outcome of having trained the parser on press reportage, with long sentences, while the shorter sentences are found in fiction, the genre against which the parser is tested.
GRAMMAR INDUCTION WITH TREEBANKS Chapter 19. Extracting Stochastic Grammars from Treebanks. Rens Bod. The assumption (see Scha 1990, 1992, Bod 1992, 1995, 1998) constituting the basis of this article is that ''human language perception and production processes may very well work with representations of concrete past language experiences, and that language processing models could emulate this behavior if they analyzed new input by combining fragments of representations from annotated corpus''. So, the idea is to use an already annotated corpus as a stochastic grammar. The idea is not new, but the aim of the article is to answer the question: in what measure can constraints be imposed on the used subtrees without decreasing the performance of the parser?
The results reported here were obtained using a data- oriented parsing (DOP) model (presented in section 2 of the paper) which was applied to two corpora of phrase structure trees: Air Travel Information System (ATIS) and the Wall Street Journal (WSJ) part from PTB. The conclusion drawn from the experiments is that almost all constraints decrease the performance of the model: the most probable parse (which takes into consideration overlapping subtrees) gives better results than the most probable derivation (which does not takes it into consideration); the larger the subtrees, the better predictions (as larger subtrees capture more dependencies than small ones); the larger the lexical context (up to a certain depth, which seems to be corpus-specific), the better accuracy (as more lexical dependencies are taken into account); the low frequency subtrees have an important contribution to the parse accuracy (as they seem to be larger, thus to contain more lexical/structural context useful for further parsing); the use of subtrees with non-headwords have a good impact on the performance of the model (as they contain syntactic relations for those non-headwords, which cannot be found in other subtrees).
Chapter 20. A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. Günter Neumann. As the title states it, the paper presents a uniform method for automatically extraction of stochastic lexicalized tree grammars (SLTG) from treebanks (allowing corpus-based analysis of grammars) and HPSG (allowing extraction of domain-independent and phenomena-oriented subgrammars), with the future aim at merging the two SLTGs to improve the coverage of treebank grammars on unseen data and to ease adaptation of treebanks to new domains.
The major operation in the extraction of SLTG is the recursive top-down tree decomposition according to the head principle, thus each extracted tree is automatically lexically anchored. The path from the lexical anchor to the root of the tree is called a head-chain. There are two more additional operations involved: each subtree of the head- chain is copied and the copied tree is processed individually by the decomposition operation, thus allowing a phrase to occur both in head and in non-head positions; for each SLTG-tree having a modifier phrase attached, a new tree is created with the modifier unattached, thus using the extracted grammar for recognizing sentences with less or no modifiers than the seen ones. There results a SLTG which is processed by a two-phase stochastic parser. The rest of the paper describes the extraction of SLTG from PTB and from NEGRA treebank, on the one hand, and from a set of parse trees with an English HPSG, on the other, and some experiments results of the use of an extracted SLTG.
Chapter 21. From Treebank Resources to LFG F-Structures. Anette Frank, Louisa Sadler, Josef van Genabith, Andy Way. This paper presents two methods for automatic f-structure annotation. The first one consists in extracting a Context- Free Grammar (CFG) from a treebank, according to Charniak 1996. A set of regular expression based annotation principles are then developed and applied to the CFG, resulting an annotated CFG. The annotated rules are rematched against the treebank trees, the result being f(unctional)-structures. The second method uses flat tree descriptions. Annotation principles define projection constraints which associate partial c(onstituent)- structures with their corresponding partial f-structures. When these principles are applied to flat set-based encoding of treebank trees, they induce the f-structure. The two methods are characterized by robustness, due to the following facts: principles are partial, underspecified and match unseen configurations, partial annotations are generated instead of failure, the constraint solver cope with conflicting information.
DISCUSSION
Although this was not the objective of the book, its first part can be used as a textbook for those venturing to construct a treebank. As the papers here focus on different types of languages, displaying grammatical phenomena and different ways of dealing with them, these papers can serve as a repository of solutions to various problems encountered when trying to design a corpus, to establish a certain annotation scheme to be used for a treebank, to develop annotation tools. The style in which the papers were written is helpful in this respect: they are clear, accessible and the information is introduced gradually. The second part of the book has a more reduced group of addressees than the first one, due to its technical details involved by the presentation of different application in computer linguistics: lexicon induction (Järvinen), grammatical induction (Frank et al., Bod) parser evaluation (Carroll et al.), checker evaluation (Becker et al.).
REFERENCES
Barnett, R., N. Calzolari, S. Flores, P. Hellwig, P. Kahrel, G. Leech, M. Melera, S. Montemagni, J. Odijk, V. Pirrelli, A. Sanfilippo, S. Teufel, M. Villegas, L. Zaysser (1996) EAGLES Recommemdations on Subcategorisation. Report of the EAGLES Working Group on Computational Lexicons, ftp://ftp.ilc.pi.cnr.it/pub/eagles/lexicons/synlex.ps.gz.
Berwick, R.C., S.P. Abney, C. Tenny (Eds.) (1991) Principle-Based Parsing: Computation and Psycholinguistics. Kluwer Academic Publishers.
Bod, R. (1992) Data Oriented Parsing (DOP), Proceedings COLING '92, Nantes, France.
Bod, R. (1995) Enriching Linguistics with Statistics: Performance Models of Natural Language, ILLC Dissertation Series 1995-14, University of Amsterdam.
Bod, R. (1998) Spoken Dialogue Interpretation with the DOP Model, Proceedings COLING-ACL'98, Montreal, Canada.
Charniak, E. (1996) Tree-bank Grammars. AAAI-96. Proccedings of the Thirteenth national Conference of Artificial Intelligence, p. 1031-1036. MIT Press.
Scha, R. (1990) Taaltheorie en Taaltechnologie; Competence en Performance, in Q.A.M. de Kort and G.L.J. Leerdam (Eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek).
Scha, R. (1992) Virtuele Gramatica's en Creatieve Algoritmen, Gramma/TTT 1(1).
Sinclair, J. (1992) The automatic analysis of corpora. In J. Svartvik (Ed.) Directions in Corpus Linguistics. Proceeedings of Nobel Symposium 82. Berlin: Mouton de Gruyter, pp. 379-397.
|