Software Details
| Title: | Language Independent Part-of-Speech Tagger |
|---|---|
| Submitter: | Vlad V. Gojol |
| Description: | Dear Readers , Those interested in the two pieces of NLP software presented below are welcome to contact me directly ( gojol@sunu.rnc.ro ). Thank you , Dr.ing. Vlad Gojol - -------------------------------- Senior Research Engineer Institutul National de Informatica Bucuresti , Romania ............................................................................ LANGUAGE INDEPENDENT PART-OF-SPEECH TAGGER I created a part-of-speech tagger with an unusual capacity of dealing with large contexts , especially for German . I used Negra ( seemingly the best known German corpus , with free obtainable licence ) . The tagger currently reputed as being the most accurate for German is perhaps TnT . I reports upon this corpus an error rate of 3.4% . But I have found a syste- matic error in Negra : all the occurences of the auxilliary verbs are tagged as auxilliary ( VAFIN ) , though in 50% of the cases they function as finite verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens ) . In this more correct environment ( where the performance of TnT should be probably around 4.5% ) , my tagger gets 1.7% . On another German corpus ( I call it X ) , with comparable contents ( newspaper articles ) and tagset , but with attached EXTERIOR lexicon ( i.e. no extracted from the corpus ) , the result is 2.4% . I also used Susanne ( the only English corpus I could get free ) . The reported result for TnT is 3.8% . Mine is 2.8% . On the ''A'' texts , best parallelable with those in Negra , as journalistic , it's 2.3% . By restricting the tagset to a more normal size ( cca 100 tags , determined as optimal after lots of test runs ) , it's 1.3% . Initially I had used a Romanian corpus , with a result of 0.9% ( compared to 1.7% , 2.5% and 4.2% respectively got by the Xerow , Birmingham and Brill taggers ) . The speed is comparable to that of TnT and modifiable by parameter setting, in reverse proportion to the accuracy ( but without affecting it much ) . The incremental operating mode and the data structures segmentation allow running on very small memory computers . There is the advantage of an intuitive output ( no hostile binary matrix ) , in a form analogue to the input of some expert systems . Alternative taggings are output , with scores : unlike with other taggers , they don't refer to individual words , but to whole sentence parts ( representing somehow phrase surfaces of minimum energy ) . Special facilities exist , such as virtual tags , or context essentialisation ( permitting to get the minimal contexts set characteristic to a certain linguisic style , useful not only for maximum accuracy and speed ) etc. Recently added features : wide-character support , the possibility of being called as a simple library routine , or suspending the notion of a file ( a complex files system is emulated into the memory ) . For example ( for the second feature above ) , you can create an instance of the tagger ( let's say for English with the tagset Lancaster ) , call it to tag a certain text buffer ( by writing the resulting tags into another buffer ) and finally kill the instance , all this without using any disc file : t = GojolTagger_new(''lancaster''); error_code = GojolTagger_tag(t,input_buffer,output_buffer); GojolTagger_free(t); All is built on two essentially new concepts : organicity and context propagation . I didn't publish anything about them , to keep up their commercial appeal . The accuracy comparable to that of manual tagging made me find many errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G. Sampson replied gratefully , saying that it's the first time somebody reports more than 2 errors , and that my findings make necessary a new version of Susanne . The handling of very large contexts could even modify the current tagsets design , by cancelling some unnatural decisions ( motivated only by the incapacity of the existing taggers to see beyond a 3-tokens neighborhood ) , such as those concerning the auxilliary verbs , participles etc. - so removing some burden from the subsequent stages of text processing . It is written in C ( Linux ) . Demos for German ( Negra ) and English ( Susanne ) are available . .............................................................................. LANGUAGE INDEPENDENT STATISTIC PARSER After learning from a 46,000 words pos-tagged corpus and a 32,000 words parsed ( treebank ) corpus , a 2,000 words text ( not included in any of the two corpora ) is parsed ( tagging excluded ) in 6 seconds ( on a 200 MHz machine ) with 2% incomplete trees ( but for these declared failures , are also provided well formed trees sufficient for a subse- quent translator ) - the extracted grammar having cca 12,000 rules . The Negra corpus of German was used . After learning from a 17,000 words parsed corpus and from the same 46,000 words pos-tagged one , a 2,000 words text included into the first ( but excluded from the second ) , to warrant that the grammar is complete relative to it ( i.e. contains all the rules necessary for its correct parsing ) , is processed in 2 seconds with no incomplete tree - the extracted grammar having cca 7,000 rules . The system is language independen - for English , upon the Susanne corpus , comparable results are obtained . To have an acceptable parser for any other idiom , you need essentially simply a corpus with 30,000 tagged words , from which only 20,000 parsed as well - and for optimal results , 50,000 and 30,000 respectively . The parser may accept a set of rules intended to modify the statistical grammar deduced from the corpus . Moreover , it can take as input only a context-free grammar ( in which case it ceases to be a statistical parser ) , but in this operating mode it requires much time and memory ( during the learning , not during the parsing as such ) if the grammar is over-dimensioned . The statistical grammar is refined no by simply adding the proposed rules , but by modifying the corpus , to exploit all the real contexts possible for them . Semantic processing could be easily inserted at rule reduction points . Actually this generalized parser can also work as a compiler generator : by appending specific semantic routines , you get efficient compilers for C , Pascal etc. This versatile system has more than 40 parameters which tune the accuracy and speed according to the target language sample . The output is in treebank format and optionally in graphic ( with the trees effectively drawn ) one . Linux demos exist for German and English . As only the minimal definition of C is used , it is easily adaptable to any machine ( for other Unix-like operating systems , probably a simple recompilation would be sufficient ) . |
| Linguistic Field(s): |
Computational Linguistics |
| Language Specialty: |
English German |
| LL Issue: | 12.511 |
| Date Posted: | 23-Feb-2001 |


