Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info

Software Details

Title: CLaRK System - an XML-based System for Corpora Development
Submitter: Kiril Simov
Description: Dear List members,

I would like to announce the CLaRK System - an XML-based System for Corpora Development. It is available on the web page of the BulTreeBank Project:

Please, follow the ''CLaRK System'' link and then Download.

The system is implemented in JAVA.

Short description:

CLaRK is an XML-based software system for corpora development. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources. It incorporates several technologies: (1) XML technology;
(2) Unicode; (3) Regular Cascade Grammars; (4) Constraints over XML Documents.

For document management, storing and querying, we chose the XML technology because of its popularity and its ease of understanding. The core of CLaRK is an XML Editor, which is
the main interface to the system. Besides the XML language itself, we implemented an XPath language for navigation in documents and an XSLT language for transformation of XML documents.

For multilingual processing tasks, CLaRK is based on an Unicode encoding of the information inside the system. There is a mechanism for the creation of a hierarchy of
tokenisers. They can be attache to the elements in the DTDs and in this way there are different tokenisers for differen parts of the documents.

The basic mechanism of CLaRK for linguistic processing of text corpora is the cascade regular grammar processor. The main challenge to the grammars in question is how to apply
them on XML encoding of the linguistic information. The system offers a solution using an XPath language for constructing the input word to the grammar and an XML encoding of the
categories of the recognised words.

Several mechanisms for imposing constraints over XML documents are available. The constraints cannot be stated by the standard XML technology. The following types of constraints are implemented in CLaRK: (1) Regular expression constraints -
additional constraints over the content of given elements based on a context; (2) Number restriction constraints - cardinality constraints over the content of a document; (3) Value constraints - restriction of the possible content or parent of an element in
a document based on a context. The constraints are used in two modes: checking the validity of a document regarding a set of constraints; supporting the linguist in his/her work during the building of a corpus. The first mode allows the creation of constraints for the validation of a corpus according to given requirements. The second mode helps the underlying strategy of minimisation of the human labour.

With best regards,


- ---------------------------------------------------------------
Kiril Simov
BulTreeBank Projec
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
- ---------------------------------------------------------------
Linguistic Field(s): Text/Corpus Linguistics

LL Issue: 13.1458
Date Posted: 23-May-2002

Search Again

Back to Software Index