* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 21.1298

Wed Mar 17 2010

FYI: Corpus release: Le Petit Prince in UNL

Editor for this issue: Elyssa Winzeler <elyssalinguistlist.org>

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
        1.    Ronaldo Martins, Corpus release: Le Petit Prince in UNL

Message 1: Corpus release: Le Petit Prince in UNL
Date: 15-Mar-2010
From: Ronaldo Martins <r.martinsundlfoundation.org>
Subject: Corpus release: Le Petit Prince in UNL
E-mail this message to a friend

The UNDL Foundation has released a version in UNL of “Le Petit Prince” (The
Little Prince), the famous novella by Antoine de Saint-Exupéry, published
in 1943. The corpus is available under an Attribution Share Alike
(CC-BY-SA) Creative Commons license at the UNLarium
(http://www.unlweb.net/unlarium), and may be used for researchers and
developers interested in semantic annotation of natural language texts.

What is UNL?

The UNL is a knowledge representation language that has been used for
several different tasks in natural language engineering, such as machine
translation, multilingual document generation, summarization, information
retrieval and semantic reasoning. It has been originally proposed by the
Institute of Advanced Studies of the United Nations University, in Tokyo,
and has been currently promoted by the UNDL Foundation, in Geneva,
Switzerland, under a mandate of the United Nations. [read more about UNL in

Why Le Petit Prince?

Le Petit Prince is one of the best-selling books ever (more than 80 million
copies), and has been translated to more than 180 languages, providing thus
the possibility of contrasting and evaluating a wide range of UNL-based
translations. Additionally, the text offers the chance of experimenting UNL
in three situations that have not been explored so often: French original,
narrative and literature. Our main goal is to “UNL-plicate” the text in at
least three different directions: replication, summarization and
simplification, in as many languages as possible. [read more about
UNLplication in http://www.unlweb.net/wiki/index.php/UNLplication]

How the text was UNLized?

The integral version of Le Petit Prince, which has been released under
public domain in Canada, was obtained from
http://wikilivres.info/wiki/Le_Petit_Prince. The whole text comprises
15,513 word forms (tokens) and 1,684 sentences. The UNLization of the text
was carried out in a fully-manual way through the UNL Editor, a graph-based
authoring tool developed by the UNDL Foundation. The sentences have been
divided into two main different groups: a) the training corpus, which
comprises the first 53 sentences of the book (dedication and first
chapter), including the title; and b) the application corpus, which
comprises the remaining 1,548 sentences. The training corpus was addressed
collectively by a group of four human UNLizers in order to synchronize and
normalize the UNLization strategies. The application corpus was organized
according to the similarity of sentences (and not to the order of
appearance) and was addressed from December 2009 to February 2010 according
to the guidelines resulting from the training exercise (and which are
available at http://www.unlweb.net/wiki/index.php/UNLization_Guidelines).

Further information

For further information, please contact
Ronaldo MARTINS (mailto:r.martinsundlfoundation.org)
Language Resources Manager
UNDL Foundation
48, route de Chancy, CH-1213, Petit-Lancy, Geneva, Switzerland
+41 22 879 8090


The UNDL Foundation (http://www.undlfoundation.org) is a non-profit
organization based in Geneva, Switzerland, which has received, from the
United Nations, the mandate for implementing the Universal Networking
Language (UNL). The UNL Programme is a collaborative effort to create
natural language resources and technology to reduce language barriers and
strengthen cross-cultural communication in the framework of the United
Nations. Participation in the Programme is free and open to individuals and
institutions, either as researchers or as developers. Special funds are
available for some languages.

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics

This Year the LINGUIST List hopes to raise $65,000. This money will go to help 
keep the List running by supporting all of our Student Editors for the coming year.

See below for donation instructions, and don't forget to check out our Space Fund 
Drive 2010 and join us for a great journey!


There are many ways to donate to LINGUIST!

You can donate right now using our secure credit card form at  

Alternatively you can also pledge right now and pay later. To do so, go to: 

For all information on donating and pledging, including information on how to 
donate by check, money order, or wire transfer, please visit: 

The LINGUIST List is under the umbrella of Eastern Michigan University and as 
such can receive donations through the EMU Foundation, which is a registered 
501(c) Non Profit organization. Our Federal Tax number is 38-6005986. These 
donations can be offset against your federal and sometimes your state tax return 
(U.S. tax payers only). For more information visit the IRS Web-Site, or contact 
your financial advisor.

Many companies also offer a gift matching program, such that they will match 
any gift you make to a non-profit organization. Normally this entails your 
contacting your human resources department and sending us a form that the 
EMU Foundation fills in and returns to your employer. This is generally a simple 
administrative procedure that doubles the value of your gift to LINGUIST, without 
costing you an extra penny. Please take a moment to check if your company 
operates such a program.

Thank you very much for your support of LINGUIST!

Read more issues|LINGUIST home page|Top of issue

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.