LINGUIST List 17.2368
Tue Aug 22 2006
Software: NEMLAR Arabic Resouces in ELRA Catalogue - 08/06
Editor for this issue: Svetlana Aksenova
NEMLAR Arabic Resouces in ELRA Catalogue - 08/06
Message 1: NEMLAR Arabic Resouces in ELRA Catalogue - 08/06
From: Helene Mazo <mazoelda.org>
Subject: NEMLAR Arabic Resouces in ELRA Catalogue - 08/06
ELRA - Language Resources Catalogue - Update
We are happy to announce the following Arabic resources, produced withinthe NEMLAR project (www.nemlar.org). All 3 resources are owned andcopyrighted by the Nemlar Consortium. They are available in our catalogue.To view all the Language Resources available, you can visit our on-linecatalogue: http://www.elra.info or http://www.elda.org
ELRA-W0042 NEMLAR Written Corpus
This corpus consists of about 500,000 words of Arabic text from 13different categories. The text is provided in 4 different versions:- Raw text- Fully vowelized text- Text with Arabic lexical analysis- Text with Arabic POS-tags
The database is distributed on 1 ISO 9660 CD-ROM volume.
For more information, seehttp://catalog.elda.org:8080/product_info.php?products_id=873&osCsid=2eb47737dba8e4365c4972784a235948
ELRA-S0219 NEMLAR Broadcast News Speech Corpus
The data consists of about 40 hours and is provided by ELDA of Arabic data(mainly Standard Arabic from a number of broadcast companies);Transcriptions follow the Transcriber conventions as used by ELDA and focuson the orthographic, named entities, speaker/turn segmentation levels. Nophonetic transcription/segmentation is planned.
The database is distributed in 1 ISO 9660 DVD-ROM volume.
For more information, seehttp://catalog.elda.org:8080/product_info.php?products_id=874&osCsid=2eb47737dba8e4365c4972784a235948
ELRA-S0220 NEMLAR Speech Synthesis Corpus
The NEMLAR Speech Synthesis Corpus contains the recordings of 2 nativeEgyptian speakers (male and female, 35 years old) recorded in a studio over2 channel (voice + laryngograph). The data collection and transcriptionwere performed by RDI (Egypt).
Speech samples are stored in 96 kHz, 24 bit with the least significant bytefirst (“lohi” or Intel format) as (signed) integers.
The speaker read 2,032 prompted sentences covering approx. 42,000 words inthree categories: transcribed speech (20%), written text (50%), andconstructed phrases (30%).
The database is provided with orthographic, prosodic and phonetictranscriptions in SAMPA. All transcriptions were segmented at theutterance (sentence/command word) level, annotated at the word level andchecked manually. A pronunciation lexicon including 3,589 headwords withphonetics in SAMPA is also available.
The database is distributed on 3 ISO 9660 DVD-ROM volumes.
For more information, seehttp://catalog.elda.org:8080/product_info.php?products_id=875&osCsid=2eb47737dba8e4365c4972784a235948
For more information on the catalogue, please contact Valérie Mapellimailto:mapellielda.org