LINGUIST List 3.945

Tue 01 Dec 1992

Qs: Survey of Modern Greek Corpora

Editor for this issue: <>


  1. Dionysis Goutsos, Survey of Modern Greek corpora

Message 1: Survey of Modern Greek corpora

Date: Tue, 01 Dec 92 13:52:19 GM
From: Dionysis Goutsos <>
Subject: Survey of Modern Greek corpora

Postal address:
School of English
University of Birmingham
Birmingham B15 2TT

fax: (int +) 44 21 414 3600

27 November 1992

Dear Colleague,

 We have recently become aware of the lack of communication
between researchers on Modern Greek and the need for exchange of
information, and so we are taking the initiative to distribute
this survey of machine-readable corpora of Modern Greek.
 Its aim is to collect information about the nature and
structure of collections of text in machine-readable form and the
specifications of hardware and software tools. This information
will be available to interested researchers and is intended to
provide a basis for discussion and exchange of information on the
future of Modern Greek corpora.
 By corpus, we mean broadly a text collection, comprising
texts to be studied individually, not linked in any coordinated
way, collected works of an author, texts selected to study a
particular author, textbanks, databases or bibliographies. If you
are not personally involved in the compilation of such a machine-
readable corpus, could you pass the survey to others or suggest
their names to us.
 We would hope to complete the results of the survey by March
1993; depending on the extent of the response we may come back
to you for more detail.

 We would like to thank you in advance for your help and we'd
be happy to hear any suggestions from you.

Dionysis Goutsos
Rania Hatzidaki
Philip King

Modern Greek Corpus Initiative

 Survey of machine-readable corpora of Modern Greek


A1. By what name is the corpus known?

A2. Who compiled the corpus?

A3. Where was it compiled? (Institution)

A4. Contact Address


A5. When did the compilation start?

A6. What was the incentive for starting the compilation?

B1. How are texts entered?
(word-processor, text-editor, typesetting tapes, optical
scanning, other)

B2. How is the corpus stored and in what format?
B2.1.What computer facilities do you use?
(IBM Personal Computer or compatible, Apple Macintosh -
workstation - mainframe)

B2.2. What software do you use for corpus processing? (please
specify item and function: word frequency, concordancing of
selected items etc.)

B2.3. Do you use ready-made or customized software?

B2.4. If you use your own software, which programming language
do you use?

B3. How do you handle the special problem of Greek characters?

 - in input processing

 - in screen output

 - in printing

B4. Do you have software for linguistic annotation (tagging,
parsing, lemmatization)?
If yes, specify


C1. How was the text acquired?

C2. How is the corpus organized?

C3. Can you give some details of the content?

C3.1. Written texts:
C3.1.1. What genres are included in your collection?

C3.1.2. What are the media of the original texts? (printed book,
periodical, manuscript, ephemera, other)

C3.1.3. Do you encode typographic and layout information?
If so, specify

C3.2. Spoken texts (transcriptions):
C3.2.1. What genres are included in your collection?

C3.2.2. What is the medium of the original source? (TV, radio,
telephone, direct: talk, conversation, other)

C3.2.3. Is the material spontaneous or not, surreptitious or not?

C3.2.4. Do you encode information about speakers (e.g. age, sex)
or about the recording?

C3.2.5. What transcription system do you use? (phonetic,
phonological, enhanced orthographical, orthographical)

C4. What period do the texts in the corpus represent?

 from _____________ to ____________

C5. What is the total amount of data stored in your collection?

 - in bytes

 - in words

 - in minutes of spoken text recording

C6. What use is made of the corpus? (specify, where appropriate)

 - to build up a multifunctional linguistic corpus

 - for lexicographic purposes

 - for literary research

 - for stylistic research

 - for preparation of a scholarly edition

 - for research in linguistics

 - for research in language learning/ teaching

 - for commercial applications

 - for natural language processing applications

 - other

C7. Is it available to other interested parties?
 If so, under what conditions?


D1. Do you plan any changes in the composition of your corpus?

D2. Are you planning to develop new text-handling software?

D3. Are there any specialized areas of Modern Greek for which a
corpus approach would be particularly useful?

D4.1. What are your views on the development of a general corpus
of Modern Greek (such as the Brown Corpus of English or the
Birmingham English Corpus)?

D4.2. What would you consider to be the optimal size of it?

D5. Do you prefer a 'clean text' strategy (i.e. plain
orthographic files) as opposed to annotated, phonologically
coded, parsed etc. text?

D6. Do you think that multilingual corpora or corpora containing
'parallel texts' are needed?

D7. Do you have any other views on the development of Modern
Greek corpora and software for processing them?

Please list any publications that you are aware of that were
based on the electronic text you describe
