Publishing Partner: Cambridge University Press CUP Extra Publisher Login

The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported primarily by your donations. Please support LINGUIST List during the 2016 Fund Drive.

FYI: GerManC Corpus is Now Available


Author: Richard Whitt

Linguistic Field(s): Computational Linguistics
Historical Linguistics
Text/Corpus Linguistics

FYI Body: THE COMPLETE GERMANC CORPUS, A REPRESENTATIVE CORPUS OF EARLY
MODERN GERMAN FROM 1650 TO 1800, IS NOW PUBLICLY AVAILABLE AT THE
OXFORD TEXT ARCHIVE:
HTTP://WWW.OTA.OX.AC.UK/DESC/2544

FOLLOWING THE MODEL OF THE ARCHER CORPUS AND GIVEN THE AIM OF
REPRESENTATIVENESS, THE GERMANC CORPUS CONSISTS OF TEXT SAMPLES OF
ABOUT 2000 WORDS FROM EIGHT GENRES: DRAMA, NEWSPAPERS, SERMONS
AND PERSONAL LETTERS (TO REPRESENT ORALLY ORIENTED REGISTERS) AND
NARRATIVE PROSE (FICTION OR NON-FICTION), SCHOLARLY (I.E. HUMANITIES),
SCIENTIFIC AND LEGAL TEXTS (TO REPRESENT MORE PRINT-ORIENTED REGISTERS). IN
ORDER TO FACILITATE TRACING HISTORICAL DEVELOPMENTS, THE WHOLE PERIOD WAS
DIVIDED INTO FIFTY YEAR SECTIONS (IN THIS CASE 1650-1700, 1700-1750 AND
1750-1800), AND AN EQUAL NUMBER OF TEXTS FROM EACH GENRE WAS
SELECTED FOR EACH OF THESE SUB-PERIODS.

THE COMPLETE CORPUS THUS CONSISTS OF 360 SAMPLES, COMPRISING
APPROXIMATELY 800,000 WORDS. APPENDIX 1 IN THE DOWNLOAD PACKAGE
CONTAINS A LISTS OF THE FILES IN THE CORPUS WITH FULL DOCUMENTATION IN AN
EXCEL SPREADSHEET.

PROJECT TEAM: MARTIN DURRELL (PI), PAUL BENNETT (CO-INVESTIGATOR), SILKE
SCHEIBLE (RA), RICHARD J. WHITT (RA), AND ASTRID ENSSLIN (RA,
NEWSPAPER CORPUS).