FYI: EF Cambridge Open Language Database
New Corpus of L2 English writings: EF Cambridge Open Language Database (EFCamDat)
We are pleased to announce the release of a new resource of L2 English writings, the EF Cambridge Open Language Database (EFCamDat). EFCamDat was developed at the Dept. of Theoretical and Applied Linguistics, at the University of Cambridge in collaboration with EF Education First, an international educational organisation. EFCamDat contains writings submitted to Englishtown, EF’s online school, accessed daily by thousands of learners worldwide. The database currently contains 412K scripts from 76K learners summing up 32 million words. As new data come in, we expect to reach 100 million words by the end of 2014 and be able to follow the longitudinal development of even more students.
Scripts are organised according to EF's proficiency levels and the topic of the writing activity, and contain teachers' corrections and score. In addition, scripts have been annotated automatically with Penn Treebank part-of-speech tags (Marcus et al., 1993) and grammatical relations according to the Stanford Dependency scheme (De Marneffe et al., 2008). Details of the automatic annotation and evaluation of how these tools perform on learner data is presented in Geertzen et al., 2013.
EFCamDat is freely available to the academic community, subject to an end-user agreement protecting copyright. It can be accessed through a web based interface at:
(please click on Frequently Asked Questions to download relevant documentation).
The interface supports selection of scripts from different proficiency levels and by learners of different nationalities and proficiency levels, search for parts of speech and grammatical relations, and export of raw text as well as tagged scripts.
We gratefully acknowledge support by the Isaac Newton Trust, Trinity College, Cambridge, and EF Education First.
Dora Alexopoulou, Rachel Baker, Jeroen Geertzen, Anna Korhonen
De Marneffeﬀe, M. C. and Manning, C. D. (2008). The Stanford typed dependencies representation. In
Coling 2008: Proc. of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8.
Education First (2012). Englishtown. http://www.englishtown.com/.
Geertzen, J., Alexopoulou, T., and Korhonen, A. (2012). Automatic linguistic annotation of large scale l2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In in Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mel lon. Cascadillla Press.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus
of english: The penn treebank. Computational Linguistics, 19(2):313–330.