LINGUIST List 26.3441

Thu Jul 30 2015

Confs: General Linguistics, Sociolinguistics/USA

Editor for this issue: Erin Arnold <>

Date: 30-Jul-2015
From: Katie Kindle <>
Subject: Preparing your Corpus for Archival Storage
E-mail this message to a friend

Preparing your Corpus for Archival Storage

Date: 07-Jan-2016 - 07-Jan-2016
Location: Washington, DC, USA
Contact: Malcah Yaeger
Contact Email: < click here to access email >

Linguistic Field(s): General Linguistics; Sociolinguistics

Meeting Description:

We wanted to pass along some good news from LSA and NSF. The special session that we proposed, ''Preparing your Corpus for Archival Storage,'' has been accepted as a Satellite Session for LSA 2016 held January 7-10, 2016 in Washington D.C. The session will take place on Thursday January 7 before the start of the Annual LSA Meeting.

As you all know, NSF has emphasized that we should prepare our corpora for storage, so that other researchers [or we ourselves, at a later date] can use the older material for comparison with newer studies. This meeting will present the critical factors that could not be included in an earlier NSF supported workshop examining issues in preparing data for comparison and sharing. Below you will see the line-up of scheduled presenters. There will be six oral presentations plus discussion among the presenters and with other workshop participants. NSF is funding the workshop so there will be no additional registrations fees for those already taking part in the annual meeting.

Note that any students who are about to carry out their own fieldwork, or who have begun doing so are eligible to apply for funding to help defray the extra costs of arriving early. There is an application form for the scholarship. Students must emphasize and include documentation that they have been or are about to carry out fieldwork. In addition, there will be [at least] two other symposia during LSA which will be relevant to this question but will not be in overlap in coverage.

Session Abstract:

An NSF supported satellite workshop on Archival Preparation will be held in conjunction with and immediately preceding the annual meeting of the Linguistic Society of America (LSA) on January 7, 2016. NSF policy now stipulates that investigators are expected to make available to other researchers the primary data created or gathered under NSF grants. However, the metadata presently associated with archived data are often inadequate to permit data, (e.g. sound files), from related studies to be compared; without an agreed-upon coding protocol, there can be no effective sharing and comparison across speech corpora. Invited speakers will discuss specific coding conventions for such factors as socioeconomic and educational speaker demographics, and for language choice, stance and footing. Using appropriate metadata for these factors will facilitate sharing of corpora and research to determine how each factor impacts on language use.

NSF previously supported a workshop in which leading scholars discussed data protocols, obtaining ethics board approval for human subject research, and ensuring that the information gathered about human subject demographics, attitudes and the situations in which they were recorded provide enough scope and detail to permit meaningful comparison across studies, and thus encourage data sharing. Following that model, this workshop will extend the topics covered, and provide a training forum in which to develop protocols for sharable data that conform to the spirit of NSF policy. This award will also support the participation of students in the training and discussions.

We are looking forward to seeing you all in Washington in January!

Best wishes,

Malcah Yaeger []
Chris Cieri []


90th Annual Meeting of the Linguistic Society of America
Session Title: Preparing your Corpus for Archival Storage: Coding for Socioeconomics, Education, Language Choice, Stance.
Type and length of session: Satellite Workshop, preceding the main meeting, with discussion Organizers: Malcah Yaeger-Dror (UArizona) & Christopher Cieri (LDC, UPenn)
A satellite workshop has been funded by NSF, to occur Thursday January 7, 2016, at the Marriott Marquis in Washington, right before the beginning of the LSA.

There will be 3 sections to the workshop.

1. The first section will propose appropriate coding options for socioeconomic status and education.

The two speakers will be:

Anne Fabricius, Roskilde University:
Social class, social capital, social practice and language in British sociolinguistics: unraveling historical and ethnographic complexities
This discussion will elaborate on the approaches to social class analysis and coding that I and others have pursued in studies of the elite/establishment sociolect of England over the past fifteen years. Changing social and political contexts as well as ‘class as an ideological construct’ within British society have ramifications for sociolinguistics and corpus work. We will look at several traditions of social class analysis and their potential contributions to sociolinguistic research. The importance of fine-grained historical and contextual understanding will be a recurring theme.

Suzanne Wagner, Michigan State University:
Conceptualizing and coding social class in North American sociolinguistics
North American sociolinguists have classified speakers by their (perceived) social status in three main ways: (i) indices of occupation, education and other measures of cultural/ economic capital; (ii) locally relevant categories like ‘Burnout’; (iii) evaluation of the ‘linguistic market’ value of speech in different occupations. These methods will be evaluated with a view to establishing future coding systems that allow for both geographic and longitudinal comparisons across datasets. To illustrate this discussion, examples are drawn from the Influence of Higher Education on Local Phonology (IHELP) project, which includes data from multiple USA sites, and from two kinds of historical archives.

2. The second section will propose coding options for specific situational variables.

The two speakers will be:

Richard Ogden, York University, UK:

Frans Gregersen, LANCHART Centre of Copenhagen University:
Discourse contexts within sociolinguistic interviews, a presentation of the LANCHART DCA coding scheme.
It is the hallmark of a mature science that previously collected data and results are tested against new knowledge. In this endeavor, a lot depends on the quality of metadata. Much of this information is commonplace and refers to technical details about how data were collected, recorded and transcribed etc. But the intelligent exploitation of older data will crucially also depend on how to document variation within recordings. In my presentation I will review what we have done at the Copenhagen University LANCHART Centre to control for internal variation within the recording sessions and critically discuss the resulting coding scheme.

3. The third section will propose coding options for bilingual /code switching segments.

The two research perspectives will be represented by:

Naomi Nagy, University of Toronto & Paulina Lyskawa
Moving forward with multilingual transcription
Since 2009, the Heritage Language Variation and Change in Toronto Project has been building corpora of conversational speech in a range of Heritage Languages in Toronto. Teams of students from each community have developed language-specific transcription protocols. We apply variationist methods to quantify the effects of various contextual forces on the selection of forms both within and across languages. This presentation will describe and problematize how we indicate use of multiple languages within one conversation and efforts to maintain consistency across protocols from different languages/communities, commenting on efforts to make these transcripts useful for inquiries developed subsequent to transcription.

Barbara Bullock, U TX, Austin & Jacqueline Toribio, U TX, Austin:
Toward automated methods of bilingual annotation
Coding of bilingual data requires levels of linguistic annotation and metadata that are typically irrelevant for populations in which the speakers sampled are presumed to be dominant and proficient in only one and the same language. As such, linguists with interests in bilingualism have traditionally had to manually categorize the language of the data, the type of data (code-switching, borrowing, calquing), and the linguistic abilities of its speakers. Here, we discuss the procedures and the results of the Bilingual Annotation TaskS (BATS) Force to automatically classify bilingual language data in ways that can be scaled up for large data sets.

Page Updated: 30-Jul-2015