LINGUIST List 4.205

Fri 19 Mar 1993

FYI: Spoken Corpus on CDROM Available

Editor for this issue: <>


  1. "Henry S. Thompson", HCRC Map Task Corpus on CD: Audio and transcripts of natural speech

Message 1: HCRC Map Task Corpus on CD: Audio and transcripts of natural speech

Date: Thu, 18 Mar 93 23:03:02 GMHCRC Map Task Corpus on CD: Audio and transcripts of natural speech
From: "Henry S. Thompson" <>
Subject: HCRC Map Task Corpus on CD: Audio and transcripts of natural speech

 The HCRC Map Task Corpus

The Human Communication Research Centre (HCRC) is happy to announce
the release of the Map Task Corpus. The Map Task Corpus is a set of 8
CD-ROMs containing linked audio and transcriptions of a total of about
18 hours of spontaneous speech that was recorded from 128 two-person
conversations according to a detailed experimental design.

Altogether, the corpus as distributed provides a thorough and
invaluable set of resources and tools for use in analyzing all levels
of linguistic structure, via both text-based and speech-based
investigation. The range of research questions that are addressable
using this corpus span a wide spectrum of linguistic and cognitive
issues. We have kept the price as low as possible to encourage
researchers from many disciplines to use this corpus as a common
reference point for many different kinds of research.

The HCRC is an interdisciplinary research centre at the Universities
of Edinburgh and Glasgow, supported by the UK Economic and Social
Research Council and the Universities Funding Council. The publication
of the Map Task Corpus was made possible by assistance from the
Linguistic Data Consortium.

Corpus Details

64 different speakers, 32 female, 32 male, all adults, each took part
in four conversations in a quiet recording studio. They were all
students at the University of Glasgow, 61 of them being native Scots.
The conversations were carried out in an experimental setting in which
each participant has a schematic map in front of them, not visible to
the other. Each map is comprised of an outline and roughly a dozen
labelled features (e.g. "a white cottage", "an oak forest", "Green
Bay", etc). Most features are common to the two maps, but not all. One
map has a route drawn in, the other does not. The task is for the
participant without the route to draw one on the basis of discussion
with the participant with the route. In addition to the conversations,
each speaker provides a wordlist reading, consisting of the major
vocabulary items contained in the conversations. All recordings were
direct to Digital Audio Tape (DAT) at 48KHz, providing very good
acoustic quality.

The experimental design allows a number of different phonemic,
syntactico-semantic and pragmatic contrasts to be explored in a
controlled way. In particular, maps and feature names were designed
to allow for controlled exploration of phonological reductions of
various kinds in a number of different referential contexts, and to
provide, via varying patterns of matches and mis-matches between the
two maps, a range of different stimuli for referent negotiation. Also
the conditions of the conversations were carefully balanced: In half
of them the speakers were strangers, in half friends; in half of them
the speakers could see each other's faces, in half they could not.

Subjects accommodated easily to the task and experimental setting, and
produced evidently unselfconscious and fluent speech. The syntax is
largely clausal rather than sentential; showing good turn-taking, with
modest amounts of overlap and interruption. The total corpus runs to
about 18 hours of speech, with the transcripts consisting of around
150,000 word tokens drawn from just over 2,000 word form types.

Transcription is at the orthographic level, quite detailed,
including filled pauses, false starts and repetitions, broken words,
etc. Considerable care has been taken to ensure consistency of
notation, which is thoroughly documented. Although the full
complexity of overlapped regions has not been reflected in the
transcriptions, such regions are clearly set off from the rest of the
transcripts. Transcripts are connected to the acoustic sampled data
by sample numbers marked every few turns.

CD-ROM Contents

The waveform data are provided in "raw" (headerless) files (16-bit
samples, 20 kHz sample rate, 2 channels per conversation), and
alternative header files are provided for use with software based on
either the NIST "SPHERE" header structure or the European "SAM" header
structure. Transcriptions are provided for each conversation, marked
up with TEI-compliant SGML, in a minimally intrusive and easily
separated way. PostScript files of the map images used in the
experiments are provided, along with full documentation of the
experimental design and data collection protocol, resources for using
SGML tools on the transcriptions and other text materials, and an
extensive set of source code for performing basic signal processing
functions on the waveform data, such as down-sampling,
de-multiplexing, channel summation, and D/A conversion for Sun
workstations (including playback of segments selected via inspection
of transcripts in Emacs).

The CD-ROMs are in High Sierra (ISO 9660) format with the RockRidge
extensions, and are compatible with (inter alia) Unix, MS-DOS and
Macintosh operating systems.

Copies of the Map Task Corpus are available from the LDC for $200 or
from HCRC for 164.50 UK pounds (including VAT) at the addresses given
below, plus postage and packing as necessary. Please contact us (by
e-mail if possible) for details of payment methods and shipping costs.

In Europe please contact

 Henry Thompson
 University of Edinburgh
 Human Communication Research Centre
 2 Buccleuch Place
 Edinburgh EH8 9LW
 Tel: +44 31 650-4440
 Fax: +44 31 650-4587

 Dawn Griesbach
 2 Buccleuch Place
 Edinburgh EH8 9LW
 Tel: +44 31 650-4594
 Fax: +44 31 650-4587

Outside Europe please contact

 Elizabeth Hodas
 Linguistic Data Consortium
 441 Williams Hall
 University of Pennsylvania
 Philadelphia, PA 19104-6305

 Tel: (215) 898-0464
 Fax: (215) 573-2175
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue