LINGUIST List 8.678

Wed May 7 1997

Sum: Names in corpora

Editor for this issue: Ann Dizdar <>


  1. Kristine Hasund, Sum: names in corpora

Message 1: Sum: names in corpora

Date: Wed, 07 May 1997 10:58:30 +0200
From: Kristine Hasund <>
Subject: Sum: names in corpora

A while ago, I posted a message on Linguist and Corpora requesting
information about the different conventions that are used to protect
the identity of informants in spoken corpora.

I wish to thank the following people who kindly replied:

Gerald Nelson
Antoinette Renouf
Bernadette Vine
Svenja Sachweh
Susan Meredith Burt
Christine Cheepen
Marco Antonio Da Rocha
Bill Fisher
Dan I. Slobin

Below is a (longish!) summary of the answers I got to the following

- To what extent have last names, first names and addresses been erased
from tape/video recordings and replaced by fictitious names in the

 - If first names have been changed as well as last names and addresses,
what were the reasons given for doing so? (legal, ethical, or other)

- Were names changed manually or automatically (eg by means of a
"search-replace" word processor function)?

1) Gerald Nelson, University College London: The ICE corpus

Since 1990, I have been responsible for collecting, transcribing, and
digitizing the recordings for the British ICE corpus. Perhaps I
should say that all recording was non-surreptitious -speakers were
required to complete a form prior to recording, granting permission
for its use in academic research. The form also contained an option
to have names changed in the transcripts and in the recording.

 In practice, very few speakers opted for anonymity. However, the
authors of personal and business letters very often chose anonymity.
In all cases we changed first and last names, as well as addresses.

Names were changed for legal reasons.

 Names were changed manually during transcription, and explicitly
marked as changed names. In the digitized version, on CD, we
concealed names by putting a "beep" in the appropriate place.

2) Antoinette Renouf, University of Liverpool

 I think that the point of your investigation, which I understand to
be that there is a sociolinguistic etc significance in the first name
choice, is not something that they will have considered much. The
question of preserving anonymity will almost certainly have been
overriding in their considerations, since thay are largely
administrators and worried about legal issues.

I would have thought you would have more luck looking for work done by
linguists and sociolinguists on names per se. You might talk to
Patrick Hanks at OUP, because he did at least a surnames dictionary
and maybe a first name dictionary. Or try other lexicographers and
editors of the books of names for naming children.

Also, occasionally the newspapers announce that such and such are the
top ten names for babies in the country. They must get this info from
the registration of birth offices. I think sociolinguistics is the
place to look for research and bibliography.

3) Bernadette Vine, Victoria University of Wellington: The Wellington
Corpus of Spoken New Zealand English:

Real names have not been erased on the tapes/videos unless the people
who recorded the tape specifically requested them to be. In the
transcripts pseudonyms are always used unless the name is a matter of
public record i.e., broadcast material. Broadcasting material from
the Corpus is prohibited except where specific permission has been
obtained. Researchers using the Corpus have to sign a document saying
they will not disclose any information from the material they listen

First and last names have been changed and place names where this may
identify speakers. This was done because speakers were assured that
their identity would be protected.

Names were changed during the initial transcription stage. Generally
names with the same gender or ambiguity of gender as the real name,
stress patterns, number of syllables and ethnicity were used as

We are currently collecting and transcribing another Corpus and have
followed the same general principles (with a few differences due to
the differing nature of the two corpora).

4) Svenja Sachweh, Freiburg: nursing homes corpus

since I'm working in the sensitive area of communication in nursing
homes for the aged, I practically changed everything in order to
guarantee 100% anonymity. Due to the fact that I did not have the
technical means to erase something from my tape-recordings, I did not
do that. (The chances are very good that no-one else but me will ever
listen to the tapes.) However, I replaced everything (i.e. first and
last names, place names, addresses, etc.) in the transcripts.

I changed first names for ethical reasons - after all, I promised to
do that when I asked for permission to audiotape conversations!

 I did use a search and replace function. However, since I keep finding
instances of real names during analysis, I also change names manually.

5) Susan Meredith Burt, University of Wisconsin Oshkosh: University
students corpus

I have done work with taped conversations 1) between university
students--Americans paired with students from other countries--and 2)
of my family the year that we hosted a foreign student in our home.
With conversations between students, I have changed their names to
plausible first names that begin with the same letter. Claire would
become Claudine, for example. This makes it possible for me to
remember who's who. In the case of people the speakers talk about, I
change first name and last name the same way. In the case of my
family, I assumed that anonymity was impossible, so I left our own
names the way they are, but I systematically changed the name of our
German guest. I have not erased any names on the tapes. I simply
change the names as I transcribe. this is not hard--you just have to
think out everyone's pseudonyms before you begin transcribing.

6) Christine Cheepen, University of Surrey

I always (...) replace names by syllabic equivalents. Sometimes it
may not be necessary, but I think it is safer, just in case someone
objects later on.

The reasons given for changing names: Legal and ethical - casual
conversation nearly always involves some gossip about people not
present. In what I would call transactional (as opposed to
interactional) dialogue - e.g. service calls, there is of course the
problem of confidentiality.

names changed manually or automatically: It very much depends on how
many items need to be changed. Sometimes I have transcribed casual
conversation by changing manually as I transcribe, but in those cases
I always do a search and replace at the end in case I've missed some.

7) Marco Rocha, University of Sussex

I have invariably replaced all names and addresses, including names of
buildings, such as hospitals. I have not erased them from tape

I have personally ensured informants that anonymity would be
guaranteed. The conversations occur in a hospital and many of them
involve medical information of a private nature. Reasons (for
replacing names) are thus ethical.

Names were changed manually as the data were transcribed by myself.
Replacements used attempt strenuously to retain the prosodic features
of the speech recorded.


 I've been involved in the production and processing
of speech corpora sponsored more-or-less directly by
ARPA for a long time, starting with TIMIT and going
through SWITCHBOARD and CALLHOME. I don't believe there
has ever been an effort to disguise personal names
that are used in the conversations we record, although
speaker i.d. is a thinly-disguised code in accompanying
tables of speaker information.

 There are probably 2 reasons for what may seem to be
a very lax policy: 1) the subjects sign a legal form
allowing the recording agency to do anything they want
to with the speech; 2) since the main use of the corpora
is to train and test speech recognizing computers,
a mutilated speech wave would hurt.

 Bill Fisher

9) Dan I. Slobin, University of California at Berkeley: child language=

	In child language research it has been the norm for the past
30 years or more to protect the identity of the child by assigning a
pseudonym. Roger Brown started this in his pioneering work at Harvard
in 1962, naming the first two children in the study "Adam" and "Eve."
His proposal was to work forward through the Bible, and so the third
child he studied was named "Sarah." Several others followed this
model, naming children "Noah" and "Shem." Other investigators simply
picked another name for the child. Last names are always deleted, and
the names of other participants are also changed (siblings, visitors,
etc.). This is easily done by global replacement.
	It is part of our agreement with committees for the protection
of human subjects that all participants in psychological research be
kept anonymous, and that the data can be used and distributed only
with the consent of the subjects. Every university has a standing
committee for this purpose, with its own set of rules.
	Tape- and video-recordings can only be used with the consent
of the persons who were recorded (or their parents). It is more
difficult in the case of videotapes, and one must be very careful
about informed consent, since the identity of the participants can't
be hidden from view.

- ---------------------------------------------
Kristine Hasund
English Department
H=F8gskolen i Agder
Tordenskjoldsgate 90
4604 Kristiansand
Tel: 38 14 16 43
- ---------------------------------------------
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue