AUTHOR: Jonathan Harrington TITLE: Phonetic Analysis of Speech Corpora PUBLISHER: Wiley-Blackwell YEAR: 2010

Olga Dmitrieva, Department of Linguistics, Stanford University

SUMMARY

The author defines the potential audience for this book as scholars of phonetics embarking on their first large scale project, such as master's or honors thesis, as well as a general research audience. The stated goal is to supplement readers' knowledge of acoustic phonetics and basic statistical techniques with a practical guide to testing research hypotheses using speech corpora. The book is essentially a practical introduction to phonetic data analysis using the Emu speech database system (a set of software tools for creation, manipulation, and analysis of speech databases) and a collection of R functions written for handling data imported from Emu: Emu-R interface (R is a software environment for statistical computing and graphics.) The book consists of nine chapters, a preface followed by a list of simple instructions for downloading the necessary software, bibliographic list and index. Each chapter, with the exception of the introductory first chapter, is followed by a set of ''questions'' or exercises devised to further familiarize the readers with the concepts/tools/functions discussed in the chapter. The solutions for the exercises are supplied. A brief overview of each of the nine chapters is presented below.

Chapter 1 addresses the importance of corpora in phonetic research and the challenges of creating your own corpus. The chapter begins with discussion of the advantages of speech corpora over other kinds of material phonetic research may rely upon, such as impressionistic transcription, and then offers a brief review of the issues related to designing your own corpus. While this introduction is brief and fairly basic it provides helpful references which readers may consult should they require more information on a particular issue.

The chapter ends with a summary and an overview of the book’s structure, along with a list of corpora available for phonetic research with examples of phonetic studies which analyzed them.

Like the rest of the book, Chapter 2 is dedicated to hands-on exploration of the Emu system and Emu-R interface. It begins by walking the reader through the process of starting up the Emu database tool, downloading a database, and opening an annotated utterance from a database. It shows with detailed illustrations how the annotation in Emu is structured and what kind of information is available to view when the utterance is brought up. A basic intro to R follows. This develops into a discussion of the Emu-R interface, in particular, how to read (query) annotation labels and associated time-stamps into R from the Emu annotation files and how to perform simple operations with them: save them as objects in R (''segment lists''), calculate the duration of the labeled segments.

Chapter 3 continues to familiarize the readers with the functionalities of the Emu database tool and its interface with R and advances to the discussion of basic signal processing capabilities of Emu system. The body of the chapter works through the procedures for calculating, displaying, and manually correcting vowel formants in Emu, as well as importing them into R as ''trackdata'' objects and creating formant plots. The exercises at the end of the chapter build on the information presented in the chapter and extend the focus to calculating intensity, zero-crossing-rate, and fundamental frequency.

Chapter 4 pursues the intricacies of annotation structures in Emu and the types of queries that can be applied to them in Emu-R. After a brief overview of the basic types of operators which can be used to make more complex queries within the same annotation tier it moves on to discuss the way annotation tiers can be linked in Emu so that queries can span more than one tier at a time. The rest of the chapter describes the types of connections between annotation tiers, shows how the tiers can be linked in Emu manually and semi-automatically, and how the linked tier structure can be translated to Praat TextGrids.

Chapter 5 aims at deepening the readers understanding of the way Emu data are treated in R as segments lists and trackdata objects, the differences between the two, and available functionalities. It is illustrated with articulatory movement data obtained with the electromagnetic midsagittal articulograph (EMA).

The chapter begins with an overview of the EMA recording technique and the process of constructing the articulatory movement database. It also describes the way the movement database is annotated and structured in Emu. It is then shown on the example of the movement data how the basic objects necessary for the further analysis are created. The practical demonstration is supplemented by the discussion of the types of objects in R and the ways they differ, especially in terms of functions that can be applied to them. The use of comparison operators and logical vectors is demonstrated while computing mean VOT, introducing along the way some basic descriptive statistics and tabulation functions in R. The author also points out the need to supplement the analysis of mean by the analysis of distribution and provides the procedure for creating boxplots displaying the median, the interquartile range and the range for the VOT data. The analysis of intergestural coordination is introduced with the synchronized tongue-body/tongue tip movement plots for individual segments as well as ensemble plots for categories of segments. Intragestural coordination, in particular the analysis of the velocity relies on the differencing operation discussed in the chapter. The chapter also considers an approach to articulatory movement as a critically damped mass-spring system and uses it to test a particular hypothesis related to the practice dataset.

Chapter 6 returns to the subject of vowel formants and formant transitions, this time in more detail and with more complex analyses. It covers issues of contextual influence on vowels, vowel targets, normalization, vowel reduction and undershoot, and coarticulatory influences of vowels on consonants. It introduces the technique of k-means used to assess the influence of the immediate phonetic context on vowel acoustics.

The chapter discusses the idea of vowel target as first formant (F1) maximum and shows how to find the point of F1 maximum using Emu-R functions and how to export the established target times as annotations into Emu. It also demonstrates two extrinsic vowel normalization techniques: transformation to z-scores and subtraction of speaker-dependent constant, as well as transformation to Bark scale as an intrinsic normalization technique. It shows how Euclidean distance is used to assess the degree of vowel space expansion/centralization and to compare the relative distance between two vowel categories in a formant space. Plotting a histogram is also introduced.

A method of estimating vowel undershoot by fitting a parabola to the (second) formant's trajectory and measuring its curvature is discussed as well. Another useful outcome of this analysis is formant smoothing resulting from the reconstruction of the formant trajectories from the parabola coefficients. The author also offers an alternative and superior method of formant smoothing: discrete cosine transformation, which allows to control the degree of smoothing and provides a better fit to the contour of the formant trajectory.

The rest of the chapter is devoted to quantifying and comparing the coarticulatory influence of vowels on neighboring consonants using locus equations.

Chapter 7 looks at electropalatography (EPG) and shows how the data obtained with this technique can be processed and analyzed in Emu-R. At the starts a general overview of palatography and electropalatography is provided with the discussion of the advantages and limitations of these techniques. It is also demonstrated how EPG data are accessed, represented, and manipulated in Emu-R. The types of plots available for EGP data are also considered.

Most of the chapter concentrates on data reduction techniques available for the EGP data which allow a more convenient way of evaluating the shape and position of the contact. These techniques include ''contact profiles'' where contacts are summed by column and/or by row, and ''contact distribution indices'', such as anteriority index, centrality index, dorsopalatal index, and center of gravity index. The author shows how Emu-R functions can be applied to calculate and plot these data-reduced objects for the EPG data and used to answer simple research questions such as compare the amount of overlap in consonant clusters and the amount of vowel-on-consonant coarticulatory influence.

Chapter 8 addresses spectral analysis, starting with the review of the fundamentals of spectra which include the discussion of digital sinusoids and their components, some basic mathematics behind Fourier transform, sampling windows and the importance of applying Hamming or Hanning windows for discrete Fourier transform (DFT) to reduce the effects of spectral leakage, the trade-off between time and frequency resolution in the results of Fourier analysis and its dependence on the length of the window, interpolation and smoothing provided by the zero padding technique, pre-emphasis and its uses, for instance, for distinguishing between two sounds that are differentiated mostly by energy in high frequencies. The theory is supplemented by example calculations and plots in R. At end of the section the author shows how the spectral data derived from the speech signal using the Emu-tkassp toolkit can handled in Emu-R. He discusses the way trackdata objects containing spectral data are represented in Emu-R and demonstrates basic operations that can be applied to spectral objects and their components, for example to limit the frequency range.

The chapter also introduces a number of data-reduction techniques which allow for a more effective comparison between different phonetic categories and shows how they can be applied using Emu-R functions. These include computing the spectral average, spectral sum, and spectral ratio between the spectral average or spectral sum in certain frequency range to the total spectral energy, which is also shown to be useful in normalizing for the possible variation in speaker's loudness. The author demonstrates that a difference spectrum produced by subtracting one spectrum from another one can also be used for normalizing, as well as for distinguishing between certain phonetic categories. The spectral slope technique which fits a line of best fit to the spectrum and reduces it to the intercept and slope coefficients is introduced to show its uses for differentiating among places of articulation in oral stops. The author highlights that all of these techniques can be applied at a single point in time as well as across the spectral slices allowing the evaluation of the changes in the particular parameter of the spectrum through time.

The chapter also introduces the method of calculating spectral moments which encodes some basic properties related to the shape of the spectrum such as its mean, variance, skew, and kurtosis. The final part of the chapter deals with yet another way of assessing the shape of the spectrum with the help of Discrete Cosine Transformation (CDT). It is also shown how CDT method can also be applied to signal smoothing.

Chapter 9, the final chapter, is dedicated to methods of classifying the speech sounds. The issues discussed appear most immediately relevant to the field of speech recognition although as the author points out probabilistic methods such as the ones used for classification are becoming more and more important in experimental phonetics and linguistics in general. The final goal of the techniques described here is to separate the phonetic categories most effectively using the least amount of contributing parameters. As a basis of most probabilistic classification analyses Bayes' theorem and Gaussian distribution are introduced up front. The concepts of training and testing stages, close and open tests, supervised and unsupervised learning are explained along the way. Data classification in one parameter/dimension space is demonstrated and followed by the increasingly complex examples of classification in two-dimension and multidimensional spaces. The author addresses the issues of over-fitting the training model and correlation between parameters. It is also demonstrated how Principal Component analysis (PCA) can be used to reduce the redundancy of the model. The author acknowledges that time is often crucial in phonetic research since so much of the data extracted from speech is dynamic in nature. Here he presents a method for compressing dynamic spectral data where DCT is applied to reduce each spectral slice to a small number of coefficients and a polynomial is fitted to each coefficient as a function of time resulting in 3 values representing the mean, the slope, and the curvature of this coefficient's trajectory in time. Thus a multitude of DFT slices and their components can be reduced to a single point in the n-dimensional space which would serve as a basis for classification. The rest of the chapter discusses the advantages of the classification using ''support vector machine'' (SVM) for data that are not normally distributed and comparing its performance to the performance of the Gaussian model in the classification of the oral stops in the two-dimensional space of the dynamic DCT parameters.

EVALUATION

The book undoubtedly succeeds entirely in its goal to provide an accessible and effective practical introduction to using Emu speech database system and Emu-R functions to analyze phonetic data. It is written in a clear and accessible language and the topics are introduced in a coherent and easy to follow manner with the complexity of the material gradually increasing from the beginning towards the end of the book. Even rather complicated concepts are made easy to understand with an exceptional use of analogy and a commendable restraint from going into too many mathematical and technical details. What I particularly appreciated about the organization of the book is that it is structured not around the features of Emu system but rather around the types of phonetic analysis that most students/researchers are likely to get involved in: vowel acoustics, formants, and formant transitions; normalization; articulatory data analysis, spectral analysis. I also found it very helpful that the functions and commands introduced in the previous chapters were often repeated in the following chapters. The use of graphic devices is superb throughout.

However, the title of the book seems somewhat misleading: it suggests a certain breadth of the scope and implies that the discussion will concentrate around using already available phonetically annotated corpora to answer research questions, while this is only briefly touched upon in the text. Since the book is very clearly focused on instructing the reader in the uses of one particular system for phonetic analysis of their own speech recordings it appears that something along the lines of ''A practical introduction to phonetic data analysis using the Emu speech database system'' would be more appropriate.

It should also be mentioned that the book limits itself to a very well defined area, mostly ways of extracting data from already annotated corpora. (This is by no means to underestimate the subject: it covers an impressive range of methods and techniques and will without a doubt be a great resource for phoneticians.) The issues preceding the data extracting -- such as development of the hypothesis, experimental methods, and construction of the corpus, including annotations -- are given only cursory attention, as are the statistical analysis and the interpretation of the results. Overall, I do not take this to be a drawback, since obviously it is not a book on linguistic research methods, although in places it would benefit from a slightly more detailed exploration of the linguistic background of the analyzed data. A brief statement about the implications of the patterns uncovered in the data could also make the exercises more exciting.

A few minor issues: on a couple of occasions the practical exercises were difficult to complete due to bugs in the Emu system. I understand this to be a developing program and I am sure these problems will be soon resolved. There are also a couple of places in the book where the text refers to the elements of a table or a graph in ''bold'' and this highlighting is actually absent.

To sum up, this is a well-written, well-structured, easy-to-follow workbook which boasts an excellent set of practical exercises and demonstrations and covers a wide range of techniques. Overall, those readers who have a basic background in phonetics and statistics and are prepared to work their way carefully through this book will be greatly rewarded with its informativeness and effectiveness. While it may be of a particular interest to researchers and students looking for an alternative to Praat and Praat scripting in phonetic data processing, the book will be a valuable addition to the list of readings in any class on research methods in linguistics, as well as an excellent main reading for a more specialized workshop or seminar.

REFERENCES

Boersma, Paul & Weenink, David (2010). Praat: doing phonetics by computer [Computer program]. URL http://www.praat.org/

The Emu speech database system (Version 2.3), URL http://emu.sourceforge.net/

R Development Core Team (2007). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

ABOUT THE REVIEWER

ABOUT THE REVIEWER:
Olga Dmitrieva is a PhD candidate in the Department of Linguistics at
Stanford University. Her main research interests are in phonetics and
phonology. Her work addresses issues of phonetics-phonology interface,
functional considerations in language typology, sound change, and language
interference. She is currently working on a crosslinguistic study of the
perception and production of consonant length in relation to the
typological distribution of geminate consonants.