Editor for this issue: <>
Dear Colleagues: At the last COCOSDA meeting in September 1993 Berlin (during EUROSPEECH-93) the committee for speech synthesis decided to form a small working group to gather information about ongoing research on speaker and voice variability in speech synthesis. I was asked to organize this group (together with Kim Silverman and Nick Campbell) and to initiate it. Well, as almost usual, after the conferences university terms start and with them the work load. Hence I was not active in this matter up to now. However, I want to give some report at the next COCOSDA meeting which is to take place in September during ICSLP-94 and/or the IEEE/ESCA workshop on speech synthesis just before ICSLP. I am thus very grateful for any information which can be provided to me in this matter. I will specify it a little more closely in the following. Recent applications in speech synthesis deal not only with implementing TTS systems and improving their quality on all levels, but also with the question how to bring something of the variety of human voices into the synthesis. A number of typical questions or problems in this domain are -- How can we synthesize emotional speech (happy, angry, sad, etc.)? In which way must the synthetic voice be varied (prosody, voice quality), and what can be achieved with the different approaches [concatenative synthesis, parametric synthesis by rule with a source-filter model (e.g., formant synthesis), articulatory synthesis]? -- How can we synthesize a variety of voices with the same system? How can we, for instance, transform a male synthetic voice into a female one and vice versa? How can we interpolate between several voices (which may be particularly difficult in concatenative synthesis which is based on elements of natural utterances)? What are the specific problems in this respect? -- How can we synthesize a variety of speaking styles (casual, clear, formal)? Which reductions, elisions etc. increase the naturalness of a "neutral" synthesis system (e.g., a reading machine) and should thus be incorporated, and which ones shouldn't because they are not appropriate? -- How can we adapt a synthetic voice to a given natural one (not only with respect to the sex of the natural speaker, but also to fundamental frequency range, spectral properties etc.) when given the task that - for reasons whatsoever - the synthetic voice shall sound as similar to the natural target voice as possible? As I said, these are only a number of questions which are to make it a little bit clearer to you what kind of information we want to receive. This list is thus far from being complete. In order to make it easy to you (and hence hopefully increase the number of responses) I do not circulate a long questionnaire, but I only want to get an answer from you to a few questions when you or some colleague(s) at your institution are active in this domain. Please indicate the type of work done (will be kept confidential if desired) and results achieved so far. If you have publications on this subject, please indicate the references (not necessarily in English!). It will be most important to us to collect references on this subject. To make things more convenient, you may use the following preformatted mailer to me. As I distribute this letter over several (moderated and unmoderated) mailing lists, please use this mailer. PLEASE DO NOT RESPOND TO THE LIST (this might be flooded otherwise, making some people very upset at me!). If you are not yourself active in this domain, but know people that are, please forward them this mail. I apologize to anybody who might receive this letter more than once via different channels. --- Start of Mailer --------------------------------------------------- mail -s "Voice Variability in Speech Synthesis" wghMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueuni-bonn.de Your name, institution, address (including fax and e'mail) ... Active in which area of speech (processing and) synthesis? ... Which principle of speech synthesis do you apply ... - concatenative synthesis with parts of natural utterances using PSOLA or some parametric representation - synthesis by rule using a parametric representation (formant synthesis, LPC, ...) - synthesis by an articulatory model What kind of system do you use ... - text-to-speech - dialog system (semantic representation to speech or similar) - other application Which language(s) is your system able to synthesize? Which specific research of yours is particularly related to voice and speaker variability as indicated above? Which questions are covered at your lab? MOST IMPORTANT: If you have relevant publications, please give me a list of references ... --- End of Mailer ----------------------------------------------------- As time is running, I would appreciate to receive this information as soon as possible, but, PLEASE, BEFORE JUNE 15, 1994. I will then compile a list of information and a small bibliography and distribute it to those who respond to this mail. Thank you in advance for your kind cooperation. Sincerely yours, Wolfgang Hess