Tutorial 4: Automatic Recognition of Natural Speech

Location: Room 101C, TICC

Presented by

Douglas O'Shaughnessy

Abstract

The automatic conversion of natural (spontaneous) speech into text is a highly-interdisciplinary task involving aspects of computer science, engineering, acoustics, linguistics, and psychology. The recent major advances in this field have come from improvements in recognition algorithms as well as in computational speed, memory and power, but also from the integration of concepts from human speech production and perception and the use of powerful models of natural language. For an ICASSP audience, this tutorial will discuss the modern techniques of automatic speech recognition, emphasizing the breadth of knowledge needed to approach near-human performance in this complex task.

We will first briefly examine human speech production from an acoustic-phonetic view. The standard methods of speech analysis (e.g., FFT, LPC, and mel-based cepstrum) will be presented and discussed in terms of efficiency and robustness. The differences in objectives between speech coding and speech recognition will be noted. We will present the modern stochastic techniques to speech recognition (i.e., hidden Markov models), with simple examples to emphasize understanding for a non-expert audience. The issues of adequate training corpora and methods, and the many trade-offs for different practical applications will be discussed. How read speech and conversational speech differ will be noted, in terms of disfluencies and variable speaking rate. The added difficulties of recognizing speech over the telephone and with hands-free terminals will be explained, as well as issues of distributed speech recognition and the need to reduce one's computational footprint.

The importance of appropriate language models will be emphasized, with both basic N-gram models and more complex class-based and distance models discussed. We will examine the inadequacy of simply using N-grams as vocabulary size increases, despite the increasing availability of training texts and the increasing power of computers. We will describe the current state-of-the-art in recognition of natural speech, in both commercial applications and in research, noting where current systems do well and where they need to improve. The possibilities of integrating knowledge-based sources (e.g., aspects of expert systems) into the current stochastic approaches to speech recognition will be examined. Predictions as to the future course of speech recognition research will be made, in the face of the current success of application-specific recognizers (but the continued failure to approach human performance on more general tasks).

Speaker Biography

Dr. O'Shaughnessy has worked in the speech communication field for almost 40 years, first in study at MIT (BSc and MS in 1972, PhD in 1976), then as director of a research team at INRS in the areas of speech analysis, coding, synthesis, recognition and enhancement. His textbook "Speech Communication: Human and Machine" (Addison-Wesley, 1987, and now in second edition by IEEE Press, 2000) is well-known and has been widely used in university courses on speech. It indicates the breadth of knowledge he brings to bear on issues of speech communication. His most recent focus has been on speech recognition, where his research group publishes regularly in the ICASSP and Interspeech Proceedings, as well as in relevant journals. He is an associate editor for the Journal of the Acoustical Society of America (JASA) and for the EURASIP Journal of Applied Signal Processing. He is the founding editor-in-chief of the EURASIP Journal of Audio, Speech and Music Processing.

In addition to his tasks at INRS, he teaches every year as an adjunct professor in the electrical engineering department at McGill University. He was the General Chair for ICASSP-2004 in Montreal, and is a Fellow of both the IEEE and the ASA. From 1995 to 1999, he served as an Associate Editor for the IEEE Transactions on Speech and Audio Processing, and was recently elected to be a member of the IEEE Technical Comittee for Speech Processing. He was recently a Member-at-Large of the IEEE SPS Board of Governors and a member of the IEEE SPS Conference Board. In 2003, with Li Deng, he co-authored the book Speech Processing: A Dynamic and Optimization-Oriented Approach. He has presented tutorials on speech recognition at ICASSP-96 in Atlanta, ICASSP-2002 in Orlando, and at ICC-2003 in Anchorage. He has authored papers at every ICASSP (except one) since 1986.


©2016 Conference Management Services, Inc. -||- email: webmaster@icassp09.com -||- Last updated Friday, April 03, 2009