Thursday, April 23, 15:30 - 18:00, Location: Show and Tell Area A
Alissa M. Harrison, Wai-Kit Lo, Helen Meng
CHELSEA is a prototype of a computer-assisted pronunciation training (CAPT) system for Chinese-speaking learners of English. Unlike previous pronunciation scoring systems which can only detect mispronunciations, CHELSEA can also effectively diagnose mispronunciations at the phone-level. The design of CHELSEA is grounded in the theory of language transfer, i.e. phonological knowledge of the learner's first language (L1) can carry over to the second language (L2). Our approach involves a comparative phonological analysis of Chinese (Mandarin/Cantonese) and English to predict possible phonetic confusions. These confusions are formalized as a set of context-sensitive rules. For example, /n/ at the beginning of words are usually pronounced as /l/. The corresponding context-sensitive rule is "/n/ -> /l/ / # _". Based on these rules, the system automatically generates a list of common mispronunciations which are used to extend the canonical entries in the pronunciation lexicon. This extended pronunciation lexicon is incorporated in an HMM-based speech recognizer, which is tasked with constrained phone recognition. More specifically, given a known sequence of words, it outputs the optimal phone sequence available from the extended lexicon. CHELSEA translates the recognition results into comprehensible feedback for the learner by aligning the recognized phone sequence with the model transcription and highlighting the differences in standard phonetic transcription (IPA). Additionally, the system utilizes the time-boundaries of the recognition output to enable the learner to playback and compare individual words within an utterance. Experimentation comparing the system's output with annotations by an expert human listener shows only 14.9% of phones are falsely rejected as mispronounced by the system (FRR) while 43.6% of mispronunciations are falsely accepted (FAR). In the inevitable tradeoff between FRR and FAR, we believe that minimizing FRR is more important for an effective learning tool. The system may be more forgiving and encouraging in terms of FAR, but over time the system strives to help with the learner's salient mispronunciations. For mispronunciation diagnosis, the system can correctly identify the phone produced by the learner in 51.0% of their mispronunciations.