Show and Tell Demonstration

Title: Motion Trajectory Driven Talking Head

Date and Location:

Thursday, April 23, 15:30 - 18:00, Location: Show and Tell Area A

Presented by

Lijuan Wang, Ning Xu, Xiaojun Qian, Yao Qian, Frank Soong

Description

In addition to speech, visual information (e.g., facial expressions, head motions, and gestures) is an important part of human communication. It conveys, explicitly or implicitly, the intentions, the emotion states and other paralinguistic information encoded in the speech chain. In this demonstration, we present a multi-language, real-time text-to-audiovisual talking head, which automatically generates both audio and visual streams for given text. The audio stream is synthesized by our multi-lingual HMM-based TTS engine. The visual stream is rendered by incorporating multiple animation channels, which control a cartoon figure parameterized in a 2D/3D model simultaneously.

The innovation of this demo lies in the new HMM-based motion trajectory generation method. This idea is analogous to the HMM-based speech synthesis technique which forms utterances by predicting the most likely speech parameters from statistically trained HMMs. Given any text input, the proposed system can generate (synthesize) the most likely motion trajectories of both head and critical markers on the face statistically. The synthesized motion trajectories are then transformed into control parameters to drive a lively 2D/3D cartoon head.

The speech animation synthesis system is based on a data-driven, statistically trained HMM-model, which is generated in 4 steps: 1. data collection; 2. model training; 3. motion trajectory generation; and 4. 2D/3D talking head rendering. In data collection, with a motion capture system, abundant facial markers’ motion trajectories data are collected along with simultaneous audio(speech) and video recordings. The recordings cover rich phonetic (speech) contexts, different speaking styles, lively emotions, and natural facial expressions. In model training, HMM is trained to model captured motion trajectories statistically in the maximum likelihood sense. In motion trajectory synthesis, statistically trained HMMs are used to generate (predict) the most likely motion trajectories, given acoustic and prosodic features of speech. Final rendering of a cartoon talking head is done by transforming marker motion trajectories into head and facial control parameters to synthesize a lively animation sequence.

Subjective evaluation shows that the proposed system can produce highly intelligible and natural speech-synchronized animations. Users can interactively test our online system by inputting any text in English or Chinese.


©2016 Conference Management Services, Inc. -||- email: webmaster@icassp09.com -||- Last updated Thursday, February 12, 2009