MetaFace Conceptual Diagram
An Explanation of the Picture
The starting point for user interaction in this particular application occurs through a text box (below the animated character) in which the user can enter natural language.
Interaction can be in the form of a question, response to a computer posed question, or even a command.
This user interaction stimulus is then sent to the server whereby the artificial intelligence subsystem interprets the stimulus, using word graphs, and produces appropriate output in the VHML format.
The VHML document is parsed, on the server, to extract animation information for expressions, gestures, and facial manipulation. At this stage, any embedded data (HTML files in this case) can be extracted,
ready to be sent to the client. Plain text from the VHML document is then extracted and translated to the particular speech markup format used by the Text to Speech TTS (TTS) engine.
The text is synthesised into phoneme information (phonemes, durations, pitch etc.) using an Natural Language Processor (NLP), Festival for this application.
The phoneme information can then be used to construct MPEG-4 visemes (the visual equivalent of phonemes).
The personality module then constructs MPEG-4 Facial Animation Parameters (FAP’s) from the gesture tags, emotional tags, facial animation tags and visemes.
The phoneme information, FAP’s, and embedded data/metadata can now be sent to the client.
Phoneme information is sent to the client instead of a compressed waveform because it is less bandwidth intensive.
The client machine’s TTS Digital Signal Processor (DSP) (Mbrola in this application) synthesises the phoneme information to produce a speech waveform.
The embedded data is displayed in the web browser, and FAP’s are synchronized with the speech during playback,
producing character animation and multimedia browser content in response to the user’s stimulus. The user’s anthropomorphic experience thus originates from a VHML document.
|