Conversational AI avatar – Proof of Concept

As a proof-of-concept project the Network Institute Tech Labs has created a modular conversational avatar using state-of-the-art speech and AI. This is meant to serve as a demonstrator for researchers who are interested in using:

  • Conversational agents
  • Realistic humanoid avatars
  • Realistic emotional human speech
  • Flexible speech recognition
  • Emotional speech creation
  • Emotional expressive agents

This demo show a user can talk – using normal speech – to an avatar who will react to that speech using emotional speech and facial expressions. This results in a natural conversation with the agent.
The demo only shows that the technical requisites are met and the conversation including emotional expression by the agent is feasible. Many aspects of this setup can be adapted and improved to specific needs.

Please, play the video below to get a quick first impression of this tool.

If you are interested in using such an emotional conversational agent, please do not hesitate to contact the Tech Labs! Read on for a more detailed and technical description of this proof-of-concept project.



Technical explanation

In this demo the Emotional Conversational Agent (ECA) uses the following software and services:

  • Unity 2022 in HDRP
  • Avatar is made in Reallusion Character Creator 4 and animations in iClone 8
  • Speech recognition is done through Microsoft Azure Cognitive Speech Services (subscription)
  • Conversation is done by OpenAI chatGTP3.5 (subscription)
  • OpenAI response is converted to audio speech by through Microsoft Azure Cognitive Speech Services (subscription) using Speech Styles
  • For the facial expression Ekman Action Units are used to animate facial features via custom code
  • Simulated lip-sync is done using SALSA

The flow of the program roughly is:

  1. Press the <Start> button to start listening to the default microphone of the system
  2. After a silence detection, the recorded audio is send to Microsoft Azure Cognitive Speech Services (MACSS) to convert the audio to text.
  3. The spoken text is then send to OpenAI in the existing message queue with the chat bot.
  4. The chat bot is instructed to respond to each input using a preset number of emotional tags and include a weight [0,1] to this emotion.
  5. The emotional tags are the MACSS Speech Style tags that are valid for several English (US) and Chinese neural voices.
  6. Once the chat bot response is received, the emotion tags are extracted from the response. The code also stores which pieces of response (text) belong to which emotional tags.
  7. The response text (minus the emotional tags) is send to MACSS to produce audio speech. The emotional tag is used to apply Speech Styles to the audio such as <friendly> or <sad>. This also includes the weight (or Speech Style Degree) belonging to that tag.
  8. Once the audio is received it is translated into a default WAV file and audio playback is started.
  9. During audio playback the MACSS emotion tags are converted into Ekman emotions and subsequently into Action Units (AU) that are animated on a frame-by-frame base using the avatar’s facial blendshapes. The weights of the emotion tags are used to set the weight of each blendshape belonging to the AU’s.
  10. The loudness of the audio is used to dynamically adjust the weight of the blendshape making the facial expression animate “along” with the force of the speech.
  11. The SALSA plugin is used to generate live speudo-lip-sync
  12. After the speech audio has played, the system is reset to process the next spoken text from the user.

Limitations:

  • MACSS has only English and Chinese neural voices (male and female) that can produce a large enough set of Speech Styles (emotional voices) to be useful for our purposes. MACSS can produce many other language but most offer only a few or no Speech Styles at all, resulting in emotionless and flat speech.
  • The Speech Styles currently available are: angry, cheerful, excited, friendly, hopeful, sad, shouting, terrified, unfriendly, whispering
  • We have an alternative version for showing facial expressions. This module does not use the Ekman Action Units, but predefined emotional animations. This module chosen from a set of applicable emotional expressions (eg Happy) and will change and modulate the animation as long as the speech runs. This module also uses the speech loudness to vary the facial expression weights.
  • Because of the steps involved there is a slight delay between talking to the agent and getting the feedback. Most of this results from the ‘silence’ detection in the Speech-To-Text module. As you can see in the video the delay from finishing talking to the start of the agent’s response is somewhere between 1 and 2 seconds.
  • At the moment, because of licensing, it is only possible to run this system on a computer in our labs in the NU building.

Improvements and changes that can be made

  • The agent can have any look. There are some limitations in hair styles and clothing, but otherwise avatars can look any way you want.
  • The virtual environment the agent is in can be anything you want.
  • The agent can also have body animations. Those animations will come from a set of predefined animation and can be selected based a any number of criteria.
  • The facial expressions now follow the loudness of the agent’s speech. This can be changed or further improved to allow for a different, or smoother or more realistic flow of expressions.
  • The lip-sync is difficult to do live. Many adjustments can be made to the SALSA plug-in that is currently used. Other solutions are possible too, but live lip-sync will always suffer in quality when compared to pre-made lip-sync.
  • It is possible to talk in one language (such as Dutch) and have MACSS translate the spoken audio to English text that will be used by the AI and the Text-To-Speech engine.
  • The agent can be ‘programmed’ to have any kind of characteristics. In this demo it is a ‘bored teenager who loves gaming’.
  • The level of facial expression can be adjusted.
  • The translation from MACSS Speech Styles to Ekman emotions can be adjusted.