ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and Socially-engaged Conversational Agents

by   Chia Yu Li, et al.
University of Stuttgart

We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically experienced users, such as machine learning researchers, but also for less technically experienced users, such as linguists or cognitive scientists, thereby providing a flexible platform for collaborative research. Link to open-source code:



There are no comments yet.


page 2

page 5


EmoTxt: A Toolkit for Emotion Recognition from Text

We present EmoTxt, a toolkit for emotion recognition from text, trained ...

LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing

We present LEGOEval, an open-source toolkit that enables researchers to ...

CRSLab: An Open-Source Toolkit for Building Conversational Recommender System

In recent years, conversational recommender system (CRS) has received mu...

deepSELF: An Open Source Deep Self End-to-End Learning Framework

We introduce an open-source toolkit, i.e., the deep Self End-to-end Lear...

MOGPTK: The Multi-Output Gaussian Process Toolkit

We present MOGPTK, a Python package for multi-channel data modelling usi...

DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature

In this work, we present to the NLP community, and to the wider research...

PyIT2FLS: A New Python Toolkit for Interval Type 2 Fuzzy Logic Systems

Fuzzy logic is an accepted and well-developed approach for constructing ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialog systems or chatbots, both text-based and multi-modal, have received much attention in recent years, with an increasing number of dialog systems in both industrial contexts such as Amazon Alexa, Apple Siri, Microsoft Cortana, Google Duplex, XiaoIce zhou2018design and Furhat222, as well as academia such as MuMMER Foster2016 and Alana Curry2018. However, open-source toolkits and frameworks for developing such systems are rare, especially for developing multi-modal systems comprised of speech, text, and vision. Most of the existing toolkits are designed for developing dialog systems focused only on core dialog components, with or without the option to access external speech processing services (Bohus2009; Baumann2012; Lison2016; Ultes2017; ortega2019adviser; Lee2019).
To the best of our knowledge, there are only two toolkits, proposed in Foster2016 and Bohus2017, that support developing dialog agents using multi-modal processing and social signals SSI. Both provide a decent platform for building systems, however, to the best of our knowledge, the former is not open-source, and the latter is based on the .NET platform, which could be less convenient for non-technical users such as linguists and cognitive scientists, who play an important role in dialog research.
In this paper, we introduce a new version of ADVISER - previously a text-based, multi-domain dialog system toolkit ortega2019adviser - that supports multi-modal dialogs, including speech, text and vision information processing. This provides a new option for building dialog systems that is open-source and Python-based for easy use and fast prototyping. The toolkit is designed in such a way that it is modular, flexible, transparent, and user-friendly for both technically experienced and less technically experienced users.
Furthermore, we add novel features to ADVISER, allowing it to process social signals and to incorporate them into the dialog flow. We believe that these features will be key to developing human-like dialog systems because it is well-known that social signals, such as emotional states and engagement levels, play an important role in human computer interaction mctear2016conversational. However in contrast to open-ended dialog systems ElizaWeizenbaum66, our toolkit focuses on task-oriented applications GUS, such as searching for a lecturer at the university ortega2019adviser. The purpose we envision for dialog systems developed using our toolkit is not the same as the objective of a social chatbot such as XiaoIce zhou2018design. Rather than promoting “an AI companion with an emotional connection to satisfy the human need for communication, affection, and social belonging” zhou2018design, ADVISER helps develop dialog systems that support users in efficiently fulfilling concrete goals, while at the same time considering social signals such as emotional states and engagement levels so as to remain friendly and likeable.

2 Objectives

The main objective of this work is to develop a multi-domain dialog system toolkit that allows for multi-modal information processing and that provides different modules for extracting social signals such as emotional states and for integrating them into the decision making process. The toolkit should be easy to use and extend for users of all levels of technical experience, providing a flexible collaborative research platform.

2.1 Toolkit Design

We extend and substantially modify our previous, text-based dialog system toolkit ortega2019adviser while following the same design choices. This means that our toolkit is meant to optimize the following four criteria: Modularity, Flexibility, Transparency and User-friendliness at different levels. This is accomplished by decomposing the dialog system into independent modules (services), which in turn are either rule-based, machine learning-based or both. These services can easily be combined in different orders/architectures, providing users with flexible options to design new dialog architectures.

2.2 Challenges & Proposed Solutions


The main challenges in handling multi-modality are a) the design of a synchronization infrastructure and b) the large range of different latencies from different modalities. To alleviate the former, we use the publisher/subscriber software pattern presented in section 4 to synchronize signals coming from different sources. This software pattern also allows for services to run in a distributed manner. By assigning computationally heavy tasks such as speech recognition and speech synthesis to a more powerful computing node, it is possible to reduce differences in latency when processing different modalities, therefore achieving more natural interactions.

Socially-Engaged Systems

Determining the ideal scope of a socially-engaged dialog system is a complex issue, that is which information should be extracted from users and how the system can best react to these signals. Here we focus on two major social signals: emotional states and engagement levels (see section 3.1), and maintain an internal user state to track them over the course of a dialog. Note that the toolkit is designed in such a way that any social signal could be extracted and leveraged in the dialog manager. In order to react to social signals extracted from the user, we provide an initial affective policy module (see section 3.5) and an initial affective NLG module (see section 3.7), which could be easily extended to more sophisticated behavior. Furthermore, we provide a backchanneling module that enables the dialog system to give feedback to users during conversations. Utilizing these features could lead to increased trust and enhance the impression of an empathetic system.

3 Functionalities

3.1 Social Signal Processing

We present the three modules of ADVISER for processing social signals: (a) emotion recognition, (b) engagement level prediction, and (c) backchanneling. Figure 1 illustrates an example of our system tracking emotion states and engagement levels.

Figure 1: Tracking emotion states and engagement levels using multi-modal information.

Multi-modal Emotion Recognition

For recognizing a user’s emotional state, all three available modalities – text, audio, and vision – can potentially be exploited, as they can deliver complementary information zeng2009survey. Therefore, the emotion recognition module can subscribe to the particular input streams of interest (see section 4 for details) and apply emotion prediction either in a time-continuous fashion or discretely per turn.
In our example implementation in the toolkit, we integrate speech emotion recognition, i.e. using the acoustic signal as features. Based on the work presented in neumann2017attentive

we use log Mel filterbank coefficients as input to convolutional neural networks (CNNs). For the sake of modularity, three separate models are employed for predicting different types of labels: (a) basic emotions {angry, happy, neutral, sad}, (b) arousal levels {low, medium, high}, and (c) valence levels {negative, neutral, positive}. The models are trained on the IEMOCAP dataset 

busso2008iemocap. The output of the emotion recognition module consists of three predictions per user turn, which can then be used by the user state tracker (see section 3.4). For future releases, we plan to incorporate multiple training datasets as well as visual features.

Engagement Level Prediction

User engagement is closely related to states such as boredom and level of interest, with implications for user satisfaction and task success (Forbes-Riley2012; Schuller2009). In ADVISER, we assume that eye activity serves as an indicator of various mental states (Schuller2009; Niu2018) and implement a gaze tracker that monitors the user’s direction of focus via webcam.
Using OpenFace 2.2.0, a toolkit for facial behavior analysis Baltrusaitis2018, we extract the features gaze_angle_x and gaze_angle_y, which capture left-right and up-down eye movement, for each frame and compute the deviation from the central point of the screen. If the deviation exceeds a certain threshold for a certain number of seconds, the user is assumed to look away from the screen, thereby disengaging. Thus, the output of our engagement level prediction module is the binary decision {looking, not looking}. Both the spatial and temporal sensitivity can be adjusted, such that developers have the option to decide how far and how long the user’s gaze can stray from the central point until they are considered to be disengaged. In an adaptive system, this information could be used to select re-engagement strategies, e.g. using an affective template (see section 3.7).


In a conversation, a backchannel (BC) is a soft interjection from the listener to the speaker, with the purpose of signaling acknowledgment or reacting to what was just uttered. Backchannels contribute to a successful conversation flow clark2004. Therefore, we add an acoustic backchannel module to create a more human-like dialog experience. For backchannel prediction, we extract 13 Mel-frequency-cepstral coefficients from the user’s speech signal, which form the input to the convolutional neural network based on OrtegaBC2020. The model assigns one of three categories from the proactive backchanneling theory goodwin1986 to each user utterance {no-backchannel, backchannel-continuer and backchannel-assessment}. The predicted category is used to add the backchannel realization, such as Right or Uh-huh, to the next system response.

3.2 Speech Processing

Automatic Speech Recognition (ASR)

The speech recognition module receives a speech signal as input, which can come from an internal or external microphone, and outputs decoded text. The specific realization of ASR can be interchanged or adapted, for example for new languages or different ASR methods. We provide an end-to-end ASR model for English based on the Transformer neural network architecture. We use the end-to-end speech processing toolkit ESPnet watanabe2018espnet and the IMS-speech English multi-dataset recipe denisov2019ims, updated to match the LibriSpeech Transformer-based system in ESPnet karita2019comparative

and to include more training data. Training data comprises the LibriSpeech, Switchboard, TED-LIUM 3, AMI, WSJ, Common Voice 3, SWC, VoxForge and M-AILABS datasets with a total amount of 3249 hours. As input features, 80-dimensional log Mel filterbank coefficients are used. Output of the ASR model is a sequence of subword units, which include single characters as well as combinations of several characters, making the model lexicon independent.

Speech Synthesis

For ADVISER’s voice output, we use the ESPnet-TTS toolkit hayashi2019espnettts, which is an extension of the ESPnet toolkit mentioned above. We use FastSpeech as the synthesis model speeding up mel-spectrogram generation by a factor of 270 and voice generation by a factor of 38 compared to autoregressive Transformer TTS ren2019fastspeech. We use a Parallel WaveGAN yamamoto2020parallel to generate waveforms that is computationally efficient and achieves a high mean opinion score of 4.16. The FastSpeech and WaveGAN models were trained with 24 hours of the LJSpeech dataset from a single speaker  ljspeech and are capable of generating voice output in real-time when using a GPU. The synthesis can run on any device in a distributed system. Additionally, we optimize the synthesizer for abbreviations, such as Prof., Univ., IMS, NLP, ECTS and PhD, as well as for German proper names, such as street names. These optimizations can be easily extended.

Turn Taking

To make interacting with the system more natural, we use a naive end-of-utterance detection. Users indicate the start of their turn by pressing a hotkey, so they can choose to pause the interaction. The highest absolute peak of each recording chunk is then compared with a predefined threshold. If a certain number of sequential chunks do not peak above the threshold, the recording stops. We are currenlty in the process of planning more sophisticated turn taking models, such as turn-taking.

3.3 Natural Language Understanding

The natural language understanding (NLU) unit parses the textual user input de2008spoken - or the output from the speech recognition system - and extracts the user action type, generally referred to as intent in goal-oriented dialog systems (e.g. Inform and Request), as well as the corresponding slots and values. The domain-independent, rule-based NLU presented in ortega2019adviser is integrated into ADVISER and adapted to the new domains presented in section 5.

3.4 State Tracking

Belief State Tracking (BST):

The BST tracks the history of user informs and the user action types, requests, with one BST entry per turn. This information is stored in a dictionary structure that is built up, as the user provides more details and the system has a better understanding of user intent.

User State Tracking (UST):

Similar to the BST, the UST tracks the history of the user’s state over the course of a dialog, with one entry per turn. In the current implementation, the user state consists of the user’s engagement level, valence, arousal, and emotion category (details in section 3.1).

3.5 Dialog Policies


To determine the correct system action, we provide three types of policy services: a handcrafted and a reinforcement learning policy for finding entities from a database


, as well as a handcrafted policy for looking up information through an API call. Both handcrafted policies use a series of rules to help the user find a single entity or, once an entity has been found (or directly provided by the user), find information about that entity. The reinforcement learning (RL) policy’s action-value function is approximated by a neural network which outputs a value for each possible system action, given the vectorized representation of a turn’s belief state as input. The neural network is constructed as proposed in

vath2019combine following a duelling architecture wang2015dueling. It consists of two separate calculation streams, each with its own layers, where the final layer yields the action-value function. For off-policy batch-training, we make use of prioritized experience replay schaul2015prioritized.

Affective Policy

In addition, we have also implemented a rule-based affective policy service that can be used to determine the system’s emotional response. As this policy is domain-agnostic, predicting the next system emotion output rather than the next system action, it can be used alongside any of the previously mentioned policies.

User Simulator

To support automatic evaluation and to train the RL policy, we provide a user simulator service outputting at the user acts level. As we are concerned with task-oriented dialogs here, the user simulator has an agenda-based schatzmann2007agenda architecture and is randomly assigned a goal at the beginning of the dialog. Each turn, it then works to first respond to the system utterance, and then after to fulfill its own goal. When the system utterance also works toward fulfilling the user goal, the RL policy is rewarded by achieving a shorter total dialog turn count ortega2019adviser.

3.6 External Information Resources

ADVISER supports three options to access information from external information sources. In addition to being able to query information from SQL-based databases, we add two new options that includes querying information via APIs and from knowledge bases (e.g. Wikidata (wikidata)). For example, when a user asks a simple question - Where was Dirk Nowitzki born?, our pretrained neural network predicts the topic entity - Dirk Nowitzki - and the relation - place of birth. Then, the answer is automatically looked up using Wikidata’s SPARQL endpoint.

3.7 Natural Language Generation (NLG)

In the NLG service, the semantic representation of the system act is transformed into natural language. ADVISER currently uses a template-based approach to NLG in which each possible system act is mapped to exactly one utterance. A special syntax using placeholders reduces the number of templates needed and accounts for correct morphological inflections ortega2019adviser. Additionally, we developed an affective NLG service, which allows for different templates to be used depending on the user’s emotional state. This enables a more sensitive/adaptive system. For example, if the user is sad and the system does not understand the user’s input, it might try to establish common ground to prevent their mood from getting worse due to the bad news. An example response would be “As much as I would love to help, I am a bit confused” rather than the more neutral “Sorry I am a bit confused”. One set of NLG templates can be specified for each possible emotional state. At runtime, the utterance is then generated from the template associated with the current system emotion and system action.

4 Software Architecture

Figure 2: Example ADVISER toolkit configuration: Grey represents backend components, blue represents domain-specific services, and all other colors represent domain-agnostic services. Two components are run remotely.

4.1 Dialog as a Collection of Services

To allow for maximum flexibility in combining and reusing components, we consider a dialog system as a group of services which communicate asynchronously by publishing/subscribing to certain topics. A service is called as soon as at least one message for all its subscribed topics is received and may additionally publish to one or more topics. Services can elect to receive the most recent message for a topic (e.g. up-to-date belief state) or a list of all messages for that topic since the last service call (e.g. a list of video frames). Constructing a dialog system in this way allows us to break free from a pipeline architecture. Each step in the dialog process is represented by one or more services which can operate in parallel or sequentially. For example, tasks like video and speech capture may be performed and processed in parallel before being synchronized by a user state tracking module subscribing to input from both sources. Figure 2 illustrates the system architecture. For debugging purposes, we provide a utility to draw the dialog graph, showing the information flow between services, including remote services, and any inconsistencies in publish/subscribe connections.

4.2 Support for Distributed Systems

Services are location-transparent and may thus be distributed across multiple machines. A central dialog system discovers local and remote services and provides synchronization guarantees for dialog initialization and termination. Distribution of services enables, for instance, a more powerful computer to handle tasks such as real-time text-to-speech generation (see Figure 2). This is particularly helpful when multiple resource-heavy tasks are combined into a single dialog system.

4.3 Support for Multi-Domain Systems

In addition to providing multi-modal support, the publish/subscribe framework also allows for multi-domain support by providing a structure which enables arbitrary branching and rejoining of graph structures. When a service is created, users simply specify which domain(s) it should publish/subscribe to. This, in combination with a domain tracking service, allows for seamless integration of domain-agnostic services (such as speech input/output) and domain-specific services (such as NLU/NLG for the lecturers domain).

5 Example Use Cases

5.1 Example Domains

We provide several example domains to demonstrate ADVISER’s functionalities. Databases for lecturers and courses at the Institute for Natural Language Processing (IMS), which we used in the previous version of ADVISER, were adapted to the new system architecture. As example APIs, we implemented a weather domain that makes calls to the OpenWeatherMap API333 and a mensa domain for gathering information from the dining hall at the university of Stuttgart. Note that affective templates were only added to the lecturers and mensa domain. All domains can be used within the same dialog, simply by switching the topic.

5.2 An Example Implementation

Our toolkit allows for easy creation of a dialog system within a few lines of code as follows.

# domains
weather = WeatherDomain()
mensa = MensaDomain()
# create system from desired services
ds = DialogSystem(services=[
    DomainTracker(domains=[mensa, weather]),

As a first step, a dialog system object is initialized, which is responsible for coordinating the initialization and graceful termination of all dialog services. Talking about multiple domains in one dialog is enabled by creating a simple keyword-based domain tracker which is introduced as the first argument to the dialog system. To make the dialog multi-modal, speech and vision modules are introduced next, along with modules to extract engagement and emotion. So far, all of these modules are domain-agnostic and can be used as shared resources between all domains. Next, domain-dependent services such as NLUs, BSTs and NLGs for weather and mensa, are added. The following shows an example dialog.

System Hello, please let me know how I can help you, I can discuss the following domains: Mensa Food and Weather. UserI could have something to eat. What does the mensa offer today? System What type of dish are you looking for? UserI would like a main dish. System Should the meal be vegan? UserYes. System The meal mediterranean Ebly wheat is served today, is a main dish and is vegan. UserOkay, cool, I will go there now! What is the weather like? System The weather in Stuttgart on January 28 at 3 PM is 3 degrees celsius with light snow. UserThank you, ADVISER, good bye! System Thank you, good bye.

6 Related Work

Other tools for building dialog systems include ConvLab Lee2019, an open-source, text-based dialog system platform that supports both pipelined architectures and an end-to-end neural architecture. ConvLab also provides reusable components and supports multi-domain settings. Other systems are largely text-based, but offer the incorporation of external speech components. InproTK Baumann2012, for instance, in which modules communicate by networks via configuration files, uses ASR based on Sphinx-4 and synthesis based on MaryTTS. Similarly, RavenClaw Bohus2009 provides a framework for creating dialog managers; ASR and synthesis components can be supplied, for example, by connecting to Sphinx and Kalliope. OpenDial Lison2016 relies on probabilistic rules and provides options to connect to speech components such as Sphinx. Multi-domain dialog toolkit - PyDial Ultes2017 supports connection to DialPort.

As mentioned in the introduction, Microsoft Research’s \psi is an open and extensible platform that supports the development of multi-modal AI systems Bohus2017. It further offers audio and visual processing, such as speech recognition and face tracking, as well as output, such as synthesis and avatar rendering. And the MuMMER (multi-modal Mall Entertainment Robot) project Foster2016 is based on the SoftBank Robotics Pepper platform, and thereby comprises processing of audio-, visual- and social signals, with the aim to develop a socially engaging robot that can be deployed in public spaces.

7 Conclusions

We introduce ADVISER – an open-source, multi-domain dialog system toolkit that allows users to easily develop multi-modal and socially-engaged conversational agents. We provide a large variety of functionalities, ranging from speech processing to core dialog system capabilities and social signal processing. With this toolkit, we hope to provide a flexible platform for collaborative research in multi-domain, multi-modal, socially-engaged conversational agents.