Conversational AI systems traditionally comprise multiple modules such as Automatic Speech Recognition (ASR), Language Understanding (LU), Dialogue State Tracking (DST), Dialogue Management (DM), Language Generation (LG), and Text To Speech (TTS). While a full-fledged spoken dialogue system typically incorporates all these components (depicted in Figure1
), most research studies have been performed in a domain-centric and somewhat isolated fashion in sub-areas such as speech synthesis and recognition, language understanding, dialogue state tracking, turn taking and dialogue management, etc. This has resulted in limited dissemination of knowledge between these fields and while this phenomenon is being mitigated by the universality of deep learning underlying most recent advances in all of these areas, conversational AI platforms that provide proper infrastructure to support model training and evaluation can be a strong catalyst for research and development in the field.
With advancements in joint and end-to-end learning [joint_dilek, joint_rastogi, joint_zhao, e2e_bingliu, e2e_weston, e2e_moss, e2e_mila] along with widely used out of the box ASR and TTS solutions [gcloud, polly, ms_speech], several platforms and toolkits [pydial, ParlAI, rasa, NeMo, convlab, among others] have recently been proposed for building Conversational AI systems. Each toolkit is largely designed around specific use-cases, with some being centered around research and others designed for scalability and use in production as described later. For this reason and others, for example unacceptable performance (e.g. occasional toxic or irrelevant output [metoo_alexa, offensive_convai]) or lack of toolkits that can support both research and prototyping, there has been a disconnect between state-of-the-art research and its applications in the production systems. However, as conversational AI agents become more and more capable, new toolkits are needed that can bridge the gap between research and production and support quick prototyping of full conversational experiences with ASR, LU, DST/DM, LG, and TTS capabilities. There are only a few toolkits such as RASA [rasa], which have been recently proposed and support both modeling and prototyping of an end-to-end experience. Most existing toolkits are centered around specific modules (e.g. LU) at the expense of other components (e.g. LG). Section 1.1 provides an overview of existing state-of-the-art toolkits.
To address the challenges mentioned above, we propose the Plato Research Dialogue System 111https://github.com/uber-research/plato-research-dialogue-system, a flexible conversational AI research, experimentation, and prototyping platform. By following theoretical and well founded aspects of Spoken Dialogue Systems, Plato is designed in a manner which makes it easy to understand for people with different levels of expertise in conversational AI, researchers and developers. Plato’s architecture and component-driven design supports statistical dialogue systems, multi-agent training (e.g. learning via agent-to-agent interaction[georgila2014single, papangelis-etal-2019-collaborative, liu2017iterative, liu2018adversarial]), joint learning of modules [joint_dilek, joint_rastogi, joint_zhao], and end-to-end learning of conversational agents [e2e_bingliu, e2e_weston, e2e_moss, e2e_mila]
, while being flexible enough for quick prototyping and demonstration system development. Plato is agnostic to statistical learning frameworks, and developers and researchers can use any Python library (e.g. any Deep Learning or Reinforcement Learning library, etc.) they choose. As such, Plato seamlessly works with Ludwig[ludwig] for code-less training of Deep Learning models and quick prototyping.
Plato is designed to be modular, i.e. each conversational agent is composed of a number of modules that can run sequentially, in parallel, or in any combination of sequential and parallel. Each module can be a “standard" component such as LU, DM, LG, or anything that fits the application’s purpose (a part of speech tagger, a topic classifier, a joint model, etc.). This modular design helps make Plato more extensible and scalable, and allows several developers to build and experiment with individual modules simultaneously. Internally, each component of a conversational agent can be anything from a statistical model (trained online or offline) to a set of rules (e.g. using pattern matching for LU or templates for LG). Moreover, each component can call an API or service such as Google Cloud[gcloud], Amazon Transcribe [aws_transcribe], or Polly [polly]
for speech recognition, speech synthesis, or any other function. Besides building full conversational AI applications, Plato can be used to evaluate and experiment with various kinds of Natural Language Processing (NLP) tasks such as Sentiment Analysis, Topic Modeling, Dialogue State Tracking, Social Language Generation, and others.
1.1 Existing Publicly Available Conversational AI Systems
Several toolkits and conversational AI platforms have been proposed recently. The following are some state-of-the-art and widely adopted platforms:
PyDial [pydial] is a toolkit for statistical modeling of multi-domain Dialogue Systems that supports Reinforcement Learning (RL) and Deep RL models. It is modular and supports customization of modules, however it is primarily designed for research rather than production and therefore requires deep knowledge of the conversational AI field.
ParlAI [ParlAI] is a dialogue research framework which contains popular baselines and state-of-the-art models corresponding to a variety of tasks such as SQuAD [squad], bAbi [bAbi], and visual question answering. It provides integration with Amazon Mechanical Turk [amt] for data collection and with Facebook Messenger for applications. Although ParlAI is rich in terms of models and tasks, it is not modular and the design is not centered around standard Dialogue System architectures, containing modules such as Language Understanding, Policy, Dialogue State Tracking and Language Generation, and therefore custom code may be required to build each component for a new task. Similar to ParlAI, DeepPavlov [DeepPavlov] provides models (and corresponding code) for various tasks. However, the agent class provided in both frameworks needs extensions in order to be applicable to new tasks.
Rasa [rasa] is a rich set of libraries for building Dialogue Systems. It is developer-friendly and supports training of LU and dialogue policy (action) models with very little code. Rasa also supports model-based dialogue policies and provides wrappers to easily train various models through configuration files (similar to Plato). Rasa supports online learning, that is users can interact with the agent to assess its behaviour and for other debugging purposes. It also supports learning conversations and corresponding actions through examples by defining the conversations in the form of stories. Although Rasa is quite robust in terms of functionality, its architecture is quite different from standard module-based Dialogue System architectures. That being said, custom code can be written along with existing modules (such as the Agent) to extend functionalities of Rasa. On the other hand, even though Rasa provides functionalities to experiment with various state-of-the-art models, it is primarily designed for scalability.
Neural Modules (NeMo) [NeMo]
is a toolkit supporting speech recognition and NLP models such as named entity recognition, and intent and slot filling. Even though NeMo has integrations with state-of-the-art models such as BERT[bert] and Transformers [transformers], it currently does not support all the building blocks of a Dialogue System. Several modules such as LG, policy, DST, and DM are missing. Hence, it does not provide an end-to-end conversational experience.
ConvLab [convlab] is a multi-domain dialogue research platform whose objective is to enable researchers to quickly set up experiments with reusable components and modules to compare across different approaches. ConvLab provides evaluation modules (leveraging humans or algorithms) for comparing and evaluating different policies or models. Even though it supports most standard dialogue system components (except ASR and TTS modules), ConvLab is primarily designed for research and evaluation of various state-of-the-art implementations.
Other frameworks such as SimDial [simdial], OpenDial [opendial], and Olympus [olympus] (among others) seem to be less active in terms of support for state of the art models and learning frameworks when compared to the aforementioned toolkits.
The goal we try to achieve with Plato, is to develop a conversational AI platform that is one abstraction level above the aforementioned platforms222Therefore, each of the use cases supported by these platforms can be supported by Plato.
, is easy to understand and debug and that is agnostic to the underlying libraries that implement each component’s statistical models or APIs. Plato can support any kind of Python module such as PyTorch[pytorch]tf]
, or Keras[keras], and can integrate with code-free Deep Learning libraries such as Ludwig [ludwig], so that users at any level of experience can build a quick prototype and experiment with state-of-the-art ideas, while expert users can dive deeper and build custom and more scalable systems. Plato supports continuous and code-free training of each component and can be used to implement agent architectures that consist of a single module (e.g. an end-to-end network) or many serial or parallel modules. Several Deep Learning and Reinforcement Learning based examples are provided, as well as integrations with standard datasets such as DSTC [dstc] and MetalWoz [metalwoz] to guide the users in leveraging the full capabilities of Plato. Lastly, Plato supports multi-agent interaction, where several agents can interact with each other and train any of their components online.
2 Plato Architecture
Plato Research Dialogue System is a platform that can be used to create, train, and evaluate conversational AI agents. It has four main components, namely: 1) dialogue which defines and implements dialogue acts and dialogue states; 2) domain which includes the ontology of the dialogue and the database that the dialogue system queries; 3) controller which orchestrates the conversations; and 4) agent which implements different components of each conversational agent. These four major components are shown in Figure 2 and are described in detail in the following sections. In Plato, each of these components is instantiated using configurations that are described in a YAML file (see Appendix 5 for an example).
Plato is designed to be modular and flexible and supports standard conversational agents, such as the ones depicted in Figure 3, as well any customized conversational agent architecture, for example joint LU and DST via the generic conversational agent that operates as shown in Figure 4. At the highest level of abstraction, Plato supports agents communicating with other agents while adhering to some dialogue principles (Figure 5). These principles define what each agent can understand (an ontology of entities, or meanings e.g. price, location, user preferences, cuisine types) and what it can do (ask for more information, provide some information, call an API, etc.). The agents can communicate over speech, text, or structured information (e.g. dialogue acts) and each agent has its own configuration. In the rest of this section we will go over the major components of Plato.
Plato facilitates conversations between agents via well-defined concepts in dialogue theory, such as dialogue states and dialogue acts. A Plato agent, however, may need to perform actions not directly related to dialogue (e.g. call an API) or actions that communicate information in modalities other than speech (e.g. show an image). Therefore, Plato models Actions and States as abstract containers out of which Dialogue Acts and Dialogue States are created. If needed for specific applications (e.g. multi-modal conversational agents) we may have task-specific Dialogue Acts and States.
For implementing a slot-filling 333Slot-filling systems are dialogue systems whose primary purpose is to provide a natural language interface to an API, for example flight booking, information retrieval, etc. task-oriented dialogue system in Plato, we need to specify two elements that constitute the domain of the dialogue system:
Ontology of the dialogue system. In task-oriented applications, the ontology determines informable slots (user provides that information), requestable slots (user requests that information), and system requestable slots (system requests that information) for the conversation, thereby reflecting the schema of the database that the agent queries to get the right information to be sent out.
Database of items (restaurants, dishes, answers to questions, etc.). While the database could already exist, Plato provides utilities to construct the domain and the database of a dialogue system from data.
For instance, in the case of a conversational agent for restaurant reservation, the cuisine could be thought of as an informable slot and the database could contain information about restaurants of different cuisines, their price range, address, etc. For non-slot-filling applications, Plato ontologies and databases can be extended to meet task- or domain-specific requirements.
Plato provides a utility that makes it easy to generate an ontology (a .json file) and a database (SQLite) from a .csv file, with columns representing item attributes and rows representing items (for an example, see plato/example/data/flowershop.csv). The main purpose of this utility is to help quick prototyping of conversational agents. The command plato domain --config <PATH/TO/CONFIG.YAML> calls the utility and generates the appropriate .db and .json files that define the domain. In the YAML configuration file, the user specifies details such as the path to the input .csv file, the columns that represent informable slots, etc.
In Plato, controllers are objects that orchestrate the conversations between the agents. A controller instantiates the agents, initializes them for each dialogue, passes input and output appropriately, and keeps track of statistics. Note that it is each agent’s responsibility to interact with the world if needed (listen to the microphone, call an API, etc.).
Running the command plato run --config <PATH/TO/CONFIG.YAML> runs Plato’s basic controller, shown in Figure 6. This command receives a value for the --config argument which points to a Plato application configuration YAML file. In this configuration file, details of the agent(s) involved in the dialogue, paths to the ontology and database of the dialogue, along with other parameters and options are specified.
Plato also has a controller that comes with a Graphical User Interface (GUI). This controller can be started by running plato gui --config <PATH/TO/CONFIG.YAML>. This controller is very similar to Plato’s basic controller except that the user is prompted through a GUI as opposed to interacting via the terminal.
Every conversational application in Plato could have one or more agents. Each agent has a role (assistant, user, tourist, tutor, etc) and a set of components such as NLU, DM, DST, dialogue policy, and NLG 444Plato supports external API calls for automatic speech recognition (ASR) and text to speech synthesis (TTS).. An agent could have one explicit module for each one of these components or alternatively, some of these components could be combined into one or more modules (e.g. joint or end-to-end agents) that can run sequentially or in parallel (Figure 4). All components inherit from conversational_module, as shown in Figure 7, which facilitates communication between any component via conversational_frames. Each one of these modules could be either rule-based or trained. In the next subsections, we will describe how to build rule-based and trained modules for agents.
2.4.1 Rule-based modules
Plato provides rule-based versions of all components of a slot-filling conversational agent: slot_filling_nlu, slot_filling_dst, slot_filling_policy, slot_filling_nlg, and the default version of the Agenda-Based User Simulator [schatzmann2007agenda] agenda_based_us. These can be used for quick prototyping, baselines, or sanity checks. Specifically, all of these components follow rules or patterns conditioned on the given ontology and sometimes on the given database and should be treated as the most basic version of what each component should do.
2.4.2 Trained modules
Plato supports training of agents’ components in an online (during the interaction) or offline (from data) manner, using any machine learning framework. Virtually any model can be loaded into Plato as long as Plato’s Input/Output interface is respected. For example, if a model is a custom NLU, it simply needs to inherit from Plato’s NLU abstract class (plato.agent.component.nlu), implement the necessary functions and pack/unpack the data into and out of the custom model.
Plato internal experience
To facilitate online learning, debugging, and evaluation, Plato keeps track of its internal experience using a utility called the dialogue_episode_recorder, which stores information about previous dialogue states, actions taken, current dialogue states, utterances received and utterances produced, rewards received, and a few other constructs including a custom field that can be used to track anything else that cannot be contained by the aforementioned categories. At the end of a dialogue or at specified intervals, each conversational agent will call the train() function of each of its internal components, passing the dialogue experience as training data. Each component then picks the parts it needs for training.
To use learning algorithms that are implemented inside Plato, any external data, such as DSTC2 data, should be parsed into this Plato experience so that they may be loaded and used by the components currently under training. Alternatively, users may parse the data and train their models outside of Plato and then load the trained model when they want to use it for a Plato agent.
Parsing data with Plato
Plato provides parsers for DSTC2, MetaLWOZ, and Taskmaster data sets. These parsers can be used to create training data for different components of the agent based on these well known data sets. For other data sets the user should implement custom parsers to convert the data into Plato-readable format. Users can then load the data into Plato (via options in the configuration file) and train or fine-tune the desired components.
Training components of conversational agents
There are two main ways to train each component of a Plato agent: online, as the agent interacts with other agents, simulators, or users and offline, from data. For online training users can determine the train schedule via hyper-parameters (train after how many dialogues, for how many epochs, how large is the experience pool and the minibatch, etc.). Moreover, users can use algorithms implemented in Plato or external frameworks such as TensorFlow, PyTorch, Keras, Ludwig, etc.
Model Training with Plato
Besides supervised models, Plato also provides some implementations of reinforcement learning algorithms, such as Q-Learning or REINFORCE. Such algorithms can be used to train for example the agent’s dialogue policy online, as the agent interacts with its environment. Plato provides the flexibility to train after each dialogue or at certain intervals. Note that while the reinforcement learning implementations are geared towards dialogue policies, Plato and Ludwig support online training of any component.
Training with Ludwig
Although virtually any modeling framework could be used in Plato, for building and training deep learning models for different components of conversational agents, Ludwig is a good choice when the aim is quick prototyping or for educational purposes, as Ludwig allows users to train models without writing any code. Users need to parse their data into .csv files, create a Ludwig configuration YAML file that describes the architecture, which features to use from the .csv and other parameters and then run a command in a terminal. This allows Plato to integrate with Ludwig models, i.e. load or save the models, train and query them. The trained models could be loaded into the modules through configuration files. In the tutorial of Plato we provide examples of building and training language understanding, generation, dialogue policy, and dialogue state tracking models for Plato using Ludwig.
Training with other frameworks
To use other learning frameworks (TensorFlow, PyTorch, Keras, etc.) users simply need to write a class that interfaces with the trained model (i.e. parses the input Plato provides and processes the model’s response into structures that Plato understands).
3 Plato Settings
As mentioned in the previous section, Plato supports interaction between a conversational agent and human agents, simulated users, or other conversational agents. In this section we discuss the details of each configuration.
3.1 Single Agent
A single Plato conversational agent can interact with a) human users, via text or speech, command line or graphical interface; b) simulated users, via dialogue acts or text; and c) data (loading data into Plato in order to train components or generating simulated data). The specifics of each interaction are specified in YAML configuration files, which have three main sections:
GENERAL defines the mode of the interaction (e.g. with a simulator, via speech or text, etc), the number of agents to spawn, paths to experience logs, and global arguments (passed to all components of each agent, for convenience).
DIALOGUE defines dialogue-specific settings, such as number of dialogues to run for, domain, etc.
AGENT_i defines each agent’s settings, such as role, and component-specific settings, such as model paths, learning rates, or other arguments each component needs. Note that the global arguments from the GENERAL section are also passed to each component of each agent.
For an example, see Appendix A.5 or in the Plato codebase:
3.2 Multiple Agents
The basic controller plato.controller.basic_controller provides support for two agents interacting with each other. Similar to the single agent configuration, this can be done in dialogue acts or in free text. Interaction through speech is also possible of course, for example the agents can exchange .wav files instead of text.
In this setting, it is possible to train all components of each agent on-line and concurrently or following any desired schedule, e.g. alternating training, training in batches, etc. These options can be defined in the yaml configuration file, which is very similar to the single agent case but defines multiple AGENT sections (see plato/example/config/application/MultiAgent_train.yaml). Plato provides some example multi-agent reinforcement learning algorithm implementations, for concurrent dialogue policy learning.
Example use cases of multiple conversational agents include
General Sum Games such as negotiations, where the agents’ objectives are not completely aligned (but are not completely opposite either).
Multi-party interactions, where one or more conversational agent interact with groups of other agents, for example family dinner ordering, playing board games, etc.
Smart Home, where a conversational agent is a point of contact between human agents, conversational agents, and non-conversational agents.
To support complex use cases, a new controller may be necessary, to make sure that information is passed correctly between the agents.
3.3 Graphical User Interface
Plato provides a dedicated controller to handle the GUI. In its current implementation it supports interaction between two agents, as shown in Figure 8. This interface is just an example based on PySimpleGUI, a flexible and easy to use package.
We have introduced the Plato Research Dialogue System, a flexible platform for research and development of conversational AI agents. With an easy-to-understand extensible design, Plato provides the infrastructure required by virtually any conversational AI agent architecture and supports any Python machine learning framework for the agent’s components. As Plato continues to grow, more models, algorithms, and metrics will be integrated as well as more examples, tutorials and data parsers to publicly available datasets.
Plato can be obtained from: https://github.com/uber-research/plato-research-dialogue-system
We would like to thank Michael Pearce and Zack Kaden for their contributions.
Appendix A Appendix
Here we provide some examples of how to run Plato. This appendix assumes some general knowledge of Spoken Dialogue Systems theory, prominent data sets, benchmarks, etc. A full user guide and installation instructions can be found at https://github.com/uber-research/plato-research-dialogue-system.
a.1 Quick Start
To run Plato after installation, you can simply run the plato command which receives 4 sub-commands:
plato run --config <CONFIG.YAML>
plato gui --config <CONFIG.YAML>
plato domain --config <CONFIG.YAML>
plato parse --config <CONFIG.YAML>
Each of these sub-commands receives a value for the --config argument that points to a configuration file. For some quick examples, try the following configuration files for the Cambridge Restaurants domain:
plato run --config CamRest_user_simulator.yaml
plato run --config CamRest_text.yaml
plato run --config CamRest_speech.yaml
plato gui --config CamRest_GUI_speech.yaml
a.2 Train a module
There are several options for training a module in Plato as detailed in the user guide555https://github.com/uber-research/plato-research-dialogue-system/README.md. Plato provides support for offline and online training (allowing for custom training schedules). For online training, users need to implement a train() function in their module (called according to the schedule) which may directly train the custom model, call an API for training it, etc. Offline training can happen inside or outside of Plato. In this section, we will demonstrate the latter, as training within Plato is relatively straightforward.
In this example, we show how to train an NLU using DSTC2 [dstc] data. Our model will jointly predict the intent and Begin-In-Out tags. Table 1, below, shows a snapshot of the training data.
|expensive restaurant that serves vegetarian food||inform||B-inform-pricerange O O O B-inform-food O|
|asian oriental type of food||inform||B-inform-food I-inform-food O O O|
|what is the phone number||request_phone||O O O O O|
|thank you good bye||bye thankyou||O O O O|
|how about french food||reqalts inform||O O B-inform-food O|
Using Ludwig, we can define our model with a simple configuration file (which can be found in Plato’s user guide) and call the following command to train the model (some details are omitted, see the full Plato guide):
ludwig experiment --model_definition_file ludwig_config.yaml --data_csv NLU.csv
To load the trained model in Plato a wrapper class needs to be written to interface with the specific model, i.e. to properly format the input, query the model, and properly parse its output so that downstream Plato components can understand it. Specifically for Ludwig, these classes are provided in Plato. The final step then is to modify Plato’s configuration file to point to the Ludwig-based NLU and to call plato using that configuration.
a.3 Load pre-trained models
Large pre-trained models from libraries such as Huggingface [wolf2019transformers] are becoming a standard in Convsational AI. Plato inherently supports such models as from Plato’s perspective they are no different than any other statistical model. For example, to use a pre-trained Huggingface BERT model for NLU, one needs to write a class implementing Plato’s NLU interface and make sure to load and query BERT appropriately (see https://github.com/huggingface/transformers for detailed instructions and documentation for Huggingface models and see Plato’s user guide for a relevant tutorial).
a.4 Other topics
Besides the above use cases, Plato provides a variety of utilities such as data parsing, database creation, and simulated data generation. For more details please see the full user guide.