Conversational systems have entered mainstream applications in recent years, as a result, we have seen a huge number of dyadic chatbots becoming available to users, from chit-chat bots which can converse about generic topics to entertain people, to chatbots which are experts and can provide useful services through natural language  , such as booking trains, flights or controlling lights. However, in everyday life, and particularly in the case of smart speakers, such as Echo and Google Home, it is common that the conversational systems are in the presence of multiple users which not only talk to the chatbot but also to each other. For a chatbot to determine whether and when to speak, the most commonly used solution in those contexts is to use direct address, that is, an explicit reference to the name of the chatbot in an utterance to command it to speak next: “Siri, what is next entry in my calendar?” However, this makes interacting with the chatbots mechanical, less social, and in many situations, plainly awkward. Similarly, there are also situations in which multiple people may want to interact with a chatbot at the same time, as in a chat group, to coordinate among themselves and achieve a common goal. Also, some experimental conversational systems propose scenarios where one person interacts with several chatbots in the same conversation in order to compare or coordinate the services provided by them .
By avoiding direct address, chatbots need to know their proper turn to interact, which leads to what is referred to as the multi-party turn taking problem . Fundamentally, the goal is to predict which agent in the conversation is the most likely to speak next and, conversely, when an agent must wait before interacting. An agent can be either a person or a chatbot. That interaction can be a reply to the last interaction, a reply to an interaction in the past in the dialogue, or even a new idea or an interruption. From the state of the art on conversational systems based only on text, we can find efforts, such as finch, a multi-party system enabling interactions between people and four chatbots which are experts in financial investments, where turn-taking is controlled by a rule-based service which is called for every utterance exchanged in the group chat, by considering both the content and the history of interaction between participants 
. Even though the rule-based system solves the turn-taking problem forfinch in that specific domain, the approach presents limitations for scaling up the set of rules and the application to new domains since it is heavily dependent on expert’s knowledge.
Therefore, the main contribution of this paper is to present and evaluate turn-taking as a machine learning (ML) problem, which can provide a more scalable solution. This involves defining a way to model turn-taking using a ML approach, and designing data sets to train and evaluate such models. For the former, given a finite set of possible agents that can speak, the algorithm tries to predict only the most likely agent to speak next, assuming only one agent should speak at a time, and the information to predict that can include only participant or both participant and content data. And for the latter, we present three corpora: one based on dialogues from seasons of an American TV situated comedy (sitcom), one with data gathered from the finch system, and multibotwoz created from the MultiWOZ dataset.
For validating the proposed approach, we have evaluated architectures with vanilla algorithms as Maximum Likelihood Estimation (MLE) , Support Vector Machines (SVMs)  , Convolutional Neural Networks (CNN) , and Long-Short Term Memory (LSTM)  to determine the most suitable technique for predicting turn-taking in conversations. We have found that the CNN models achieved higher accuracy than the other ML techniques on all three datasets, and that the content improved the overall performance of all approaches for the topic-oriented finch and multibotwoz dataset but not for the chit-chat dialogues of the sitcom dataset, where the ML models have not been able to beat the baseline. Finally, we found that the size of the corpus had a very positive impact on the accuracy for the content-based deep learning approaches, since we observe that those models perform best in the multibotwoz dataset, i.e. the largest one.
2 Related Work
To solve the task of managing the dialog of chatbots in a multi-party conversation, rule-based systems based on finite state automata have been proposed  . However, in those works ML-based turn-taking models have not been explored, which can be an issue since rule-based systems tend to not scale well, specially considering that natural language evolves very quickly.
On the other hand, several ML-based end-to-end data-driven dialogue systems have been built and evaluated , including some which consider multi-party dialogues, albeit disentangling them into dyadic dialogues . Further studies have also been conducted in order to build participant social role models .
A related problem happens when chatbots are built as a collection of multiple chatbots with different skills, and after an utterance of the user the chatbot has to select which of the component chatbots must answer. In fact, this was the approach used by the majority of the contestants of the Alexa Prize111Amazon’s Alexa Prize. https://developer.amazon.com/alexaprize proposed by Amazon to advance voice control skills and technologies. The 2018 chatbot winner implemented an open-domain dyadic social conversation system that achieved an average rating of 3.56 out of 5 in the last seven days of the semi-finals . However, although these ML-based related works are extremely important for building better end-to-end dialogue systems, as far as we know, none of them have been employed to learn multi-party turn-taking models from conversation logs with the goal of predicting who speaks next.
The most closest work is , in which a model that encoded the context to predict the addressee and a response in multi-party conversation was proposed. We, on the other hand, do not try to predict the response. Furthermore, there are datasets as Reddit or Ubuntu IRC   which are forum-oriented and they do not have the structure of a conversation with clear speaker turns. We have studied the transformation of the Ubuntu IRC dataset into a multi-party dialogue dataset. However, it is far from trivial. The original training corpus is organized in days with no clear threads of conversation between the participants. Therefore, there are multiple conversations happening concurrently each day. After applying some filters, we realized that this effort requires sophisticated algorithms and is out of the scope of our paper.
Related to finch dataset is the Multi-Domain Wizard-of-Oz dataset (MultiWOZ)222MultiWoZ dataset: https://www.repository.cam.ac.uk/handle/1810/280608 , which is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. However, this dataset was not created considering that more than one bot would be in the conversation with the user. Rather, it was created considering a dyadic conversation between the user and a bot that can talk about multiple domains or topics. The topics are actually service providers. We have though adapted this dataset as explained in next section.
For learning multi-party turn-taking models we considered three datasets: a corpus based on dialogues from scripts of 10 seasons of a very popular American sitcom; the finch corpus with logs of real-world interactions between people and four chatbots , and multibotwoz333Multibotwoz dataset: to be provided by the authors upon request. which we created as an adaptation from the MultiWoZ dataset , a multi-domain service-based dialogue corpora. finch was designed so that the chatbots cooperate to advise the user (a person) to choose among three investment options: savings account, certificate of deposit, and treasure bonds.
For each of those investment options there is an expert chatbot (saGuru, cdGuru and tdGuru, respectively) plus a fourth chatbot (inGuru) which is a mediator moderating the conversation and assuring that the user’s utterances are replied. The expert chatbots are not only able to give answers related to investments but can also estimate the return on investment for an initial amount and period of time specified by the user.
In general, the American sitcom dataset is more chit-chat oriented, while the finch corpus contains human-machine interactions and is more topic-oriented in a way that each chatbot interacts only depending on the topic. For the American sitcom corpus, we considered each scene of each episode as a single dialogue, and considered only the scenes containing at least three (3) and at most six (6) agents (considering only the six main characters of the sitcom). The actual corpus ended up with utterances in dialogues with agents on average in each dialogue and utterances exchanged on average in each dialogue.
Regarding the multibotwoz
dataset, we have filtered the MultiWoZ dataset with the multi-domain-based dialogues only, i.e., the dialogues that the user requested more than one service. And we have filtered (within this subset) the ones that contained only the following services: attractions, hotel, restaurant, taxi and train. We did that because the amount of dialogues were not sufficient as a sample to learn turn taking for these services. We then updated the bots name depending on the service (one bot for each service). To perform this task, we have created a pool of classifiers that based on the domain-based dictionaries and the classification of the two last utterances sent in the dialog, could determine the domain of a given utterance.
|Corpus size (number of utterances)||20,086||1,148||99,553|
|Total number of Dialogues||1,050||41||6,138|
|Avg. number of agents per Dialogue||4||5||4|
|Avg. number of utterances per Dialogue||19||24||16|
|Avg. length of utterances (words)||11||12||13|
|Agent||Interaction frequency after user(%)|
For the sentences that were related to generic dialog acts444Also known as speech acts and represent the function of the speech., as greetings and thanks, or clarification questions, we have defined that these utterances were sent by the system_bot. It then played the role of a mediator in the conversation, just like inGuru bot does in finch dataset. The resulted corpus ended up with utterances in dialogues with agents on average in each dialogue, varying from to and utterances exchanged on average in each dialogue.
A summary of the the three corpora is listed in Tables 1 and 2. The finch corpus is much smaller than the sitcom and multibotwoz corpora, comprising utterance exchanges in a total of dialogues, with a average number of agents and utterances exchanged per dialogue on average. Despite that, it was created in a real-world system in which people interacted with multiple bots in the same conversation. Further, we present in Tables 3, 4 and 5 the interaction frequencies between agents, to better characterize the difference between the three datasets. Element represents the frequency of agent interacting after agent . Note that we did not compute the frequency for the same agent. In all datasets we have merged consecutive utterances by the same agent into one utterance.
In the sitcom dataset the interaction frequencies between agents are reasonably similar, as expected in typical chit-chat scenarios, with the minimum value of the interaction frequency between agents being (agent E after D) and the maximum value (agent B after D), with an average of
and standard deviation of. However, the same does not happen in the finch and multibotwoz datasets. The minimum value in the former is (cdGuru after inGuru) and the maximum value is (finch-user after inGuru) with average of and standard deviation of . While in the later, minimum value is for taxi_bot, and the maximum value is for train_bot. This clearly shows that there is more variability on the conditional interaction for the dialogues in these ones than for the sitcom. For instance, this makes easier to accurately predict in finch when the user interacts after inGuru, and similarly that it is very uncommon for cdGuru to speak after inGuru. Finally, no bot interacts after another bot in the multibotwoz dataset, only after the user with frequencies illustrated in table 5.
4 Machine Learning-based Models
In this section we describe the ML-based methods which we have considered in this research. Briefly, we have evaluated different types of ML approaches, such as MLE, SVMs, and Neural Networks, and for some of the approaches we have varied the implementation by considering only agent information, i.e. only who spoke the previous utterances; or by taking into account also the content in of the utterances, i.e. who spoke and what was said. In addition, the methods differ in the way the agent information is encoded. For most methods, a binary-based encoding is considered, while for the Neural Networks the agents are encoded as raw text. We refer to these methods as Traditional ML and Deep Learning Methods, respectively.
We consider a baseline which we call Repeat Last to compare our proposed methods: this approach is based on a social rule often observed in multi-party human dialogues : when more than two agents are exchanging utterances, there is a tendency that the agent speaking before the current to be the SNS. Thus, whenever an agent speaks, we might predict the next one as being the one that had spoken before. More formally, the Repeat Last baseline prediction works as following: let be the set of agents in the dialogue, be the number of agents, and let be the set of agents who sent an utterance in the dialogue up to a time , where . Whenever the speaker sends an utterance, the next agent selected to talk, denoted , is the one who spoke at time , i. e., .
4.2 Traditional ML Methods
For the more traditional ML methods, such as MLE, SVM, and the like, we make use of the one-hot encoding to convert the information of the agents to a feature vector, formalized as follows. Let be a vector and , a -dimensional instance space with agents in the conversation and the -th agent, where , and let be the agent who spoke at time . The binary feature vector for predicting the next speaker at time , in the simplest case with a lookback window with size equals to 1, can be defined as:
We perform a linear transformationon , by taking into account until , where is the size of lookback window and , as:
4.2.1 Agents-based models:
The approaches considered herein encode only the information of the agents, by making use of the aforementioned one-hot encoding method. The methods are the following:
A-MLE: Maximum Likelihood Estimation  taking into account only the order in which the agents interact in the conversation. Therefore, transitions are learned by considering that the previous state is the last agent which sent an utterance and the next state is the following agent which sent an utterance. We also modeled A-MLE considering a lookback window of size 2, which means the previous state contains information of the two last agents which sent an utterance. In this case, a transition from state to , is modeled as:
We then compute the MLE with smoothing to estimate the parameter for each transition type. Therefore, for each corpus, we estimate for observed transitions as:
Where is the set of states and is the number of states in the set.
A-SVM: a multi-class linear SVM model  is trained with classes: one for each agent. The model is trained by receiving as input binary vectors of length , like in A-MLE. However, in this case, a class is predicted considering:
where is the label set which contains the names of the agents.
BA-SVM: A binary SVM model is learned for each agent, which can classify whether the respective agent is likely to reply or not for a given utterance. The dialogues are parsed and only the agent encoded vectors are considered. Then, for each utterance, if the agent is the speaker of the utterance, then the agent’s name is assigned as the label, otherwise the ”Other” label is assigned. The training data are used as input in a SVM modeler and the generated models are saved. For each example in the testing data, all models are called and the output is ranked. The top 1 of the ranking list is chosen as the next most likely speaker.
4.2.2 Agents-and-Content-based Models:
Below we describe variations of the previously described methods but with the addition of content information, that is, what was actually spoken by each agent.
AC-MLE: The agent-and-content MLE-based architecture considers also the utterance which is being exchanged in the dialogue in addition to the agent vector as defined in Equation 4. The utterances are first tokenized and punctuation symbols are removed, having a lists of tokens as result. By taking into account all utterances in the training set, a Word2Vec embedding model is trained. We have used the gensim555https://radimrehurek.com/gensim/models/word2vec.html library for generating the word2vec models. After this process, the feature vector (Utterance2Vec) representing the utterance consists of the mean of all of its corresponding word vectors. Then, the utterance vectors are given as input to a K-means clustering algorithm .
A binary vector which represents the detected cluster for each utterance is generated as well as a binary vector which represents the agent which sent the utterance, both considering the one-hot encoding method. Then, the cluster and agent vector pairs are used to train a MLE-based model. More formally, let be a vector as in Equation 3, but in this case concatenated with the binary vector described in the previous paragraph, then a class is predicted by computing the transitions as in Equation 4 and the likelihood as in Equation 5.
AC-SVM: This agent-and-content SVM-based architecture makes use of word embeddings for better capturing the semantic meaning of the utterances.
Then a multi-class linear SVM model is trained also with classes as in the A-SVM approach, however it receives as input both utterance and agent vectors concatenated into single vectors. More formally, let be a vector as in Equation 1 and be the utterance vector. We perform a linear transformation on by taking the , where is the lookback window. So, for , we have:
Then a class is predicted considering:
Where is the label set which contains the names of the agents.
4.3 Deep Learning Methods
We consider two different deep learning methods. The first is based on a CNN  which is a model generally applied on classification tasks, and the second is based on LSTM  which is a model also generally used for classification tasks but with an extra capability of learning temporal information which is particularly attractive in the context of dialogues.
AC-CNN: The agent-and-content convolutional neural network (AC-CNN) presented herein consists of a standard model used for text classification adapted for the task involved in this paper. Such adaptation consists of formatting the previous utterances and the name of the agent as a raw text, and defining the label as in the previous methods.
More formally, let be the agent who spoke utterance at time , and the agent who spoke the last utterance , to predict who will speak at time , we build the following raw text: , where represents the concatenation of textual strings. That text is then used as input to the neural network.
The architecture considered for the CNN is the following: embedding layer with 64 dimensions; dropout set to 0.2; convolutional layer with 64 filters with kernel size of 3 and stride equals to 1; 1D Global Max-pooling layer with pool size set to 5; another dropout set to 0.2; and 300-dimensional dense hidden layer.
AC-LSTM: For the agent-and-content long-short term memory neural network, we make use of the same raw-text-based encoding we described for the CNNs, and the main difference lies in the architecture of the two methods, since LSTMs contain a layer aiming at learning temporal information. Given that a dialogue consists of a sequence of pairs comprised of a speaker and an utterance, the goal of evaluating this model is to investigate whether such temporal sequence can be captured and learned by a ML model.
The architecture we considered for this neural network is the following: embedding layer with 64 dimensions; dropout set to 0.25; convolutional layer with 64 filters with kernel size of 3 and stride equals to 1; and 1D Global Max-pooling layer with pool size set to 5. For this model we have set meta-parameters that are similar to that of the CNN, being the only exception the number of epochs which was set to 2 for the LSTM. By removing the text from the input, we have also implemented two variations of the neural networks only with agent information to compare with the agent-only models: (i) A-CNN: the same architecture as AC-CNN; (ii) A-LSTM: equivalent to AC-LSTM.
5 Evaluation Results
In this section we present the training approach and the results with the agent-based baselines models and the learning models considering both content and agent information, for both the sitcom dataset and the finch dataset.
5.1 Training Approach
Our model does not constraint with regard to waiting for a specific moment to start predicting, it follows a more classical batch-learning process. For both corpora, we considered atrain-test split, where of subsequent dialogues are used for training and the remaining for testing. In order to set meta-parameters for the models, cross-validation has been applied on the training set. Regarding the number of clusters for the MLE-based architecture, after some trials and observing how the clusters were created in a PCA  2D plot, we parametrized the model with six (6) clusters for the sitcom data, five (5) for finch data and seven (7) for multibotwoz. Although the number of classes are identical to the number of agents, we did not find a correlation between the clusters and the agents or the conditional interaction between the agents. The vocabulary is built with training and testing data, therefore, all words had WE and there were no words which where OOV. For both the embedding and the hidden layers in the AC-CNN models, Rectified-Linear-Units activation functions (Relu) are applied. For the training, we make use of the Adam optimizer, with 3 epochs for training and learning rate set to 0.001. Batch size is set to 50 for the sitcom dataset, and 5 for finch and multiwoz data. And for the AC-LSTM architecture, one LSTM layer was considered with output size set to 90 for the TV sitcom dataset and to 50 for the finch and multiwoz datasets. To evaluate the models and compare the results, we have computed the accuracy metric.
5.2 Evaluation of the Proposed Methods
The Repeat Last baseline achieves an accuracy of on the sitcom dataset, on finch, and on multibotwoz. From the results shown in Tables 6 and 7 we can see in all datasets that higher accuracy values can be achieved when two turns are considered, i.e. with a lookback window equals to two, than when only the last turn is considered. Although we have also trained MLE-based models with lookback windows ranging from three to five, we do not present the results here because they did not improve over the accuracy of the models with lookback window equals to two.
In general, almost all models performed equal or below the baseline for the sitcom dataset, while much greater improvements in accuracy were seen in the finch dataset for several of the models. Regarding the multibotwoz dataset, the models that used only the agent information could not beat the baseline, while CNN actually had the lowest performance. On the other and, the agent and content based CNN model had the best performance compared to all others models and datasets (). We present in table 8 the difference of percentage points between Repeat Last baseline and results with lookback window equals to two. Unfortunetly, a chatbot can not determine which approach to consider (baseline or learned model) based on the type of conversation (chit-chat or topic-oriented) and the size of the dialog with these results, because the results for the sitcom dataset compared to the baseline were not statistical significant (. However, the results show: (i) the size of the corpus has a very positive impact on the accuracy for the content-based deep learning approaches and those models perform best in the larger datasets, since the results were statistical significant over the baseline for the multibotwoz dataset (); and (ii) if the dialogue dataset is small and topic-oriented (but with few topics), which is the case of finch dataset, it is sufficient to use an agent-only MLE or SVM models, although slightly higher accuracies can be achieved with the use of the content of the utterances with a CNN model ().
6 Conclusions and Future Work
In this paper we investigated the application of machine learning (ML) techniques to learn multi-party turn-taking models from three dialogue corpora.
We presented results which indicate that if the dialogue dataset is small and topic-oriented (but with few topics), it might be sufficient to use an agent-only MLE or SVM models, although slightly higher accuracies can be achieved with the use of the content of the utterances with a CNN model. However, if the dialogue dataset is bigger, our results indicate that an agent-and-content CNN model performs best, albeit almost at the level of a very simple, baseline model which simply uses the speaker before the last as its prediction.
Further studies could be done in order to find the best number of clusters for the clustering algorithm in the AC-MLE approach. As future work, online and reinforcement learning could also be tested, so a chatbot would be able to learn turn-taking during interaction, enabling a self-adaptive behavior on the turn-taking model. Finally, research on syntactic variations of the speaker to be used for prediction could be done.
-  I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, A. Courville J. Pineau, and Y. Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI Conference, 2017.
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva,
Stefan Ultes, Osman Ramadan, and Milica Gašić.
Multiwoz - a large-scale multi-domain wizard-of-oz dataset for
task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium, 2018. ACL.
-  C. S. Pinhanez, H. Candello, M. C. Pichiliani, M. Vasconcelos, M. Guerra, M. Gatti de Bayser, and Paulo Cavalin. Different but equal: Comparing user collaboration with digital personal assistants vs. teams of expert agents. 2018. arXiv:1808.08157.
-  Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organization of turn-taking for conversation. Language, 50(4):696–735, 1974.
-  J. W. Harris and H Stocker. Maximum likelihood method. page 824, 1998.
-  Ürün Doǧan, Tobias Glasmachers, and Christian Igel. A unified view on multi-class support vector classification. J. Mach. Learn. Res., 17(1):1550–1831, January 2016.
-  Yann Guermeur. A generic model of multi-class support vector machine. Int. J. Intell. Inf. Database Syst., 6(6):555–577, October 2012.
-  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. CoRR, abs/1103.0398, 2011.
Pengfei Liu, Shafiq R. Joty, and Helen M. Meng.
Fine-grained opinion mining with recurrent neural networks and word embeddings.In Lluis Marquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton, editors, EMNLP, pages 1433–1443. The ACL, 2015.
-  M. Gatti de Bayser, C. Pinhanez, H. Candello, M. A. Vasconcelos, M. Pichiliani, M. Alberio Guerra, P. Cavalin, , and R. Souza. Ravel: a mas orchestration platform for human-chatbots conversations. In The 6th International Workshop on Engineering Multi-Agent Systems (EMAS @ AAMAS 2018), Stockholm, Sweden, 2018.
-  M. Gatti de Bayser, M. Alberio Guerra, P. Cavalin, and C. Pinhanez. Specifying and implementing multi-party conversation rules with finite-state-automata. In Proc. of the AAAI Workshop On Reasoning and Learning for Human-Machine Dialogues 2018, New Orleans, USA, 2018.
-  Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse, 9(1), 2018.
-  Micha Elsner and Eugene Charniak. You talking to me? a corpus and algorithm for conversation disentanglement. In Proceedings of Association for Computational Linguistics (ACL), 2008.
-  Samira Shaikh, Tomek Strzalkowski, George Aaron Broadwell, Jennifer Stromer-Galley, Sarah M. Taylor, and Nick Webb. Mpc: A multi-party chat corpus for modeling social phenomena in discourse. In The LREC, 2010.
-  Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang Yang, Jiaping Zhang, Mingyang Zhou, Kevin Jesse, Austin Chau, Antara Bhowmick, Shreenath Iyer, Giritheja Sreenivasulu, Runxiang Cheng, Ashwin Bhandare, and Zhou Yu. Gunrock: Building a human-like social bot byleveraging large scale real user data. In 2nd Proceedings of Alexa Prize (Alexa Prize 2018), 2018.
-  Hiroki Ouchi and Yuta Tsuboi. Addressee and response selection for multi-party conversation. In Kevin Duh Jian Su, Xavier Carreras, editor, EMNLP, pages 2133–2143. The ACL, 2016.
-  D.C. Uthus and D.W Aha. The ubuntu chat corpus for multiparticipant chat analysis analyzing microtext. In AAAI Spring Symposium on Analyzing Mi- crotext, pages 99–102, 2013.
-  Ryan Lowe, Nissan Pow, Iulian V. Serban, and Joelle Pineau. The ubuntu dialogue corpus: A large dataset for research in unstructure multi-turn dialogue systems. pages 285–294, 2015.
Some methods for classification and analysis of multivariate
5-th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967.
-  Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37 – 52, 1987. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.