Embodied conversational AI agents as user interfaces is a growing research area, and various efforts have been made in training naturalistic dialog systems using human to human interactions (Sordoni et al., 2015b) (Wen et al., 2015) (Li et al., 2015). The neural methods with large datasets have made advancements in generating meaningful responses in chit-chat dialogues. However, responses generated by these models are frequently vague and inconsistent (Serban et al., 2016) (Li et al., 2016b) (Vinyals and Le, 2015).
Recent advancements in language models have been made in the personalization of dialog systems by incorporating the persona of the user as embeddings with the dialog history (Li et al., 2016a) (Zhang et al., 2018) (Song et al., 2019). The persona was explored as a user profile, which included interaction style, language behavior, and background facts encoded in multiple sentences.
The generative and ranking models were trained on corpus of twitter conversations (Sordoni et al., 2015b), dialog sets comprising TV series scripts
(Crane and Kauffman, 1994) (Lorre and Prady, 2007) (IMSD, 2014), and the conversations collected from crowd sources workers (Zhang et al., 2018) for persona based conversational dialogues.
Agent migration has been explored in the past through a variety of prior works (Imai et al., 1999) (Aylett et al., ) (with Robots and Companions, 2009) (Duffy et al., 2003). It was explored as a process in which an agent could migrate from one embodiment to another while maintaining the same relationship with the user across embodiments. Recently, Migratable AI in (Tejwani et al., 2020) explored the migration of conversational AI agents across different robots while measuring the users’ perception on different elements of migration - identity and information migration. It was found that users reported the highest trust, competence, likability, and social presence towards the AI agent when both the identity and information of the agent were migrated to the robots. The identity of the agent was explored as the agent with the same visual and voice characteristics, and information was explored as the utterances from the dialog history, which included both personal and non-personal information.
Various state of the art conversational datasets have also been introduced in the past such as Google Taskmaster dataset (Byrne et al., 2019), Alexa Prize Topical Chat dataset (Gopalakrishnan et al., 2019), MultiWoz datasets (Budzianowski et al., 2018) (Eric et al., 2019), Microsoft State Tracking Challenges datasets (Williams et al., 2013) (Henderson et al., 2014a) (Henderson et al., 2014b) and SpaceBook corpus (Vlachos and Clark, 2014). However, they have been limited to the transactional chit-chat communications, such as airline, restaurant booking or twitter conversations. There is a need for the dataset, which explores the migratable elements of the conversational AI agent as the agent migrates across different embodiments. The dataset could be used to further train the migratable AI agents to contextually learn to deliver personal and non-personal information of the users during the dialog conversations in private and public settings.
In this work, we take a step forward in Migratable AI and explore the information migration of the agent towards contextual information migration while maintaining the identity migration. We define the migration context analogous to the persona from the literature research.
Migration context in a dialog conversation can be viewed as the type of user information in the utterances (personal or non-personal) and the type of embodiment to which the agent migrates into (public or private). The goal is to train the agent to generate utterances during the dialog conversation with the user based on the migration context i.e. the agent should not present the personal information of the user when migrated into the public embodiment and vice versa.
This paper offers the following contributions. First, we present a migration dataset collected from the dialog conversations between crowdsourced workers using the migration context. Second, we show the personalization of dialog conversation between the migrated agent and the user by training the generative and information retrieval models on the migration dataset. Lastly, we evaluate these models on qualitative metrics and human evaluation for both with and without migration context and report the results.
2. Related Work
Advancements in dialog systems have been made towards goal-oriented dialog models with predefined user intents and labeled internal dialog states using data-driven methods which provide probabilistic state distributions needed by partially observable Markov decision process (POMDP) dialog managers(Young et al., 2013) (Young et al., 2010)
or belief tracker based on recurrent neural networks(Henderson et al., 2013) (Mrkšić et al., 2015). Several conversational datasets (Byrne et al., 2019) (Gopalakrishnan et al., 2019) (Williams et al., 2013) are available for training the dialog systems in chit chat settings which are mainly concerned towards transactional goals such as restaurant booking, flight booking or weather queries.
Neural approaches in language modeling have seen growing interest in response generation such as recurrent neural network(RNN) framework for generating responses on microblogging websites as proposed in(Ritter et al., 2011) (Shang et al., 2015) (Sordoni et al., 2015a). Similarly, sequence to sequence models was applied to dialog systems to produce novel responses, but they lacked consistent personality (Li et al., 2016a) as they were trained over many dialogs across several users.
Personalization in goal-oriented dialog systems was explored by (Joshi et al., 2017) (Lucas et al., 2009) by focusing on the user profile information to model the speech style and to personalize the reasoning over the knowledge bases. The persona-based neural conversation model (Li et al., 2016a) on chit-chat conversations from twitter corpus showed that that user personas could be encoded into the sequence to sequence models (Vinyals and Le, 2015) by embedding the user’s persona, such as background information and speaking style along with the dialog history.
The most relevant work is (Zhang et al., 2018)
, which contributed to the persona-chat dataset by incorporating the persona information of the users collected from the crowdsourced workers in the form of multiple sentences. They encoded its vector representation along with the dialog history into the generative models to produce personalized responses.
Prior work has explored the concept of agent migration through identity migration (Aylett et al., ) (with Robots and Companions, 2009) (Duffy et al., 2003) and information migration architectures (with Robots and Companions, 2009) (Ho et al., 2009) (Ono et al., 2000). Agent migration was explored as a process in which an agent could migrate from one embodiment to another while maintaining the same relationship with the user across embodiments. For the identity migration - the identity of the agent such as appearance, voice, and dynamics of motion was migrated across embodiments; the information migration was explored as Short term memory (STM) to maintain agent’s current focus and Long term memory (LTM) for the agent to interact with users over a long period of time.
Migratable AI in (Tejwani et al., 2020) explored the migration of conversational AI agents across different robots while measuring the users’ perception on identity and information migration. The identity of the agent was explored as the conversational AI agent with the same visual and voice characteristics, and information was explored as the utterances from the dialog history. Since the information migration included both the personal and non-personal information of the user when the agent was migrated in a public embodiment, it even used the personal information of the user in the dialog conversations. The user reactions to this were reported in (Tejwani et al., 2020). Hence, we further explore the information migration as contextual information migration along the lines of (Tejwani et al., 2020) (Zhang et al., 2018) through modeling both the personal and non-personal utterances in the dialog conversations across different settings (Public or Private) of the embodiments.
|Health Center Reception|
|Heath Care Professional’s Room|
3. Migration Dataset
The migration dataset is a crowd-sourced dataset, collected via Amazon Mechanical Turk, where each pair of users conditioned their dialog from the instructions provided in a task-based scenario. The users that are responsible for carrying out these tasks are referred to as AMT workers.
3.1. Task based scenario
The participatory design work by Luria et al. on Re- Embodiment and Co-Embodiment (Luria et al., 2019) explored futuristic scenarios with participants using the concept of ”speed dating” (Zimmerman and Forlizzi, 2017). They crafted and piloted four user enactments (DMV, Home and Work, Health Center, and Autonomous Cars) over the course of a month in which a person might interact with multiple agents that can re-embody and co-embody.
In our research, we explore and further extend the Health Center scenario, which involves the user acting out a visit to a health center to evaluate recovery from an injury. The scenario would begin at the user’s home where the personal agent would get to know them by asking personal and non-personal questions and remind them that it was time for their medical appointment. Upon arrival at the health center reception in a public setting, the personal agent (migrated onto receptionist robot) would greet, acknowledge the appointment of the user, and further escort them to the health care professional’s room. At the health care professional’s room in a private setting, the personal agent (migrated onto Smart TV) would further assist the user while waiting for the health care professional. This would allow us to explore the migration of the conversation agent in more sensitive settings and address issues of context-crossing agents, privacy, and data storage perceptions.
|Number of instances||1014|
|Number of dialogs||92|
|Number of MRs||402|
3.2. Migration modes
For each dialog, we paired two AMT workers who were randomly assigned to one of the migration modes:
With migration context The AMT workers were restricted to share the amount of information with each other that they had learned about themselves. They could only share personal information in a private setting and non-personal information in a public setting.
Without migration context The AMT workers were not restricted to any information that they had learned about themselves. They could share both personal and non-personal information across private and public settings.
|Model||No Migration Context||Migration Context|
3.3. Migration chat
From the task-based scenario, we instructed the AMT workers to carry out the dialog conversations. One worker was instructed to enact the role of a person who had met with an injury in the past and needs to visit the health center for a medical appointment. The other worker was instructed to guide the person by enacting through different roles across locations: a friend at home, receptionist at health center reception, and helper at health care professional’s room. At home, as a friend, his task was to get to know the person and inquire about his injury and upcoming appointment by asking a few personal and non-personal questions. At the health center reception in a public setting, as a receptionist, he was instructed to greet the person for his appointment and acknowledge him by using information from previous interaction at home (based on migration mode). At the health professional’s room reception, in a private setting, he was further instructed to assist the person while he is waiting for the health care professional (based on migration mode).
In an early investigation of the study, we found out that AMT workers asked similar questions to each other, so we added the instructions that they need to ask at least two personal and non-personal questions. At the end of each dialog conversation, the AMT workers were instructed to label each utterance in the dialog with ”NP” for non-personal information and ”P” for personal information. An example dialogue from the dataset is shown in Table 1. We defined a minimum dialog length of between 8 and 10 turns for each dialog.
3.4.1. Descriptive Statistics
The descriptive statistics for the dataset are summarized in Table 2. Here, the number of instances represent the total number of utterances in the dataset, a number of dialogs represent the total dialog conversations, number of MRs represent the number of distinct Meaning Representations(MR), Refs/MR is the number of natural language references per MR, Words/Ref represent the average number of words per MR, Slots/MR represent the average slot value pairs per MR, Sentences/Ref is the number of natural language sentences per MR and Words/Sentence is the average words per sentence. We split the dataset in an 80:20 ratio for training and test set.
3.4.2. Lexical Richness
We used Lexical Complexity Analyser (Lu, 2012) for computing the lexical richness and highlighted them in Table 3. Along with the total number of tokens and types, we computed the type-token ratio(TTR) and a mean segmental TTR (MSTTR) by dividing the dataset into segments of a defined length(100) and then further computing the average TTR for each segment as described in (Lu, 2012).
Models were trained on utterances from dialog history (and possibly, migration context) to generate the response word by word. The migration context consisted of the type of user information labeled with Personal(P) or Non-Personal Information (NP), and the type of setting in the dialog conversation - Private or Public. In the past, (Zhang et al., 2018) (Li et al., 2016a) explored the ranking and generative models for persona-based chat personalization by training the persona of the user, short profile encoded in multiple sentences, along with dialog history. We condition those training methods with personalization for migration context instead of persona, as described below.
|No Migration Context||Migration Context||No Migration Context||Migration Context|
|Fluency||3.12 (1.54)||4.22 (0.76)||2.22 (1.20)||2.84 (1.33)|
|Engagingness||3.66 (1.41)||4.62 (1.31)||2.41 (1.35)||2.87 (1.04)|
|Consistency||2.80 (1.07)||4.41 (1.87)||2.12 (0.96)||3.16 (1.21)|
Given a sequence of inputs , an LSTM can be encoded by applying:
For word embedding vectors, we used GloVe (Pennington et al., 2014). The vector for each text unit such as a word or a sentence is denoted by at time step and the vector computed by LSTM model at time by combining and is denoted by . Each input, , is paired with a sequence of outputs to predict . The softmax function for the distribution over outputs is defined as:
The activation function is denoted byand the model is trained through the negative log likelihood. In order to include the migration context for personalization, we generate it’s vector representation and prepend it to the input sequence , i.e. , where denotes the concatenation.
4.2. Generative Profile Memory Network
We consider the generative model, Generative Profile Memory Network (Zhang et al., 2018), and encode the migration context as individual memory representations in the memory network. The final state of is used as an initial hidden state of the decoder, which is encoded through utterances in dialog history. Each entry, is encoded via
We compute the weights of the words in the utterances using the inverse document frequency: , where is from the GloVe index via Zipf’ law (Zhang et al., 2018).
The decoder attends the encoded migration context entries: mask , context and next input as,
where and is the set of encoded memories.
When the model is not enabled for the migration context (i.e., no memory), then it is similar to the Sequence-to-Sequence model.
4.3. Information Retrieval(Starspace)
We consider an Information Retrieval based supervised embedding model, Starspace (Wu et al., 2018)
. The Starspace model consists of entities described by features such as a bag of words. An entity is an utterance described as n-grams. The model assigns a d-dimensional vector to each of the unique features in a dictionary that needs to be embedded. The embeddings for the entities in the dictionary are learned implicitly.
The model performs the information retrieval by computing the similarity between the word embeddings of dialog conversation and the next utterance using the negative sampling and margin ranking loss (Zhang et al., 2018). The similarity function
is defined as the cosine similarity of the sum of word embeddings of queryand a candidate . The dictionary of word embeddings as is a matrix, where indexes the feature (row), with dimensional embedding on sequences and .
, we evaluate the task using the metrics: (i) F1 score evaluated on the word level, (ii) perplexity for the log-likelihood of the correct utterance, and (iii) Hits@1 representing the probability of candidate utterance ranking from the model. In our setting, we choose n=12 for the number of input responses from the dialogs for the prediction.
We report the results on qualitative analysis of the models and human evaluation of the models performed by crowdsourced workers.
5.2.1. Qualitative Analysis
The performance of the models is reported in Table 4. The generative models improved significantly when conditioning the prediction of utterance on using the migration context. For example, Sequence to Sequence and Generative Profile Memory Network(GPNV) improved their perplexity and hits@1. However, the Information Retrieval model - Starspace did not report a similar trend in the measures. It may be because the word-based probability which the generative models treat is not calibrated uniformly for the sentence based probability for the IR model.
5.2.2. Human Analysis
We performed the human evaluation using the crowdsourced workers since the qualitative analysis comes with several weaknesses in its evaluation metrics(Liu et al., 2016). We followed a similar procedure as in the data collection process described in Section 3.3. In that procedure, we paired two AMT workers to carry out the dialog conversation who were randomly assigned to either of the condition - with or without migration context. Here we replaced one of the AMT workers with our model. They did not know about this while they conversed with each other.
After the dialog conversation, we asked the crowdsourced workers, a few additional questions in order to evaluate the model. We asked them to score between 1 to 5 on fluency, engagingness, and consistency following (Zhang et al., 2018).
The results of the measures are reported in Table 5. We used the Sequence to Sequence model for the evaluation of 10 dialog conversations for both with and without migration context. For the baseline, we also evaluated the scores of human performance by replacing the model with another worker. We noticed that all the measures - fluency, engagingness, and consistency were reported significantly higher in both model and human performance when the migration context was used in the evaluation. We also noticed that the overall measures(both with and without migration context) were higher for human performance than the model. It could be because of the linguistic difference in sentence generation from the tokens predicted by the model.
In this work, we introduced the migration dataset, which consists of the crowed sourced dialog conversations between participants on a task-based migration scenario. In the dataset, we explored the information migration of the migrated agent using migration context for the personal and non-personal utterances in the dialog history across different settings (Public or Private) of embodiment into which the agent migrates. We trained the generative and information retrieval models on the dataset and report that generative models show improvement when conditioning the prediction of utterance on migration context. We also performed the human evaluation on the dataset and found that participants reported higher fluency, engagingness, and consistency in both models and the human performance when migration context was used.
We believe that the migration dataset will be useful for training future migratable AI systems in personalizing the dialog systems during the migration of conversational AI agent across different devices. For human performance, we evaluated only on the Seq2Seq model.
7. Future Work
This paper described the work that supports the migration of conversational AI agents across multiple embodiments while maintaining the relationship with the user. Naturally, this is only a beginning, and there remain many exciting research challenges that can be addressed, based on the results of this paper.
In the present system, a single conversational AI assistant was migrated across different embodiments. In the future, this can be extended to more than one conversational AI assistant, and the user can possibly choose as to which assistant he/she wishes to migrate onto a given embodiment.
The dialog conversations between the conversational AI assistant and the user were limited in number of turns at each embodiment due to limitations of MTurk. Hence, only a limited amount of time was spent between the agent and the user to build a relationship. In the future, this can be extended to a long term interactions by recording the dialog conversations between the agent and the user spanning across multiple days, and then the relationship with the migrated agent could be analyzed to investigate the effect of user perceptions.
It is crucial to consider the ethical implications of the conversational agent’s migration because both the agent’s identity and the dialog history between the user and the agent are migrated in different embodiments. In terms of system security, for user authentication to the system, the conversational AI assistant was designed to migrate on to a new embodiment based on the user’s face detection through the embodiment’s front-facing camera. In the future, the user can control the activation of the migration of the agent using a secure mechanism such as an RFID ring, or NFC enabled wearables.
In terms of data privacy and security, the dialog conversations between the user and the migrated agent across different devices were transcribed and recorded on to the private and secure cloud database. This was designed to train the machine learning models from the recorded conversations and improve the behavior of the migrated agent. In the future, the system can be further extended to allow the users to store their dialog conversations either in their private data network or in a local wearable device memory.
The migration dataset could be useful for training future migratable AI systems. In the future, the migration dataset could be extended to include more dialog conversations between the participants in order to improve the accuracy of the models. For the human performance, it was evaluated only on the sequence to sequence model. In future work, different recurrent neural network (RNN) and Long short-term memory(LSTM) techniques could be explored for the model and human evaluations.
In the current world, we are surrounded by conversational AI agents such as Alexa, Jibo, Google Home, Siri in different spaces (private or public). They do not share the user context with each other or other functional robots such as Pepper, Kuri, Care-E. To have a seamless conversation across these devices, they must be connected to a single platform presently owned by corporate companies such as Google, Apple, or Amazon. Therefore, this paper poses the question to the research community on what if the conversational AI agent could be the same and could migrate across different platforms and devices while maintaining the continuity of interaction and the relationship with the user. Just like a spiritual companion accompanying you everywhere during the daily interactions with different robots and devices. This paper further attempts to push the boundary by modeling the migrated agent’s behavior in different spaces (private or public) to address the data privacy and trust between the user and the agent.
Body–hopping: migrating artificial intelligent agents between embodiments. Cited by: §1, §2.
- Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278. Cited by: §1.
Taskmaster-1:toward a realistic and diverse dialog dataset.
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Hong Kong. Cited by: §1, §2.
- Friends. Note: https://en.wikipedia.org/wiki/Friends Cited by: §1.
- Agent chameleons: agent minds and bodies. In Proceedings 11th IEEE International Workshop on Program Comprehension, pp. 118–125. Cited by: §1, §2.
- Multiwoz 2.1: multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669. Cited by: §1.
- Topical-chat: towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pp. 1891–1895. Cited by: §1, §2.
- The second dialog state tracking challenge. In Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL), pp. 263–272. Cited by: §1.
- The third dialog state tracking challenge. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 324–329. Cited by: §1.
- Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, pp. 467–471. Cited by: §2.
- An initial memory model for virtual and robot companions supporting migration and long-term interaction. In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 277–284. Cited by: §2.
- Agent migration: communications between a human and robot. In IEEE SMC’99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 99CH37028), Vol. 4, pp. 1044–1048. Cited by: §1.
- The internet movie script database. Note: http://www.imsdb.com Cited by: §1.
- Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503. Cited by: §2.
- A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055. Cited by: §1.
- A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. Cited by: §1, §2, §2, §4, §5.1.
Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541. Cited by: §1.
- How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: §5.2.2.
- The big bang theory. Note: https://en.wikipedia.org/wiki/The_Big_Bang_Theory Cited by: §1.
- The relationship of lexical richness to the quality of esl learners’ oral narratives. The Modern Language Journal 96 (2), pp. 190–208. Cited by: §3.4.2.
- Managing speaker identity and user profiles in a spoken dialogue system. Procesamiento del lenguaje natural (43), pp. 77–84. Cited by: §2.
- Re-embodiment and co-embodiment: exploration of social presence for robots and conversational agents. In Proceedings of the 2019 on Designing Interactive Systems Conference, DIS ’19, New York, NY, USA, pp. 633–644. External Links: Cited by: §3.1.
- Multi-domain dialog state tracking using recurrent neural networks. arXiv preprint arXiv:1506.07190. Cited by: §2.
- Reading a robot’s mind: a model of utterance understanding based on the theory of mind mechanism. Advanced Robotics 14 (4), pp. 311–326. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.1.
- Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593. Cited by: §2.
- Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §1.
- Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364. Cited by: §2.
- Exploiting persona information for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188. Cited by: §1.
- A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 553–562. Cited by: §2.
- A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §1, §1.
- Migratable ai. arXiv preprint arXiv:2007.05801. Cited by: §1, §2.
- A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1, §2.
A new corpus and imitation learning framework for context-dependent semantic parsing. Transactions of the Association for Computational Linguistics 2, pp. 547–560. Cited by: §1.
- Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. Cited by: §1.
- The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, pp. 404–413. Cited by: §1, §2.
- Lirec. Note: http://lirec.eu Cited by: §1, §2.
- Starspace: embed all the things!. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.3.
- The hidden information state model: a practical framework for pomdp-based spoken dialogue management. Computer Speech & Language 24 (2), pp. 150–174. Cited by: §2.
- Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §2.
- Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §1, §2, §2, §4.2, §4.2, §4.3, §4, §5.1, §5.2.2.
- Speed dating: providing a menu of possible futures. She Ji: The Journal of Design, Economics, and Innovation 3 (1), pp. 30–50. Cited by: §3.1.