A conversation system is one of the most essential modules for intelligent robots. Current intelligent robots are already capable of receiving prosodic signals from humans and giving appropriate verbal responses with techniques proposed in [13, 12, 30, 10, 40], but few of them can express nonverbal feedback while listening, or present variable body gestures according to the verbal responses while speaking, which makes the human-robot communication seemed unnatural.
The authors in  enable the robots to exhibit body gestures during conversations. However, the possible options are pre-defined and limited, which means that the robot motions are constrained. Moreover, as far as we know, it lacks of work aiming at synthesizing body gestures for robots during the listening phase in their talks with human.
Our research is based on the observations of conversational regularity illustrated in Fig. 1: In human-human communications, the two roles, listener and speaker, alternate between the two parties involved in the conversation. While the listener is listening, he/she receives utterance signals as well as nonverbal cues from the speaker and may give nonverbal feedback in the meantime. Their roles exchange when the previous speaker stops talking. Then the new speaker makes verbal responses with appropriate body gestures based on what he/she heard. This procedure carries on repeatedly until the conversation ends.
In this paper, we aim to enhance the abilities of intelligent agents such as virtual avatar and real-world humanoid robots with better comprehension and expression of body gestures. It will not only help them appear more expressive, intelligible and interactive, but also provide humans with a more natural communication experience with robots.
Inspired by the great success of sequence-to-sequence (seq2seq) network  in the sequence mapping problems, we propose a human-robot interaction system composed of seq2seq-based listening and speaking models for synthesizing body gestures. The listening model takes both the speaker’s verbal and nonverbal signals as input and generates body gestures as nonverbal feedback. And the speaking model takes only the verbal response as input and generates body gestures as nonverbal accompaniment.
convert utterance to text, which can be encoded to a vector sequence using word embedding algorithms[23, 22, 24, 25, 8]. As for body gesture, the models proposed in [4, 11, 35] extract 2D coordinates of body keypoints and transform them into 3D space. However, the coordinate representation includes much noise, so we develop keypoints rotation and normalization methods in the gesture parsing module to discard irrelevant information. To demonstrate the prediction of our system, the body gestures generated by the listening and speaking models are reconstructed on avatar or robot by motion synthesis module.
The rest of the paper is organized as follows: Section II reviews related work on conversational systems and nonverbal expressions synthesis models. The architecture of the proposed system is presented in Section III. In Section IV, the body gesture parsing process is described in detail. Section V introduces the seq2seq-based listening model and speaking model that realize body gestures generation. Experimental results are given and discussed in Section VI. And finally, Section VII gives the conclusion.
Ii Related Work
Conversational systems have been explored for many years. Early dialogue models like ELIZA  and PARRY  can already respond to relatively complicated questions based mainly on hand-crafted rules, which makes the conversation seemed somewhat monotonous. The authors in  pioneered to formulate this problem as language translation. However, they met some difficulties because the space of possible responses in conversation is much larger than that in language translation.
The authors in  developed seq2seq framework, which was applied to machine translation and achieved excellent performance. Later, they introduced the same approach to conversational modeling in . In analogy to mapping a sentence from one language to another in machine translation, the conversational model maps a query sentence to a response sentence. Generally, seq2seq framework uses an LSTM  layer to encode the input sentence to a vector of fixed dimensionality, and then another LSTM layer to decode the target sentence from the vector. This encoder-decoder architecture is widely used in sequence mapping problems like machine translation , conversation modeling  and even video description  due to its powerful capabilities.
extended hierarchical recurrent encoder-decoder neural network to conversational modeling and upgraded LSTM units to advanced GRU units. They improved the model further in  by appending stochastic latent variables to generate more diverse and meaningful responses.
All the aforementioned methods involve only verbal information. However, nonverbal expressions like body gestures and facial movements are commonly used in human conversations. In recent years, facial gestures synthesis domain gains much attention.  presented a method that can reconstruct appearance-like virtual heads from a single RGB image of humans.  captured physical features from a huge collection of photos of a person, and reconstructed Tom Hanks, an avatar, from the learned personal characteristics.  proposed an approach for making the person in target video reenact the facial expressions of another person captured with a webcam.  synthesized realistic facial expressions and lip sync for a talking avatar from audio signals using an RNN network. Furthermore, the authors in  fused facial cues into conversation model based on the observation that the same sentence might have different meanings with different sentiments conveyed by facial gestures. They adopted RNN encoder-decoder architecture to generate both verbal responses and facial expressions for a chatting avatar.
In comparison with facial gestures, research on body gesture synthesis is left behind.  proposed an exploratory analysis of body gesture interaction between humanoid robots and humans and proved that arm movements play an important role in conversations. Authors in [19, 18]
exploited a system that synthesizes appropriate body gestures based on prosodic features extracted from real-world speech using hidden Markov model (HMM). However, they aimed at enhancing the avatar’s performance in virtual environment, thus concentrated little on human-robot conversational interactions. presented an approach to extend communicative behavior for Nao, a humanoid robot, with a set of pre-defined body gestures. In [2, 3], the authors utilized the coupled hidden Markov models (CHMM) to generate verbal responses accompanied with arm movements based on the human’s prosodic characteristics. In this paper, a body gesture interaction system is constructed based on seq2seq network to provide more natural human-robot conversational experiences. In this system, the body gestures of avatar or robot are driven by the models trained with the video data captured from real human-human conversations.
Iii System Overview
In this paper, we propose a novel human-robot interaction system with seq2seq-based body gesture generation models. As shown in Fig. 2, when the robot is communicating with a human, audio signals containing verbal information and RGB frames containing nonverbal information are input to our system. The audio is then transformed to text using the speech recognition algorithm proposed in [13, 12] and the raw frames are processed by the body gesture parsing module, which will be elaborated in the next section. The extracted text is branched into two routes: one is passed to the conversation model for generating response sentence, and another, along with the parsed body gesture, is input to the listening model. Then the listening model predicts output body keypoints sequence, which is analyzed by the motion synthesis model to perform body gesture feedback while the robot is listening. When the human stops talking, the response sentence is transformed to prosodic signals using text-to-speech (TTS) model proposed in [10, 40]. In the meantime, the speaking model generates body keypoints sequence based on the response sentence. With the motion synthesis module, the robot is capable of giving both verbal and nonverbal responses.
Iv Gesture Parsing
Iv-a Dataset Overview
Our listening and speaking models needed substantial data of human-human conversation to synthesize realistic body gestures. In addition, considering that the behavioral habits differ among humans, we hoped that one of the communicators is fixed during the training process to ensure the persona consistence.
We found talk shows like The Ellen Show and The Tonight Show meet our requirements perfectly. We downloaded 2263 videos of The Ellen Show and 1978 videos of The Tonight Show from Youtube. All the collected videos were segmented into clips based on the audio signals. Using the speaker recognition method proposed in , we could distinguish the current speaker and cut the video at each role-exchange border. Then, the audio in each clip was transferred to corresponding text by speech recognition module. As a result, we got several clip-text pairs with the ID of recognized speaker included in the text. Finally, all the clips were processed by gesture parsing module, which is composed of the keypoints extraction, 3D-transformation, rotation and normalization.
Iv-B Keypoints Extraction
Benefitting from the outstanding performance of AlphaPose , the module could easily extract each person’s 2D keypoints from each frame. At first, we attempted to focus on 17 keypoints of the whole body. However, we discovered that human’s lower body was usually invisible in many videos. Moreover, even when the lower body appears, it always seems stable because people rarely move their positions while communicating with others. Consequently, ignoring the lower-body keypoints would largely increase the number of usable videos without causing much deviation. Fig. 3(a) shows the extraction of 2D keypoints.
In the dataset preparation phase, we also discarded the clips that did not contain exactly two people. After that, we obtain 52403 clips of The Ellen Show and 51347 clips of The Tonight Show in total.
Iv-C Keypoints Transformation from 2D to 3D
It was insufficient to reconstruct body gestures from 2D keypoints only. Using , we obtained 3D keypoints coordinates by inputting original RGB frame together with 2D coordinates extracted by AlphaPose into the model proposed in 
. Then the corresponding 3D keypoints coordinates of each person were estimated individually (see Fig.3(b)).
In our experiments, twelve 3D keypoints were selected as follows: head top, neck, chest, belly, left and right shoulders, elbows, wrists and hips. Since we focused mainly on the body gestures, facial keypoints were simplified to head top and neck for locating the position of head.
After this step, each keypoints group was represented by a matrix , where indicates the 3D coordinates of the -th keypoint.
Iv-D Keypoints Rotation
We recognized that similar body gestures might differ completely in their representations because of the camera perspective. To eliminate this effect, all the keypoints groups were rotated.
We exploited our keypoints rotation algorithm based on the hypothesis that the shoulders on the both sides are at the same height. In other words, the -coordinates of left and right shoulders are expected to be equal. This hypothesis was proved reasonable as we analyzed all the clips and found that in more than keypoints groups, the difference between their left and right shoulders’ -coordinates was less than of their heights.
Based on the above hypothesis, we rotated all the keypoints groups around -axis. Suppose that the -th keypoints matrix is , where the coordinates of the left and right shoulders are and respectively. It can be assumed that through the hypothesis. For the -th keypoints group, we hoped that and are satisfied after the rotation by an angle anticlockwise (see Fig. 4). We first obtained the difference vector and then calculated as the angle between -axis and the projection of on plane. As is determined, the -th rotated keypoints matrix is calculated by:
where , the basic rotation matrix about -axis, is defined as:
Iv-E Keypoints Normalization
In most machine learning algorithms, data normalization is one of the most important steps in data processing. It standardizes different scales of features, which has been proved helpful in promoting convergence of neural networks.
The goal of keypoints normalization was to eliminate the effects of absolute body positions and scales under the principle that the same body gestures should have the same representations. We attempted three normalization methods in our experiments, and the comparison among them is elaborated in Section VI.
Iv-E1 Individual normalization
For -th rotated keypoints matrix , where , we denote and as the maximum and minimum of in . , , and are defined similarly.
Using individual normalization, the -th normalized keypoints matrix was calculated by:
where is composed of 12 columns of , is composed of 12 columns of and denotes the element-wise multiplication of two matrices.
Individual normalization brings all values into , which eliminates the effects of absolute body positions and scales. However, it has a severe disadvantage. For more concise explanations, the example will be raised in 2D space. As illustrated in Fig. 5 and 5, assume that the keypoints have been normalized in frame . In the frame , the communicator raises his/her lower arm. We hoped that only the coordinates of ‘left wrist’ would change in the normalized keypoints matrices. However, under the rules of individual normalization, the -coordinate of ‘left wrist’ will stay at 1, with all the other still keypoints squeezed. It’s detrimental for the network to converge if the input features are not in accordance with the actual body movements.
Iv-E2 Global normalization
To address the issue in individual normalization, we tried global normalization:
However, the global normalization cannot standardize the absolute scale, which means that the normalized values are proportional to the size of body skeleton. Fig. 6 shows an example that may happen in global normalization: Although two humans are acting the same body gesture, the normalized coordinates are unequal because of their body scales.
Iv-E3 Vector normalization
The aforementioned two normalization methods are based on the coordinate representation. In our experiments, we found coordinate representation cannot reflect the essential body movements. As the example raised in Fig. 7, when a communicator raises his upper arm, the coordinates of both elbow and wrist will change. However, the wrist’s movement is a ripple effect of the elbow’s. In other words, to make the robot imitate this body gesture, we just need to move its elbow. Therefore, the normalization method should emphasize the active keypoints movement and ignore the passive position change.
For this purpose, we exploited the vector normalization method. First, we transformed coordinates to vector representation. As illustrated in Fig. 7, we set belly as the center of body, whose representation is . Chest and hips are connected to belly, neck and shoulders are connected to chest, head top is connected to neck, elbows are connected to the same-side shoulders, and wrists are connected to the same-side elbows. As a result, each keypoint is represented by the vector from an adjacent keypoint to itself instead of its coordinate. At last, we scaled all the vectors to unit length.
Vector normalization eliminates the effects of absolute body positions and scales. After standardizing the length of each vector, only the direction information is remained. While reconstructing body gestures for the avatar, we defined a sensible length for each connection of adjacent keypoints. However, it’s unnecessary for a real-world robot because the lengths of its limbs and trunk are already definite.
V Seq2Seq-based Listening and Speaking Models
We adapted the sequence-to-sequence architecture  for listening and speaking phases separately. In listening phase, the input contains both the speaker’s keypoint sequence and the sentence he/she said. While in speaking phase, the input contains the response sentence only. The output of both phases is a keypoint sequence, which is regarded as nonverbal feedback in the listening phase and accompaniment of utterance response in the speaking phase.
The architecture of two models is illustrated in Fig. 8. During listening phase, two LSTMs are used to encode speaker’s verbal signals and body gestures separately. Then, feature vectors from these two LSTMs are fused and decoded to generate body gestures feedback. However, during speaking phase, only the response sentence is encoded into a latent vector, which is then decoded to synthesize body gestures accompaniment.
To parse the verbal features, we implemented word embedding, which is a collection of techniques mapping words or phrases into numerical vectors. Landmark models include Word2Vec [23, 22], GloVe , ELMo  and BERT . Word2Vec is a computationally-efficient word embedding algorithm, which is built on either the Continuous Bag-of-Words (CBOW) model or the Skip-Gram model. GloVe introduces global matrix factorization and local context window methods to improve the performance of word embedding. ELMo and BERT both focus on the fact that the same word might have different meanings based on the context. However, ELMo adopts stack LSTM while BERT, the state-of-the-art word embedding model, uses Transformer proposed in . In our experiments, we finetuned a pretrained Word2Vec model based on CBOW to embed each word to a vector.
The nonverbal features were also unified to vector representations. We simply flattened each keypoint matrix into a 36-D vector, and then concatenated every 10 contiguous keypoint vectors into a single one. The reason for gathering every 10 frames together is that we analyzed all the videos in the dataset and found that the average speech rate is 0.4 seconds per word. Since the videos were captured at 25 fps, body movements happening in 10 frames approximately correspond to speaking a word. Consequently, the body gestures would be represented by a sequence of 360-D vectors. To equate the dimensionality of feature vectors for both body gestures and sentences, our Word2Vec model embedded each word into a 360-D vector, thus sentences would also be represented by a sequence of 360-D vectors.
Recall the mechanism of LSTM network proposed in . Assume the input sequence is , the LSTM unit updates the states based on the states at :
denotes the sigmoid function,is the hyperbolic tangent function, denotes the element-wise multiplication, , , represent input gate, forget gate, output gate of -th LSTM unit respectively. and are the -th cell state and hidden state. and are trainable weights.
Therefore, two hidden vectors will be produced from the encoder during listening phase. Then they are fused by element-wise addition and passed into the decoder. During speaking phase, the encoder only produces one hidden vector, which is directly input into the decoder.
In the decoder stage, keypoint sequence is generated one at a step. The first LSTM unit receives the hidden state and outputs the first prediction. Latter LSTM units take the previous prediction as input and calculate the current keypoints prediction. After the output sequence is produced, the mean squared error (MSE) will be calculated by:
where and denote the keypoints vector in the ground truth and predicted sequence respectively. denotes the Euclidean norm and is the number of LSTM units in the decoder.
Vi-a Implementation Details
We implemented our seq2seq-based listening and speaking models using TFLearn 
, which is a Python third-party module built on TensorFlow. In the training stage, network weights were optimized by minimizing the loss (6) using Adam 
. Hyper-parameters were tuned on the validation set and the following values were applied at last: the batch size was 64, the initial learning rate was 0.01 and the training process stopped after 40000 epochs. Moreover, the LSTM layer, in both encoder and decoder, contains 7 units, which means that our models accept a maximum of 7 words for text input and 70 contiguous frames for keypoints input (recall that we packed every 10 frames in one vector). For sequences with shorter length, zero-paddings will be concatenated to their tails.
For body motion synthesis, we reconstructed the body gestures to the avatar (Fig. 9) and the humanoid robot Pepper (Fig. 9) with synthesized keypoint sequence. Fig. 9 and 9 show the controllable joints of avatar and Pepper 
. First, we selected one frame every ten frames from the keypoint sequence to form a new sequence for interpolation. After that, we chose belly as the benchmark and adjusted other vectors in the new sequence to align approximately with the avatar by geometric transformation. Then we computed the angle between adjacent vectors in the first frame. After rotating all the keypoints by the corresponding angles, the avatar and Pepper would be in the pose as the first frame. Finally, we calculated the rotation angles of all the adjacent vectors between consecutive frames, interpolated values and rotated each vector so that the intelligent agents would present target body gestures smoothly based on the keypoint sequence.
Vi-B Metrics & Evaluations
We applied MSE and cosine similarity to measure the accuracy of listening model and speaking model separately on the test set. For listening model, each test sample is composed of a word sequence and a keypoint sequence of the speaker as input, and a keypoint sequence of the listener as ground truth. For speaking model, the input is a word sequence of the speaker, and the ground truth is his/her keypoint sequence while speaking.
Specifically, suppose that there are test samples for the listening model, the speaker’s word and keypoint sequences of each sample are input into our listening model to predict the listener’s keypoint sequence. Then we use (6) to calculate loss for -th sample, and the total loss among the test set is calculated by:
And the cosine similarity is obtained as follow:
Similar evaluations are applied to the speaking model, with only the input and ground truth sequence different.
We compared three normalization methods on both datasets. It can be seen from Table I and Table II that vector normalization method outperforms individual and global normalization methods on both MSE and cosine similarity metrics in both datasets. Therefore, we finally adopted vector normalization method in our experiments.
|belly (the center)||0.000||1.000||0.000||1.000|
|neck head top||0.087||0.997||0.093||0.990|
|chest left shoulder||0.090||0.992||0.100||0.986|
|left shoulder left elbow||0.096||0.986||0.104||0.979|
|left elbow left wrist||0.107||0.983||0.118||0.972|
|chest right shoulder||0.092||0.989||0.099||0.983|
|right shoulder right elbow||0.096||0.980||0.107||0.970|
|right elbow right wrist||0.110||0.972||0.116||0.963|
|belly left hip||0.085||0.997||0.092||0.992|
|belly right hip||0.087||0.998||0.096||0.991|
We designed an experiment to exchange the input and ground truth sequence, thus the model attempted to learn the body gestures of the guest instead of the host. As shown in Table III, it is proved that the persona had huge effects on our models. In Table IV, we compared the accuracy of each keypoint vector separately. It is shown that the vector connecting elbow and wrist has the largest deviation, which might be caused by the frequent movements of the lower arm when a human is talking.
In Fig. 10, the experimental results of four examples are illustrated to show the keypoints comparison between ground truth and our predictions, as well as the synthesized body gestures on both the avatar and the Pepper.
In this paper, we propose a novel body gesture interaction system for more realistic human-robot conversations. The seq2seq architecture is adapted to a listening model and a speaking model for body gesture synthesis in corresponding conversational phases. Both models utilize LSTM network to encode and decode the text and body gestures represented by 12 crucial upper-body keypoints, which are extracted, 3D-transformed, rotated and normalized by the gesture parsing module. Our models are trained by substantial talk show videos downloaded from Youtube and evaluated by the metrices of MSE and cosine similarity. Synthesized body gestures are reconstructed to the avatar and Pepper using the keypoints vector sequence predicted by the models. Experimental results show that the proposed models are able to learn human’s body gestures during both listening and speaking phases in conversation, and the proposed interaction system is possible to provide natural human-robot conversation experiences. In future, real-time conversation model based on the proposed system will be investigated.
-  (2016) Tensorflow: a system for large-scale machine learning. In OSDI, pp. 265–283. Cited by: §VI-A.
-  (2012) An integrated model of speech to arm gestures mapping in human-robot interaction. IFAC Proc. Volumes 45 (6), pp. 817–822. Cited by: §II.
-  (2016) Towards an intelligent system for generating an adapted verbal and nonverbal combined behavior in human crobot interaction. Autonomous Robots 40 (2), pp. 193–209. Cited by: §II.
-  (2018) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008. Cited by: §I.
-  (2018) A face-to-face neural conversation model. In CVPR, Cited by: §II.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §II.
-  (1981) Modeling a paranoid mind. Behavioral and Brain Sci. 4 (04), pp. 515. Cited by: §II.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I, §V.
-  (1985) Speaker recognition identifying people by their voices. ieee_j_proc 73 (11), pp. 1651–1664. Cited by: §IV-A.
-  (2014) TTS synthesis with bidirectional lstm based recurrent neural networks. In INTERSPEECH, pp. 1964–1968. Cited by: §I, §III.
-  (2017) RMPE: regional multi-person pose estimation. In ICCV, pp. 2334–2343. Cited by: §I, §IV-B.
-  (2013) Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649. Cited by: §I, §I, §III.
-  (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. ieee_m_sp 29 (6), pp. 82–97. Cited by: §I, §I, §III.
-  (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §II, §V.
-  (2017) Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. 36 (6), pp. 1–14. Cited by: §II.
-  (2003) Body movement analysis of human-robot interaction. In IJCAI, pp. 177–182. Cited by: §II.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VI-A.
-  (2010) Gesture controllers. ACM Trans. Graph. 29 (4), pp. 1. Cited by: §II.
-  (2009) Real-time prosody-driven synthesis of body language. ACM Trans. Graph. 28 (5), pp. 1. Cited by: §II.
-  (2015) A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055. Cited by: §II.
-  (2012) Integration of gestures and speech in human-robot interaction. In CogInfoCom, pp. 1–6. Cited by: §I, §II.
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §I, §V.
-  (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §I, §V.
-  (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §I, §V.
-  (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §I, §V.
-  (2011) Data-driven response generation in social media. In EMNLP, pp. 583–593. Cited by: §II.
-  (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pp. 1583. Cited by: §II.
-  (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pp. 3776–3783. Cited by: §II.
-  (2019) Pepper - documentation. Note: http://doc.aldebaran.com/2-4/home_pepper.html Cited by: §VI-A.
-  (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §I, §I, §II, §V.
-  (2015) What makes tom hanks look like tom hanks. In ICCV, pp. 3952–3960. Cited by: §II.
-  (2017) Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36 (4), pp. 1–13. Cited by: §II.
-  (2016) TF. learn: tensorflow’s high-level module for distributed machine learning. arXiv preprint arXiv:1612.04251. Cited by: §VI-A.
-  (2016) Face2face: real-time face capture and reenactment of rgb videos. In CVPR, pp. 2387–2395. Cited by: §II.
-  (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In CVPR, pp. 2500–2509. Cited by: §I, §IV-C.
-  (2017) Attention is all you need. In NIPS, pp. 6000–6010. Cited by: §V.
-  (2015) Sequence to sequence-video to text. In ICCV, pp. 4534–4542. Cited by: §II.
-  (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §II.
-  (1966) ELIZA—a computer program for the study of natural language communication between man and machine. Commun. ACM 9 (1), pp. 36–45. Cited by: §II.
-  (2015) Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In ICASSP, pp. 4470–4474. Cited by: §I, §III.