“Any human activity is impregnated with language because it takes places in an environment that is build up through language and as language”. As such, human behavior is deeply related to the natural language in our lives. A human has the ability to perform an action corresponding to a given sentence, and conversely one can verbally understand the behavior of an observed person. If a robot can also perform actions corresponding to a given language description, it will make the interaction with robots easier.
Finding the link between language and action has been a great interest in machine learning. There are datasets which provide human whole body motions and corresponding word or sentence annotations[2, 3]. In additions, there have been attempts for learning the mapping between language and human action [4, 5]. In 
, hidden Markov models (HMMs) is used to encode motion primitives and to associate them with words.  used a sequence to sequence (Seq2Seq) model  to learn the relationship between the natural language and the human actions.
In this paper, we choose to use a generative adversarial network (GAN) , which is a generative model, consisting of a generator and a discriminator . and plays a two-player minimax game, such that tries to create more realistic data that can fool and tries to differentiate between the data generated by and the real data. Based on this adversarial training method, it has been shown that GANs can synthesize realistic high-dimensional new data, which is difficult to generate through manually designed features. [9, 10, 11]. In addition, it has been proven that a GAN has a unique solution, in which captures the distribution of the real data and does not distinguish the real data from the data generated from . Thanks to these features of GANs, our experiment also shows that GANs can generate more realistic action than the previous work .
The proposed generative model is a GAN based on the Seq2Seq model. The objective of a Seq2Seq model is to learn the relationship between the source sequence and the target sequence, so that it can generate a sequence in the target domain corresponding to the sequence in the input domain . As shown in Figure 1, the proposed model consists of a text encoder and an action decoder based on recurrent neural networks (RNNs) . Since both sentences and actions are sequences, a RNN is a suitable model for both the text encoder and action decoder. The text encoder converts an input sentence, a sequence of words, into a feature vector. A set of processed feature vectors is transferred to the action decoder, where actions corresponding to the input sequence are generated. When decoding processed feature vectors, we have used an attention mechanism based decoder .
In order to train the proposed generative network, we have chosen to use the MSR-Video to Text (MSR-VTT) dataset, which contains web video clips with annotation texts . Existing datasets [2, 3] are not suitable for our purpose since videos are recorded in laboratory environments. One remaining problem is that the MSR-VTT dataset does not provide human pose information. Hence, for each video clip, we have extracted a human upper body pose sequence using the convolutional pose machine (CPM) . Extracted 2D poses are converted to 3D poses and used as our dataset . We have gathered pairs of sentence descriptions and action sequences, containing descriptions and actions. Each sentence description is paired with about 10 to 12 actions.
The remaining of the paper is constructed as follows. The proposed Text2Action network is given in Section II. Section III describes the structure of the proposed generative model and implementation details. Section IV shows various 3D human like action sequences obtained from the proposed generative network and discusses the result. In addition, we demonstrate that a Baxter robot can generate the action based on a provided sentence.
Ii Text2Action Network
Let denote an input sentence composed of words. Here, is the one-hot vector representation of the th word, where is the size of vocabulary. In this paper, we encode into , the word embedding vectors for the sentence, based on the word2vec model . Here, is the word embedding representation of , such that , where is the word embedding matrix. is the dimension of a word embedding vector. With our dataset, we have pretrained based on the method presented in .
Since the proposed generative network is a GAN, it consists of a generator and a discriminator as shown in Figure 2. The objective of the generator is to generate a proper human action sequence corresponding to the input embedding sentence representation , and the objective of the discriminator is to differentiate the real actions from fake actions considering the given sentence . A text encoder, , encodes an embedded sentence, , into its hidden states, , such that contains the processed information related to . Here, and is the dimension of the hidden state.
Let denote an action sequence with pose vectors. Here, denotes the th human pose vector and is the dimension of a human pose vector. The pair of the text encoder and the generator is a Seq2Seq model. The generator converts into the target human pose sequence . In order to generate , the generator decodes the hidden states of into a set of language feature vectors based on the attention mechanism . Here, denotes a feature vector for generating the th human pose and is the dimension of the feature vector , which is the same as the dimension of .
In addition, a set of random noise vectors is provided to , whereis the dimension of a random noise vector. With a set of feature vectors and a set of random noise vectors , the generator synthesizes a corresponding human action sequence , such that (see Figure 2). Here, the first human pose input is set to the mean pose of all first human poses in the training dataset.
The objective of the discriminator is to differentiate the generated from and the real human action data . As shown in Figure 2, it also decodes the hidden state of into the set of language feature vectors based on the attention mechanism . With a set of feature vectors and a human action sequence as inputs, the discriminator determines whether is fake or real considering . The output from the last RNN cell is the result of the discriminator such that (see Figure 2). The discriminator returns if the is identified as real.
In order to train and , we use the value function defined as follows :
and play a two-player minimax game on the value function , such that tries to create more realistic data that can fool and tries to differentiate between the data generated by and the real data.
Iii Network Structure
Iii-a RNN-based Text Encoder
Here, is the dimension of the hidden state , and is the nonlinear function in a LSTM cell operating as follows:
where denotes the element-wise production and the
denotes the sigmoid function. The dimension of the matrices and vectors are as follows:, , , , , , , , , , , , , and .
After the text encoder encodes into its hidden states , the generator decodes into the set of feature vectors based on the attention mechanism , where is calculated as follows:
The weight of each feature is computed as
Here, the dimensions of matrices and vectors are as follows: , .
After encoding the language feature , a set of random noise vectors is provided to . With and , the generator synthesizes a corresponding human action sequence such that . Let denote the hidden states of the LSTM cells composing . Each hidden state of the LSTM cell , where is the dimension of the hidden state, is computed as follows:
and the output pose at time , is computed as
The dimensions of matrices and vectors are as follows: , , , , , , , , , , , , , , , and . is the nonlinear function constructed based on the attention mechanism presented in .
The discriminator also decodes into the set of feature vectors based on the attention mechanism (see equations (9)-(11)) . The discriminator takes and as inputs and generates its scalar value result such that (see Figure 2). It returns 1 if the input has been determined as the real data. Let denote the hidden states of the LSTM cell composing , where and is the dimension of the hidden state which is same as the one of . The output of is calculated from its last hidden state as follows:
Iii-D Implementation Details
The desired performance was not obtained properly when we tried to train the entire network end-to-end. Therefore, we pretrain the RNN-text encoder first. Regarding this, the text encoder is trained by training an autoencoder which learns the relationship between the natural language and the human action as shown in Figure 3. This autoencoder consists of the text-to-action encoder that maps the natural language to the human action, and the action-to-text decoder which reconstructs the human action back to the natural language description. Both the text-to-action encoder and the action-to-text decoder are Seq2Seq models based on the attention mechanism .
The encoding part of the text-to-action encoder corresponds to the text encoder in our network, such that it encodes into its encoder’s hidden states using (2)-(8). Based on (see (9)-(11)), the hidden states of its decoder is calculated as (see (12)-(17)), and is generated (see (18)). Here, is the zero vector instead of random vector such that .
The action-to-text decoder works on the similar principle as above. After its encoder encodes the human action sequence into its hidden states (see Figure 3), the decoding part of the action-to-text decoder decodes into the set of feature vectors (see (9)-(11)). Based on , the hidden states of its decoder , is calculated as . (see (12)-(17)). From the hidden states , the word embedding representation of the sentence is reconstructed as (see (18)).
In order to train this autoencoder network, we have used a loss functiondefined as follows:
denotes the resulted estimation value of. The constants and are used to control how much the estimation loss of the action sequence and the reconstruction loss of the word embedding vector sequence should be reduced.
Overall steps for training the proposed network are presented in Algorithm 1. After training the autoencoder network, the part of the text encoder is extracted and passed to the generator and discriminator . In addition, in order to make the training of more stable, the weight matrices and bias vectors of that are shared with the autoencoder, , , , , , , , , , , , , , , , , , , , , are initialized to trained values. When training and with the GAN value function shown in (1), we do not train the text encoder . It is to prevent the pretrained encoded language information from being corrupted while training the network with the GAN value function.
For training the autoencoder network, we set the number of training epochs aswith batch size . The dimension of its hidden state in LSTM cell is set to . The Adam optimizer  is used to minimize the loss function and the learning rate is set to . For parameters and in the loss function , we use and . The dimension of the hidden state in the LSTM cell composing and is set to . The dimension of the random vector is set to , and it is sampled from the Gaussian noise such that . In order to train and , we set the number of epochs with batch size . The Adam optimizer  is used to maximize the value function and , and each learning rate is set to and . All values of these parameters are chosen empirically.
In order to train the proposed generative network, we use a MSR-VTT dataset which provides Youtube video clips and sentence annotations . As shown in Figure 4, we have extracted videos in which the human behavior is observed, and extracted the upper body 2D pose of the observed person through CPM . Extracted 2D poses are converted to 3D poses and used as our dataset . (The dataset will be made available publicly.) We choose to use only the upper body pose rather than the full body pose, since the occlusion near the lower body has been observed in the video considerably. Another option was to use the data presented in , but there are pairs of actions and sentence description, which has been judged to be insufficient to train our network.
Each extracted upper body pose for time is a 24-dimensional vector such that . The 3D position of the human neck, and other 3D vectors of seven other joints compose the pose vector data (see Figure 5). Since sizes of detected human poses are different, we have normalized the joint vectors such that for (see Figure 5). For the poses extracted incorrectly, we manually corrected the pose by hand. The corrected pose are then smoothed through Gaussian filtering. Each action sequence is seconds long, and the frame rate is fps, making a total of frames for an action sequence.
Regarding the language annotations, there were some annotations containing information that is not relevant to the human action. For example, for a sentence ‘a man in a brown jacket is addressing the camera while moving his hands wildly’, we cannot know whether the man wears a brown jacket or not with only human pose information. For these cases, we manually correct the annotation to include the information only related to the human action such that ‘a man is addressing the camera while moving his hands wildly’.
In total, we have gathered pairs of sentence descriptions and action sequences, which consists of descriptions and actions. Each sentence description pairs with about 10 to 12 actions. The time length of total action sequences is hours. The number of words included in the sentence description data is , and the size of vocabulary which makes up the data is .
Iv-B 3D Action Generation
We first examine how action sequences are generated when a fixed sentence input and a different random noise vector inputs are given to the trained network. Figure 6 shows three actions generated with one sentence input and three differently sampled random noise vector sequences such that . Generated pose vector data which contains , (see Figure 5) is fitted to the human skeleton of a predetermined size. The input sentence description is ‘A girl is dancing to the hip hop beat’, which is not included in the training dataset. In this figure, human poses in a rectangle represent the one action sequence, listed in time order from left to right. The time interval between the each pose is 0.5 second. It is interesting to note that even though the same sentence input is given, varied human actions are generated if the random vectors are different. In addition, it is observed that generated motions are all taking the action like dancing.
We also examine how the action sequence is generated when the input random noise vector sequence is fixed and the sentence input information varies. Figure 7 shows three actions generated based on the one fixed random noise vector sequence and three different sentence inputs such that . Input sentences are ‘A woman drinks a coffee’, ‘A muscular man exercises in the gym’, and ‘A chef is cooking a meal in the kitchen’. The disadvantage of the given result is that it is difficult to understand the concrete context by only seeing the action, since no tools or background information related to the given action is given. However, the first result in Figure 7 shows the action sequence as if a human is lifting right hand and getting close to the mouth as when drinking something (see the th frame). The second result shows the action sequence like a human moving with a dumbbell in both hands. The last result shows the action sequence as if a chef is cooking food in the kitchen and trying a sample.
Iv-C Comparison with 
based on the Tensorflow and trained it with our dataset. First, we compare generated actions when we give the sentence ‘Woman dancing ballet with man’, which is included in the training dataset, as an input to the each network. The result of the comparison is shown in Figure8. The time interval between the each pose is 0.4 second. In this figure, results from both networks are compared to the human action data that matches to the input sentence in the training dataset. The result shows that our generative model synthesizes the human action sequence that is more similar to the data. Although the network presented in  also generates the action as the ballet player with both arms open, it is shown that the action sequence synthesized by our network is more natural and similar to the data.
In addition, we give the sentence which is not included in the training dataset as an input to each network. The result of the comparison is shown in Figure 9. The time interval between the each pose is 0.4 second. The given sentence is ‘A drunk woman is stumbling while lifting heavy weights’. It is a combinations of two sentences included in the training dataset, which are ‘A drunk woman stumbling’ and ‘Woman is lifting heavy weights’. Although we know that it is difficult to see the situation as described by the input sentence, this experiment is to test whether the proposed network has learned well about the relationship between natural language and human action and responds flexibly to input sentences. The action sequence generated from our network is like a drunk woman staggering and lifting the weights, while the action sequence generated from the network in  is just like a person lifting weights.
It is shown that the method suggested in  also produces the human action sequence that seems somewhat corresponding to the input sentence, however, the generated behaviors are all symmetric and not as dynamic as the data. It is because their loss function is designed to maximize the likelihood of the data, whereas the data contains asymmetric pose to the left or right. As an example of the ballet movement shown in Figure 8, our training data may have a left arm lifting action and a right arm lifting action to the same sentence ‘Woman dancing ballet with man’. But with the network that is trained to maximize the likelihood of the entire data, a symmetric pose to lift both arms has a higher likelihood and eventually become a solution of the network. On the other hand, our network which is trained based on the GAN value function (1) manages to generate various human action sequences that look close to the training data.
Iv-D Generated Action for a Baxter Robot
We enable a Baxter robot to execute the give action trajectory defined in a 3D Cartesian coordinate system by referring the code from . Since the maximum speed at which a Baxter robot can move its joint is limited, we slow down the given action trajectory and apply it to the robot. Figure 10 shows how the Baxter robot executes the given 3D action trajectory corresponding to the input sentence ‘A man is throwing something out’. Here, the time difference between frames capturing the Baxter’s pose is about second. We can see that the generated 3D action takes the action as throwing something forward.
In this paper, we have proposed a generative model based on the Seq2Seq model  and generative adversarial network (GAN), for enabling a robot to execute various actions corresponding to an input language description. In order to train the proposed network, we have used the MSR-Video to Text dataset , which contains recorded videos from real-world situations and uses a wider range of words in the language description than other datasets. Since the data do not contain 3D human pose information, we have extracted the 2D upper body pose of the observed person through convolutional pose machine . Extracted 2D poses are converted to 3D poses and used as our dataset . The generated 3D action sequence is transferred to a robot.
It is interesting to note that our generative model, which is different from other existing related works in terms of utilizing the advantages of the GAN, is able to generate diverse behaviors when the input random vector sequence changes. In addition, results show that our network can generate an action sequence that is more dynamic and closer to the actual data than the network presented presented in . The proposed generative model, which understands the relationship between the human language and the action, generates an action corresponding to the input language. We believe that the proposed method can make actions by robots more understandable to their users.
-  E. Ribes-Iñesta, “Human behavior as language: some thoughts on wittgenstein,” Behavior and Philosophy, pp. 109–121, 2006.
-  W. Takano and Y. Nakamura, “Symbolically structured database for human whole body motions based on association between motion symbols and motion words,” Robotics and Autonomous Systems, vol. 66, pp. 75–85, 2015.
-  M. Plappert, C. Mandery, and T. Asfour, “The kit motion-language dataset,” Big data, vol. 4, no. 4, pp. 236–252, 2016.
-  W. Takano and Y. Nakamura, “Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1314–1328, 2015.
-  M. Plappert, C. Mandery, and T. Asfour, “Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks,” arXiv preprint arXiv:1705.06400, 2017.
-  S. R. Eddy, “Hidden markov models,” Current opinion in structural biology, vol. 6, no. 3, pp. 361–365, 1996.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. org, 2016, pp. 1060–1069.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802, 2016.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, “Grammar as a foreign language,” in Advances in Neural Information Processing Systems, 2015, pp. 2773–2781.
-  J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
-  X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, “Sparseness meets deepness: 3d human pose estimation from monocular video,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4966–4975.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in neural information processing systems, 2013, pp. 3111–3119.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  P. Steadman. (2015) baxter-teleoperation. [Online]. Available: https://github.com/ptsteadman/baxter-teleoperation