Learning End-to-End Action Interaction by Paired-Embedding Data Augmentation

07/16/2020 ∙ by Ziyang Song, et al. ∙ 0

In recognition-based action interaction, robots' responses to human actions are often pre-designed according to recognized categories and thus stiff. In this paper, we specify a new Interactive Action Translation (IAT) task which aims to learn end-to-end action interaction from unlabeled interactive pairs, removing explicit action recognition. To enable learning on small-scale data, we propose a Paired-Embedding (PE) method for effective and reliable data augmentation. Specifically, our method first utilizes paired relationships to cluster individual actions in an embedding space. Then two actions originally paired can be replaced with other actions in their respective neighborhood, assembling into new pairs. An Act2Act network based on conditional GAN follows to learn from augmented data. Besides, IAT-test and IAT-train scores are specifically proposed for evaluating methods on our task. Experimental results on two datasets show impressive effects and broad application prospects of our method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action interaction is an essential part of human-robot interaction (HRI) [1]. For robots, action interaction with human includes two levels: 1) perceiving human actions and understanding intentions behind; 2) performing responsive actions accordingly. Thanks to the development of action recognition methods [2], considerable progress has been made on the first level. As for the second level, robots often perform pre-designed action responses according to recognition results. We call this scheme as recognition-based action interaction. However, colorful appearances of human actions are mapped to a few fixed categories in this way, leading to a few fixed responses. Robots’ action responses are thus stiff, lacking in human-like vividity. Moreover, annotating data for training action recognition models consumes manpower.

In this paper, we aim to learn end-to-end interaction from unlabeled action interaction data. Explicit recognition is removed, leaving the interaction implicitly guided by high-level semantic translation relationships. To achieve this goal, we specify a novel Interactive Action Translation (IAT) task: Given a set of ”stimulation-response” action pairs conforming to defined interaction rules and without category labeled, learn a model to generate a response for a given stimulation during inference. The generated results are expected to manifest:

1) reality: indistinguishable from real human actions;

2) precision: conforming to defined interaction rules semantically, conditioned on the stimulation;

3) diversity: be various each time given the same stimulation.

For different interaction scenes and defined rules, paired action data need to be re-collected each time. Thus IAT would be more appealing if learning from a small number of samples. However, the task implicitly seeks for a high-level semantic translation relationship, which is hard to generalize from insufficient data. Moreover, the multimodal distribution of real actions is difficult to approximate without sufficient data. The contradiction between task goals and applications poses the main challenge: to achieve the three generation goals above with small-scale data.

Data augmentation is widely adopted to improve learning on small datasets. Traditional augmentation strategies apply hand-crafted transformations on existing data, thus only bring changes in limited modes. Generative Adversarial Networks (GAN) [3] emerges as a powerful technique to generate realistic samples. Nonetheless, a reliable GAN itself requires large-scale data to train. Some variants of GAN, like ACGAN [4], DAGAN [5], and BAGAN [6], are proposed to augment data for classification tasks. However, all of them need category labels that are not provided in our task. Therefore, a specially designed augmentation method is needed for small-scale unlabeled data in IAT.

Figure 1: An overview of our proposed Paired-Embedding (PE) method. Colors distinguish stimulations and responses. Circle and cross denote actions of different semantic categories. Dotted lines describe paired relationships.

We propose a novel Paired-Embedding (PE) method, as Fig. 1 shows. Through encoders in a Paired Variational Auto-Encoders (PVAEs) and PCA-based linear dimension reductions, individual action instances are projected into a low-dimension embedding space. Along with the vanilla VAE objectives [7]

, we employ a new PE loss utilizing paired relationships between actions to train PVAEs. Specifically, VAE loss prefers large variance of action embeddings while PE loss pull actions within the same categories together. As a result, action instances are clustered in the embedding space in an unsupervised manner. Subsequently, both two actions in a data pair are allowed to be replaced with other instances in their respective neighborhood, assembling into new pairs conforming to defined interaction rules semantically. Therefore, the diversity of paired data is significantly and reliably enriched. Finally, we train an Act2Act network based on conditional GAN 

[8] on augmented data to solve our task.

Although IAT is formally an instance-conditional generation task like image translation [9, 10]

, it actually conditions on the semantic category of input action instances. Therefore, evaluation metrics for neither image translation 

[11, 12] nor category-conditional generation [13] is suitable for this task. Considering the three generation goals, we propose two evaluation metrics, IAT-test and IAT-train scores, to compare methods for our task from distinct perspectives. Experiments show that our proposed method gives satisfying generated action responses, both quantitatively and qualitatively.

The major contributions of our work are summarized as follows:

1) We specify a new IAT task, aiming to learn end-to-end action interaction from unlabeled interactive action pairs.

2) We design a PE data augmentation method to resolve the main challenge of our task: learning with a small number of samples.

3) We propose IAT-test and IAT-train scores to evaluate methods on our task, covering three task goals. Experiments prove the satisfying generation effects of our proposed method.

2 Related Work

2.1 Data Augmentation with GAN

It is widely accepted that in deep learning, a larger dataset often leads to a more reliable algorithm. In practical applications, data augmentation by adding synthetic data provides another way to improve performance. The most common data augmentation strategies are applying various hand-designed transformations on existing data. As GAN arises, it is a straightforward idea to use GAN to directly synthesize realistic data for augmentation. However, GAN itself always requires large-scale data for stable training. Otherwise, the quality of synthesized data is not ensured.

Several variants of conditional GAN are proposed for augmenting classification tasks, where category labels are included in GAN training. ACGAN [4] lets the generator and discriminator ’cooperating’ on classification in addition to ’competing’ on generation. DAGAN [5] aims to learn transformations on existing data for data augmentation. BAGAN [6] restores the dataset balance by generating minority-class samples. Unfortunately, these methods can not be applied to augmenting data without category labels given. Some other GAN-based data augmentation methods are also designed for different tasks, like [14] for emotion classification and [15, 16] for person re-identification. They are only suitable for respective tasks but not extensible to our task. Unlike these methods, our proposed method augments IAT data by re-assigning individual actions from existing pairs into new pairs. Data synthesized in this way are undoubtedly natural and realistic. Meanwhile, PE method ensures the same interaction rules on augmented data and existing data, namely the semantic-level reality of augmented data.

2.2 Evaluation Metrics for Generation

Early work often relies on subjective visual evaluation of synthesized samples from generative methods like GAN. Quantitative metrics are proposed in recent years, and the most popular among them are Inception score (IS) [17] and Fréchet Inception distance (FID) [18]

. Both of them are based on a pre-trained classification network (for image generation, an Inception network pre-trained on ImageNet). IS predicts category probabilities on generated samples through the classification network and evaluates generated results accordingly. FID directly measures the divergence between distributions of real and synthesized data in feature-level. CSGN 

[19] has extended IS and FID metrics from image generation to skeleton-based action synthesis. However, they fail to reflect the dependence of generated results upon conditions, thus are unsuitable for conditional generation tasks like ours.

GAN-train and GAN-test scores [13] are proposed for comparing category-conditional GANs. An additional classification network is also introduced. Given category information, the two metrics quantify the correlation between generated samples and conditioned categories besides generating reality and diversity. Nonetheless, category labels are missing in our task and semantic categories are implicitly reflected in paired relationships. Enlightened by GAN-train and GAN-test, we propose IAT-test and IAT-train scores to fit our task. In our metrics, binary classification on data pairs is adopted in the classification network instead of explicit multi-category classification on individual instances.

3 Proposed Method

Our method consists of two parts: a core Paired-Embedding (PE) method for effective and reliable data augmentation, and an Act2Act network following the former. We illustrate the two parts separately in the following.

3.1 Paired-Embedding Data Augmentation

Figure 2: (a) The structure of Paired Variational Auto-Encoders (PVAEs) and losses for training. (b) Effects of different losses.

Here we propose a Paired-Embedding (PE) method, which aims to cluster individual action instances in a low-dimension embedding space by utilizing paired relationships between them.

3.1.1 Paired Variational Auto-Encoders (PVAEs).

PE is based on a Paired Variational Auto-Encoders (PVAEs) consisting of two separate Variational Auto-Encoder (VAE) [7] networks and with the same architecture, as shown in Fig. 2(a). Following [7], a VAE network is composed of an encoder and a decoder . The encoder projects each sample into (,

), which are parameters of a multivariate Gaussian distribution

. Then a latent variable is sampled from this distribution to generate through the decoder. Reconstruction error from to and a prior regularization term constitutes VAE loss, i.e.,



is the Kullback-Leibler divergence, with

controlling its relative importance.

We extract individual action instances from original action pairs. The two networks can be respectively trained under VAE loss to model the distribution of stimulative/responsive actions.

3.1.2 Paired-Embedding (PE) Loss.

Given an action set, the encoder of VAE projects each action into a as the mean of a Gaussian distribution. We collect Gaussian means from all the actions and compute a matrix

for linear dimension reduction, using Principal Component Analysis (PCA) on them. These Gaussian means are further projected by

into an extremely low-dimension embedding space , namely as . Owing to PCA, the variance of Gaussian means is well maintained in the space. Both stimulative and responsive actions are projected into the embedding space in this way. For two actions paired in the original dataset A, we push them towards each other in the embedding space using a Paired-Embedding (PE) loss, i.e.,


where and are embeddings of an interactive pair of actions in the space. Fig. 2(a) illustrates such a process.

3.1.3 Training PVAEs.

We train and

synchronously and divide each epoch into two steps, as in Algorithm 

1. During the first step, the two networks are independently optimized towards minimizing respective VAE loss. In the second step, PE loss serves to guide encoders in two networks.

2:, , ,
3:Initialize , , ,
4:for  in [1, do
5:     # First step under VAE loss
6:     ,
7:     for  in A do
8:         ,
9:         Sample , Sample
10:         ,
11:          += , +=
12:     end for
13:     Back-prop , update , ; Back-prop , update ,
14:     # Second step under PE loss
15:     , , ,
16:     for  in A do
17:         ,
18:         .append(), .append(), M.append()
19:     end for
20:      = PCA(), = PCA()
21:     for  in M do
22:         ,
23:          +=
24:     end for
25:     Back-prop , update ,
26:end for
Algorithm 1 Training of PVAEs

Such an alternating strategy drives PVAEs from two opposite directions, as Fig. 2(b) shows.

  • On the one hand, Gaussian means should scatter for the reconstruction of different action instances. In other words, Gaussian means must maintain a sufficiently large variance, which is transfered almost losslessly to space by PCA. Consequently, the first learning step under VAE loss requires a large variance among embeddings of stimulative/responsive actions respectively.

  • On the other hand, each defined interaction rule is shared among several action pairs. For these action pairs, semantic category information is unified while other patterns in action instances are diverse. Since space has an extremely low dimension, embeddings of paired actions can not be close for all pairs if the space mostly represents patterns apart from semantics. In other words, PE loss pushes the space towards representing semantic categories of actions only. Thus, stimulative or responsive actions within the same semantic category are pulled together in space, guided by PE loss.

As a result, actions with similar semantics tend to cluster in the embedding space. Meanwhile, different clusters are far away from each other to maintain large variance. Experimental results in Sec. 4.4 further verify this effect.

3.1.4 Data Augmentation with PVAEs.

Given a set of individual action instances (either stimulative or responsive) and the corresponding VAE network from trained PVAEs, an matrix is computed as,


where is the number of action instances, with and indexing two samples. A pre-set scale factor controls the neighborhood range. After that, we normalize the sum of each row in to 1, i.e.,


The computed matrix represents confidence in replacing one action with another under defined interaction rules. An action is believed to express semantics similar to other actions in its neighborhood, owing to clustering effects in space. We respectively compute two matrices for stimulative and responsive action instances and use them to augment action pairs. Two actions from each action pair in the original dataset are replaced with other samples in their respective neighborhood, according to matrices. Assume that data pairs in the original set are evenly distributed in semantic categories. With replacement, we can optimally attain various data pairs conforming to defined interaction rules. Such an increase in data diversity will significantly boost the learning effects of IAT task.

3.2 Act2Act: Encoder-Deoder Network under Conditional GAN

IAT is similar to paired image translation in the task form and goals. Both of them can be regarded as an instance-conditional generation task. They differ in that image translation conditions on the structured content of input instance, while our task implicitly conditions on the higher-level semantics of input instance. In recent years, GAN-based methods have been successful in image translation, generating photorealistic results. A similar GAN-based scheme is applied to our task.

Our Act2Act network is stacked with an encoder-decoder architecture, as in Fig. 3(a). It receives a stimulative action as input, and gives an output with the same form. Through the encoder, a low-dimension code is extracted from

. A random noise vector

is sampled from zero-mean Gaussian distribution with unit variance, and then combined with to decode .

Conditional GAN is applied for training, as Fig. 3(b) shows. The encoder-decoder network is treated as Generator , with another Discriminator receives a combination of two action sequences and outputs a score. Given paired training data , is trained to produce indistinguishable from . Meanwhile, is trained to differentiate from as well as possible.

Behind the above design lies our understanding of IAT task. We consider the task as an implicit series connection of recognition and category-conditional generation. Therefore, we do not introduce until input is extracted into , unlike in [9, 10] for image translation. The code has a very low dimension since we expect it to encode high-level semantics. Correlation between and exists only in semantics, but not low-level appearance. Thus the encoder-decoder network is supervised by conditional GAN only, without reconstruction error from to .

Figure 3: (a) The Act2Act network and (b) training under conditional GAN.

4 Experiments

4.1 IAT-test and IAT-train

Figure 4: Illustration of our proposed evaluation metrics.

Inspired by [13], we propose IAT-test and IAT-train scores to evaluate methods on our task, as illustrated in Fig. 4. Besides the training set A for the task, another set B composed of individual actions is introduced. Categories of actions in set B are annotated. Based on annotations, we can pair actions in B and assign pairs to or . The former contains action pairs under the same interaction rules as , while the latter contains the rest, as Fig. 4(a1) shows. Given a model trained on set A, we select stimulative actions from B and generate responses for them, resulting in paired action set . Fig. 4(a2) illustrates such a process. We evaluate the model according to samples in the following ways.

4.1.1 IAT-test.

With positive samples from and negative samples from

, we train a binary classifier

to judge whether an action pair accords to the defined interaction rules and give a 1/0 score accordingly. K-fold cross-validation is adopted to investigate and ensure the generalization performance of .

IAT-test is the test score of model on set , as shown in Fig. 4(b). If is provided by a perfect model , IAT-test score should approximate the K-fold validation accuracy of model during training. Otherwise, a lower score can be attributed to: 1) Generated responses are not realistic enough; 2) Semantic translation relationships captured by are not precise, especially when generalized to stimulative actions in set B. In other words, IAT-test quantifies how well the generation goals of reality and precision are achieved.

4.1.2 IAT-train.

Here a classifier similar to the above is trained, with positive samples from and negative samples from .

IAT-train is the test score of model on set A, as shown in Fig. 4(c). A low score can appear due to: 1) From unrealistic generation results, learns features useless for classifying real samples; 2) Incorrect interaction relationships in misleads the model . 3) Lack of diversity in impairs the generalization performance of . Overall, IAT-train reflects the achievement of all three goals.

Combining the two metrics helps separate diversity from the other generation goals. In other words, when the model receives a high IAT-test score and a low IAT-train score, the latter can be reasonably attributed to a poor generation diversity.

4.2 Dataset

We evaluate our method on UTD-MHAD [20] and AID [21] datasets, both composed of skeleton-based single-person daily interactive actions. For each dataset, action categories are firstly paired to form our defined interaction rules, such as ”tennis serve - tennis swing”, ”throw - catch”, etc. Then action clips in the dataset are divided into two parts: clips in one part are randomly paired according to interaction rules to form set A for learning our task; clips in the other part are reserved as set B for evaluation.

4.2.1 Utd-Mhad

consists of 861 action clips from 27 categories performed by 8 subjects. Each frame describes a 3D human pose with 20 joints of the whole body. We select 10 of 27 action categories and pair them into 5 meaningful interaction rules. Moreover, we choose to use 9 joints of the upper body only since other joints hardly participate in selected actions. Finally, we obtain a set A of 80 action pairs and a set B of 160 individual action instances.

4.2.2 Aid

consists of 102 long sequences, each containing several short action clips. Each frame describes a 3D human pose with 10 joints of the upper body. After removing 5 corrupted sequences, we have 97 sequences left, performed by 19 subjects and covering 10 action categories. Subsequently, 5 interaction rules are defined on the 10 categories. Finally, we obtain a set A of 282 action pairs and a set B of 407 individual action instances.

4.2.3 Implementation Details.

Similar to [22], action data are represented as normalized limb vectors instead of original joint coordinates. This setting brings two benefits. On the one hand, it eliminates the variance of body sizes of subjects in datasets. On the other hand, it ensures that the lengths of human limbs in each generated sequence are consistent.

Action instances (whether at input or output) in our method are skeleton action sequences. indicates the temporal length (unified to 32 frames long on both two datasets) and is the dimension of a 3D human pose in one frame (normally ). 1D convolutions are performed in our various networks. All GAN-based models in the following experiments are trained under WGAN-GP [23].

4.3 Comparison with GAN-based Data Augmentation

Data Augmentation IAT-test IAT-train IAT-test IAT-train
85.32 53.92 87.29 51.17
CSGN [19] 87.86 58.97 89.96 68.82
PE (Ours) 91.03 64.94 90.69 75.65
Table 1: Quantitative comparison of data augmentation effects between CSGN and PE.

As discussed in Sec. 1 and 2.1, GAN-based augmentation methods for classification and other specified tasks can not be applied to our task. Therefore, training an unconditional GAN for directly generating action pairs is left as the only choice for GAN-based data augmentation. We select CSGN [19], which is promising to generate high-quality human action sequences unconditionally. A comparison of data augmentation effects between our PE method and this method is shown in Table. 1.

Learning without augmentation gives generation results that are acceptable from reality and precision (a 85.32/87.29 IAT-test score), but extremely disappointing in diversity (a 53.92/51.17 IAT-train score). For augmentation, a CSGN network is first trained to model the distribution of paired action data. Then we mix generated action pairs with existing data to train our Act2Act network. This method benefits the learning of the task followed, especially visible from a significant increase in IAT-train score. However, it still lags behind our method 3.17/0.73 and 5.97/6.83 respectively in two metrics. We examine generated actions from CSGN and find them to be realistic but not diverse enough, thus provide limited modes for augmentation. Such results keep in line with the fact that GAN-based methods need large-scale training data to ensure multi-modal generation quality. As a comparison, our PE method is more friendly to this small-scale data. Considerable improvements in diversity of generated action responses reflect similar improvements brought by PE in diversity of paired training data.

4.4 Ablation Study

4.4.1 Embedding Space.

Figure 5: Action embeddings projected by PVAEs trained with VAE loss only and with PE loss also.

Fig. 5 visualizes the distribution of actions in the embedding space, projected by PVAEs trained with/without PE loss. Groundtruth category labels are utilized to color data points for comparison. As can be seen, additional PE loss brings much better clustering effects in both gatherings within categories (especially in Fig. 5(a)) and distances between categories (in Fig. 5(b)).

We analyze two critical hyper-parameters affecting PE data augmentation: the scale factor and the dimension of embedding space . Augmentation effects reflected in matrices are evaluated from effectiveness and reliability . Specifically, is represented as the probability that each sample is replaced by others to form new pairs, i.e.,


Meanwhile, we import groundtruth category labels to calculate the probability of category unchanged after replacement as , i.e.,


where is the category of action.

Figure 6: Data augmentation effects with different (a) scale factors and (b) dimensions of space.

As the neighborhood range controlled by expands, the effectiveness of PE data augmentation increases while the reliability decreases. Fig. 6(a) suggests to be the equilibrium point of and on both two datasets. Changes brought by different dimensions of space are more complicated. As Fig. 6(b) shows, when is a 1-d space, learning PVAEs to cluster actions in it can be difficult. The low reliability reflects relatively weak clustering effects at this time. Then the subtle difference between 3-d and 16-d suggests a very flexible selection range for a reasonable embedding space dimension. When the dimension further increases, augmentation effects start to corrupt, mostly due to the imbalance between PE loss and VAE loss during training PVAEs.

4.4.2 Comparison with Label-Given Methods.

Here experiments are conducted in label-given situations to give an upper bound of performance of our method:

Data Label-given IAT-test IAT-train IAT-test IAT-train
Original 85.32 53.92 87.29 51.17
PE aug. 91.03 64.94 90.69 75.65
Re-assign 90.97 68.93 93.05 82.15
Split 91.35 71.64 95.04 85.89
Table 2: Quantitative comparison of generation effects between our proposed method and methods in label-given situations.

1) Re-assign: Actions are re-assigned into new pairs according to groundtruth labels. All paired relationships conforming to defined interaction rules are exhausted for the training of Act2Act.

2) Split: The network is explicitly split into two parts: a classification part for stimulative actions and a category-conditional generation part for responsive actions. The two parts are independently trained with category labels given and connected in series during inference.

As Table. 2 shows, methods augmented by PE is very close to label-given methods in performance, compared to the original baseline. With category labels given, we can attain more satisfactory generation results.

4.5 Qualitative Evaluation

Generated responses conditioned on some stimulative actions are shown in Fig. 7. Three fixed random noise vectors , and are involved in each generation. We first examine how responsive actions are generated with a fixed stimulation and different random noise vectors. It is surprising to note that given the same stimulative action, generated responses are various due to randomness from . Such variety of actions manifests in several aspects like pose, movement speed and range.

Secondly, we examine how responsive actions are generated with a fixed random noise vector and different stimulations. As can be seen, all generated responses belong to respective categories expected by interaction rules. This indicates that within our method, latent code in Act2Act precisely controls semantic translation. In addition, human-like vividity shown in these generated actions is impressive. Overall, qualitative evaluation further verifies the effectiveness of our method in meeting all three generation goals.

Figure 7: Examples of generation on two datasets. For each example, the given stimulative action and generated responses corresponding to three random noise vectors are shown. Visualized actions are meanly sampled from 32-frame sequences.

5 Conclusion and Future Work

In this paper, we specify a novel task to learn end-to-end action interaction and propose a PE data augmentation method to enable learning with small-scale unlabeled data. Another Act2Act network learns from augmented data. Two new metrics are also specially designed to evaluate methods on our task from generation goals of reality, precision and diversity. Our PE method manages to augment paired action data significantly and reliably. Experimental results show its superiority to baseline and other GAN-based augmentation methods, approximating the performance of label-given methods. Given impressively high-quality action responses generated, our work shows broad application prospects in action interaction. We also hope our PE method to enlighten other unsupervised learning tasks with weak information like paired relationships in our task.

In the future, we plan to advance research in two directions. On the one hand, we aim to transfer our method to other output forms, like low-level control parameters of a robot platform. Thus generated responses can be directly applied in robot control. On the other hand, we expect to learn from unsegmented long interaction sequences instead of segmented clips to further simplify the data collection of our task.