Detect, anticipate and generate: Semi-supervised recurrent latent variable models for human activity modeling

Successful Human-Robot collaboration requires a predictive model of human behavior. The robot needs to be able to recognize current goals and actions and to predict future activities in a given context. However, the spatio-temporal sequence of human actions is difficult to model since latent factors such as intention, task, knowledge, intuition and preference determine the action choices of each individual. In this work we introduce semi-supervised variational recurrent neural networks which are able to a) model temporal distributions over latent factors and the observable feature space, b) incorporate discrete labels such as activity type when available, and c) generate possible future action sequences on both feature and label level. We evaluate our model on the Cornell Activity Dataset CAD-120 dataset. Our model outperforms state-of-the-art approaches in both activity and affordance detection and anticipation. Additionally, we show how samples of possible future action sequences are in line with past observations.


page 1

page 2

page 3

page 4


Classify, predict, detect, anticipate and synthesize: Hierarchical recurrent latent variable models for human activity modeling

Human activity modeling operates on two levels: high-level action modeli...

Deep Learning-based Action Detection in Untrimmed Videos: A Survey

Understanding human behavior and activity facilitates advancement of num...

ARTiS: Appearance-based Action Recognition in Task Space for Real-Time Human-Robot Collaboration

To have a robot actively supporting a human during a collaborative task,...

Anticipation in Human-Robot Cooperation: A Recurrent Neural Network Approach for Multiple Action Sequences Prediction

Close human-robot cooperation is a key enabler for new developments in a...

A-ACT: Action Anticipation through Cycle Transformations

While action anticipation has garnered a lot of research interest recent...

A semi-supervised geometric-driven methodology for supervised fishing activity detection on multi-source AIS tracking messages

Automatic Identification System (AIS) messages are useful for tracking v...

On Recovering Latent Factors From Sampling And Firing Graph

Consider a set of latent factors whose observable effect of activation i...

I Introduction

Human behavior is often stochastic and therefore difficult to predict over a longer period of time. Even within the context of a given task and a certain environment individuals might act differently based on e.g. intuition, prior knowledge and preferences. For example, if you provide a number of individuals with the task to prepare a meal following the same recipe, one person might follow a different order than specified because they have learned that a certain ingredient needs time to develop flavor. Another person might use only the big green knife instead of the more handy red knife because they prefer the color green and someone else might intentionally leave out a step.

One way to approach this problem is to model different types of human characters [10]. While this method is suitable for a single task setting such as an assembly line application, it might not scale to more general behavior which is distributed over many tasks and environments. A more scalable approach is structured prediction with e.g. conditional random fields (CRFs) [8, 7] which allows to capture the statistical dependencies between human subjects, their activities, objects in the environment and their affordances. However, common CRFs are limited in their capacity to model long-term dependencies due to the Markov assumption. Structural recurrent neural networks (S-RNN) [3] overcome this problem by employing recurrent neural networks (RNNs) as nodes and edges in the structured graph to detect and predict activity and affordance labels at each time step. The expressiveness and representational power of these neural networks increases the predictive power over short time horizons but the model structure prohibits long-term sequence generation. As S-RNNs do not explicitly learn to predict future feature states, they can not generate possible state-action sequences. Additionally, this deterministic model is not able to generate multiple possible sequences but is restricted to predict a single label.

The key contribution of this paper is to address these issues with a generative, temporal model that can capture the complex dependencies of context and human features as well as discrete, hierarchical labels over time. In detail, we propose a semi-supervised variational recurrent neural network (SVRNN), as described in Section II-B

, which inherits the generative capacities of a variational autoencoder (VAE)

[6, 11], extends these to temporal data [1] and combines them with a discriminative model in a semi-supervised fashion. The semi-supervised VAE, first introduced by [5], can handle labeled and unlabeled data. This property allows us to propagate label information over time even during testing and therefore to generate possible future action sequences. Furthermore, we incorporate the dependencies between human and object features by extending the model to a multi-entity semi-supervised variational recurrent neural network (ME-SVRNN), as introduced in Section II-C. The ME-SVRNN propagates information about the current state of an entity to other entities which increases the predictive power of the model. We apply our model to the Cornell Activity Dataset (CAD-120), consisting of 4 subjects who perform ten different high level actions, see Section III for details. Our model is trained to simultaneously detect and anticipate the activities and object affordances and to predict the next time step in feature space. We find that our model outperforms state-of-the-art methods in both detection and anticipation (Section III-A) while being able to generate possible long term action sequences (Section III-B). We conclude this paper with a final discussion of these findings in Section IV.

Ii Methodology

In this section we introduce the model structure and detail the inference procedure. After a short overview of VAEs, we begin with a description of the general SVRNN before extending it to the multi-entity case.

We denote random variables by bold characters and represent continuous data points by

, discrete labels by and latent variables by . The hidden state of a RNN unit at time is denoted by . Similarly, time-dependent random variables are indexed by , e.g. . Distributions commonly depend on parameters . For the sake of brevity, we will neglect this dependence in the following discussion.

a) VAE


Fig. 1: Model structure of the VAE (a)), its semi-supervised version SVAE (b)), and the recurrent model VRNN (c)). Random variables (circle) and states of RNN hidden units (square) are either observed (gray), unobserved (white) or partially observed (gray-white). The dotted arrows indicate inference connections.

Ii-a Variational autoencoders and amortized inference

Our model builds on VAEs, latent variable models that are combined with an amortized version of variational inference (VI). Amortized VI employs neural networks to learn a function from the data to a distribution over the latent variables that approximates the posterior . Likewise, they learn the likelihood distribution as a function of the latent variables . This mapping is depicted in Figure 1a). Instead of having to infer local latent variables for observed data points, as common in VI, amortized VI requires only the learning of neural network parameters of the functions and . We call the recognition network and the generative network. To sample from a VAE, we first draw a sample from the prior which is then fed to the generative network to yield . We refer to [12] for more details.

To incorporate label information when available, semi-supervised VAEs (SVAE) [5] include a label into the generative process and the recognition network , as shown in Figure 1b). To handle unobserved labels, an additional approximate distribution over labels

is learned which can be interpreted as a classifier. When no label is available, the discrete label distribution can be marginalize out, e.g.


VAEs can also be extended to temporal data, so called variational recurrent neural networks (VRNN) [1]. Instead of being stationary as in vanilla VAEs, the prior over the latent variables depends in this case on past observations , which are encoded in the hidden state of a RNN . Similarly, the approximate distribution depends on the history as can be seen in Figure 1c). The advantage of this structure is that data sequences can be generated by sampling from the temporal prior instead of an uninformed prior, i.e. .

Ii-B Semi-supervised variational recurrent neural network

For SVRNN, we assume that we are given a dataset with temporal structure consisting of labeled time steps and unlabeled observations . denotes the empirical distribution. Further we assume that the temporal process is governed by latent variables , whose distribution depends on a deterministic function of the history up to time : . The generative process follows and finally Here, and are time-dependent priors, as shown in Figure 2

a). To fit this model to the dataset at hand, we need to estimate the posterior over the unobserved variables

and which is intractable. Therefore we resign to amortized VI and approximate the posterior with a simpler distribution , as shown in Figure 2b). To minimize the distance between the approximate and posterior distributions, we optimize the variational lower bound of the marginal likelihood . As the distribution over is only required when it is unobserved, the bound decomposes as follows


and are the lower bounds for labeled and unlabeled data points respectively, while is an additional term that encourages and to follow the data distribution over . This lower bound is optimized jointly. We assume the latent variables

to be i.i.d Gaussian distributed. The categorical distribution over

is determined by parameters . To model such discrete distributions, we apply the Gumbel trick [4, 9]. The history

is modeled with a Long short-term memory (LSTM) unit

[2]. For more details, we refer the reader to the related work discussed in Section II-A.

a) sampling from prior

b) inference

c) recurrence
Fig. 2: Information flow through SVRNN. a) Passing samples from the prior through the generative network. b) Information passing through the inference network. c) The recurrent update. Node appearance follows Figure 1.

Ii-C Modeling multiple entities

To model different entities, we allow these to share information between each other over time. The structure and information flow of this model is a design choice. In our case, these entities consist of the human and additional entities, such as objects or other humans. We denote the dependency of variables on their source by and . Further, we summarize the history and current observation of all additional entities by and respectively. Instead of only conditioning on its own history and observation, as described in Section II-B, we let the entities share information by conditioning on others’ history and observations. Specifically, the model of the human receives information from all additional entities, while these receive information from the human model. Let and for . The structure of the prior and approximate distribution then become , , and for the human, and , , and for each additional entity , We assume that the labels for all entities are observed and unobserved at the same points in time. Therefore, the lower bound in Equation 1 is extended by summing over all entities.

Iii Experiments

Fig. 3: Sampled sub-activity sequences given the last five observed sub-activities of the high-level actions taking medicine (top) and having a meal (bottom). Black lines indicate ground truth and gray lines indicate sampled sub-activities. A sub-activity has an average duration of 3.6 seconds.

In this section, we present our experimental results. We evaluate our model on the Cornell Activity Dataset 120 (CAD -120) [8]. This dataset consists of 4 subjects performing 10 high-level tasks, such as cleaning a microwave or having a meal, in 3 trials each. These activities are further annotated with 10 sub-activities, such as moving and eating and 12 object affordances, such as movable and openable

. In this work we are focusing on classifying the sub-activities and affordances. We use the features extracted in

[8] and preprocess these as in [3]. Our results rely on four-fold cross-validation with the same folds as used in [8]. For comparison, we trained the S-RNN models, for which code is provided online, on these folds and under the same conditions as described in [3]. We use a learning rate of 0.001, a batch size of 10 and the adagrad optimizer. Further, we apply a dropout rate of 0.1 to all units but the latent variable parameters and the output layers. In each batch, we mark ca. 25 % of the labels as unobserved. The object models share all parameters, i.e. we effectively learn one human model and one object model both in the single- and multi-entity case.

Iii-a Detection and anticipation

First, we investigate the ability of our model to detect the current sub-activity and object affordance and to anticipate these labels at the following time step. We compare the performance to the anticipatory CRF reported in [7] and the replicated results of the S-RNN [3]. The F1 score of all models averaged over the cross-validation folds and 20 samples from the latent distributions is reported in Table I. While the SVRNN without information exchange between entities outperforms the baseline methods, the multi-entity model achieves the highest values. Especially the sub-activity detection and anticipation seems to benefit from the information provided by the object states and observations.

Detection Anticipation
Method Sub-Act Obj-Aff Sub-Act Obj-Aff
ATCRF [7] 86.4 85.2 40.6 41.4
S-RNN [3] 69.6 84.8 53.9 74.3
SVRNN 83.4 88.3 67.7 81.4
ME-SVRNN 89.8 90.5 77.1 82.1
TABLE I: Average F1 score for sub-activity and object affordances for detection and anticipation.

Iii-B Generation

In contrast to S-RNN, our SVRNN model is able to generate possible, long-term action sequences. These are generated by propagating a short observation sequence through the network to obtain the summarizing state and to subsequently sample from the priors and . These samples are used by the generative network to make a prediction of the next observation , which forms the next input to the model. We present a number of sampled sub-activity sequences in Figure 3. Note that a sub-activity has an average duration of 3.6 seconds [7]. Thus, we sample possible sequences for around 18 seconds into the future. The samples are plausible action sequences given the observed past. For example, the model remembers that the action opening requires closing over several time steps. Additionally, unrelated sub-activities such as cleaning are not sampled.

Iv Conclusion

In this work, we presented a generative, temporal model for human activity modeling. Our experimental evaluation shows promising performance in the three tasks of detection, anticipation and generation. In future work, we are planning to evaluate the model more extensively and to extend the model to hierarchical label structures.