I Introduction
Human behavior is often stochastic and therefore difficult to predict over a longer period of time. Even within the context of a given task and a certain environment individuals might act differently based on e.g. intuition, prior knowledge and preferences. For example, if you provide a number of individuals with the task to prepare a meal following the same recipe, one person might follow a different order than specified because they have learned that a certain ingredient needs time to develop flavor. Another person might use only the big green knife instead of the more handy red knife because they prefer the color green and someone else might intentionally leave out a step.
One way to approach this problem is to model different types of human characters [10]. While this method is suitable for a single task setting such as an assembly line application, it might not scale to more general behavior which is distributed over many tasks and environments. A more scalable approach is structured prediction with e.g. conditional random fields (CRFs) [8, 7] which allows to capture the statistical dependencies between human subjects, their activities, objects in the environment and their affordances. However, common CRFs are limited in their capacity to model longterm dependencies due to the Markov assumption. Structural recurrent neural networks (SRNN) [3] overcome this problem by employing recurrent neural networks (RNNs) as nodes and edges in the structured graph to detect and predict activity and affordance labels at each time step. The expressiveness and representational power of these neural networks increases the predictive power over short time horizons but the model structure prohibits longterm sequence generation. As SRNNs do not explicitly learn to predict future feature states, they can not generate possible stateaction sequences. Additionally, this deterministic model is not able to generate multiple possible sequences but is restricted to predict a single label.
The key contribution of this paper is to address these issues with a generative, temporal model that can capture the complex dependencies of context and human features as well as discrete, hierarchical labels over time. In detail, we propose a semisupervised variational recurrent neural network (SVRNN), as described in Section IIB
, which inherits the generative capacities of a variational autoencoder (VAE)
[6, 11], extends these to temporal data [1] and combines them with a discriminative model in a semisupervised fashion. The semisupervised VAE, first introduced by [5], can handle labeled and unlabeled data. This property allows us to propagate label information over time even during testing and therefore to generate possible future action sequences. Furthermore, we incorporate the dependencies between human and object features by extending the model to a multientity semisupervised variational recurrent neural network (MESVRNN), as introduced in Section IIC. The MESVRNN propagates information about the current state of an entity to other entities which increases the predictive power of the model. We apply our model to the Cornell Activity Dataset (CAD120), consisting of 4 subjects who perform ten different high level actions, see Section III for details. Our model is trained to simultaneously detect and anticipate the activities and object affordances and to predict the next time step in feature space. We find that our model outperforms stateoftheart methods in both detection and anticipation (Section IIIA) while being able to generate possible long term action sequences (Section IIIB). We conclude this paper with a final discussion of these findings in Section IV.Ii Methodology
In this section we introduce the model structure and detail the inference procedure. After a short overview of VAEs, we begin with a description of the general SVRNN before extending it to the multientity case.
We denote random variables by bold characters and represent continuous data points by
, discrete labels by and latent variables by . The hidden state of a RNN unit at time is denoted by . Similarly, timedependent random variables are indexed by , e.g. . Distributions commonly depend on parameters . For the sake of brevity, we will neglect this dependence in the following discussion.Iia Variational autoencoders and amortized inference
Our model builds on VAEs, latent variable models that are combined with an amortized version of variational inference (VI). Amortized VI employs neural networks to learn a function from the data to a distribution over the latent variables that approximates the posterior . Likewise, they learn the likelihood distribution as a function of the latent variables . This mapping is depicted in Figure 1a). Instead of having to infer local latent variables for observed data points, as common in VI, amortized VI requires only the learning of neural network parameters of the functions and . We call the recognition network and the generative network. To sample from a VAE, we first draw a sample from the prior which is then fed to the generative network to yield . We refer to [12] for more details.
To incorporate label information when available, semisupervised VAEs (SVAE) [5] include a label into the generative process and the recognition network , as shown in Figure 1b). To handle unobserved labels, an additional approximate distribution over labels
is learned which can be interpreted as a classifier. When no label is available, the discrete label distribution can be marginalize out, e.g.
.VAEs can also be extended to temporal data, so called variational recurrent neural networks (VRNN) [1]. Instead of being stationary as in vanilla VAEs, the prior over the latent variables depends in this case on past observations , which are encoded in the hidden state of a RNN . Similarly, the approximate distribution depends on the history as can be seen in Figure 1c). The advantage of this structure is that data sequences can be generated by sampling from the temporal prior instead of an uninformed prior, i.e. .
IiB Semisupervised variational recurrent neural network
For SVRNN, we assume that we are given a dataset with temporal structure consisting of labeled time steps and unlabeled observations . denotes the empirical distribution. Further we assume that the temporal process is governed by latent variables , whose distribution depends on a deterministic function of the history up to time : . The generative process follows and finally Here, and are timedependent priors, as shown in Figure 2
a). To fit this model to the dataset at hand, we need to estimate the posterior over the unobserved variables
and which is intractable. Therefore we resign to amortized VI and approximate the posterior with a simpler distribution , as shown in Figure 2b). To minimize the distance between the approximate and posterior distributions, we optimize the variational lower bound of the marginal likelihood . As the distribution over is only required when it is unobserved, the bound decomposes as follows(1)  
and are the lower bounds for labeled and unlabeled data points respectively, while is an additional term that encourages and to follow the data distribution over . This lower bound is optimized jointly. We assume the latent variables
to be i.i.d Gaussian distributed. The categorical distribution over
is determined by parameters . To model such discrete distributions, we apply the Gumbel trick [4, 9]. The historyis modeled with a Long shortterm memory (LSTM) unit
[2]. For more details, we refer the reader to the related work discussed in Section IIA.IiC Modeling multiple entities
To model different entities, we allow these to share information between each other over time. The structure and information flow of this model is a design choice. In our case, these entities consist of the human and additional entities, such as objects or other humans. We denote the dependency of variables on their source by and . Further, we summarize the history and current observation of all additional entities by and respectively. Instead of only conditioning on its own history and observation, as described in Section IIB, we let the entities share information by conditioning on others’ history and observations. Specifically, the model of the human receives information from all additional entities, while these receive information from the human model. Let and for . The structure of the prior and approximate distribution then become , , and for the human, and , , and for each additional entity , We assume that the labels for all entities are observed and unobserved at the same points in time. Therefore, the lower bound in Equation 1 is extended by summing over all entities.
Iii Experiments
In this section, we present our experimental results. We evaluate our model on the Cornell Activity Dataset 120 (CAD 120) [8]. This dataset consists of 4 subjects performing 10 highlevel tasks, such as cleaning a microwave or having a meal, in 3 trials each. These activities are further annotated with 10 subactivities, such as moving and eating and 12 object affordances, such as movable and openable
. In this work we are focusing on classifying the subactivities and affordances. We use the features extracted in
[8] and preprocess these as in [3]. Our results rely on fourfold crossvalidation with the same folds as used in [8]. For comparison, we trained the SRNN models, for which code is provided online, on these folds and under the same conditions as described in [3]. We use a learning rate of 0.001, a batch size of 10 and the adagrad optimizer. Further, we apply a dropout rate of 0.1 to all units but the latent variable parameters and the output layers. In each batch, we mark ca. 25 % of the labels as unobserved. The object models share all parameters, i.e. we effectively learn one human model and one object model both in the single and multientity case.Iiia Detection and anticipation
First, we investigate the ability of our model to detect the current subactivity and object affordance and to anticipate these labels at the following time step. We compare the performance to the anticipatory CRF reported in [7] and the replicated results of the SRNN [3]. The F1 score of all models averaged over the crossvalidation folds and 20 samples from the latent distributions is reported in Table I. While the SVRNN without information exchange between entities outperforms the baseline methods, the multientity model achieves the highest values. Especially the subactivity detection and anticipation seems to benefit from the information provided by the object states and observations.
IiiB Generation
In contrast to SRNN, our SVRNN model is able to generate possible, longterm action sequences. These are generated by propagating a short observation sequence through the network to obtain the summarizing state and to subsequently sample from the priors and . These samples are used by the generative network to make a prediction of the next observation , which forms the next input to the model. We present a number of sampled subactivity sequences in Figure 3. Note that a subactivity has an average duration of 3.6 seconds [7]. Thus, we sample possible sequences for around 18 seconds into the future. The samples are plausible action sequences given the observed past. For example, the model remembers that the action opening requires closing over several time steps. Additionally, unrelated subactivities such as cleaning are not sampled.
Iv Conclusion
In this work, we presented a generative, temporal model for human activity modeling. Our experimental evaluation shows promising performance in the three tasks of detection, anticipation and generation. In future work, we are planning to evaluate the model more extensively and to extend the model to hierarchical label structures.
References
 [1] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
 [2] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

[3]
Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena.
StructuralRNN: Deep learning on spatiotemporal graphs.
InIEEE Conference on Computer Vision and Pattern Recognition
, 2016.  [4] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 [5] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In NIPS, pages 3581–3589, 2014.
 [6] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:, 2013.
 [7] Hema S Koppula and Ashutosh Saxena. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence, 38(1):14–29, 2016.
 [8] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. Learning human activities and object affordances from rgbd videos. The International Journal of Robotics Research, 32(8):951–970, 2013.

[9]
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In ICLR, 2017.  [10] Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and Julie Shah. Efficient model learning from jointaction demonstrations for humanrobot collaborative tasks. In 2015 10th ACM/IEEE International Conference on HRI, pages 189–196. ACM, 2015.
 [11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [12] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. arXiv preprint arXiv:1711.05597, 2017.