Multi-Level Sequence GAN for Group Activity Recognition

12/18/2018 ∙ by Harshala Gammulle, et al. ∙ qut 0

We propose a novel semi-supervised, Multi-Level Sequential Generative Adversarial Network (MLS-GAN) architecture for group activity recognition. In contrast to previous works which utilise manually annotated individual human action predictions, we allow the models to learn it's own internal representations to discover pertinent sub-activities that aid the final group activity recognition task. The generator is fed with person-level and scene-level features that are mapped temporally through LSTM networks. Action-based feature fusion is performed through novel gated fusion units that are able to consider long-term dependencies, exploring the relationships among all individual actions, to learn an intermediate representation or `action code' for the current group activity. The network achieves its semi-supervised behaviour by allowing it to perform group action classification together with the adversarial real/fake validation. We perform extensive evaluations on different architectural variants to demonstrate the importance of the proposed architecture. Furthermore, we show that utilising both person-level and scene-level features facilitates the group activity prediction better than using only person-level features. Our proposed architecture outperforms current state-of-the-art results for sports and pedestrian based classification tasks on Volleyball and Collective Activity datasets, showing it's flexible nature for effective learning of group activities.



There are no comments yet.


page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The area of human activity analysis has been an active field within the research community as it can aid in numerous important real world tasks such as video surveillance, video search and retrieval, sports video analytics, etc. In such scenarios, methods with the capability to handle multi-person actions and determine the collective action being performed play a major role. Among the main challenges, handling different personnel appearing at different times and capturing their contribution towards the overall group activity is crucial. Learning the interactions between these individuals further aids the recognition of the collaborative action. Methods should retain the ability to capture information from the overall frame together with information from individual agents. We argue that the overall frame is important as it provides information regrading the varying background and context, the positions of agents within the frame and objects related to the action (e.g. the ball and the net in volleyball) together with the individual agent information.

Recent works on group activity analysis have utilised recurrent neural network architectures to capture temporal dynamics in video sequences. Even though deep networks are capable of performing automatic feature learning, they require manual human effort to design effective losses. Therefore, GAN based networks have become beneficial in overcoming the limitation of deep networks as they are capable of learning both features and the loss function automatically. Furthermore extending the GAN based architecture to a semi-supervised architecture, which is obtained by combining the unsupervised GAN objective with supervised classification objective, leverages the capacity for the network to learn from both labelled and unlabelled data.

In this paper we present a semi-supervised Generative Adversarial Network (GAN) architecture based on video sequence modelling with LSTM networks to perform group activity recognition. Figure 1, shows the overall framework of our proposed GAN architecture for group activity recognition. The generator is fed with sequences of person-level and scene-level RGB features which are extracted through the visual feature extractor for each person and the scene. Then the extracted features are sent through separate LSTMs to map the temporal correspondence of the sequences at each level. We utilise a gated fusion unit inspired by [1] to map the relevance of these LSTM outputs to an intermediate action representation, an ‘action code’, to represent the current group action. These action codes are then employed by the discriminator model which determines the current group action class and whether the given action code is real (ground truth) or fake (generated). Overall, the generator focuses on generating action codes that are indistinguishable from the ground truth action codes while the discriminator tries to achieve the real/fake and group activity classifications. With the use of a gated fusion unit, the model gains the ability to consider all the inputs when deciding on the output. Therefore, it is able to map the relevance of each performed individual action and their interactions with the attention weights automatically, to perform the final classification task. The contributions of our proposed model are as follows: (i) we introduce a novel recurrent semi-supervised GAN framework for group activity recognition, (ii) we formulate the framework such that it incorporates both person-level and scene-level features to determine the group activity, (iii) we demonstrate a feature fusion mechanism with gated fusion units, which automatically learns an attention mechanism focusing on the relevance of each individual agent and their interactions, (iv) we evaluate the proposed model on pedestrian and sports datasets and achieve state-of-the-art results, outperforming the current baseline methods.

(a) Generator (G)
(b) Discriminator (D)
Figure 1: The proposed Multi-Level Sequence GAN (MLS-GAN) architecture: (a) G is trained with sequences of person-level and scene-level features to learn an intermediate action representation, an ‘action code’. (b) The model D performs group activity classification while discriminating real/fake data from scene level sequences and ground truth/generated action codes.

2 Related Work

Human action recognition is an area that has been of pivotal interest to researchers in the computer vision domain. However, a high proportion of proposed models are based on single-human actions which do not align with the nature of the real world scenarios where actions are continuous. Furthermore, many existing approaches only consider actions performed by a single agent, limiting the utility of these approaches.

Some early works [2, 3, 4, 5] on group activity recognition have addressed the group activity recognition task on surveillance and sports video datasets with probabilistic and discriminative models that utilise hand-crafted features. As these hand-crafted feature based methods always require feature engineering, attention has shifted towards deep network based methods due to their automatic feature learning capability.

In [6] authors introduce an LSTM based two stage hierarchical model for group activity recognition. The model first learns the individual actions which are then integrated into a higher level model for group activity recognition. Shu et al. in [7]

have introduced the Confidence Energy Recurrent Network (CERN) which is also a two-level hierarchy of LSTM networks that utilises an energy layer for estimating the energy of the predictions. As these LSTM based methods focus on learning each individual action independently, afterwhich the group activity is learnt by considering the predicted individual action, they are unable to map the interactions between individuals well

[8]. Kim et al. [9]

proposed a gated recurrent unit (GRU) based model that utilises discriminative group context features (DGCF) to handle people as individuals or sub groups. Another similar approach is suggested in


for classifying puck possession events in ice hockey by extracting convolutional layer features to train recurrent networks. In

[8], the authors introduced the Structural Recurrent Neural Network (SRNN) model which is able to handle a varying number of individuals in the scene at each time step with the aid of a grid pooling layer. Even though these deep network based models are capable of performing automatic feature learning, they still require manual human effort to design effective losses.

Motivated by the recent advancements and with the ability to learn effective losses automatically, we build on the concept of Generative Adversarial Networks (GANs) to propose a recurrent semi-supervised GAN framework for group activity recognition. GAN based models are capable of learning an output that is difficult to discriminate from real examples, and also learn a mapping from input to output while learning a loss function to train the mapping. As a result of this ability, GANs have been used in solving different computer vision problems such as inpainting [11], product photo generation [12] etc. We utilise an extended variant of GANs, the conditional GAN [13, 14, 15] architecture where both the generator and the discriminator models are conditioned with additional data such as class labels or data from other modalities. A further enhancement of the architecture can be achieved by following the semi-supervised GAN architecture introduced in [16]. There are only a handful of GAN based methods [17, 18] that have been introduced for human activity analysis. In [17] the authors train the generative model to synthesise frames in an action video sequence, and in [18] the generative model synthesises masks for humans. While these methods try to learn a distribution on video frame level attributes, no effort has been made to learn an intermediate representation at the human behaviour level. Motivated by [19, 20], which have demonstrated the viability of learning intermediate representations with GANs, we believe that learning such an intermediate representation (‘action code’) would help the action classification process, as the classification model has to classify this discriminative action code.

To this end we make the first attempt to apply GAN based methods to the group activity recognition task, where the network jointly learns a loss function as well as providing auxiliary classification of the class.

3 Methodology

GANs are generative models that are capable of learning a mapping from a random noise vector

to an output vector, [21]. In our work, we utilise the conditional GAN [13], an extension of the GAN that is capable of learning a mapping from the observed image at time and a random noise vector to [13, 22].

GANs are composed of two main components: the Generator (G) and the Discriminator (D), which compete in a two player game. G tries to generate data that is indistinguishable from real data, while D tries to distinguish between real and generated (fake) data.

We introduce a conditional GAN based model, Multi-Level Sequence GAN (MLS-GAN), for group activity recognition, which utilises sequences of person-level and scene-level data for classification. In Section 3.1, we describe the action code format that the GAN is trained to generate; Section 3.2 describes the semi-supervised GAN architecture; and in Section 3.3 we explain the objectives that the models seek to optimise.

3.1 Action codes

The generator network is trained to synthesise an ‘action code’ to represent the current group action. The generator maps dense pixel information to this action code. Hence having a one hot vector is not optimal. Therefore we scale it to a range from 0 to 255 giving more freedom for the action generator and discriminator to represent each action code as a dense vector representation,


where k is the number of group action classes in the dataset. This action code representation can also be seen as an intermediate representation for the action classification. Several works have previously demonstrated the viability of these representations with GANs [19, 20]. Overall, this action code generation is effected by adversarial loss as well as the classification loss, where the learnt action codes need to be informative for the classification task. In Figure 2 we have sample action codes for a scenario where there are 7 action classes.

Figure 2: Sample ground truth action codes, with (i.e. we have 7 actions). For the code in (a), and for the code shown in (b) . Note that a green border is shown around the codes for clarify, this is not part of the code and is only included to aid display. Codes are of size 1 x k pixels.

3.2 Semi-supervised GAN architecture

The semi-supervised GAN architecture is achieved by combining the unsupervised GAN objective with the supervised classification objective. Unlike the standard GAN architecture, the discriminator of the semi-supervised GAN network is able to perform group action classification together with the real/fake classification task.

3.2.1 Generator

The generator takes person-level inputs and scene-level inputs for each video sequence for T time steps. Let be the full frame image sequences, while is the person-level cropped bounding box image sequence for the person, where . The generator input, , can be defined as follows,


The generator learns a mapping from the observed input and a noise vector to , where is the action code. As shown in Figure 1, to obtain this mapping the generator extracts visual features from using a pre-trained deep network. Let the extracted scene-level visual feature be and the extracted person-level features be for the person such that,




These extracted visual features are then passed through the respective LSTM networks,


Outputs for the person LSTM model are subsequently sent through a gated fusion unit to perform feature fusion as follows,



is a weight vector for encoding. Next the sigmoid function,

, is used to determine the information flow from each input stream,


afterwhich we multiply the embedding with gate output such that,


Therefore, when determining information flow from the person stream we attend over all the other input streams, rather than having one constant weight value for the entire stream. Using these functions we generate gated outputs, , for each person level input as well as the other for the scene level input. Given these person and scene level outputs, the fused output of the gated unit can be defined as,


This output, , is finally sent through a fully connected layer to obtain the action code to represent the current action,


which also utilises a latent vector in the process.

3.2.2 Discriminator

The discriminator takes the scene-level inputs () from each video sequence for T time steps together with real (ground truth)/ fake (generated) action codes (). The aim of the semi-supervised GAN architecture is to perform real/fake validation together with the group action classification. The inputs to the discriminator models are as follows,


Unlike the generator, the discriminator is not fed with person-level features. The action codes provide intermediate representations of the group activities that have been generated by considering person-level features. Therefore, the activities of the individuals are already encoded in the action codes and the scene-level features are used to support the decision. Considering these scene level inputs also contain the individual people, providing the crops of every individual is redundant and greatly increases the architecture complexity. We believe that by providing the scene level features the model should be able to capture the spatial relationships and the spatial arrangements of the individuals, which is essential when deciding upon the authenticity of the generated action code.

The scene-level feature input () is then sent through the visual feature extractor defined in Equation 3 and we obtain . The scene-level features capture spatial relationships and the spatial arrangements of the people, which helps to decide whether the action is realistic given the arrangements. The action code input () is sent through a fully connected layer and we obtain . These extracted features are then sent through gated fusion unit to perform feature fusion and the output of the gated unit can be defined as,


Finally is passed through fully connected layers to perform group action classification together with the real/fake validation of the current action code.

3.3 GAN Objectives

The objective of the proposed MLS-GAN model can be defined as,


where is the output classifier head of the discriminator and is a hyper parameter which balances the contributions of classification loss and the adversarial loss.

4 Experiments

4.1 Datasets

To demonstrate the flexibility of the proposed method we evaluate our proposed model on sports and pedestrian group activity datasets: the volleyball dataset [6] and the collective activity dataset [3]. We don’t use the annotation for individual person activities in this research. Rather, we allow the model to learn it’s own internal representation of the individual activities. We argue this is more appropriate for group activity recognition as the model is able to discover pertinent sub-activities rather than being forced to learn a (possibly) less informative representation that is provided by the individual activity ground truth.

4.1.1 Volleyball dataset

The Volleyball dataset is composed of 55 videos containing 4,830 annotated frames. The dataset represents 8 group activities that can be found in Volleyball : right set, right spike, right pass, right win-point, left win-point, left pass, left spike and left set. The train/test splits of [6, 7] are used.

4.1.2 Collective activity dataset

The collective activity dataset is composed of 44 video sequences representing five group-level activities. The group activity label is assigned by considering the most common action that is performed by the people in the scene. The train/test splits for evaluations are as in [6]. The available group actions are crossing, walking, waiting, talking and queueing.

4.2 Metrics

We perform comparisons to the state-of-the-art by utilising the same metrics used by the baseline approaches [23, 6]. We use the multi-class accuracy (MCA) and the mean per class accuracy (MPCA) to overcome the imbalance in the test set (e.g. the total number of crossing examples is more than twice that of queueing and talking examples [23]) when evaluating the performance. As MPCA calculates the accuracy for each class, before taking the average accuracy values, this overcomes the accuracy bias on the imbalanced test set.

4.3 Network Architecture and Training

We extract visual features through a ResNet-50 [24]

network pre-trained on ImageNet


for each set of person-level and scene-level inputs. Each input frame is resized to 224 x 224 as a preprocessing step prior to feature extraction. The features are extracted from the

layer of ResNet-50 and these features are then sent through the first layer of the LSTMs which have 10 time steps. The number of LSTMs for the first layer is determined by considering the maximum number of persons (with bounding boxes) in each dataset. If the maximum number of available bounding boxes for a dataset is N, then the first layer of LSTMs is composed of (N+1) LSTMs i.e. one LSTM for each person plus one for the scene level features. In cases where there are fewer than N person we create a dummy sequence with default values. We select for the volleyball dataset and for the collective activity dataset. The gated fusion mechanism automatically learns to discard dummy sequences when there are less than N people in the scene.

The outputs of these LSTMs are passed through the gated fusion unit (GFU) to map the correspondences among person-level and scene-level streams. For all the LSTMs we set the hidden state embedding dimension to be 300 units. For the volleyball dataset the dimensionality of the FC(k) layer is set to 8 as there are 8 group activities in the dataset, and for the collective activity dataset we set this to 5. The hyper parameter, , is chosen experimentally.

In both datasets, the annotations are given in a consistent order. In the volleyball dataset the annotations are ordered based on player role (i.e. spiker, blocker); and in the collective dataset, persons in the frame are annotated from left to right in the scene. We maintain this order of the inputs allowing the GFU to understand the contribution of each person in the scene and learn how the individual actions affect the group action.

The training procedure is similar to [22] and alternates between one gradient decent pass for the discriminators and one for the action generators using mini-batch standard gradient decent (32 examples per mini-batch), and uses the Adam optimiser [26]

with an initial learning rate of 0.1 for 250 epochs and 0.01 for the next 750 epochs.

For discriminator training, we take (batch_size)/2 generated (fake) action codes and (batch_size)/2 ground truth (real) action codes where the ground truth action codes are manually created. We use Keras


and Theano

[28] to implement our model.

4.4 Results

Table 1 and 2 present the evaluations for the proposed MLS-GAN along with the state-of-the-art baseline methods for the Collective Activity [3] and Volleyball [6] benchmark datasets respectively.

When observing the results in Table 1, we observe poor performance from the hand-crafted feature based models [23, 29] as they are capable of capturing only abstract level concepts [30]. The deep structured model [31] utilising a CNN based feature extraction scheme improves upon the handcrafted features. However it does not utilise temporal modelling to map the evolution of the actions, which we believe causes the deficiencies in it’s performance.

The authors in [6, 7]

utilise enhanced temporal modelling through LSTMs and achieved improved performance. However we believe the two step training process leads to an information loss. First, they train a person-level LSTM model which generates a probability distribution over the individual action class for each person in the scene. In the next level only these distributions are used for deciding upon the group activities. Neither person-level features, nor the scene structure information such as the locations of the individual persons is utilised.

In contrast, by utilising features from both the person level and scene level, and further improving the learning process through the proposed GAN based learning framework, the proposed MLS-GAN model has been able to outperform the state-of-the-art models in both considered metrics.

Approach MCA MPCA
Latent SVM [23] 79.7 78.4
Deep structured [31] 80.6 NA
Cardinality Kernel [29] 83.4 81.9
2-layer LSTMs [6] 81.5 80.9
CERN [7] 87.2 88.3
MLS-GAN 91.7 91.2
Table 1: Comparison of the results on Collective Activity dataset [3] using MCA and MPCA. NA refers to unavailability of that evaluation.

In Figure 3

we visualise sample frames for 4 sequences from the collective activity dataset which contain the ‘Crossing’ scene level activity. We highlight each pedestrian within a bounding box which is colour coded based on the individual activity performed where yellow denotes ‘Crossing’, green denotes ‘Waiting’ and blue denotes ‘Walking’ activity classes. Note that the group activity label is assigned by considering the action that is performed by the majority of people in the sequence. These sequences clearly illustrate the challenges with the dataset. For the same scene level activity we observe significant view point changes. Furthermore there exists a high degree of visual similarity between the action transition frames and the action frames themselves. For example in 3rd column we observe such instances where the pedestrians transition from the ‘Crossing’ to ‘Walking’ classes. However, the proposed architecture has been able to overcome these challenges and generate accurate predictions.

Figure 3: Sample frames from 4 example sequences (in columns) from the collective activity dataset with the ‘Crossing scene level activity’. The colour of the bounding box indicates the activity class of each individual where yellow denotes ‘Crossing’, green denotes ‘Waiting’ and blue denotes ‘Walking’. The sequences illustrate the challenges due to view point changes and visual similarity between the transition frames and the action frames (i.e 3rd column. transitions from ‘Crossing’ to ‘Walking’).

Comparing Table 1 with Table 2, we observe a similar performance for [6, 7] with the volleyball dataset due to the deficiencies in the two level modelling structure. In [8] and [32] the methods achieved improvements over [6, 7] by pooling the hidden feature representation when predicting the group activities. However, these methods still utilise hand engineered loss functions for training the model. Our proposed GAN based model is capable of learning a mapping to an intermediate representation (i.e action codes) which is easily distinguishable for the activity classifier. The automatic loss function learning process embedded within the GAN objective synthesises this artificial mapping. Hence we are able to outperform the state-of-the-art methods in all considered metrics.

With the results presented in Table 2 we observe a clear improvement in performance over the baseline methods when considering players as 2 groups rather than 1 group. The 2 group representation first segments the players into the respective 2 teams using the ground truth annotations and then pools out the features from the 2 groups separately. Then these team level features are merged together for the group activity recognition. In contrast, the 1 group representation considers all players at once for feature extraction, rather than considering the two state approach. However this segmentation process is an additional overhead when these annotations are not readily available. In contrast the proposed MLS-GAN method receives all the player features together and automatically learns the contribution of each player for the group activity, outperforming both the 1 group and 2 group methods. We argue this is a result of the enhanced structure with the gated fusion units for the feature fusion process. Instead of learning a single static kernel for pooling out features from each player in the team, we attend over all the feature streams from both the player and scene levels, at that particular time step. This generates a system which efficiently varies the level of attention to each feature stream depending on the scene context.

Figure 4 visualises qualitative results from the proposed MLS-GAN model for the Volleyball dataset. Irrespective of the level of clutter and camera motion, the proposed model correctly recognises the group activity.

(a) l-set
(b) l-pass
(c) l-spike
(d) r-set
(e) r-spike
(f) r-pass
Figure 4: Visualisations of the predicted group activities for the Volleyball dataset using the proposed MLS-GAN model.
Approach MCA MPCA
2-layer LSTMs [6] (1 group) 70.3 65.9
CERN [7] (1 group) 73.5 72.2
SRNN [8] (1 group) 73.39 NA
2-layer LSTMs [6] (2 group) 81.9 82.9
CERN [7] (2 group) 83.3 83.6
SRNN [8] (2 group) 83.47 NA
Social Scene [32] (2 group) 89.90 NA
MLS-GAN 93.0 92.4
Table 2: Comparisons with the state-of-the-art for Volleyball Dataset [6]. The first block of results (1 group) are for the methods considering all the players as a one group and the second block is for dividing players into two groups (i.e each team) first and extracting features from them separately. NA refers to unavailability of results.

4.5 Ablation Experiments

We further experiment with the collective activity dataset by conducting an ablation experiment using a series of a models constructed by removing certain components from the proposed MLS-GAN model. Details of the ablation models are as follows:

  1. [label=]

  2. G-GFU: We use only the generator from MLS-GAN

    and trained it to predict group activity classes by adding a final softmax layer. This model learns through supervised learning using categorical cross-entropy loss. Further, we removed the Gated Fusion Unit (GFU) defined in Eq.

    7 to Eq. 10. Therefore this model simply concatenates the outputs from each stream.

  3. G: The generator model plus the GFU trained in a fully supervised model as per the G-GFU model above.

  4. cGAN-(GFU and ): a conditional GAN architecture where the generator model utilises only the person-level features (no scene-level features), and does not utilise the GFU mechanism for feature fusion. However the discriminator model still receives the scene level image and the generated action code as the inputs.

  5. cGAN-GFU: a conditional GAN architecture which is similar to the proposed MLS-GAN model, however does not utilise the GFU mechanism for feature fusion.

  6. MLS-GAN- : MLS-GAN architecture where the generator utilises only the person-level features for action code generation. The discriminator model is as in cGAN-(GFU and ). As per cGAN-(GFU and ), the discriminator still recieves the scene level image.

Approach MCA MPCA
G-GFU 58.9 58.7
G 61.3 60.5
cGAN-(GFU and ) 88.4 87.7
cGAN-GFU 89.5 88.3
MLS-GAN- 91.2 90.8
MLS-GAN 91.7 91.2
Table 3: Ablation experiment results on Collective Activity dataset [3].

When analysing the results presented in Table 3 we observe significantly lower accuracies for methods G-GFU and G. Even though a slight improvement of performance is observed with the introduction of the GFU fusion strategy, still we observe a significant reduction in performance. We believe this is due to the deficiencies with the supervised learning process where we directly map the dense visual features to a sparse categorical vector. However, with the other variants and the proposed approach we learn an objective which maps the input to an intermediate representation (i.e action codes) which is easily distinguishable by the classifier. The merit of the intermediate representation is shown by the performance gap between G and the cGAN-(GFU and ), which we further enhance in cGAN-GFU by including scene information alongside the features extracted for the individual agents. This allows the GAN to understand the spatial arrangements of the actors when determining the group activity. Comparing cGAN-GFU and MLS-GAN- , we can also see the value of the GFU which is able to better combine data from the individual agents. Finally by utilising both person-level and scene-level features and combining those through proposed GFUs the proposed MLS-GAN model attains better recognition results.

We would like to further compare the performance of non-GAN based models G-GFU and G with the results for the deep architectures in Table 1. Methods such as the 2-layer LSTMs [6] and CERN [7] have been able to attain improved performance compared to G-GFU and G, however with the added expense of the need for hand annotated individual actions in the database. In contrast, with the improved GAN learning procedure the same architectures (i.e cGAN-(GFU and ), cGAN-GFU and MLS-GAN- ) have been able to achieve much better performance without using those individual level annotations.

In order to further demonstrate the discriminative power of the generated action codes we directly classified the action codes generated by cGAN-(GFU and ) model. We added a softmax layer to the generated model and tried directly classifying the action codes. We trained only this added layer by freezing the rest of the network weights. We obtained 90.7 MPCA for the collective activity dataset. Comparing this with the ablation model in Table 3 (the generator without the GAN objective, trained using only the classification objective), the reported MPCA value is 60.5. Hence it is clear that the additional GAN objective makes a substantial contribution.

4.6 Time Efficiency

We tested the computational requirements of the MLS-GAN method using the test set of the Volleyball dataset [6] where the total number of persons, , is set to 12 and each sequence contains 10 time steps. Model generates 100 predictions in 20.4 seconds using a single core of an Intel E5-2680 2.50 GHz CPU.

5 Conclusions

In this paper we propose a Multi-Level Sequential Generative Adversarial Network (MLS-GAN) which is composed of LSTM networks for capturing separate individual actions followed by a gated fusion unit to perform feature integration, considering long-term feature dependancies. We allow the network to learn both person-level and scene-level features to avoid information loss on related objects, backgrounds, and the locations of the individuals within the scene. With the inherited ability to learn both features and the loss function automatically, we employ a semi supervised GAN architecture to learn an intermediate representation of the scene and person-level features of the given scene, rendering an easily distinguishable vector representation, an action code, to represent the group activity. Our evaluations on two diverse datasets, Volleyball and Collective Activity datasets, demonstrates the augmented learning capacity and the flexibility of the proposed MLS-GAN approach. Furthermore, with the extensive evaluations it is evident that the combination of scene-level features with person-level features is able to enhance performance by a considerable margin.


  • [1] Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. 5th International conference on learning representations 2017 workshop (2017)
  • [2] Amer, M.R., Lei, P., Todorovic, S.: Hirf: Hierarchical random field for collective activity recognition in videos. In: European Conference on Computer Vision, Springer (2014) 572–585
  • [3] Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, IEEE (2009) 1282–1289
  • [4] Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity recognition.

    In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1354–1361

  • [5] Ramanathan, V., Yao, B., Fei-Fei, L.: Social role discovery in human events. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE (2013) 2475–2482
  • [6] Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE (2016) 1971–1980
  • [7] Shu, T., Todorovic, S., Zhu, S.C.: Cern: confidence-energy recurrent network for group activity recognition. Proc. of CVPR, Honolulu, Hawaii (2017)
  • [8] Biswas, S., Gall, J.: Structural recurrent neural network (srnn) for group activity analysis. IEEE Winter Conference on Applications of Computer Vision (WACV) (2018)
  • [9] Kim, P.S., Lee, D.G., Lee, S.W.: Discriminative context learning with gated recurrent unit for group activity recognition. Pattern Recognition 76 (2018) 149–161
  • [10] Tora, M.R., Chen, J., Little, J.J.: Classification of puck possession events in ice hockey. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, IEEE (2017) 147–154
  • [11] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2536–2544
  • [12] Yoo, D., Kim, N., Park, S., Paek, A.S., Kweon, I.S.: Pixel-level domain transfer. In: European Conference on Computer Vision, Springer (2016) 517–532
  • [13] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  • [14] Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Tracking by prediction: A deep generative model for mutli-person localisation and tracking. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE (2018) 1122–1132
  • [15] Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Task specific visual saliency prediction with memory augmented conditional generative adversarial networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE (2018) 1539–1548
  • [16] Denton, E., Gross, S., Fergus, R.: Semi-supervised learning with context-conditional generative adversarial networks. arXiv preprint arXiv:1611.06430 (2016)
  • [17] Ahsan, U., Sun, C., Essa, I.: Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. arXiv preprint arXiv:1801.07230 (2018)
  • [18] Li, X., Zhang, Y., Zhang, J., Chen, Y., Li, H., Marsic, I., Burd, R.S.: Region-based activity recognition using conditional gan. In: Proceedings of the 2017 ACM on Multimedia Conference, ACM (2017) 1059–1067
  • [19] Li, Y., Song, J., Ermon, S.:

    Infogail: Interpretable imitation learning from visual demonstrations.

    In: Advances in Neural Information Processing Systems. (2017) 3815–3825
  • [20] Bora, A., Jalal, A., Price, E., Dimakis, A.G.: Compressed sensing using generative models.

    International Conference on Machine Learning (ICML) (2018)

  • [21] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
  • [22] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [23] Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012) 1549–1562
  • [24] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
  • [25] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (2015) 211–252
  • [26] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2015)
  • [27] Chollet, F., et al.: Keras. (2015)
  • [28] Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., Belopolsky, A., et al.: Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 472 (2016) 473
  • [29] Hajimirsadeghi, H., Yan, W., Vahdat, A., Mori, G.: Visual recognition by counting instances: A multi-instance cardinality potential kernel. IEEE Computer Vision and Pattern Recognition (CVPR) (2015)
  • [30] Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Two stream lstm: A deep fusion framework for human action recognition. In: Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, IEEE (2017) 177–186
  • [31] Deng, Z., Zhai, M., Chen, L., Liu, Y., Muralidharan, S., Roshtkhari, M.J., Mori, G.: Deep structured models for group activity recognition. British Machine Vision Conference (BMVC) (2015)
  • [32] Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.:

    Social scene understanding: End-to-end multi-person action localization and collective activity recognition.

    In: Conference on Computer Vision and Pattern Recognition. Volume 2. (2017)