Introduction
Recognizing actions in a video stream requires the aggregation of temporal as well as spatial features (as in object classification). These video streams, unlike still images, have short and long temporal correlations, attributes that single frame convolutional neural networks fail to discover. Therefore, the first hurdle to reach humanlevel performance is designing feature extractors that can learn this latent temporal structure. Nonetheless, there has been much progress in devising novel neural network architecture since the work of [Karpathy et al.2014]. Another problem is the large compute, storage and memory requirement for analysing moderately sized video snippets. One requires a relatively larger computing resource to train ultra deep neural networks that can learn the subtleties in temporal correlations, given varying lighting, camera angles, pose, etc. It is also difficult to utilise standard image augmentation (like random rotations, shears, flips, etc.) techniques on a video stream. Additionally, features for a video stream (unlike static images) evolve with a dynamics across several orders of timescales.
Nonetheless, the action recognition problem has reached sufficient maturity using the twostream deep convolutional neural networks (CNN) framework [Simonyan and Zisserman2014]. Such a framework utilises a deep convolutional neural network (CNN) to extract static RGB (RedGreenBlue) features as well as motion cues by deconstructing the opticflow of a given video clip. Notably, there has been plenty of work in utilising a variety of network architectures for factorising the RGB and opticalflow based features. For example, an inception network [Szegedy et al.2016] uses
convolutions in its inception block to estimate crosschannel corrections, which is then followed by the estimation of crossspatial and crosschannel correlations. A residual network (ResNet), on the other hand, learns residuals on the inputs instead of learning unreferenced functions
[He et al.2016]. While such frameworks have proven useful for many action recognition datasets (UCF101, UCF50, etc.), they are yet to show promise where videos have varying signaltonoise ratio, viewing angles, etc.We improve upon existing technology by combining Inception networks and ResNets using a Gaussian Process classifier that is further combined in a productofexpert (PoE) framework to yield, to the best of our knowledge, a stateoftheart performance on the HMDB51 dataset
[Kuehne et al.2013]. Under a Bayesian setting, our pillar networks provide not only mean predictions, but also the uncertainty associated with each prediction. Notably, our work forwards the following contributions:
We introduce pillar networks++ that allow for independent multistream deep neural networks, enabling horizontal scalability

Ability to classify video snippets that have heterogeneity regarding camera angle, video quality, pose, etc.

Combine deep convolutional neural networks with nonparametric Bayesian models, wherein there is a possibility to train them using less amount of data

Demonstrate the utility of model averaging that takes uncertainty around mean predictions into account
Methods
In this section, we describe the dataset, the network architectures and the nonparametric Bayesian setup that we utilise in our fourstream CNN pillar network for activity recognition. We refer the readers to the original network architectures in [Wang et al.2016] and [Ma et al.2017]
for further technical details. Utilising classification methodologies like AdaBoost, gradient boosting, random forests, etc. provide us with accuracies in the range of 555% for this dataset, for either the RGB or the opticflow based features.
Dataset
The HMDB51 dataset [Kuehne et al.2013] is an action classification dataset that comprises of 6,766 video clips which have been divided into 51 action classes. Although a much larger UCFsports dataset exists with 101 action classes [Soomro, Zamir, and Shah2012], the HMDB51 has proven to be more challenging. This is because each video has been filmed using a variety of viewpoints, occlusions, camera motions, video quality, etc. anointing the challenges of videobased prediction problems. The second motivation behind using such a dataset lies in the fact that HMDB51 has storage and compute requirement that is fulfilled by a modern workstation with GPUs – alleviating deployment on expensive cloudbased compute resources.
All experiments were done on Intel Xeon E52687W 3 GHz 128 GB workstation with two 12GB nVIDIA TITAN Xp GPUs. As in the original evaluation scheme, we report accuracy as an average over the three training/testing splits.
Inception layers for RGB and flow extraction
We use the inception layer architecture described in [Wang et al.2016]. Each video is divided into segments, and a short subsegment is randomly selected from each segment so that a preliminary prediction can be produced from each snippet. This is later combined to form a videolevel prediction. An Inception with Batch Normalisation network [Ioffe and Szegedy2015] is utilised for both the spatial and the opticflow stream. The feature size of each inception network is fixed at 1024. For further details on network pretraining, construction, etc. please refer to [Wang et al.2016].
Residual layers for RGB and flow extraction
We utilise the network architecture proposed in [Ma et al.2017] where the authors leverage recurrent networks and convolutions over temporally constructed feature matrices as shown in Fig. 1. In our instantiation, we truncate the network to yield 2048 features, which is different from [Ma et al.2017]
where these features feed into an LSTM (Long Short Term Memory) network. The spatial stream network takes in RGB images as input with a ResNet101
[He et al.2016]as a feature extractor; this ResNet101 spatialstream ConvNet has been pretrained on the ImageNet dataset. The temporal stream stacks ten optical flow images using the pretraining protocol suggested in
[Wang et al.2016]. The feature size of each ResNet network is fixed at 2048. For further details on network pretraining, construction, etc. please refer to [Ma et al.2017].Nonparametric Bayesian Classification
Gaussian Processes (GP) emerged out of filtering theory [Wiener1949]
in nonparametric Bayesian statistics via work done in geostatistics
[Matheron1973]. Put simply, GPs are collection of random variables that have a joint Gaussian distribution,
Obervation:  
GP Prior:  
Hyperprior:  (1) 
where, is the kernel function parameterized by ; is the parameter of the observation model; is the latent function evaluated at i.e., the features. denotes the class of the input features and denote the set of hyperparameters.
For multiclass problem with a nonGaussian likelihood (softmax; ), the conditional posterior is approximated via the Laplace approximation [Williams and Barber1998] i.e., a second order Taylor expansion of around the mode as,
(2) 
is the (input,output) tuple. After the Laplace approximations, the approximate posterior distribution becomes,
Finally, we can evaulate the approximate conditional predictive density of ,
(4) 
Product of Experts
For each of the neural network, we subdivide the training set into subsets so that different GPs could be trained, giving us 28 GPs for the 4 deep networks (2 Inception networks and 2 ResNets) that we have trained in the first part of our training. We assume that each of the 7 GPs are independent, such that the marginal likelihood in our product of expert (PoE) becomes,
(5)  
What we have done is to reduce the computational expenditure from to . Notice that unlike GPs with inducing inputs or variational parameters such a distributed GP does not require optimisation of additional parameters. Finally, a productofGPexperts is instantiated that predicts the function at the test point as,
(6) 
Results
We used 3570 videos from HMDB51 as the training dataset; this was further split into seven subsets, each with 510 videos. We select ten videos randomly chosen from each category, and each subset is nonoverlapping. Based on seven subsets, seven GPs are trained on different features (RGB and Flow) from different Networks (TSNInception [Wang et al.2016] and ResNetLSTM [Ma et al.2017]). In total, twentyeight GPs are generated. The features for both the RGB and the optical flow were extracted from the last connected layer with 1024 dimension for the Inception network and 2028 for the ResNet network. The fusion is then performed both vertically (seven subsets) and horizontally (four networks). The accuracies of individual GPs and different fusion combinations (PoE) on split1 are shown in Table 1. Fusion1 represents the results from the fusion of seven GPs for each feature; Fusion2 show the fusion result of RGB and Flow using different networks; Fusionall shows the result by fusion of all the 28 GPs. Additionally, the results with a supportvectormachine (SVM) for each of the network and their fusion using multikernellearning (MKL) are listed in the last three rows [Sengupta and Qian2017]. The average result for three splits is displayed in Table 2.
Accuracy [%]  InceptionRGB  InceptionFlow  ResNetRGB  ResNetFlow 
GP1  51.4  59.5  52.7  58.9 
GP2  52.0  59.7  51.9  59.1 
GP3  50.1  60.3  49.7  59.9 
GP4  48.7  58.5  49.5  59.1 
GP5  48.2  59.3  49.0  59.5 
GP6  52.0  59.5  52.2  57.9 
GP7  51.1  58.8  51.8  58.1 
Average  50.5  59.4  51.0  58.9 
Fusion1  54.6  62.6  54.8  61.6 
Fusion2  69.7  68.2  
Fusionall  75.7  
SVMSingleKernel  54.0  61.0  53.1  58.5 
SVMMutliKernels1  68.1  63.3  
SVMMutliKernels2  71.7 
Methods  Accuracy [%]  Reference 
Twostream  59.4  [Simonyan and Zisserman2014] 
Rank Pooling (ALL)+ HRP (CNN)  65  [Fernando and Gould2017] 
Convolutional Twostream  65.4  [Feichtenhofer, Pinz, and Zisserman2016] 
TemporalInception  67.5  [Ma et al.2017] 
Temporal Segment Network (2/3 modalities)  68.5/69.4  [Wang et al.2016] 
TSLSTM  69  [Ma et al.2017] 
Pillar Networks++ (ResNet)  66.8  this paper 
Pillar Networks++ (Inceptionv2)  69.4  this paper 
Pillar Networks SVMMKL  71.8  [Sengupta and Qian2017] 
STmultiplier network + handcrafted iDT  72.2  [Feichtenhofer, Pinz, and Wildes2017] 
Pillar Networks++ (4 Networks)  73.6  this paper 
Discussion
Here, we make two contributions – (a) we build on recently proposed pillar networks [Sengupta and Qian2017] and combine deep convolutional neural networks with nonparametric Bayesian models, wherein they have the possibility of being trained with less amount of data and (b) demonstrate the utility of model averaging that takes uncertainty around mean predictions into account. Combining different methodologies allow us to supersede the current stateoftheart in video classification especially, action recognition.
We utilised the HMDB51 dataset instead of UCF101 as the former has proven to be difficult for deep networks due to the heterogeneity of image quality, camera angles, etc. As is wellknown, videos contain extensive longrange temporal structure; using different networks (2 ResNets and 2 Inception networks) to capture the subtleties of this temporal structure is an absolute requirement. Since each network implements a different nonlinear transformation, one can utilise them to learn very deep yet different features. Utilising the distributedGP architecture then enables us to parcellate the feature tensors into computable chunks (by being distributed) of input for a Gaussian Process classifier. Such an architectural choice, therefore, enables us to scale horizontally by plugging in a variety of networks
as per requirement. While we have used this architecture for video based classification, there is a wide range of problems where we can apply this methodology – from speech processing (with different pillars/networks) to naturallanguageprocessing (NLP).
Ultra deep convolutional networks have been influential for a variety of problems, from image classification to natural language processing (NLP). Recently, there has been work on combining the Inception network with that of a Residual network such that the resulting network builds on the advantages offered by either network in isolation [Szegedy et al.2017]
. In future, it would be useful to see how different are the features when they are extracted from Inception module, ResNet module or a combination of both. Not only this, a wide variety of handcrafted features can also be augmented as inputs to the distributed GPs; our initial experiments using the iDT features show that this is indeed the case. Input data can also be augmented using RGB difference or optic flow warps, as had been done in
[Wang et al.2016].Also, the second stage of training, i.e., the GP classifiers work with far fewer examples than what a deep learning network requires. It would be useful to see how pillar networks perform on immensely large datasets such as the Youtube8m dataset
[AbuElHaija et al.2016]. Additionally, recently published Kinetics human action video dataset from DeepMind [Kay et al.2017] is equally attractive, as pretraining, the pillar networks on this dataset before finegrained training on HMDB51 will invariably increase the accuracy of the current network architecture.The Bayesian productofGPs would suffer from a problem were we to increase the number of experts. This is because the precision of the experts adds up which leads to overconfident predictions, especially in the absence of data. In unpublished work, we have utilised generalised Product of Experts (gPoE) [Cao and Fleet2014] and Bayesian Committee Machine (BCM) [Tresp2000] to increase the fidelity of our predictions. These would be reported in a subsequent publication along with results from a robust Bayesian Committee Machine (rBCM) which includes the productofGPs and the BCM as special cases [Deisenroth and Ng2015].
For inference, we have limited our experiments to the Laplace approximation inference under a distributed GP framework. An alternative inference methodology for multiclass classification include (stochastic) expectation propagation [Riihimäki, Jylänki, and Vehtari, VillacampaCalvo and HernándezLobato2017] or variational approximations [Hensman, Matthews, and Ghahramani2015]. From our experience in variational optimisation for dynamical probabilistic graphical models [Cooray et al.2017], there is merit in using freeenergy minimization, simply due to lower computational overhead. Indeed, it comes with its problems such as underestimation of the variability of the posterior density, inability to describe multimodal densities and the inaccuracy due to the presence of multiple equilibrium points. All being said, some of these problems are also shared by stateoftheart MCMC samplers for dynamical systems [Sengupta, Friston, and Penny2015a, Sengupta, Friston, and Penny2015b]. Due to the flexibility of utilising GPUs, both methods (variational inference and EP) can prove to be computationally efficient, especially for streaming data. Thus, there is a scope of future work where one can apply these inference methodologies and compare it with vanilla Laplace approximations, as utilised here.
References
 [AbuElHaija et al.2016] AbuElHaija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; and Vijayanarasimhan, S. 2016. YouTube8M: a largescale video classification benchmark.
 [Cao and Fleet2014] Cao, Y., and Fleet, D. J. 2014. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. arXiv preprint arXiv:1410.7827.
 [Cooray et al.2017] Cooray, G.; Rosch, R.; Baldeweg, T.; Lemieux, L.; Friston, K.; and Sengupta, B. 2017. Bayesian Belief Updating of Spatiotemporal Seizure Dynamics. ICML Workshop on TimeSeries methods.
 [Deisenroth and Ng2015] Deisenroth, M. P., and Ng, J. W. 2015. Distributed Gaussian processes. arXiv preprint arXiv:1502.02843.

[Feichtenhofer, Pinz, and
Wildes2017]
Feichtenhofer, C.; Pinz, A.; and Wildes, R. P.
2017.
Spatiotemporal multiplier networks for video action recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.  [Feichtenhofer, Pinz, and Zisserman2016] Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional twostream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1933–1941.
 [Fernando and Gould2017] Fernando, B., and Gould, S. 2017. Discriminatively learned hierarchical rank pooling networks. arXiv preprint arXiv:1705.10420.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

[Hensman, Matthews, and
Ghahramani2015]
Hensman, J.; Matthews, A. G. d. G.; and Ghahramani, Z.
2015.
Scalable variational Gaussian process classification.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
. 
[Ioffe and Szegedy2015]
Ioffe, S., and Szegedy, C.
2015.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, 448–456.  [Karpathy et al.2014] Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; and FeiFei, L. 2014. Largescale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732.
 [Kay et al.2017] Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
 [Kuehne et al.2013] Kuehne, H.; Jhuang, H.; Stiefelhagen, R.; and Serre, T. 2013. HMDB51: a large video database for human motion recognition. In High Performance Computing in Science and Engineering ‘12. Springer. 571–582.
 [Ma et al.2017] Ma, C.Y.; Chen, M.H.; Kira, Z.; and AlRegib, G. 2017. TSLSTM and TemporalInception: Exploiting spatiotemporal dynamics for activity recognition. arXiv preprint arXiv:1703.10667.

[Matheron1973]
Matheron, G.
1973.
The intrinsic random functions and their applications.
Advances in applied probability
5(3):439–468.  [Riihimäki, Jylänki, and Vehtari] Riihimäki, J.; Jylänki, P.; and Vehtari, A. Nested expectation propagation for Gaussian Process classification with a multinomial probit likelihood.
 [Sengupta and Qian2017] Sengupta, B., and Qian, Y. 2017. Pillar Networks for action recognition. IROS Workshop on Semantic Policy and Action Representations for Autonomous Robots.
 [Sengupta, Friston, and Penny2015a] Sengupta, B.; Friston, K. J.; and Penny, W. D. 2015a. Gradientbased MCMC samplers for dynamic causal modelling. Neuroimage.
 [Sengupta, Friston, and Penny2015b] Sengupta, B.; Friston, K. J.; and Penny, W. D. 2015b. Gradientfree MCMC methods for dynamic causal modelling. Neuroimage 112:375–81.
 [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Twostream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576.
 [Soomro, Zamir, and Shah2012] Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
 [Szegedy et al.2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2818–2826.

[Szegedy et al.2017]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A.
2017.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, 4278–4284.  [Tresp2000] Tresp, V. 2000. A Bayesian committee machine. Neural computation 12(11):2719–2741.
 [VillacampaCalvo and HernándezLobato2017] VillacampaCalvo, C., and HernándezLobato, D. 2017. Scalable MultiClass Gaussian Process Classification using Expectation Propagation. ArXiv eprints.
 [Wang et al.2016] Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, 20–36. Springer.

[Wiener1949]
Wiener, N.
1949.
Extrapolation, interpolation, and smoothing of stationary time series
, volume 7. MIT press Cambridge, MA.  [Williams and Barber1998] Williams, C. K., and Barber, D. 1998. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(12):1342–1351.
Comments
There are no comments yet.