1 Introduction
Today’s digital contents are inherently multimedia. Particularly, with the proliferation of sensorrich mobile devices, images and videos become media of everyday communication. Therefore, understanding of multimedia content becomes highly demanded, which accelerates the development of various techniques in visual annotation. Among them, a fundamental breakthrough underlining the success of these techniques is representation learning. This can be evidenced by the success of Convolutional Neural Networks (CNN), which demonstrates high capability of learning and generalization in visual representation. For example, an ensemble of residual nets [11]
achieves 3.57% top5 error on ImageNet test set, which is even lower than 5.1% of the reported humanlevel performance. Despite these impressive progresses, learning powerful and generic spatiotemporal representation remains challenging, due to larger variations and complexities of video content.
A natural extension of CNN from image to video domain is by direct exploitation of 2D CNN on video frames [18, 34, 41] or 3D CNN on video clips [15, 28, 29, 38]. An inherent limitation of this extension, however, is that each convolution operation, either 2D or 3D, processes only a local window of neighboring pixels. As window size is normally set to a small value, the holistic view of field cannot be adequately captured. This problem is engineered by performing repeated convolution and pooling operations to capture longrange visual dependencies. In this way, receptive fields can be increased through progressive propagation of signal responses over local operations. When a network is deep, the repeated operations, however, post difficulty to parameter optimization. Concretely, the connection between two distant pixels are only established after a large number of local operations, resulting in vanishing gradient.
In this paper, we present Local and Global Diffusion (LGD) networks – a novel architecture to learn spatiotemporal representations capturing largerange dependencies, as shown in Figure 1. In LGD networks, the feature maps are divided into local and global paths, respectively describing local variation and holistic appearance at each spatiotemporal location. The networks are composed of several staked LGD blocks of each couples with mutually inferring local and global paths. Specifically, the inference takes place by attaching the residual value of global path to the output of local feature map, while the feature of global path is produced by linear embedding of itself with the global average pooling of local feature map. The diffusion is constructed at every level from bottom to top such that the learnt representations encapsulate a holistic view of content evolution. Furthermore, the final representations from both paths are combined by a novel kernelbased classifier proposed in this paper.
The main contribution of this work is the proposal of the Local and Global Diffusion networks, which is a twopath network aiming to model local and global video information. The diffusion between two paths enables the capturing of largerange dependency by the learnt video representations economically and effectively. Through an extensive set of experiments, we demonstrate that our LGD network outperforms several stateoftheart models on six benchmarks, including Kinetics400, Kinetics600, UCF101, HMDB51 for video action recognition and JHMDB, UCF101D for spatiotemporal action detection.
2 Related Work
We broadly categorize the existing research in video representation learning into handcrafted and deep learning based methods.
Handcrafted representation starts by detecting spatiotemporal interest points and then describing them with local representations. Examples of representations include SpaceTime Interest Points (STIP) [21], Histogram of Gradient and Histogram of Optical Flow [22], 3D Histogram of Gradient [19], SIFT3D [33] and Extended SURF [45]. These representations are extended from image domain to model temporal variation of 3D volumes. One particularly popular representation is the dense trajectory feature proposed by Wang et al., which densely samples local patches from each frame at different scales and then tracks them in a dense optical flow field [40]. These handcrafted descriptors, however, are not optimized and hardly to be generalized across different tasks of video analysis.
The second category is deep learning based video representation. The early works are mostly extended from image representation by applying 2D CNN on video frames. Karparthy et al. stack CNNbased framelevel representations in a fixed size of windows and then leverage spatiotemporal convolutions for learning video representation [18]. In [34], the famous twostream architecture is devised by applying two 2D CNN architectures separately on visual frames and staked optical flows. This twostream architecture is further extended by exploiting convolutional fusion [5], spatiotemporal attention [24], temporal segment networks [41, 42] and convolutional encoding [4, 27] for video representation learning. Ng et al. [49] highlight the drawback of performing 2D CNN on video frames, in which longterm dependencies cannot be captured by twostream network. To overcome this limitation, LSTMRNN is proposed by [49] to model longrange temporal dynamics in videos. Srivastava et al. [37]
further formulate the video representation learning task as an autoencoder model based on the encoder and decoder LSTMs.
The aforementioned approaches are limited by treating video as a sequence of frames and optical flows for representation learning. More concretely, pixellevel temporal evolution across consecutive frames are not explored. The problem is addressed by 3D CNN proposed by Ji et al. [15], which directly learns spatiotemporal representation from a short video clip. Later in [38], Tran et al. devise a widely adopted 3D CNN, namely C3D, for learning video representation over 16frame video clips in the context of largescale supervised video dataset. Furthermore, performance of the 3D CNN is further boosted by inflating 2D convolutional kernels [3], decomposing 3D convolutional kernels [28, 39] and aggregated residual transformation [9].
Despite these progresses, longrange temporal dependency beyond local operation remains not fully exploited, which is the main theme of this paper. The most closely related work to this paper is [43], which investigates the nonlocal mean operation proposed in [2]. The work captures longrange dependency by iterative utilization of local and nonlocal operations. Our method is different from [43] in that local and global representations are learnt simultaneously and the interaction between them encapsulates a holistic view for the local representation. In addition, we combine the final representations from both paths for more accurate prediction.
3 Local and Global Diffusion
We start by introducing the Local and Global Diffusion (LGD) blocks for representation learning. LGD is a cell with local and global paths interacting each other. A classifier is proposed to combine local and global representations. With these, two LGD networks, namely LGD2D and LGD3D deriving from temporal segment networks [41] and pseudo3D convolutional networks [28], respectively, are further detailed.
3.1 Local and Global Diffusion Blocks
Unlike the existing methods which stack the local operations to learn spatiotemporal representations, our proposed Local and Global Diffusion (LGD) model additionally integrates the global aspect into video representation learning. Specifically, we propose the novel neural networks that learn the discriminative local representation and global representation in parallel while combining them to synthesize new information. To achieve this, the feature maps in neural networks are splitted into local path and global path. Then, we define a LGD block to model the interaction between two paths as:
(1) 
where and denote the input pair and output pair of the th block. The localglobal pair consists of local feature map
and global feature vector
, where , , and are the number of channels, temporal length, height and width of the 4D volume data, respectively.The detailed operations inside each block are shown in Figure 2 and can be decomposed into two diffusion directions as following.
(1) Globaltolocal diffusion. The first direction is to learn the transformation from to the updated local feature with the priority of global vector . Taking the inspiration from the recent successes of Residual Learning [11], we aim to formulate the global priority as the global residual value, which can be broadcasted to each location as
(2) 
where is the projection matrix, is the upsampling operation duplicating the residual vector to each location and is a local transformation function (i.e., 3D convolutions). The choice of function is dependent on the network architecture and will be discussed in Section 4.
(2) Localtoglobal diffusion. The second direction is to update the global vector with current local feature . Here, we simply linearly embed the input global feature and Global Average Pooling (GAP) of local feature by
(3) 
where and are the projection matrices combining local and global features.
Compared with the traditional convolutional block which directly apply the transformation to local feature, the LGD block introduced in Eq.(2) and Eq.(3) only requires three more projection matrices to produce the output pair. In order to reduce the additional parameters for LGD block, we exploit the lowrank approximation of each projection matrix as , in which and . When , the parameters as well as computational cost can be sharply reduced. Through crossvalidation, we empirically set which is found not to impact the performance negatively. By this approximation, the number of additional parameters is reduced from to for each block.
3.2 Local and Global Combination Classifier
With the proposed LGD block, the network can learn local and global representations in parallel. The next question is how to make the final prediction by combining the two representations. Here, we consider the kernelized view of similarity measurement between two videos. Formally, denote and as the last output pair of two videos, we choose the bilinear kernel [25] on both the local and global features, which can be trained endtoend in neural network. Thus, the kernel function can be given by
(4) 
in which is the number of spatiotemporal locations, is the bilinear kernel and denotes the feature vector of th position in . In the last line of Eq (4
), we approximate the bilinear kernel by Tensor Sketch Projection
in [6], which can effectively reduce the dimension of feature space. By decomposing the kernel function in Eq (4), the feature mapping is formulated as(5) 
where denotes concatenation of two vectors. The combines the pair into a high dimensional representation. The whole process can be trained endtoend in the neural networks. Finally, the resulting representation is fed into a fully connected layer for class labels prediction.
4 Local and Global Diffusion Networks
The proposed LGD block and the classifier can be easily integrated with most of the existing video representation learning frameworks. Figure 3 shows two different constructions of LGD blocks, called LGD2D and LGD3D, with different transformation and training strategies.
4.1 Lgd2d
The straightforward way to learn video representation directly employs 2D convolution as the transformation function . Thus, in the local path of LGD2D, a shared 2D CNN is performed as backbone network on each frame independently, as shown in the upper part in Figure 3. To enable efficient endtoend learning, we uniformly split a video into T snippets and select only one frame per snippet for processing. The idea is inspired by Temporal Segment Network (TSN) [41, 42], which overcomes computational issue by selecting a subset of frames for longterm temporal modeling. Thus, the input of LGD2D consists of noncontinuous frames, and the global path learns a holistic representation of all these frames. Please note that the initial local representation is achieved by a single local operation applied on the input frames, and the initial global representation is the global average of . At the end of the networks, the local and global combination classifier is employed to achieve a hybrid prediction.
4.2 Lgd3d
Another major branch of video representation learning is 3D CNN [15, 28, 38]. Following the common settings of 3D CNN, we feed consecutive frames into the LGD3D network and exploit 3D convolution as local transformation , as shown in the lower part in Figure 3. Nevertheless, the training of 3D CNN is computationally expensive and the model size also has a quadratic growth compared with 2D CNN. Therefore, we choose the pseudo3D convolution proposed in [28] that decomposes 3D learning into 2D convolutions in spatial space and 1D operations in temporal dimension. To simplify the decomposition, in this paper, we only choose P3DA block with the highest performance in [28], which cascades the the spatial convolution and temporal convolution in turn.
Here, we show the exampler architecture of LGD3D based on the ResNet50 [11] backbone in Table 1. The LGD3D firstly replaces each convolutional kernel in original ResNet50 with one spatial convolution and temporal convolution, and then builds a LGD block based on each residual unit. All the weights of spatial convolutions can be initialized from the pretrained ResNet50 model as done in [28]. The dimension of input video clip is set as consisting of consecutive frames with resolution
. The clip length will be reduced twice by two max pooling layers with temporal stride of
. The computational cost and training time thus can be effectively reduced by the small input resolution and temporal pooling. The final local representation with dimension is combined with global representation by the kernelized classifier. This architecture can be easily extended to ResNet101 or deeper networks by repeating more LGD blocks.4.3 Optimization
Next, we present the optimization of LGD networks. Considering the difficulty in training the whole network from scratch by kernelized classifier [6, 25], we propose a twostage strategy to train the LGD networks. At the beginning of the training, we optimize the basic network without the combination classifier, and adjust local and global representations separately. Denote and as the last output pair and corresponding category of the input video, the optimization function is given as
(6) 
where denotes the softmax crossentropy loss with projection matrix . The overall loss consists of the classification errors from both global representation and local representation after global average pooling. After the training of basic network, we then tune the whole network with the following loss:
(7) 
where is the feature mapping proposed in Section 3.2.
Layer  Operation  Local path size 

conv1  , stride 1, 2, 2  
pool1  , max, stride 2, 1, 1  
res2  
pool2  , max, stride 2, 1, 1  
res3  
res4  
res5 
5 Experiments
5.1 Datasets
We empirically evaluate LGD networks on the Kinetcis400 [3] and Kinetcis600 [7] datasets. The Kinetics400 dataset is one of the largescale action recognition benchmarks. It consists of around 300K videos from 400 action categories. The 300K videos are divided into 240K, 20K, 40K for training, validation and test sets, respectively. Each video in this dataset is 10second short clip cropped from the raw YouTube video. Note that the labels for test set are not publicly available and the performances on Kinetics400 dataset are all reported on the validation set. The Kinetics600 is an extended version of Kinetics400 dataset, firstly made public in ActivityNet Challenge 2018 [7]. It consists of around 480K videos from 600 action categories. The 480K videos are divided into 390K, 30K, 60K for training, validation and test sets, respectively. Since the labels for Kinetics600 test set are available, we report the final performance on both the validation and test sets.
5.2 Training and Inference Strategy
Our proposal is implemented on Caffe
[16]framework and the minibatch Stochastic Gradient Descent (SGD) algorithm is employed to optimize the model. In the
training stage, for LGD2D, we set the input as image which is randomly cropped from the resized video frame. For LGD3D, the dimension of input video clips is set as , which is randomly cropped from the resized nonoverlapping 16frame clip with the size of . Each frame/clip is randomly flipped along horizontal direction for data augmentation. We set each minibatch as 128 triple frames for LGD2D, and 64 clips for LGD3D, which are implemented with multiple GPUs in parallel. The network parameters are optimized by standard SGD. For each stage in Section 4.3, the initial learning rate is set as 0.01, which is divided by 10 after every 20 epochs. The training is stopped after 50 epoches.
There are two weights initialization strategies for LGD networks. The first one is to train the whole networks from scratch. In this way, all the convolutional kernels and the projection matrices in LGD block are initialized by Xavier initialization [8], and all the biases are set as zero. The second one initializes the spatial convolutions with the existing 2D CNN pretrained on ImageNet dataset [31]. In order to keep the semantic information for these pretrained convolutions, we set the projection matrix as zero, making the global residual value vanishes when the training begins. Especially, the temporal convolutions in LGD3D are initialized as an identity mapping in this case.
In the inference stage, we resize the video frames with the shorter side 240/120 for LGD2D/LGD3D, and perform spatially fully convolutional inference on the whole frame. Thus, the LGD2D will predict one score for each triple frames and the videolevel prediction score is calculated by averaging all scores from 10 uniformly sampled triple frames. Similarly, the videolevel prediction score from LGD3D is achieved by averaging all scores from 15 uniformly sampled 16frame clips.
5.3 Evaluation of LGD block
We firstly verify the effectiveness of our proposed LGD block for spatiotemporal representation learning and compare with two diffusion block variants, i.e., block and block by different diffusion functions. Specifically, compared with LGD block, the block ignores the global representation from lower layers, making the output function of global path as
(8) 
Motivated by the channelwise scaling proposed in [13], the block utilizes the global priority as channelwise multiplication. Thus, the output of local path in block can be formulated as
(9) 
where denotes the elementwise multiplication.
Table 2 summarizes the performance comparisons on Kinetics600 dataset. The backbone architectures are all ResNet50 trained from scratch. Overall, all the three diffusion blocks (i.e., LGD block, block and block) exhibit better performance than baseline networks for both 2D and 3D CNNs. The results basically indicate the advantage of exploring largescale dependency by the diffusion between local path and global path. In particular, as indicated by our results, utilizing the proposed LGD block which embeds both input local and global representations and explores the global priority as residual value, can constantly lead to better performance than block and block.
The loss curves of baseline networks and LGD networks are shown in Figure 4. The training losses of local and global paths in Eq. (9) are given separately. Generally, the LGD networks produce lower losses than baseline networks, and converge faster and stably. Another observation is that the loss on local path is consistently lower than the loss on global path. We speculate that this may be due to information lost by lowrank approximation of projection matrices in Eq. (3).
5.4 An Ablation Study of LGD networks
Method  R50  R101  Img  Com  Long  Top1 

LGD2D  72.5  
74.4  
74.8  
74.5  
76.4  
76.7  
LGD3D  74.2  
75.8  
76.3  
79.4  
76.0  
77.7  
78.3  
81.5 
Next, we study how each design in LGD networks influences the overall performance. Here, we choose ResNet50 (R50) or ResNet101 (R101) as backbone network. This backbone network is either trained from scratch or pretrained by ImageNet (Img). The local and global combination classifier (Com) uses the kernelized classifier for prediction. In order to capture longterm temporal information, we further extend the LGD3D network with 128frame input (Long). Following the settings in [43], we firstly train the networks with 16frame clips in the first stage in Section 4.3 and then with 128frame clips in the second stage. When training with 128frame clips, we increase the stride of pool1 layer to 4, and set each minibatch as 16 clips to meet the requirements of GPU memory. The training is stopped after 12.5 epoches.
Table 3 details the accuracy improvement on Kinetics600 dataset by different designs of LGD networks. When exploiting ResNet50 as backbone network, the pretraining on ImageNet dataset successfully boosts up the top1 accuracy from 72.5% to 74.4% for LGD2D and from 74.2% to 75.8% for LGD3D. This demonstrates the effectiveness of pretraining on largescale image recognition dataset. The local and global combination classifier which combines the representations from two paths leads to the performance boost of 0.4% and 0.5% for LGD2D and LGD3D, respectively. Especially for LGD3D, the training on 128frame clips contributes a large performance increase of 3.1% by involving longterm temporal information in the network. Moreover, compared with ResNet50, both the LGD2D and LGD3D based on ResNet101 exhibit significantly better performance, with the top1 accuracy of 76.7% and 81.5% for LGD2D and LGD3D, respectively. The results verify that deeper networks have larger learning capacity for spatiotemporal representation learning.
5.5 Comparisons with StateoftheArt
Method  Backbone  Top1  Top5 

I3D RGB [3]  Inception  72.1  90.3 
I3D Flow [3]  Inception  65.3  86.2 
I3D Twostream [3]  Inception  75.7  92.0 
ResNeXt101 RGB [9]  custom  65.1  85.7 
R(2+1)D RGB [39]  custom  74.3  91.4 
R(2+1)D Flow [39]  custom  68.5  88.1 
R(2+1)D Twostream [39]  custom  75.4  91.9 
NL I3D RGB [43]  ResNet101  77.7  93.3 
S3DG RGB [46]  Inception  74.7  93.4 
S3DG Flow [46]  Inception  68.0  87.6 
S3DG Twostream [46]  Inception  77.2  93.0 
From Anet17 winner report [1]  
2D CNN RGB  InceptionResNetv2  73.0  90.9 
Threestream late fusion  InceptionResNetv2  74.9  91.6 
Threestream SATT  InceptionResNetv2  77.7  93.2 
LGD3D RGB  ResNet101  79.4  94.4 
LGD3D Flow  ResNet101  72.3  90.9 
LGD3D Twostream  ResNet101  81.2  95.2 
We compare with several stateoftheart techniques on Kinetics400 and Kinetics600 datasets. The performance comparisons are summarized in tables 4 and 5, respectively. Please note that most recent works employ fusion of two or three modalities on these two datasets. Broadly, we can categorize the most common modalities into four categories, i.e., RGB, Flow, Twostream and Threestream. The RGB/Flow feeds the video frames/optical flow images into the networks. The optical flow image in this paper consists of twodirection optical flow extracted by TVL1 algorithm [50]. The predictions from RGB and Flow modalities are fused by Twostream methods. The Threestream approaches further merge the prediction from audio input.
As shown in Table 4, with only RGB input, the LGD3D achieves 79.4% top1 accuracy, which makes the relative improvement over the recent approaches I3D [3], R(2+1)D [39], NL I3D [43] and S3DG [46] by 10.1%, 6.8%, 2.1% and 6.2%, respectively. This accuracy is also higher than 2D CNN with a deeper backbone reported by the ActivityNet 2017 challenge winner [1]. Note that the LGD3D with RGB input can obtain higher performance even compared with the Twostream or Threestream methods. When fusing the prediction from both RGB and Flow modalities, the accuracy of LGD3D will be further improved to 81.2%, which is todate the best published performance on Kinetics400.
Method  Backbone  Top1  Top5 

From Anet18 winner report [10]  
TSN RGB  SENet152  76.2  – 
TSN Flow  SENet152  71.3  – 
StNet RGB  InceptionResNetv2  78.9  – 
NL I3D RGB  ResNet101  78.6  – 
Threestream Attention  mixed  82.3  96.0 
Threestream iTXN  mixed  82.4  95.8 
From Anet18 runnerup report [48]  
P3D RGB  ResNet152  78.4  93.9 
P3D Flow  ResNet152  71.0  90.0 
P3D Twostream  ResNet152  80.9  94.9 
LGD3D RGB  ResNet101  81.5  95.6 
LGD3D Flow  ResNet101  75.0  92.4 
LGD3D Twostream  ResNet101  83.1  96.2 
LGD3D Twostream*  ResNet101  82.7  96.0 
Similar results are also observed on Kinetics600, as summarized in Table 5. Since this dataset is recently made available for ActivityNet 2018 challenge, we show the performance of different approaches reported by the challenge winner [10] and challenge runnerup [48]. With the RGB inputs, LGD3D achieves 81.5% top1 accuracy on Kinetics600 validation set, which obtains 3.4% relative improvement than P3D with the deeper backbone of ResNet152. The performance is higher than that of NL I3D which also explores largerange dependency. This result basically indicates that LGD network is an effective way to learn video representation with a global aspect. By combining the RGB and Flow modalities, the top1 accuracy of LGD3D achieves 83.1%, which is even higher than threestream method proposed by ActivityNet 2018 challenge winner.
5.6 Evaluation on Video Representation
Here we evaluate video representation learnt by our LGD3D for two different tasks and on four popular datasets, i.e., UCF101, HMDB51, JHMDB and UCF101D. UCF101 [36] and HMDB51[20] are two of the most popular video action recognition benchmarks. UCF101 consists of 13K videos from 101 action categories, and HMDB51 consists of 7K videos from 51 action categories. We follow the three training/test splits provided by the dataset organisers. Each split in UCF101 includes about 9.5K training and 3.7K test videos, while a HMDB51 split contains 3.5K training and 1.5K test videos.
JHMDB and UCF101D are two datasets for spatiotemporal action detection. JHMDB [14] contains 928 well trimmed video clips of 21 actions. The videos are truncated to actions and the bounding box annotations are available for all frames. It provides three training/test splits for evaluation. UCF101D [36] is a subset of UCF101 for action detection task. It consists of 3K videos from 24 classes with spatiotemporal ground truths.
Method  Pretraining  U101  H51 

IDT [40]  –  86.4  61.7 
Twostream [34]  ImageNet  88.0  59.4 
TSN [41]  ImageNet  94.2  69.4 
I3D RGB [3]  ImageNet+Kinetics400  95.4  74.5 
I3D Flow [3]  ImageNet+Kinetics400  95.4  74.6 
I3D Twostream [3]  ImageNet+Kinetics400  97.9  80.2 
ResNeXt101 RGB [9]  Kinetics400  94.5  70.2 
R(2+1)D RGB [39]  Kinetics400  96.8  74.5 
R(2+1)D Flow [39]  Kinetics400  95.5  76.4 
R(2+1)D Twostream [39]  Kinetics400  97.3  78.7 
S3DG RGB [46]  ImageNet+Kinetics400  96.8  75.9 
LGD3D RGB  ImageNet+Kinetics600  97.0  75.7 
LGD3D Flow  ImageNet+Kinetics600  96.8  78.9 
LGD3D Twostream  ImageNet+Kinetics600  98.2  80.5 
Method  JHMDB  UCF101D  

0.2  0.5  0.05  0.1  0.2  0.3  
Weinzaepfel et al. [44]  63.1  60.7  54.3  51.7  46.8  37.8 
Saha et al. [32]  72.6  71.5  79.1  76.6  66.8  55.5 
Peng et al. [26]  74.3  73.1  78.8  77.3  72.9  65.7 
Singh et al. [35]  73.8  72.0  –  –  73.5  – 
Kalogeiton et al. [17]  74.2  73.7  –  –  77.2  – 
Hou et al. [12]  78.4  76.9  78.2  77.9  73.1  69.4 
Yang et al. [47]  –  –  79.0  77.3  73.5  60.8 
Li et al. [23]  82.7  81.3  82.1  81.3  77.9  71.4 
LGD3D RGB  77.3  74.2  78.8  77.6  69.3  64.1 
LGD3D Flow  84.5  82.9  86.5  84.2  79.8  74.7 
LGD3D Twostream  85.7  84.9  88.3  87.1  82.2  75.6 
We first validate the global representations learnt by the pretrained LGD3D network. Therefore, we finetune the pretrained LGD3D on the UCF101 and HMDB51 datasets. The performance comparisons are summarized in Table 6. Overall, the twostream LGD3D achieves 98.2% on UCF101 and 80.5% on HMDB51, which consistently indicate that video representation produced by our LGD3D attains a performance boost against baselines on action recognition task. Specifically, the twostream LGD3D outperforms three traditional approaches, i.e., IDT, Twostream and TSN by 11.8%, 10.2% and 4.0% on UCF101, respectively. The results demonstrate the advantage of pretraining on largescale video recognition dataset. Moreover, compared with recent methods pretrained on Kinetics400 dataset, LGD3D still surpasses the best competitor Twostream I3D by 0.3% on UCF101.
Next, we turn to evaluate the local representations from pretrained LGD3D networks on the task of spatiotemporal action detection. To build the action detection framework based on LGD3D, we firstly obtain the action proposals in each frame by a region proposal network [30] with ResNet101. The action tubelet is generated by proposal linking and temporally trimming in [32]
. Then the prediction score of each proposal is estimated by the ROIpooled local feature from LGD3D network. In Table
7, we summarize the performance comparisons on JHMDB (3 splits) and UCF101D with different IoU thresholds. Our LGD3D achieves the best performance at all the cases. Specifically, at the standard threshold (0.5 for JHMDB, and 0.2 for UCF101D), LGD3D makes relative improvement of 4.4% and 5.5% than the best competitor [23] on JHMDB and UCF101D, respectively. Figure 5 showcases four detection examples from JHMDB and UCF101D.6 Conclusion
We have presented Local and Global Diffusion (LGD) network architecture which aims to learn local and global representations in an unified fashion. Particularly, we investigate the interaction between localized and holistic representations, by designing LGD block with diffusion operations to model local and global features. A kernelized classifier is also formulated to combine the final prediction from two representations. With the development of the two components, we have proposed two LGD network architectures, i.e., LGD2D and LGD3D, based on 2D CNN and 3D CNN, respectively. The results on largescale Kinetics400 and Kinetics600 datasets validate our proposal and analysis. Similar conclusion is also drawn from the other four datasets in the context of video action recognition and spatiotemporal action detection. The spatiotemporal video representation produced by our LGD networks is not only effective but also highly generalized across datasets and tasks. Performance improvements are clearly observed when comparing to other feature learning techniques. More remarkably, we achieve new stateoftheart performances on all the six datasets.
Our future works are as follows. First, more advanced techniques, such as attention mechanism, will be investigated in the LGD block. Second, more indepth study of how to combine the local and global representations could be explored. Third, we will extend the LGD network to other types of inputs, e.g., audio information.
References
 [1] Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie Zhou, Shilei Wen, and Yuanqing Lin. Revisiting the effectiveness of offtheshelf temporal modeling approaches for largescale video classification. arXiv preprint arXiv:1708.03805, 2017.
 [2] Antoni Buades, Bartomeu Coll, and JM Morel. A nonlocal algorithm for image denoising. In CVPR, 2005.
 [3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
 [4] Ali Diba, Vivek Sharma, and Luc Van Gool. Deep temporal linear encoding networks. In CVPR, 2017.
 [5] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, 2016.
 [6] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In CVPR, 2016.
 [7] Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Khrisna, Shyamal Buch, and Cuong Duc Dao. The activitynet largescale activity recognition challenge 2018 summary. arXiv preprint arXiv:1808.03766, 2018.
 [8] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 [9] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In CVPR, 2018.
 [10] Dongliang He, Fu Li, Qijie Zhao, Xiang Long, Yi Fu, and Shilei Wen. Exploiting spatialtemporal modelling and multimodal fusion for human action recognition. arXiv preprint arXiv:1806.10319, 2018.
 [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [12] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (tcnn) for action detection in videos. In ICCV, 2017.
 [13] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In CVPR, 2018.
 [14] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, 2013.
 [15] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. on PAMI, 35(1):221–231, 2013.
 [16] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
 [17] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatiotemporal action localization. In ICCV, 2017.
 [18] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 [19] Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. A spatiotemporal descriptor based on 3dgradients. In BMVC, 2008.
 [20] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.

[21]
Ivan Laptev.
On spacetime interest points.
International Journal of Computer Vision
, 64(23):107–123, 2005.  [22] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
 [23] Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. Recurrent tubelet proposal and recognition networks for action detection. In ECCV, 2018.
 [24] Dong Li, Ting Yao, Lingyu Duan, Tao Mei, and Yong Rui. Unified spatiotemporal attention networks for action recognition in videos. IEEE Trans. on MM, 21(2):416–428, 2019.
 [25] TsungYu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for finegrained visual recognition. In ICCV, 2015.
 [26] Xiaojiang Peng and Cordelia Schmid. Multiregion twostream rcnn for action detection. In ECCV, 2016.
 [27] Zhaofan Qiu, Ting Yao, and Tao Mei. Deep quantization: Encoding convolutional activations with deep generative model. In CVPR, 2017.
 [28] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatiotemporal representation with pseudo3d residual networks. In ICCV, 2017.
 [29] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning deep spatiotemporal dependence for semantic video segmentation. IEEE Trans. on MM, 20(4):939–949, 2018.
 [30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [32] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Deep learning for detecting multiple spacetime action tubes in videos. In BMVC, 2016.
 [33] Paul Scovanner, Saad Ali, and Mubarak Shah. A 3dimensional sift descriptor and its application to action recognition. In ACM MM, 2007.
 [34] Karen Simonyan and Andrew Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [35] Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Online realtime multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
 [36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human action classes from videos in the wild. CRCVTR1201, 2012.
 [37] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
 [38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
 [39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
 [40] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 [41] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 [42] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE Trans. on PAMI, 2018.
 [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In CVPR, 2018.
 [44] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatiotemporal action localization. In ICCV, 2015.
 [45] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scaleinvariant spatiotemporal interest point detector. In ECCV, 2008.
 [46] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speedaccuracy tradeoffs in video classification. In ECCV, 2018.
 [47] Zhenheng Yang, Jiyang Gao, and Ram Nevatia. Spatiotemporal action detection with cascade proposal and location anticipation. In BMVC, 2017.
 [48] Ting Yao and Xue Li. Yh technologies at activitynet challenge 2018. arXiv preprint arXiv:1807.00686, 2018.
 [49] Joe YueHei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
 [50] C Zach, T Pock, and H Bischof. A duality based approach for realtime tvl1 optical flow. Pattern Recognition, 2007.
Comments
There are no comments yet.