Log In Sign Up

Locally-Consistent Deformable Convolution Networks for Fine-Grained Action Detection

by   Khoi-Nguyen C. Mac, et al.

Fine-grained action detection is an important task with numerous applications in robotics, human-computer interaction, and video surveillance. Several existing methods use the popular two-stream approach, which learns the spatial and temporal information independently from one another. Additionally, the temporal stream of the model usually relies on extracted optical flow from the video stream. In this work, we propose a deep learning model to jointly learn both spatial and temporal information without the necessity of optical flow. We also propose a novel convolution, namely locally-consistent deformable convolution, which enforces a local coherency constraint on the receptive fields. The model produces short-term spatio-temporal features, which can be flexibly used in conjunction with other long-temporal modeling networks. The proposed features used in conjunction with a major state-of-the-art long-temporal model ED-TCN outperforms the original ED-TCN implementation on two fine-grained action datasets: 50 Salads and GTEA, by up to 10.0 and also outperforms the recent state-of-the-art TDRN, by up to 5.9


page 2

page 8


Multi-Stream Single Shot Spatial-Temporal Action Detection

We present a 3D Convolutional Neural Networks (CNNs) based single shot d...

Exploiting Temporal Information for DCNN-based Fine-Grained Object Classification

Fine-grained classification is a relatively new field that has concentra...

Real-time Action Recognition for Fine-Grained Actions and The Hand Wash Dataset

In this paper we present a three-stream algorithm for real-time action r...

Temporal Interlacing Network

For a long time, the vision community tries to learn the spatio-temporal...

Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

The increasing number of surveillance cameras and security concerns have...

Leveraging Structural Context Models and Ranking Score Fusion for Human Interaction Prediction

Predicting an interaction before it is fully executed is very important ...

1 Introduction

(a) frame at time .
(b) frame at time .
(c) masks of the person at time (blue) and (green).

no motion vectors found on the background region.

(e) motion vectors found on the moving region.
(f) visualization of motion in feature space.
Figure 1: Visualization of difference of adaptive receptive fields for action cutting lettuce in 50 Salads dataset: (a) and (b) are two consecutive frames; (c) is the manually defined mask of the person at time and ; (d) and (e) are motion vectors at background and moving regions (green dots indicate of activation locations and red arrows indicate motion vectors); and (f) is the energy of motion field in feature space, computed from multiple layers.

Action detection, a.k.a

action segmentation, addresses the task of classifying every frame of a given video as one out of

defined categories, including a category for unknown actions. Contrary to the simpler task of action recognition, wherein a given video is pre-segmented and guaranteed to be one of the provided classes. The setting of action detection is closely related to the goal of human-computer interaction, or video surveillance [9, 17, 18].

In this work, we focus on producing spatio-temporal features for fine-grained actions. Fine-grained actions are defined to be actions which have high inter-class similarity [14, 17], i.e. it is difficult, even for humans, to distinguish two different actions just from observing single frames. Unlike generic action detection, which can largely rely on “what” is in a video frame to perform detection, fine-grained action detection requires additional reason about “how” the objects are moving across several video frames. In other words, fine-grained actions heavily rely on motion, rather than mostly on appearance cues.

Traditional approaches tackle problems in video analysis by decoupling spatial and temporal information in different feature extractors and then combining the two streams with a fusion module. Fusion results are then fed into a long-term dependency modeling network in order to learn a longer term temporal structure. The construction of such networks has received much attention in literature whereas the feature extraction backbone is often kept the same. More specifically, the spatial features often come from a standard spatial Convolutional Neural Network (CNN) while temporal features are modeled by inputting optical flow into a second CNN

[4, 5, 10, 11, 12, 15, 17]. More recently, there have been efforts to model motion in video using variants of 3-D convolutions [1, 8, 21]. However in such cases, modeling of motion field is somewhat limited by receptive fields of standard convolutional filters.

Instead of modeling temporal information with optical flow (as is generally the norm in fine-grained action detection community), we learn temporal information in the feature space instead. This is accomplished by utilizing the proposed locally-consistent deformable convolution (LCDC), which is an extension of the standard deformable convolution [2]. At a high-level, we model motion by evaluating the local movements in adaptive receptive fields over time, instead of relying on additional optical flows. A local coherency constraint is enforced over the adaptive receptive fields in order to make the learned temporal information consistent. We next list our main contributions as:

  • Modeling motion in feature space, using changes in adaptive receptive fields over time, instead of relying on pixel space as in traditional optical flow based methods. To the best of our knowledge, we are the first to extract temporal information from receptive fields.

  • Introducing local coherency constraint to enforce consistency in motion. This consistency reduces redundant model parameters, making motion modeling more robust.

  • Constructing a backbone single-stream network that can effectively learn joint spatial and temporal information. This backbone network is flexible and can be used in consonance with other approaches that model long-term temporal structure. We also prove that the learned features can replace optical flow in terms of temporal modeling.

  • Significant reduction of model complexity is achieved by using a single-stream network and enforcing local coherency constraint on deformable convolution layers without sacrificing overall performance. This reduction is scalable with the number of deformable convolution layers. Moreover, our single-stream approach is more computationally efficient than traditional two-stream networks, as they require expensive optical flow computation and multi-stream inference.

To demonstrate the effectiveness of our approach, we evaluate on two standard fine-grained action detection datasets: 50 Salads [19] and Georgia Tech Egocentric Activities (GTEA) [3]. We show that our system is robust and out-performs features from two-stream networks, without any optical flow guidance. Additionally, we provide mathematical proofs and perform quantitative evaluation of the learned motion using ablation studies to demonstrate the power of our model in capturing temporal information.

2 Related work

An extensive body of literature exists for action recognition and backbone network architectures. In this section we will only review the most recent and relevant papers related to our approach.

Generic Action Recognition: Simonyan and Zisserman [15] proposed a two-stream network architecture which processes RGB video frames and optical flows. The video and optical flows are processed through two separate deep convolutional neural networks modified from the VGG-architecture. To combine each of the streams, the scores from each domain are combined at the end of the network. Feichtenhofer et al. [5, 4] proposed different ways of fusing information from the two streams, using different operators, such as sum and max. They also leverage ResNet and the two-stream architecture to capture information from the two domains. A different school of thought models motion using variants of 3-D convolutions including C3D proposed by Tran et al. [21]. Along similar lines Carreira [1] proposed a two-stream inflated I3D convolutional network for action recognition. However the aforementioned variants of 3-D convolutions have largely been applied for general action recognition and use fixed receptive fields.

Fine-Grained Action Detection: Singh et al. [17] proposed a multi-stream network (MSB-RNN) for fine-grained action detection. Their network begins with a CNN that has four streams, two for appearance and two for motion. The output features of this CNN, namely MSN, are processed by a bi-directional LSTM to analyze long-term dependency. Lea et al. [10] proposed to use temporal convolution networks instead of an LSTM. The network fuses the given spatio-temporal features and captures long-range temporal patterns by convolving them in the time-domain. Lei et al. [12] later proposed a different approach, namely Temporal Deformable Residual Networks (TDRN), to model long-term temporal information by applying deformable convolution in time domain. TDRN is currently the state-of-the-art in fine-grained action detection. In general, these methods focus on the fusion part and the performance relies heavily on the extracted spatio-temporal feature, which in turn requires optical flow and multiple streams. In contrast, our approach focuses on constructing better features which do not rely on optical flow. Our produced features can be used in conjunction with other long-temporal modeling networks, such as MSN, ED-TCN, or TDRN.

Network Architectures: Pre-trained architectures for image classification, such as VGG, Inception, ResNet [6, 16, 20] are perhaps the most important determinants of the performance of the main down-stream vision tasks. Many papers have focused on improving the recognition accuracy by innovating on the network architecture. In standard convolution, not all pixels in a receptive field share the same contribution to convolutional response because of the kernel shape. Dilated convolutions have been introduced to overcome this problem by changing the shape of receptive fields with some pre-determined patterns [7, 22, 23]. In 2017, Dai et al. [2] introduced deformable convolutional networks with adaptive receptive fields. The method is more flexible since the receptive fields depend on input and can approximate an arbitrary object’s shape. We leverage on the advances of [2], specifically the adaptive receptive fields from the model to capture motion/actions in the feature space. We further add a local coherency constraint on receptive fields in order to ensure that the motion fields are consistent. This constraint also plays a major role in reducing model complexity.

3 Approach

Figure 2: Network architecture of our proposed framework across multiple frames . Appearance information comes from the last layer while motion information is extracted directly from deformation in the feature space instead of from a separate optical flow stream. Weights are shared across all time frames.

Our architecture builds upon deformable convolutional networks with an underlying ResNet CNN. While a deformable convolutional network has been shown to succeed in the task of object detection and semantic segmentation, it is not directly designed for fine-grained action detection. However, we observe that the deformable convolution layers have a byproduct which can capture motion very naturally, the adaptive receptive field.

At a high level, an adaptive receptive field in a deformable convolution layer can be viewed as an aggregation of important pixels, as the network has the flexibility to change where each convolution samples from. In a way, the adaptive receptive fields are performing some form of key-points detection. Therefore, our hypothesis is that, if the key-points are consistent across frames, then we can capture motion by taking the differences in the adaptive receptive fields across time. As a deformable convolution can be trained end-to-end, our network can learn to model motion at hidden layers of the network. Combining this with spatial features leads to a powerful spatio-temporal feature.

We illustrate the intuition of our method in Fig. 1. The motion here is computed using difference in adaptive receptive fields on multiple feature spaces instead of pixel space as in optical flow. Two consecutive frames of action cutting lettuce from 50 Salads dataset are shown in Fig. 0(a) and Fig. 0(b). Fig. 0(c) shows masks of the person to illustrate how the action takes place. We also show the motion vectors corresponding to different regions in Fig. 0(d) and Fig. 0(e). Red arrows are used to describe the motion and green dots are used to show the corresponding activation units. We suppress motion vectors with low values for the sake of visualization. In Fig. 0(d), the activation unit lies on a background region (cut ingredients inside the bowl) and so there is no motion recorded as the difference between two adaptive receptive fields of background region over time is minimal. However, we can find motion in Fig. 0(e) (the field of red arrows) because the activation unit lies on a moving region, i.e. the arm region. The motion field at all activation units is seen in Fig. 0(f), where the field’s energy corresponds to the length of motion vectors at each location. The motion field is excited around the moving region (the arm) while suppressed in the background. Therefore, this highly suggests that the motion information we introduce can be used as an alternative solution for optical flow. A schematic of the proposed network architecture is shown in Fig. 2.

3.1 Deformable convolution

We first briefly review the deformable convolution layers, before going into a concrete description of the construction of the network architecture. Let be the input signal such that . The standard convolution is defined as: where is the convolutional kernel, and are the signal and kernel indices ( and can be treated as multidimensional indices). The deformable convolution proposed in [2] is therefore defined as:


where represents the displacement offsets of deformable convolution. These offsets are learned from another convolution with i.e. , where is a different kernel. Note that we use parentheses instead of brackets for in Eq. (1) because the index

requires interpolation as

is fractional.

3.2 Modeling temporal information with adaptive receptive fields

We define the adaptive receptive field of a deformable convolution at time as where . To extract motion information from adaptive receptive fields, we take the difference of the receptive field through time, which we denote as:


It is obvious that the activation locations are canceled, going from to in Eq. (2), leaving only the difference of displacement offsets.

Fig. 3 further illustrates the meaning of in 2D for different types of convolutions. Red square shows the current activation location, green dots show the standard receptive fields, and blue dots show the receptive fields after adding displacement offsets. In the last row, red arrows show the changes of receptive field from time (faded blue dots) to time (solid blue dots). Readers should note that there are no red arrows for standard convolution and dilated convolution because the offsets are either zero or identical. Red arrows only appear in deformable convolution, which motivates modeling of temporal information.

Figure 3: Illustration of temporal information modeled by the difference of receptive fields at a single location in 2D. Only deformable convolution can capture temporal information (shown with red arrows). Related to Eq. (1) and Eq. (2), is red square, are green dots, are black arrows, are blue dots, and are red arrows.

3.3 Locally-consistent deformable convolution

Figure 4: Illustration of receptive fields at two consecutive locations (faded and solid red squares) in 2D at time , with and without local coherency constraint. The constraint produces more locality consistency in receptive fields (with deformation represented by arrows).

Directly modeling motion using is not very effective because there is no guarantee of local consistency in receptive fields in the original deformable convolution formulation, as illustrated in Fig. 4. The figure shows receptive fields at two consecutive locations: faded and solid red squares. The receptive fields of faded square are also annotated by faded dots. We can see that some green dots on the left side of Fig. 3

correspond to two displacement offsets (blue dots) as we move from the faded to solid red square. This effect results in an inconsistent motion tensor

. In order to model motion effectively, we wish to have a receptive field that respects local consistency, as shown on the right side of Fig. 4. While local consistency could be learned as a side-effect of the training process, it is still not explicitly formulated in the original deformable convolution formulation.

The former behavior is attributed to the way in which is defined in the original deformable convolution formulation on both location () and kernel () indices, which essentially corresponds to , where . However, there are multiple ways to decompose , i.e. , for any . Therefore, one single is deformed by multiple , with different . This produces inconsistency when we model in Eq. (2), as there are multiple motion vectors corresponding to the same location.

In order to enforce consistency, we propose a locally-consistent deformable convolution (LCDC):


for . LCDC is a special case of deformable convolution where


We name this as local coherency constraint. The interpretation of LCDC is that instead of deforming the receptive field as in Eq. (1), we can deform the input signal instead. Specifically, LCDC in Eq. (3) can be rewritten as:




is a deformed version of and is the standard convolution ( is defined as the deforming operation by offset ).

We now show that LCDC can effectively model both appearance and motion information in a single network, as the difference is equivalent to motion information produced by optical flow.

Proposition 1.

Suppose that two inputs and are related through an optical flow, i.e.


where is a location and is the optical flow operation, and is assumed to be locally varying. Then the corresponding LCDC outputs with :

are consistent, i.e. , if and ony if


With the connection of LCDC to standard convolution, under the assumption that , we have:

Substituting the LHS in the optical flow relation in Eq. (7), we obtain the following equivalent conditions :

(since is locally varying). ∎

The above result shows that by enforcing consistent output and sharing weights across frames, the learned deformed map encodes motion information, as in Eq. (8). Hence, we can effectively model both appearance and motion information in a single network with LCDC, instead of using two different streams.

Furthermore, LCDC is more memory-efficient as going from to can reduce the memory cost by times. In practice, the dimensionality of the offsets in deformable convolution is , where and are the height and width of inputs, is the number of deformable groups, and are the height and width of kernels, and 2 comes from the fact that the offsets are 2D vectors. However, the dimensionality of LCDC offsets is only . We also drop the number of deformable groups since we want to model one single type of motion between two time frames. Therefore, the reduction in this case is times. The parameter reduction is scalable with the number of deformable convolution layers that are used.

3.4 Spatio-temporal features

Figure 5: A more detailed view of our network architecture with the fusion module. Appearance information comes from output of the last layer while motion information comes from aggregating from multiple layers. Outputs of the final fc layer can be flexibly used as the features for any long-temporal modeling networks.

To create the spatio-temporal feature, we further concatenate (across channel dimensions) the learned motion information (from multiple layers) with appearance features (output of the last layer ). We illustrate this process in Fig. 5.

To model the fusion mechanism, we use two 3D convolutions followed by two fully connected layers. Each 3D convolution unit is followed by batch normalization, ReLU activation, and 3D max pooling to gradually reduce temporal dimension (while the spatial dimension is retained). Outputs of the final fully connected layer can be flexibly used as the features for any long-temporal modeling networks, such as Temporal Convolution Networks (Dilated-TCN and ED-TCN)


, Multi-Stream Bidirectional Recurrent Neural Networks (MSB-RNN)

[17], or Temporal Deformable Residual Networks (TRN and TDRN) [12]. In this paper, we use ED-TCN [10] as our long-temporal model.

3.5 Learning

Let be the dataset, such that , where is a video snippet with corresponding frame-label label . and are height and width of a frame and is the number of frames per snippet. We write the loss as:


where is the weight decay regularization on the model parameters , is the predicted class labels for the video snippet , and is the indicator function.

4 Experimental results

4.1 Datasets

We evaluate our approach on two standard datasets, namely, 50 Salads dataset and GTEA dataset.

50 Salads Dataset [19]: This dataset contains 50 salad making videos from multiple sensors. We only used RGB videos in our work. Each video lasts from 5-10 minutes, containing multiple action instances. The dataset has different granularity levels: low, mid, high, and eval. We report experimental results for mid (17 action classes) and eval level (9 action classes) in order to be consistent with results reported in [10].

Georgia Tech Egocentric Activities (GTEA) [3]: Published in 2011, this dataset contains 28 videos of 7 action classes, performed by 4 subjects. The camera in this dataset is head-mounted, thus introducing more motion instability, compared to other datasets. Each video is about 1 minute long and has around 19 different actions on average.

4.2 Implementation details

In order to train our network, we used the common Momentum solver [13] (with momentum of 0.9) and followed the standard procedure of hyper-parameter search. We used ResNet50 as our backbone network with locally-consistent deformable convolutions (at layers conv5a, conv5b, and conv5c) as in [2]. Learning rate was initialized as and decayed every epochs (out of epochs in total) with a decaying rate of

. In the fusion module, we used a spatial kernel of size 3 and stride of 1; and temporal kernel of size 4 and stride of 2. We also used pooling size 2 and stride of 2 in 3D max pooling. Temporal dimension is collapsed with an average operation before being input to the last two fully connected layers. For training, we use the frame rate of 6fps for 50 salads and 15 fps for GTEA because of different action rates. Each video snippet contained 16 consecutive frames after sampling. Images are resized to

and augmented with random cropping and mean removal. All data splits followed the settings of [10]. For testing, we extract features followed by downsampling to make sure the frame rates are comparable with other papers. LCDC features are incorporated into ED-TCN [10] framework to learn action classifiers with long temporal dependency. No public implementation for TDRN [12] was available. As a result, we are unable to evaluate the combination of TDRN and LCDC.

4.3 Results

We benchmark our approach using the three standard metrics reported in [10, 12]: frame-wise accuracy, segmental edit score, and F1 score with overlapping of 10% (F1@10). Frame-wise accuracy is the most common and simple metrics as it evaluates whether a frame is correctly classified or not. However, it does not consider the temporal structure of the output. Segmental edit score takes into account this problem by penalizing over-segmentation. It evaluates the ordering of actions without following specific timings. F1@k score, proposed in [10], also penalizes over-segmentation but ignores small time-shifting between the prediction and ground-truth.

Tab. 1 shows the results on 50 Salads dataset for two granularity levels: mid and eval (referred as higher level in [10]). Our performance numbers are averaged across five cross-validation splits. The results show that our approach achieves superior performance in all metrics. Our system outperforms ED-TCN by 8.6% on F1, 10.0% on Edit, and 9.3% on accuracy, for mid-level granularity. For eval-level, we also outperform ED-TCN by 6.4% on F1, 5.0% on Edit, and 5.8% on accuracy. Although we do not use TDRN for long-temporal modeling, our approach is still able to outperform it by 3.7% on F1, 3.8% on Edit, and 5.9% on accuracy, for mid-level granularity. We are unable to compare on eval-level as such results are not published for TDRN. Overall, our improvement over state-of-the-art systems is robust across several metrics.


F1@10 Edit Acc
Mid Spatial CNN [11] 32.3 24.8 54.9
ST-CNN [11] 55.9 45.9 59.4
Bi-LSTM [17] 62.6 55.6 55.7
Dilated TCN [10] 52.2 43.1 59.3
ED-TCN [10] 68.0 59.8 64.7
TRN [12] 70.2 63.7 66.9
TDRN [12] 72.9 66.0 68.1
LCDC 76.6 69.8 74.0
Eval Spatial CNN [11] 35.0 25.5 68.0
ST-CNN [11] 67.1 52.8 71.3
Bi-LSTM [17] 72.2 67.7 70.9
Dilated TCN [10] 55.8 46.9 71.1
ED-TCN [10] 76.5 72.2 73.4
LCDC 82.9 77.2 79.2


Table 1: Results of 50 salads dataset (2 granularity levels).

We further show segmentation results of a test video from 50 Salads dataset on mid-level granularity in Fig. 6. The first row is the ground-truth segmentation. The next four rows are results from different long-temporal models using the same input features: SVM, tCNN, Dilated-TCN, and ED-TCN. All of these segmentation results are directly retrieved from the provided features in [10], without any further training. The last row shows the segmentation results by feeding LCDC features into ED-TCN. We do not compare against TDRN as we are unable to obtain TDRN features. Each row also comes with its respective accuracy on the right. The figure shows that LCDC achieves a 4.8% improvement over original ED-TCN. We also achieve a higher accuracy on the temporal boundaries, i.e. the beginning and the end of an action instance is close to that of ground-truth.

Tab. 2 shows the results on GTEA dataset. The final results are averaged across four cross-validation splits. Our system outperforms the original ED-TCN, by 4.3% on F1 and 0.5% on accuracy. Our Edit score is also higher than TDRN by 2.6%, although our F1 scores and accuracy are not as high. However, since we use ED-TCN as the long-temporal model, our results are more comparable with the original ED-TCN instead of TRN or TDRN. Since TDRN has been proved to outperform ED-TCN, we believe that using TDRN to model long-temporal dependency will provide further improvement in performance. However we reiterate that we were unable to obtain a public TDRN implementation in order to achieve this.


F1@10 Edit Acc
EgoNet+TDD [18] - - 64.4
Spatial CNN [11] 41.8 - 54.1
ST-CNN [11] 58.7 - 60.6
Bi-LSTM [17] 66.5 - 55.5
Dilated TCN [10] 58.8 - 58.3
ED-TCN [10] 72.2 - 64.0
TRN [12] 77.4 72.2 67.8
TDRN [12] 79.2 74.1 70.1
LCDC 76.5 76.7 64.5


Table 2: Results on GTEA dataset.

Fig. 7 shows detailed segmentation result of a test video from GTEA dataset. We follow the same convention as in Fig. 6 to display the statistics of different models. The figure shows a strong improvement of LCDC over ED-TCN, being 9.2% in terms of frame-wise accuracy.

Figure 6: Comparison of segmentation results across different methods on a test video from 50 Salads dataset (mid-level).
Figure 7: Comparison of segmentation results across different methods on a test video from GTEA dataset.


Acc Total Deform
params params
RawFeat 61.5 - -
VanillaResnet50 69.1 30.9M -
NaiveTemp 72.2 30.9M -
2StrOptFlow 72.3 165.0M -
DC 73.20.313 45.7M 995.5K
LCDC 74.90.306 42.7M 27.7K


Table 3: Ablation study on 50 Salads dataset, with Split 1 and mid-level granularity. We report features accuracy (before feeding into ED-TCN) and number of parameters.

4.4 Ablation study

We performed an ablation study (Tab. 3) on Split 1 and mid-level granularity of 50 Salads dataset. In the table, each row is a different setup, showing the frame-wise accuracy of the features before feeding into ED-TCN, the total number of parameters of the model used to produce the features, and the number of parameters related to deformable convolutions (wherever applicable). We next explain the different setups: (1) RawFeat: Raw features provided in [10] (input features of original Dilated-TCN and ED-TCN) are used to train a linear classifier on top of the features. (2) VanillaResnet50: Frame-wise class prediction using ResNet50 (no temporal information involved in this setup). (3) NaiveTemp: Temporal information incorporated in a naive way. This model is the same as VanillaResnet50, except that we have multiple frames per video snippet. Temporal information is simply modeled by averaging features across time, then feeding to two fully connected layers with ReLU activations. (4) 2StrOptFlow: Temporal information is modeled by training a VGG-16 (with dense optical flow from multiple frames as input). This is followed by a linear fusion module which combines scores of the latter motion stream with NaiveTemp (appearance stream). Since all our other setups use ResNet backbone and ED-TCN uses VGG-like for both streams, fusing ResNet appearance stream and VGG motion stream provides us a more direct comparison. (5) DC: Receptive fields of deformable convolution network (with backbone ResNet50) are used to model motion, but without local coherency constraint. (6) LCDC:

The proposed LCDC model which additionally enforces local coherency constraint on receptive fields. We also report the standard deviation of mean for the last two setups.

Compared to RawFeat, VanillaResnet50 has a higher accuracy because the original ED-TCN uses a VGG-like model while VanillaResnet50 uses ResNet50 as the backbone network. The accuracy is further improved by 3.1% by using naive temporal modeling. Notice that the number of parameters does not change, going from VanillaResnet50 to NaiveTemp because the only difference is the number of frames being fed as input to the network. The accuracy of 2StrOptFlow (which uses optical flow as the motion stream and NaiveTemp as the appearance stream) is slightly higher than NaiveTemp. However, the number of parameters is significantly increased. The VGG network alone requires 134.1M parameters. Together with the appearance stream, the complexity of the model is increased to 165.0M. This prevents us from both having larger batch size or jointly training the two streams together.

DC, which directly uses adaptive receptive fields from the original deformable convolution increases the accuracy to 73.2% with significantly fewer number of extra parameters (extra parameters for deformable convolution and fusion), only 14.8M more, compared against VanillaResnet50 or NaiveTemp. Enforcing local coherency constraint in LCDC further improves the accuracy to 74.9% but with 3M fewer extra parameters (around 20.3% less) compared to DC. This complexity reduction is a consequence of the fact that LCDC uses fewer parameters for displacement offsets (Sec. 3.3). Moreover, if we consider only the parameters related to deformable convolutions, DC would require 36x more parameters than LCDC. In the current implementation, we use 3 deformable convolutions (with biases). Specifically, DC requires 331.8K parameters for the weights and 72 parameters for biases to create offsets in each deformable convolution. However, LCDC only needs 9.2K parameters for weights and 2 parameters for biases. The reduction of 36x matches our derivation in Sec 3.3, where and .

5 Conclusion

We introduce locally-consistent deformable convolution and create a single-stream network that can jointly learn spatio-temporal features by exploiting motion in adaptive receptive fields. The framework is significantly more compact and can produce robust spatio-temporal features without using conventional motion extraction methods, e.g. optical flow. Our network outperforms the long-temporal modeling networks ED-TCN and state-of-the-art TDRN. For future work, we plan to generalize our method to multiple resolutions in the feature space and unify long-temporal modeling into the framework.


  • [1] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4724–4733. IEEE, 2017.
  • [2] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 764–773, Oct 2017.
  • [3] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize objects in egocentric activities. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference On, pages 3281–3288. IEEE, 2011.
  • [4] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS), pages 3468–3476, 2016.
  • [5] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  • [7] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In J.-M. Combes, A. Grossmann, and P. Tchamitchian, editors, Wavelets, pages 286–297, Berlin, Heidelberg, 1990. Springer Berlin Heidelberg.
  • [8] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, and W. Ouyang. T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2017.
  • [9] S. M. Kang and R. P. Wildes. Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906, 2016.
  • [10] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks for action segmentation and detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [11] C. Lea, A. Reiter, R. Vidal, and G. D. Hager. Segmental spatiotemporal cnns for fine-grained action segmentation. In European Conference on Computer Vision (ECCV), pages 36–52. Springer, 2016.
  • [12] P. Lei and S. Todorovic. Temporal deformable residual networks for action segmentation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6742–6751, 2018.
  • [13] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  • [14] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1194–1201. IEEE, 2012.
  • [15] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 568–576, Cambridge, MA, USA, 2014. MIT Press.
  • [16] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [17] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 1961–1970. IEEE, 2016.
  • [18] S. Singh, C. Arora, and C. Jawahar. First person action recognition using deep learned descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2620–2628, 2016.
  • [19] S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2013), Zurich, Switzerland. ACM, September 2013.
  • [20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015.
  • [21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
  • [22] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016.
  • [23] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017.