Learning Video Representations from Correspondence Proposals

05/20/2019 ∙ by Xingyu Liu, et al. ∙ 0

Correspondences between frames encode rich information about dynamic content in videos. However, it is challenging to effectively capture and learn those due to their irregular structure and complex dynamics. In this paper, we propose a novel neural network that learns video representations by aggregating information from potential correspondences. This network, named CPNet, can learn evolving 2D fields with temporal consistency. In particular, it can effectively learn representations for videos by mixing appearance and long-range motion with an RGB-only input. We provide extensive ablation experiments to validate our model. CPNet shows stronger performance than existing methods on Kinetics and achieves the state-of-the-art performance on Something-Something and Jester. We provide analysis towards the behavior of our model and show its robustness to errors in proposals.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video modality can be viewed as a sequence of images evolving over time. A good model for learning video representations should be able to learn from both the static appearance of images as well as the dynamic change of images over time. The dynamic nature of video is described by temporal consistency, which says an object in one frame usually has its correspondence in other frames and its semantic features are carried along the way. Analysis of these correspondences, either fine-grained or coarse-grained, can lead to valuable information for video recognition, such as how objects move or how viewpoints changes, which can further benefit high-level understanding tasks such as action recognition and action prediction.

Unlike static images where there is a standard representation learning approach of convolutional neural networks (CNNs), the correspondence of objects in videos has entirely different pattern and is more challenging to learn. For example, the corresponding objects can be arbitrarily far away, may deform or change their pose, or may not even exist in other frames. Previous methods rely on operations within a local neighborhood (e.g. convolution) or global feature re-weighting (e.g. non-local means) for inter-frame relation reasoning thus cannot effectively capture correspondence: stacking local operations for wider coverage is inefficient or insufficient for long-range correspondences while global feature re-weighting fails to include positional information which is crucial for correspondence.

In this paper, we present a novel method of learning representations for videos from correspondence proposals. Our intuition is that, the corresponding objects of a given object in other frames typically only occupy a limited set of regions, thus we need to focus on those regions during learning. In practice, for each position (a pixel or a feature), we only consider the few other positions that is most likely to be the correspondence.

Figure 1:

We view video representation tensor as a point cloud of features with

points. For each point (e.g. the purple point), its potentially corresponding points are the -NN in -dimensional semantic space from other frames. Our CP module will learn and aggregate all these potential correspondences.
Figure 2: CP module architecture. Gray boxes denote tensors, white boxes denote operators and orange boxes denote neural networks with trainable weights. The dashed box represents the Correspondence Embedding layer, whose architecture is illustrated in detail in Figure 3.

The key of our approach is a novel neural network module for video recognition named CP module. This module views the video representation tensor as a point cloud in semantic space. As illustrated in Figure 1, for every feature in video representation, the module finds and groups its nearest neighbors in other frames in the semantic space as potential correspondence. Each of the

feature pairs is processed identically and independently by a neural network. Then max pooling is applied to select the strongest response. The module effectively learns a set of functions that embeds and selects the most interesting information among the

pairs and encode the reason for such selection. The output of CP module is the encoded representation of correspondence, i.e. dynamic component in videos, and can be used in subsequent parts of an end-to-end architecture and other applications.

Ordered spatiotemporal location information is included in the CP module so that motion can be modelled. We integrate the proposed CP module into CNN so that both static appearance feature and dynamic motion feature of videos are mixed and learned jointly. We name the resulting deep neural network CPNet. We constructed a toy dataset and showed that CPNet is the only RGB-only video recognition architecture that can effectively learn long-range motion. On real datasets, we show the robustness of the max pooling in CP module: it can filter out clearly wrong correspondence proposals and only select embeddings from reasonable proposals.

We showcase CPNet in the application of video classification. We experimented with it on action recognition dataset Kinetics [18] and compared it against existing methods. It beats previous methods and achieves leading performance. It also achieves state-of-the-art results among published methods on action-centric datasets Something-Something [12] and Jester [30] with fewer parameters. We expect that our CPNet and the ideas behind it can benefit video applications and research in related domains.

2 Related Work

Representation Learning for Videos.

Existing approaches of video representation learning can generally be categorized by how dynamic component is modelled. The first family of approaches extract a global feature vector for each frame with a shared CNN and use recurrent neural nets to model temporal relation

[6, 36]

. Though recurrent architectures can efficiently capture temporal relations, it is harder to train and results in low performance on the latest benchmarks. The second family of approaches learn dynamic changes from offline-estimated optical flow

[27, 4] or online learned optical flow [8] with a separate branch of the network. The optical flow branch may share the same architecture as the static appearance branch. Though optical flow field bridges consecutive frames, the question of how to learn multiple evolving 2D fields is still not answered.

The third family of approaches use single-stream 3D CNN with RGB-only inputs and learn dynamic changes jointly and implicitly with static appearance [28, 3, 29, 17, 32, 40]. These architectures are usually built with local operations such as convolution so cannot learn long-range dependencies. To address this problem, non-local neural nets (NL Nets) [33] was proposed. It adopted non-local operations where features are globally re-weighted by their pairwise feature distance. Our network consumes RGB-only inputs and explicitly computes correspondence proposals in a non-local fashion. Different from NL Net, our architecture focuses on only top correspondences and considers pairwise positional information, thus it can effectively learn not only appearance but also dynamic motion features.

Figure 3: Correspondence Embedding layer architecture. s are semantic vectors with length and the -th row of input feature tensor. is a semantic vector with length and the -th row in the output feature tensor. We made so that the output can be added back to the main stream CNN. are the spatiotemporal normalized locations.

Deep Learning on Unstructured Point Data. The pioneering work of PointNet [23]

proposed a class of deep learning methods on unordered point sets. The core idea is a symmetric function constructed with shared-weight deep neural networks followed by an element-wise max pooling. Due to the symmetry of pooling, it is invariant to the order of input points. This idea can also be applied to learning functions on generic orderless sets

[37]. Follow-up work of PointNet++ [24] extracts local features in local point sets within a neighborhood in the Euclidean space and hierarchically aggregates features. Dynamic graph CNN [34] proposed similar idea, the difference is that the neighborhood is determined in the semantic space and the neural network processes point pairs instead of individual points. Inspired by these works, we treat correspondence candidates as an unordered set. Through a shared-weight MLP and max pooling, our network will learn informative representations about appearance and motion in videos.

Deep Learning for Correspondence and Relation Reasoning.

Capturing relation is an essential task in computer vision and machine learning. A common approach to learn relation is letting extracted feature interact through a designed or learned function and discover similarity from the output. This is the general idea behind previous works on stereo matching

[31, 38] and flow estimation [7, 15, 21]. The learned relation can also be used later in learning high-level semantic information such as video relational reasoning [39] and visual question answering [26]. Compared to these works, we focus on learning video representation from long-range feature correspondences over time and space.

3 Learning Correspondence Proposals

Our proposed method is inspired by the following three properties of correspondences in videos:

1. Corresponding positions have similar visual or semantic features. This is the assumption underlying many computer vision tasks related to correspondence, such as image matching, relation reasoning or flow estimation.

2. Corresponding positions can span arbitrarily long ranges, spatially or temporally. In the case of fast motion or low frame rate, displacements along spatial dimensions can be large within small frame steps. Objects that disappear and then re-appear in videos across a long time can span arbitrarily long temporal range.

3. Potential correspondence positions in other frames are small in percentage. Given a pixel/feature, usually only very small portion of pixels/features in other frames could be the potential correspondence. Other apparently dissimilar pixels/features can be safely ignored.

A good video representation model should at least address the above three properties: it should be able to capture potential pixel/feature correspondence pairs at arbitrary locations and learn from the pairs. It poses huge challenges to the design of the deep architecture, since most deep learning methods work on regular structured data. Inspired by recent work on deep learning on point clouds [23, 24, 34] and their motion [21], we develop an architecture that addresses the above three properties.

In this section, we first briefly review point cloud deep learning techniques and their theoretical foundations. Then we explain Correspondence Proposal (CP) module, the core of our architecture. Finally we describe how it is integrated into the entire deep neural network architecture.

3.1 Review of Point Cloud Deep Learning

Qi et al. [23] recently proposed PointNet, a neural network architecture for deep learning in point clouds. Its theoretical foundation was proven in [23]: given a set of point clouds and any continuous set function w.r.t Hausdorff distance, symmetric function

can arbitrarily closely approximate on , where and are two continuous functions and is the element-wise maximum operation. In practice, and

are instantiated to be multi-layer perceptron (MLP) as learnable functions with universal approximation potential. The symmetry of max pooling ensures the output to be invariant to the ordering of the points.

While PointNet was originally proposed to learn geometric representation for 3D point clouds, it has been shown that the MLP can take mixed types of modalities as input to learn other tasks. For example, the MLP can take learned geometric representation and displacement in 3D Euclidean space as input to estimate scene flow [21].

3.2 CP Module

In this subsection, we explain the architectures of CP module. As illustrated in Figure 2, the input and output to CP module are both video representation tensors with shape , where denotes the number of frames, denotes the spatial dimension and denotes the number of channels. CP module treats the input video tensor as a point cloud of features with points and accomplishes two tasks: 1) -NN grouping; 2) correspondence embedding.

-NN grouping. For each feature in the video representation tensor output by a CNN, CP module selects its most likely corresponding features in other frames. The selection is solely based on semantic similarity to ensure correspondence can be across arbitrarily long spatiotemporal ranges. Features within the same frame are excluded because temporal consistency should be between different frames.

The first step is to calculate the semantic similarity of all features point pairs. We use the negative distance as our similarity metric. It can be done efficiently with matrix multiply operations and produces a matrix of shape . The next step is to set the values of the elements in the diagonal block matrices of shape to be . With this operation, the features within the same frame will be excluded from potential correspondences of each other. The final step is to apply an top- operation along the row dimension of the matrix. It outputs a tensor of shape , where the -th row are the indices of the nearest neighbors of the feature . The workflow is shown in Figure 2.

Correspondence Embedding layer. The goal of this layer is for each feature, learn a representation from its proposed correspondences. The features’ motion to their corresponding position in other frames can be learned during this process. The top-1 correspondence candidate can only give the information from one frame so it cannot capture the full correspondence information of the entire video. Besides, there may be more than one qualified correspondence candidates in a frame. So we use a larger , process pairs identically and independently, aggregate information from outputs. This is the general idea behind Correspondence Embedding layer, the core of our CP module.

Correspondence Embedding layer is located in the dashed box of Figure 2 and illustrated in detail in Figure 3. Suppose the spatiotemporal location and semantic vector of input feature is , its -th -NN is where . For each of the pairs, we pass the semantic vectors of two features, i.e. , and their relative spatiotemporal displacements, i.e. , to an MLP with shared weights. All three dimensions of the spatiotemporal locations, i.e. , are normalized to from , and before sent into MLP. Then the outputs are aggregated by an element-wise max pooling operation to produce , the semantic vector of output feature . During the process, the most informative signals about correspondence, i.e. entangled representation from mixing displacement and two semantic vectors, can be selected from pairs and the output will implicitly encode motion information. Mathematically, the operation of Correspondence Embedding layer can be written as:

(1)

where is the function computed by MLP and is element-wise max pooling.

There are other design choices for Correspondence Embedding layer as well. For example, instead of sending both features directly to the MLP, one can first compute a certain distance between two features. However, as discussed in [21], sending both features into MLP is a more general form and yields better performance in motion learning.

3.3 Overall Network Architecture

Our CP module are inserted into CNN architecture and are interleaved with convolution layers, which enables the static image features extracted from convolution layers and correspondence signals extracted from CP module be mixed and learned jointly. Specifically, the CP modules are inserted into the ResNet

[14]

architectures and is located right after a residual block but before ReLU. We initialize the convolution part of our architecture with a pre-trained ImageNet model. The MLPs in CP modules are randomly initialized with MSRA initialization

[13]

, except for the gamma parameter of the last batch normalization layer

[16] being initialized with all zeros. This ensures identity mapping at the start of training so pre-trained model can be used.

In this paper, we only explore CP modules with nearest neighbors in other frames in semantic space. In general, however, the nearest neighbors of CP modules can be determined in other metric space as well, such as temporal-only space, spatiotemporal space or joint spatiotemporal-semantic space. We call such convolutional architecture inserted with generic CP module as CPNet.

layer
I3D NL
Net [33]
ARTNet
[32]
TRN
[39]
C2D CPNet
(ours)
conv ,16 ,16 ,16 ,16
NL block - - CP module
conv
,16
,16
SMART-
,16
,16 ,16
pooling,
fc
pooling,
fc
pooling, temporal
relation, fc
pooling,
fc
train 27.8 26.8 27.1 97.9
val 26.4 25.9 26.9 97.4
Table 1: Architectures for toy experiment
Figure 4: An “up” example in our toy dataset.

4 A Failing of Several Previous Methods

We constructed a toy video dataset where previous RGB-only methods fail in learning long-range motion. Through this extremely simple dataset, we show the drawbacks of previous methods and the advantage of our architecture.

The dataset consists of videos of a white square moving on a black canvas. The videos have 4 frames and the spatial size is . There are four labels of the moving direction of the square: “left”, “right”, “up” and “down”. The square’s moving distance per step is random between 7 and 9 pixels. The dataset has 1000 training and 200 validation videos, both have an equal number of videos for each label. Figure 4 illustrated an example of videos in our dataset.

We inserted the core module of several previous RGB-only deep architectures for video recognition, i.e. I3D NL Net [33], ARTNet [32], TRN [39], as well as our CPNet, into a toy CNN with two convolution layers. We listed the architectures used this experiment in Table 1. The convolution parts of all models have small spatial receptive fields of . The dataset-model settings are designed to simulate long-range motion situations where stacking convolution layers to increase receptive field is insufficient or inefficient. No data augmentation is used.

The training and validation results are listed in Table 1. Our model can overfit the toy dataset, while other models simply generate random guesses and fail in learning the motion. It’s easy to understand that ARTNet and TRN have insufficient convolution receptive fields to cover the step of the motion of the square. However, it’s intriguing that NL Net, which should have a global receptive field, also fails.

We provide an explanation as follows. Though the toy NL Net gets by the problem of insufficient convolution receptive fields, its NL block fails to include positional information thus can’t learn long-range motion. However, it’s not straightforward to directly add pairwise positional information to NL block without significantly increasing the memory and computation workload to an intractable amount. Through this experiment, we show another advantage of our CPNet: by only focusing on top potential correspondences, memory and computation can be saved significantly thus allow positional information and semantic feature be learned together with more a complicated method such as a neural network.

5 Experiment Results

To validate the choice of our architecture for data in the wild, we first did a sequence of ablation studies on Kinetics dataset [18]. Then we re-implemented several recently published and relevant architectures with the same dataset and experiment settings to produce results as good as we can and compare with our results. Next, we experiment with very large models and compare with the state-of-the-art methods on Kinetics validation set. Finally, we did experiments on action-centric datasets Something-something v2 [12] and Jester v1 [30] and report our results on both validation and testing sets. Visualizations are also provided to help the understanding of our architecture.

layer output size C2D baseline CPNet (Ours) 6 CP modules
conv , 64, stride 2, 2(, 1) , 64 stride 2, 2
res
res
res
res
global average pooling, fc
Table 2: Architectures used in Kinetics experiments in Table 3LABEL:sub@tab:kinetics:comparison.

5.1 Ablation Studies

Kinetics [18] is one of the largest well-labelled datasets for human action recognition from videos in the wild. Its classification task involves 400 action classes. It contains around 246,000 training videos and 20,000 validation videos. We used C2D ResNet-18 as backbone for all ablation experiments. The architectures we used are derived from the last column of Table 2. We included C2D baseline for comparison. We downsampled the video frames to be only 1/12 of the original frame rate and used only 8 frames for each clip. This ensures that the clip duration are long enough to cover a complete action while still maintain fast iteration of experiment. The single-clip single-center-crop validation results are shown in Table 3LABEL:sub@tab:kinetics:ablation:numLABEL:sub@tab:kinetics:ablation:kLABEL:sub@tab:kinetics:ablation:pos.

Ablation on the Number of CP modules. We explored the effect of the number of CP modules on the accuracy. We experimented with adding one or two CP modules to the res group, two CP modules to each of res and res groups, and two CP modules to each of res, res and groups. The results are shown in Table 3LABEL:sub@tab:kinetics:ablation:num. As the number of CP modules increases, the accuracy gain is consistent.

Ablation on . We explored the the combination of training-testing time values and compared the results in Table 3LABEL:sub@tab:kinetics:ablation:k. When s are the same during training and testing, highest validation accuracy are achieved. It suggests that using different forces the architecture to learn different distribution and highest accuracy are achieved only when training and test distribution are similar.

We also notice that the highest accuracy are achieved at a sweet point when both . An explanation is that when is too small, CP module can’t get enough correspondence candidates to select from; when is too large, clearly unrelated elements are also included and introduce noise.

Ablation on the position of CP modules. We explored effect of the position of CP modules. We added two CP modules to three different groups: res, res and res, respectively. The results are shown in Table 3LABEL:sub@tab:kinetics:ablation:pos. The highest accuracy are achieved when adding two CP modules to res group. A possible explanation is that res doesn’t contain enough semantic information for finding correct -NN while resolution of res is too low ().

model top-1 top-5
C2D 56.9 79.5
1 CP 60.3 82.4
2 CPs 60.4 82.4
4 CPs 61.0 83.1
6 CPs 61.1 83.1
(a) number of CP modules
top-1/top-5 accuracy test
train 59.9/82.3 59.2/81.6 56.6/79.4 52.5/76.1 49.0/72.6 44.6/58.5
59.1/81.8 60.2/82.5 59.6/81.8 56.9/80.1 53.0/77.1 48.9/73.5
59.0/81.2 60.2/82.4 60.5/82.6 59.0/81.7 55.3/79.2 49.2/73.5
53.4/76.3 56.8/79.5 59.6/81.9 60.7/82.8 59.7/82.1 57.0/80.3
51.3/75.1 53.8/77.3 56.8/79.7 59.8/82.1 60.6/82.8 59.2/81.8
52.6/76.6 53.8/77.7 55.5/79.1 58.2/80.8 60.0/82.2 60.4/82.4
(b) Ablation on CP module’s values used in training and testing time.
model top-1 top-5
C2D 56.9 79.5
res 60.4 82.4
res 60.8 82.8
res 59.2 81.6
(c) CP module positions
frame rate 1/12 of original frame rate 1/4 of original frame rate
val configuration 1-clip, 1 crop 25-clip, 10 crops 1-clip, 1 crop 25-clip, 10 crops
accuracy top-1 top-5 top-1 top-5 top-1 top-5 top-1 top-5
C2D 56.9 79.5 61.3 83.6 54.1 77.4 60.8 83.3
C3D [28] 58.3 80.7 64.4 85.8 55.0 78.5 63.3 85.2
NL C2D Net [33] 58.6 81.3 63.3 85.1 55.3 78.6 62.1 84.2
ARTNet [32] 59.1 81.1 65.1 86.1 56.1 78.7 64.2 85.6
CPNet (Ours) 61.1 83.1 66.3 87.1 57.2 80.8 64.9 86.5
(d) Kinetics validation accuracy of architectures in Table 2. Clip length is 8 frames.
Table 3: Kinetics datasets results for ablations and comparison with other prior works. The top-1/top-5 accuracies are shown.

5.2 Comparison with Other Architectures

We compare our architecture with C2D/C3D baselines, C2D NL Networks [33] and ARTNet [32], on Kinetics. We did two sets of experiments, with frame rate downsampling ratio of 12 and 4 respectively. Both experiment sets used 8 frames per clip. The settings enable us to compare the performance under both low and high frame rates. The architecture used in the experiments are illustrated in Table 2. We experimented with two inference methods: 25-clip 10-crop with averaged softmax score as in [32] and single-clip single-center-crop. The results are shown in Table 3LABEL:sub@tab:kinetics:comparison.

Our architecture outperforms C2D/C3D baselines by a significant margin, which proves the efficacy of CP module. It also outperforms NL Net and ARTNet given fewer parameters, further showing the superiority of our CPNet.

5.3 Large Models on Kinetics

model params (M) top-1 top-5
I3D Inception [3] 25.0 72.1 90.3
Inception-ResNet-v2 [2] 50.9 73.0 90.9
NL C2D ResNet-101 [33] 48.2 75.1 91.7
CPNet C2D ResNet-101 (ours) 42.1 75.3 92.4
Table 4: Large RGB-only models on Kinetics validation accuracy. Clip length for NL Net and our CPNet is 32 frames.

We train a large model with C2D ResNet-101 as backbone. We applied three phases of training where we progressively increase the number of frames in a clip from 8 to 16 and then to 32. We freeze batch normalization layers starting the second phase. During inference, we use 10-clip in time dimension, 3-crop spatially fully-convolutional inference. The results are illustrated in Table 4.

Compared with large models of several previous RGB-only architectures, our CPNet achieves higher accuracy with fewer parameters. We point out that Kinetics is an appearance-centric dataset where static appearance information dominates the classification. We will show later that our CPNet has larger advantage on other action-centric datasets where dynamic component more important.

model params val test
(M) top-1 top-5 top-1 top-5
Goyal et al. [12] 22.2 51.33 80.46 50.76 80.77
MultiScale TRN [39] 22.8 48.80 77.64 50.85 79.33
Two-stream TRN [39] 46.4 55.52 83.06 56.24 83.15
C2D Res18 baseline 10.7 35.24 64.49 - -
C2D Res34 baseline 20.3 39.64 69.61 - -
CPNet Res18, 5 CP (ours) 11.3 54.08 82.10 53.31 81.00
CPNet Res34, 5 CP (ours) 21.0 57.65 83.95 57.57 84.26
(a) Something-Something v2 Results
model params val test
(M)
BesNet [11] 37.8 - 94.23
MultiScale TRN [39] 22.8 95.31 94.78
TPRN [35] 22.0 95.40 95.34
MFNet [20] 41.1 96.68 96.22
MFF [19] 43.4 96.33 96.28
C2D Res34 baseline 20.3 84.73 -
CPNet Res34, 5 CP (ours) 21.0 96.70 96.56
(b) Jester v1 Results
Table 5: TwentyBN datasets results. Our CPNet outperforms all published results, with fewer number of parameters.

5.4 Results on Something-Something

Something-Something [12] is a recently released dataset for recognizing human-object interaction from video. It has 220,847 videos in 174 categories. This challenging dataset is action-centric and especially suitable for evaluating recognition of motion components in videos. For example its categories are in the form of ”Pushing something from left to right”. Thus solely recognizing the object doesn’t guarantee correct classification in this dataset.

We trained two different CPNet models with ResNet-18 and -34 C2D as backbone respectively. We applied two phases of training where we increase the number of frames in a clip from 12 to 24. We freeze batch normalization layers in the second phase. The clip length are kept to be 2s 111There are space for accuracy improvement when using 48 frames.. During inference, we use 6-crop spatially fully-convolutional inference. We sample 16 clips evenly in temporal dimension from a full-length video and compute the averaged softmax scores over clips. The results are listed in Table 5LABEL:sub@tab:something.

Our CPNet model with ResNet-34 backbone achieves the state-of-the-art results on both validation and testing accuracy. Our model size is less than half but beat Two-stream TRN [39] by more than 2% in validation accuracy and more than 1% testing accuracy. Our CPNet model with ResNet-18 also achieves competing results. With fewer than half parameters, it beats MultiScale TRN [39] by more than 5% in validation and more than 2% in testing accuracy. Besides, we also showed the effect of CP modules by comparing against respective ResNet C2D baselines. Although parameter size increase due to CP module is tiny, the validation accuracy gain is significant (14%).

5.5 Results on Jester

Jester [30] is a dataset for recognizing hand gestures from video. It has 148,092 videos in 27 categories. This dataset is also action-centric and especially suitable for evaluating recognizing motion components in video recognition models. One example of its categories is ”Turning Hand Clockwise”: solely recognizing the static gesture doesn’t guarantee correct classification in this dataset. We used the same CPNet with ResNet-34 C2D backbone and the same training strategy as subsection 5.4. During inference, we use 6-crop spatially fully-convolutional inference. We sample 8 clips evenly in temporal dimension from a full-length video and compute the averaged softmax scores over clips. The results are listed in Table 5LABEL:sub@tab:jester.

Our CPNet model outperforms all published results on both validation and testing accuracy, while having the smallest parameter size. The effect of CP modules is also shown by comparing against ResNet-34 C2D baselines. Again, although parameter size increase due to CP module is tiny, the validation accuracy gain is significant (12%).

(a) A video clip with label “playing basketball” from Kinetics validation set.
(b) A video clip with label “Rolling something on a flat surface” from Something-Something v2 validation set.
(c) A video clip with label “Thumb Up” from Jester v1 validation set.
Figure 5: Visualization on our final models. The starting points of arrows are located at feature . Arrows point to the proposed correspondences () of feature . Proposed correspondences whose indices are in (defined in Equation (2)) are pointed by red arrows otherwise by blue arrows. Feature changes after going through CP module are shown in heatmaps.

5.6 Visualization

To understand the behavior of CP module and demystify why it works, we provide visualization in three aspects with the datasets used in previous experiments as follows.

What correspondences are proposed? We are interested to see whether CP module is able to learn to propose reasonable correspondences purely based on semantic feature similarity. As illustrated in Figure 5, in general CP module can find majority of reasonable correspondences. Due to

being a fixed hyperparameter, its

-NN in semantic space may also include wrong correspondences.

Which of proposed correspondences activate output neurons? We are curious about CP module’s robustness to wrong proposals. We trace which of the

proposed correspondence pairs affect the value of output neurons after max pooling. Mathematically, let

and be the dimension of and from Equation (1) respectively, we are interested in the set

(2)

associated with a feature , where not being in means pair is entirely overwhelmed by other proposed correspondence pairs and thus filtered by max pooling when calculating output feature . We illustrate of several selected features in Figure 5 and show that CP module is robust to incorrectly proposed correspondences.

How semantic feature map changes? We show in Figure 5 the heatmap of change in distance of the semantic feature map for each frame after going through CP module. We found that CP modules make more changes to features that correspond to moving pixels. Besides, CP modules on a later stage focus more on the moving parts with specific semantic information that helps final classification.

6 Discussion

6.1 Relation to Other Single-stream Architectures

Note that since the MLPs in CP modules can potentially learn to approximate any continuous set functions, CPNet can be seen as a generalization of several previous RGB-only architectures for video recognition.

CPNet can be reduced to a C3D [28] with kernel size , if we set the of CP modules to be , determine the nearest neighbors in spatiotemporal space with distance and let the MLP learn to compute inner product operation within the neighborhood.

CPNet can also be reduced to an NL Net [33], if we set the of CP modules to be maximum and let the MLP learn to perform the same distance and normalization functions as the NL block.

CPNet can also be reduced to a TRN [39], if we put one final CP module at the end of C2D, determine the nearest neighbors in temporal-only space, and let the MLP learn to perform the same and functions defined in [39].

6.2 Pixel-level Motion vs. Feature-level Motion

In two-stream architectures, motion in pixel level, i.e. optical flow fields, are first estimated before sent into deep networks. In contrast, CP modules captures motion in semantic feature level. We point out that, though CP module process positional information at a lower spatial resolution (e.g. ), detailed motion feature can still be captured, since the semantic features already encode rich information within the receptive fields [22].

In fact, migrating positional reasoning from the original input data to semantic representation has contributed to several successes in computer vision research. For example, in the realm of object detection, moving the input and/or output of ROI proposal from the original image to the pooled representation tensor is the core of progress from RCNN [10] to Fast-RCNN [9] and to Faster-RCNN [25]; in the realm of flow estimation, successful architectures also calculate displacements within feature representations [7, 15].

7 Conclusion

In this paper, we presented a novel neural network architecture to learn representation for video. We propose a new CP model that computes correspondence proposals for each feature and feeds each of proposed pair to a shared neural network followed by max pooling to learn a new feature tensor. We show that the module can effectively capture motion correspondence information in videos. The proposed CP module can be integrated with most existing frame-based or clip-based video architectures. We show our proposed architecture achieves strong performance on standard video recognition benchmarks. In terms of future work, we plan to investigate this new architecture for problems beyond video classification.

References

Supplementary

Appendix A Overview

In this document, we provide more details to the main paper and show extra results on per-class accuracy and visualizations.

In section B, we provide more details on the Kinetics/ResNet-18 ablation experiments (main paper section 5.1). In section C, we provide more details on the baseline architectures in Kinetics/ResNet-18 comparison experiments (main paper section 5.2). In section D, we provide details on the CPNet architecture used in Kinetics/ResNet-101 experiment (main paper section 5.3). In section E, we provide details on the architecture used in Something-Something and Jester experiments (main paper section 5.4 and 5.5). In section F we report the per-class accuracy of C2D model and our CPNet model on Something-Something and Jester datasets. Lastly in section G we provide time complexity of our model and in section H we provide more visualization results on all three datasets.

Appendix B CPNet Architecture in Kinetics/ResNet-18 Experiments

Our CPNet is instantiated by adding a CP module after the last convolution layer of a residual group but before ReLU, as illustrated in Figure 6. For Kinetics/ResNet-18 experiments in main paper section 5.1 and 5.2, each CP module has MLP with two hidden layers. Suppose the number of channels of the input tensor of CP module is . The number of channels of the hidden layers in the MLPs is then . The number of nearest neighbors is set to 8 for the results in Table 3(a)(c)(d) of the main paper. varies for the results in Table 3(b). The location of CP module is deduced from the last column of Table 6 for different experiments in section 5.1 of the main paper.

Appendix C Baseline Architectures in Kinetics/ResNet-18 Comparison Experiment

In Table 6, we listed all the architectures used in Kinetics/ResNet-18 comparison experiments, as a supplementary to Table 2 of the main paper. C2D/C3D are vanilla 2D or 3D CNN. ARTNet is pulled directly from [32]. It was designed to have the same number of parameters as its C3D counterpart. NL Net model is adapted from [33], by adding an NL block at the end of each residual group of C2D ResNet-18. CPNet is instantiated in the same way as illustrated in Figure 6. Combined with results in Table 3(d) of the main paper, our CPNet outperforms NL Net and ARTNet in terms of validation accuracy with fewer parameters, showing its superiority.

Figure 6: CP module inserted into a residual group of ResNet-18 backbone.
Figure 7: CP module inserted into a residual group of ResNet-101 backbone.
layer output size C2D (C3D) ARTNet [32] NL C2D Net 6 NL blocks [33] CPNet (Ours) 6 CP modules
conv1 , 64, stride 2, 2(, 1) SMART , 64, stride 2, 2, 1 , 64, stride 2, 2 , 64, stride 2, 2
res
res
res
res
global average pooling, fc
params (M) 10.84 (31.81) 31.81 10.88 10.86
Table 6: Complete Architectures used in Kinetics dataset comparison experiments.
layer output size CPNet, 5 CP modules
conv1 , 64, stride 2, 2
res
res
res
res
global average pooling, fc
Table 7: CPNet Architectures used in Kinetics large model experiments.

Appendix D CPNet Architecture in Kinetics/ResNet-101 Experiment

We listed CPNet architecture used in Kinetics/ResNet-101 experiment in Table 7. Each residual group in ResNet-101 has three convolution layers. Our CPNet is instantiated by adding a CP module after the last convolution layer of a residual group but before ReLU, as illustrated in Figure 7. Suppose the number of channels of the input tensor of CP module is . The number of channels of the hidden layers in the MLPs is then . The number of nearest neighbors is set to 4.

We used five CP modules in the architecture. Two CP modules are in res groups with spatial resolution of and the rest three are in res groups with spatial resolution . Such mixed usage of CP modules at residual groups of different spatial resolutions enables correspondence and motion in different semantic level to be learned jointly. We only listed the case of using 8 frames as input. For 32-frame input, all “” in the second column of Table 7 should be replaced by .

Appendix E Architecture used in Something-Something and Jester Experiments

We listed the CPNet architectures used in Something-Something [12] and Jester [30] experiments in Table 8. CPNet is instantiated in the same way as illustrated in Figure 6. Suppose the number of channels of the input tensor of CP module is . The number of channels of the hidden layers in the MLPs is then . The number of nearest neighbors is set to 12.

We used five CP modules in the architecture. Two CP modules are in res groups with spatial resolution of and the rest three are in res groups with spatial resolution . We only listed the case of using 12 frames as input. For 24- or 48-frame input, all “” in the second column of Table 7 should be replaced by or .

layer output size CPNet, 5 CP modules
conv1 , 64, stride 2, 2
res
res
res
res
global average pooling, fc or fc
Table 8: CPNet Architectures used in Something-Something and Jester dataset experiments.

Appendix F Per-class accuracy of Something-Something and Jester models

To understand the effect of CP module to the final performance, we provide the CPNet’s per-class top-1 accuracy gain compared with the respective C2D baseline on Jester in Figure 8 and Something-Something in Figure 10.

Figure 8: Per-class top-1 accuracy gain in percentage on Jester v1 dataset due to CP module.

We can see that categories that strongly rely on motion (especially in long-range) in videos typically have large accuracy improvement after adding CP module. On the other hand, categories that doesn’t require reasoning motion to classify have little or negative gain in accuracy. The results coincide with our intuition that CP module effectively captures dynamic content of videos.

On Jester dataset [30], the largest accuracy improvements are achieved in categories that involve long-range spatial motion such as “Sliding Two Fingers Up”, or long-range temporal relation such as “Stop Sign”. At the same time, categories that don’t even need multiple frames to classify, such as “Thump Up” or “Thumb Down”, have the smallest accuracy gain.

On Something-Something dataset [12], the largest accuracy improvements are achieved in categories that involve long-range spatial motion such as “Moving away from something with your camera”, or long-range temporal relation such as “Lifting up one end of something without letting it drop down”. At the same time, categories that don’t even need multiple frames to classify, such as “Showing a photo of something to the camera”, have the smallest or negative accuracy gain .

Appendix G Model Run Time

In this section, we provide time complexity results of our model. Our CP module can be very efficient in term of computation and memory, for both training and inference.

During training, NL Net [33] computes a matrix followed by a row-wise softmax. The whole process is differentiable and all the intermediate values have to be stored for computing gradients during back propagation, which causes huge overhead in memory and computation. Unlike NL Net, our CP module’s computation of a matrix results in integers used for indexing, which is non-differentiable. Thus CPNet doesn’t compute gradients or store the intermediate values of the matrix, a huge saving compared to NL Net and all other works involving global attention.

During inference, our CPNet is also efficient. We evaluate the inference time complexity of the CPNet model used in Jester v1 experiment. The spatial size is

. The model backbone is ResNet-34. The computing platform is an NVIDIA GTX 1080 Ti GPU with Tensorflow and cuDNN. The model performances with various batch sizes and frame lengths are illustrated in Figure

9. With batch size of 1, CPNet can reach processsing speed of 10.1 videos/s for frame length of 8 and 3.9 videos/s for frame length of 32. The number of videos that can be processed in a given time also increases as batch size increases.

Figure 9: Model run time (solid line) and number of video sequences per second (dashed line) of CPNet with ResNet-34 backbone and spatial size .

We point out that there exist other more efficient implementations of CP module. In the main paper, we only presented the approach of the finding per-point -NN in a point cloud via computing a pairwise feature distance matrix of size followed by a row-wise top , which has time complexity of . This is the most convenient way to implement in deep learning frameworks such as Tensorflow. However, when deployed on inference platforms, per-point -NN can be computed by much more efficient approaches with geometric data structures such as k-d tree [1] or Bounding Volume Hierarchy (BVH) [5] in dimensional space. The time complexity will then be , which includes both the construction and traversal of such tree data structures. Accelerating k-d tree or BVH on various platforms is an ongoing research problem in computer systems & architectures community and is not the focus of our work.

Figure 10: Per-class top-1 accuracy gain in percentage on Something-Something v2 dataset due to CP module.

Appendix H More Visualizations

In this section, we provide more visualizations on examples from Kinetics [18] in Figure 11, Something-Something [12] in Figure 12 and Jester [30] in Figure 13. They further show CP module’s ability to propose reasonable correspondences and robustness to errors in correspondence proposal.

Despite what has been shown in the main paper, we also notice some negative examples. For example, in Figure 11LABEL:sub@fig:viz:kinetics:supp:ice_skating, when proposing correspondences of the boy’s left ice skate, CP module incorrectly proposed the a girl’s left ice skate due to the two ice skates’ visual features being too similar. CP module also didn’t completely overwhelm this wrong proposal after max pooling. However, we notice that this wrong proposal is weak in the output signal: it only activates 3 out of 64 channels during max pooling which is acceptable. We point out that such “error” could also be fixed in later stages of the network or even be beneficial for applications that require reasoning relations between similar but different objects.

(a) A video clip with label “ice skating” from Kinetics validation set.
(b) A video clip with label “riding a bike” from Kinetics validation set.
(c) A video clip with label “driving tractor” from Kinetics validation set.
Figure 11: Additional Visualization on our final models on Kinetics dataset. Approach is the same as the main paper.
(a) A video clip with label “Turning something upside down” from Something-Something v2 validation set.
(b) A video clip with label “Picking something up” from Something-Something v2 validation set.
(c) A video clip with label “Moving something down” from Something-Something v2 validation set.
(d) A video clip with label “Dropping something next to something” from Something-Something v2 validation set.
Figure 12: Additional Visualization on our final models on Something-Something v2 dataset. Approach is the same as the main paper.
(a) A video clip with label “Drumming Fingers” from Jester v1 validation set.
(b) A video clip with label “Shaking Hand” from Jester v1 validation set.
(c) A video clip with label “Stop Sign” from Jester v1 validation set.
(d) A video clip with label “Pushing Two Fingers Away” from Jester v1 validation set.
Figure 13: Additional Visualization on our final models on Jester v1 dataset. Approach is the same as the main paper.