ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning

08/24/2021 ∙ by Zhiwu Qing, et al. ∙ National University of Singapore Huazhong University of Science u0026 Technology 7

The central idea of contrastive learning is to discriminate between different instances and force different views of the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a strong and generalized representation. Commonly used random crop operation keeps the difference between two views statistically consistent along the training process. In this work, we challenge this convention by showing that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learnt representation. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic from the video by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data. The visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop and the learned change in the disparity along the training process is beneficial to learning a strong representation. Extensive ablation studies demonstrate the effectiveness of the proposed ParamCrop on multiple contrastive learning frameworks and video backbones. With ParamCrop, we improve the state-of-the-art performance on both HMDB51 and UCF101 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning representations from massive unlabeled data is a prominent research topic in computer vision for reducing the need for laborious and time-consuming manual annotations 

[chen2020simclr, he2020moco, swav, byol, simsiam]

. In the video analysis paradigm, which is the focus of our work, such unsupervised learning strategies are more crucial because of its increased labelling difficulty caused by the ambiguous association between videos and their labels. Early works manually design proxy tasks for learning videos, either by generalizing methods from the image domain 

[jing2018-3drotnet, kim2019cubic_puzzles] or by exploiting temporal properties of videos [xu2019vcop, luo2020vcp, diba2019dynamonet, benaim2020speednet], where visual structures and contents are learned through solving these proxy tasks. Inspired by the instance discimination task [wu2018ins_dis], contrastive based self-supervised approaches have achieved impressive performances [chen2020simclr, simsiam, he2020moco, han2019dpc, han2020memdpc, versatile], which shows its potential to learn advanced semantic information from unlabelled data. One of the key factors in the success of the contrastive learning framework is data augmentation for different views of the same instance, which prevents the model from trivial solution. Among common data augmentation strategies, it is shown in  [chen2020simclr] that random cropping is one of the most effective operations.

Most current approaches for contrastive learning use random cropping with fixed scale range and completely random spatio-temporal location selections. This results in the statistical consistency in the difference of two cropped views along the training process, as showcased in grey in Figure 1(). In this work, we challenge this convention, and propose to control the cropping disparity between two views of the same instance. Inspired by curriculum learning, we increase the contrast difficulty in the later stage of contrastive training and an improvement in the representation is observed. Motivated by this, we present a parametric cubic cropping for video contrastive learning, dubbed ParamCrop, where cubic cropping refers to cropping a 3D cube from the input video. The central component of ParamCrop is a differentiable spatio-temporal cropping operation, which enables ParamCrop to be trained simultaneously with the video backbone and adjust the cropping strategy on the fly. We set the objective of ParamCrop to be adversarial to the video backbone, which is to increase the contrastive loss. Hence, initialized with the simplest setting where two cropped views largely overlaps, ParamCrop gradually increases the disparity between two views. It is worth noting that our objective in this paper is to discover an optimal cropping strategy so that the change in the differences between two augmented views are reasonably controlled. This is radically different from the auto augmentation approaches in the supervised setting [lim2019fastaugment, cubuk2019autoaugment, ho2019pba, cubuk2020randaugment], where the objectives are mostly to increase the diversity of the data so as to enhance the generalization ability.

We quantitatively evaluate the representations trained by ParamCrop on two downstream tasks, i.e., video action recognition and video retrieval. The empirical results demonstrate notable improvements on multiple mainstream contrastive learning frameworks and video backbones. Interestingly, when the change in disparity between two different views generated by ParamCrop is manually simulated, similar improvements in the representation quality are still observed over random augmentation baselines. It further demonstrates that the change process in the view differences is important for training a strong and generalized representation. Finally, we achieve state-of-the-art performance on both evaluated tasks on HMDB51 and UCF101.

Contributions. Our contributions can be mainly summarized as follows: (a) We propose to alter the cropping strategy in video contrastive learning by controlling the disparity between two cropped views; (b) An optimal change process is found by introducing a differentiable spatio-temporal cropping operation and training ParamCrop adversarially and simultaneously with the video backbone; (c) Extensive ablation studies decompose the effectiveness of different components and cropping strategies; (d) State-of-the-art performance is reached on multiple downstream action recognition datasets.

Figure 2: The overall framework of the proposed ParamCrop and detailed illustration of the differentiable spatio-temporal cropping operation workflow. The cropping operations in ParamCrop is inserted between other random augmentations (if any) and the backbone

, which is to ensure the flow of gradients. For the differentiable cropping operation, a multi-layer perceptron is employed to regress six transformation parameters from the input noise, which composes the affine matrix

that defines the transformation relationship from the homogeneous coordinates in the cropped video to the original coordinates in the original video . After obtaining the corresponding pixel positions of the cropped video in the original video

, the pixel values in the cropped video are calculated using bilinear interpolation. However, simply performing bilinear interpolation between two frames is meaningless. The specific operation that we used for temporal sampling can be referred to the supplementary materials. Finally, the cropped videos are fed to backbone

for training.

2 Related Work

Self-supervised video representation learning. To avoid laborious and time-consuming annotation process, a wide range of prior works have proposed different approaches for leveraging unlabelled data. Recent endeavors can be mainly divided into two categories, that is pretext task based approaches [jing2018-3drotnet, benaim2020speednet, diba2019dynamonet, wang2020video_peace, luo2020vcp, xu2019vcop] and contrastive learning based ones [han2019dpc, han2020memdpc, coclr, chen2020simclr, he2020moco, oord2018cpc]. The former ones usually introduce a proxy task for the model to solve. Besides the simple generalization from the image domain [rotation, noroozi2016jigsaw] such as rotation prediction [jing2018-3drotnet] and solving puzzles[kim2019cubic_puzzles, luo2020vcp], other tasks include predictions on the temporal dimension such as speed prediction [benaim2020speednet], frame/clip order prediction [lee2017sort_seq, xu2019vcop], and predicting future frames [diba2019dynamonet], etc. Closely related to our work is contrastive learning based approaches, which were inspired by the instance discrimination task [wu2018ins_dis]. It requires the model to discriminate augmented samples from the same instance from other instances and map different views of the same instance to the same representation. Based on the formulation in [oord2018cpc], [han2019dpc, han2020memdpc] contrast between the representation of the predicted future frames and that of the real ones. Some recent works exploit video pace variation as augmentation and contrast between representations with different paces [wang2020video_peace, chen2020rspnet]. Whether it is in the video paradigm, which is the focus of this paper, or in the image domain, augmentations are all shown to be critical to learning a strong representation. Yet all of them apply random cropping with completely random spatio-temporal location and uniform scale variation parameters along the whole training process. We build our approach for video contrastive learning upon the simplest contrastive framework [chen2020simclr, he2020moco] and show that parameterized cubic cropping controlling the change process is conducive to the improvement of learned representations.

Data Augmentation.

The importance of the data augmentation has already been discovered in the supervised learning settings. The main objective of data augmentation in supervised settings is to enhance the diversity of the training data so that the model can learn generalized representations. The hand-craft data augmentations confuse the network by erasing information 

[random_erasing, cutout] or mixing difference samples [zhang2017mixup, hendrycks2019augmix]. To reduce the dependence on human expertise, automatic data augmentations are proposed to search the combination of augmentation policies by undifferentiable methods  [cubuk2019autoaugment, ho2019pba, lim2019fastaugment] or online learnable strategy [li2020dada]. Although our approach is similar to automatic augmentation, our ParamCrop learn the cropping region that adaptive to the training process, rather than the combination of augmentations. Further, in unsupervised learning, Alex et al. [tamkin2020viewmaker] propose a learnable color transformation improve the robustness of the network. However, this work explore the automatic cropping operation to provide adaptive augmented views for video contrastive learning.

3 Method

This section introduces the proposed parametric cubic cropping framework, ParamCrop, based on contrastive learning. The objective of ParamCrop is to adaptively control the cropping disparity between two generated views during the contrastive training process. To this end, we propose an differentiable spatio-temporal cropping operation, with the generation of the elements in the affine transformation matrix trainable. This enables the 3D cropping operation to be jointly optimized with the video backbone. Two identical but independent cropping modules are connected respectively to the each of the views. To gradually increase the disparity between views, we initialize the parameters so that two crops share a similar space-time location and train the parameters adversarially along with the video backbone using a gradient reversal operation. The overall framework as well as the detailed workflow is visualized in Figure 2.

3.1 Contrastive Learning

Our ParamCrop framework is built upon recent simplified contrastive learning frameworks [chen2020simclr, he2020moco], where the model is trained to maximize the agreement between two augmented views of the same instance and minimize that from different instances. Suppose there are different samples, we can generate augmented views and the contrastive loss can be written as:

(1)

where defines the loss between two paired samples:

(2)

where

is the cosine similarity between the representation of view

, and is the temperature parameter. Typically, the two views are generated by the same set of augmentation, usually consisting of random cropping, color jittering, etc. In traditiontal contrastive methods, the augmentation strategy keeps unchanged along the training process. Hence, the view distribution is in fact consistent along the training process.

3.2 Differentiable 3D Affine Cropping

For the proposed ParamCrop to control the cropping disparity during the training process, the cropping operation is firstly required to be trainable. Therefore, we extend the image affine transformations to video 3D affine transformations and combine it with bilinear interpolation for cropping cubes from original videos in a differentiable way.

Cubic cropping with 3D affine transformations. Before introducing the 3D affine transformation, we first define mathematical notations , and , as the size and temporal interval in the original videos and cropped videos , as illustrated in Figure 3. With these notations, a 3-dimensional affine transformation matrix for calculating transformation relationship from the homogeneous coordinate in the cropped video to the original video can be defined as follows:

(3)

where indicates the region scale, indicates the spatial rotation angle, indicates the spatial center position offsets, indicates the temporal scale and indicates the temporal center position. We set , as and entangles space-time by changing the cropped location along temporal axis, which is unnecessary considering this essentially mimics camera motion that already exists in the videos. The value for are set to as well, so as to avoid confusion of the space-time boundary, where a frame sampled at time step may contain spatial information from multiple other time steps if and are none zero. Therefore, there are altogether six parameters that can be learned in the affine matrix.

Figure 3: The diagram of preliminary parameters for the 3D affine transformation. (a) Illustration of the cropped region width , original frame width and spatial center offsets between cropped region and original frame. The region scale is determined by and as . (b) Illustration of the temporal length of original video:, cropped video:. The temporal scale and temporal offset are similar to spatial dimension.

With these parameters, the 3D affine matrix is essentially a coordinate transformation function that maps the coordinate system from the cropped video to the original video with scaling, rotation and translation in the spatial dimensions as well as scaling and translation in the temporal dimension. Given the homogeneous coordinate in the cropped videos , its corresponding coordinate in the original video can be calculated as follows:

(4)

where all the coordinates in both cropped and original videos are normalized, i.e., . By this means, we can calculate corresponding sampling coordinate for each position in the cropped video. Because the transformed coordinates are continuous rather than discrete, the pixel value is calculated in the output video by bilinear interpolation. This cropping process with the 3D affine transformation is visualized in Figure 4.

Generating transformation parameters.

As the naive application of this 3D affine transformation matrix leads to the same cropping strategy within a batch, it is required that some level of randomness is introduced for each transformation parameters to ensure the diversity of the cropped data. This is enabled by mapping a random generated vector

into the aforementioned six 3D affine transformation parameters, by which we can generate a cropped view.

In particular, we employ a multi-layer perceptron to predict the transformation parameters by:

(5)

where are parameters of the multi-layer perceptron and

denotes ReLU activation between two linear layers. The output of MLP

corresponds to , which is a vector composed by the six controlling parameters in the 3D affine matrix

. The sigmoid function

is employed to constrain the range of output values to avoid generating meaningless views. Moreover, for and , a near-zero or zero value indicates that an extremely small cube is cropped, which is meaningless as well and degenerate the learned representation. Therefore, we set a limited interval for each transformation parameter in :

(6)

where indicates the element-wise multiplication and and represent the minimum and maximum value allowed during training, respectively. , , and . These constraints ensure that the cropped region will fall within the cube of the original video because of the large value of the spatial scale and the temporal scale during training, which avoids exceeding the boundary of the original video and yielding invalid views. The hyper-parameters are given empirically.

Our differentiable 3D affine cropping thus consists of a multi-layer perceptron that generates the transformation parameters and a 3D cubic cropping approach taking these parameters to generate the cropped views. A ParamCrop framework contains two independent affine cropping modules, with one for each view, as in Figure 2.

Figure 4: Illustration of coordinate systems and the cropping process using the 3D affine sampling operation. The coordinate systems are defined such that each pixel location in the -axis is normalized to within the range of [-1, 1]. For the cropping process, the corresponding coordinate in the original video for each pixel in the cropped video is first calculated using 3D affine transformation matrix . Then a bilinear interpolation is applied to obtain the pixel value in the cropped video.

3.3 Objective, Optimization and Constraints

With the 3D affine transformation based sampling, the cropping operation can now be optimized simultaneously with the video backbone in the contrastive training process. Here we introduce the objective, the optimization approach and some constraints when optimizing for the objective. Recall that the objective of the ParamCrop framework is to adaptively control the cropping strategy. Inspired by curriculum learning, we aim for the differentiable 3D affine transformation to crop two views such that the disparity between two cropped views gradually increases along the training process.

Because the weights of the multi-layer perceptron for generating transformation parameters are randomly initialized, which usually contains small values, it generally maps the random noise vector to values around . This means that the initial cropped cubes largely overlaps and are thus very similar. Therefore, with this initialization condition, to achieve gradual increase in the disparity, we simply set the goal to be in the opposite direction of the video optimization direction, that is to be adversarial to the contrastive loss, i.e., . The rationale behind this training objective is that, two identical views for the same video naturally gives the lowest contrastive loss. Therefore, to gradually generate distinct views, what ParamCrop need to do is to gradually increase the contrastive loss.

For the optimization of this goal, we apply a simple gradient reversal strategy for the multi-layer perceptron during the back propagation. As the contrastive loss is calculated for minimization, this reversal operation in fact forces the cropping module to maximize the contrastive loss, as we have expected.

However, blindly maximizing the contrastive loss without any constraint for the cropping module during the adversarial training may cause the module to rapidly converge into an extreme position (two diagonal regions in the space-time cube) to fulfil the training goal, and keep yielding two in the extreme locations views until the end of the training process. This can deteriorate the resultant representation as the video backbone can not make full use of the complete visual semantics in videos, and simple memorization is probably enough for the video backbone since no diversity exists for the input clip across different epochs. Due to the adversarial training, the final views tend to converge to some extreme regions, which is demonstrated in our experiments as well.

To solve this problem, we propose to apply the early stopping strategy to avoid the extreme solution by:

(7)

where is the -th entry of , and and are the greatest upper and lower bounds.

With the early stopping strategy, the video backbone can endow the generated views with more diversities by avoiding the extreme saturation.

4 Experiments

Training dataset.

We pre-train the models on the training set of Kinetics-400 

[carreira2017k400] dataset, containing 240k training videos with each lasting about 10 seconds.

Pre-training settings. We adopt S3D-G [xie2018s3dg] and R2D3D [tran2018r2d3d] as our backbone, and employ SimCLR [chen2020simclr] and MoCo[he2020moco] as our contrastive learning frameworks. The 3D affine cropping module takes frames with resolution as inputs, and outputs the augmented views with spatial-temporal size to the video backbone. LARS [you2017lars] is employed as optimizer. The batch size, learning rate and weight decay are set to 1024, 0.3 and 1e-6 respectively. Color jittering and random horizontal flip are employed before our cropping module. We set the minimum scales and to 0.5 and the maximum scales and to 1.0. The bounds for early stopping are set as and . The rotation angle and both set to 0.0. When compared with other methods, the models are pre-trained with 100 epochs for a fair comparison. To reduce the costs of ablation studies, all the models are only pre-trained with 20 epochs.

Evaluations. The evaluations of the trained representaion are performed on two downstream tasks, i.e., action recognition and video retrieval, on two public datasets: (i) UCF101 [soomro2012ucf101] dataset with 13320 videos from 101 action categories; (ii) HMDB51 [jhuang2011hmdb51] dataset contains 6849 videos from 51 action classes.

Fully fine-tuning and Linear fine-tuning settings. The pre-trained models are fine-tuned on both UCF101 and HMDB51 with spatio-temporal size as input. Adam [kingma2014adam] optimizer is employed with batch size and weight decay 1e-3. The learning rate for fully fine-tuning is set to 0.0002, while 0.002 for linear fine-tuning. In the fine-tuning phase, the common data augmentation strategies are adopted, such as color jittering, random cropping and random horizontal flip. We report fully fine-tuning results if not otherwise specified.

4.1 Understanding ParamCrop

We first analyze the training process of the ParamCrop by visualizing the curve of spatio-temporal IoU and 3D center Manhattan distance between two cropped regions along the contrastive training in Figure 5.

At the initialization stage, two cropped cubes share a large portion of common visual contents, as showcased in Figure 5 . Observing along the training process, because the training objective of ParamCrop is to maximize the contrastive loss, the disparity between two cropped views gradually increases, in that the center distances gradually increases and the IoU decreases. Without the early stopping strategy, the maximum distance is quickly reached and two cropped views are distributed in two diagonal locations, as in Figure 5 , before the distance reduces and oscillates around 0.75, as in Figure 5 . This is probably because the learning process without early stopping has a strong momentum but in fact a distance around 0.75 is sufficiently large for it to have sufficiently small IoU to maximize the contrastive loss. However, ParamCrop with early stopping can prevent the extreme locations, which ensures enough shared semactic information between the two views, as in Figure 5 . Compared to random cropping, ParamCrop adapts the visual disparity between views along with the training process, while the difference in random cropping training are statistically consistent. Overall, this is what we have expected for the training process of ParamCrop. As in Table 1, ParamCrop outperforms random crop by 3.9% and 1.6% on HMDB51 and UCF101 respectively.

Figure 5: The curve of IoU (dashed lines) and center Manhattan distance (solid lines) between two cropping regions generated by ParamCrop along the training process. The yellow curve and the blue curve respectively indicate the training process with/without early stopping (ES). The red curve is a smooth center distance curve that is manually simulated to prove that it is the change process that matters to learning a strong representation in ParamCrop. Symbols () represent the relative position between the two cropping regions (orange and green boxes) at different stages in training. Note that although only spatial region is visualized as an showcase, the similar push-away trend also exists between two views in the temporal dimension. No early stopping strategy is applied in (), while it is applied in ().

We further study this improvement by manually designing cropping strategy of various difficulties:

(i) Simple: Similar to () in Figure 5, there are much shared visual content between the two augmented views.

(ii) Hard: Similar to () and () in Figure 5, there are less shared visual contents.

(iii) Manual simulation: A smooth change in the difficulty is designed along the training process as in the trend for to in Figure 5.

(iv) AutoAugment [cubuk2019autoaugment]: An augmentation policy searched by  [cubuk2019autoaugment].

It can be observed in Table 1 that neither fixing the disparity to be low or high gives a stronger representation, while if we manually simulate the cropping process as in ParamCrop, we can observe a notable improvement in the performance on both datasets. This demonstrate that the cropping strategy from low disparity to high disparity is conducive to a strong and generalized representation. The underperformance of manual simulation further show that the adaptive process by maximizing contrastive loss is beneficial. We also experiment the augmentation policy searched by AutoAugment [cubuk2019autoaugment], which does not promote performance, especially in HMDB51. Note that previous auto-augmentation methods [cubuk2019autoaugment, ho2019pba, lim2019fastaugment] are devoted to seeking a better augmentation combination among a large pool of candidates. Even our ParamCrop can be viewed as an auto augmentation technique, the motivation of ParamCrop is to design a auto cropping strategy from easy to hard in the training process.

Backbone Difficulty HMDB51 UCF101
S3D-G Random 56.0 85.3
Simple 54.6 85.3
Hard 55.2 85.5
ParamCrop 59.9 86.9
Manual Simulation
58.6 86.6
AutoAugment [cubuk2019autoaugment]
55.7 86.2
Table 1: The impact of data augmentation difficulty. Naive usage of simple, random, and hard augmentations yield suboptimal representations compared to the proposed ParamCrop. It is noteworthy that when we manually simulate the augmentation difficulty in the ParamCrop, we observe a notable performance boost.

4.2 Ablation Studies

Random ParamCrop Rot HMDB51 UCF101
FT LFT FT LFT
56.0 33.5 85.3 57.9
58.0 37.2 86.9 58.5
55.8 29.5 86.2 48.6
59.9 37.3 86.9 59.3
Table 2: Baseline comparison of ParamCrop and the performance evaluation for the combination of ParamCrop and random cropping strategies. ’Random’ indicates the usage of random cropping in during the contrastive training process. Additionally, we enabled parametric rotation in the second last experiment to evaluate its influence.

Baseline comparisons. We first evaluate the proposed ParamCrop by comparing it with the baseline, as in Table 13. The baseline of ParamCrop is defined as a random cropping operation in replacement with the differentiable 3D affine cropping. By simply switching from random cropping operation to the proposed learnable cropping operation, improvements of 2% and 1.6% are observed in the fully finetune settings on respectively HMDB51 and UCF101, as well as 3.7% and 0.6% in the linear finetune settings on respective datasets. This has indicated not only a more generalized representation learned by ParamCrop compared to random cropping, it also shows that our learned weights serve as a good initialization for finetuning on downstream action recognition tasks. Note that the temporal cropping in only-ParamCrop settings is limited to crop from a -frames range because of the hardware constraint, which means with a more powerful computation hardware to store a longer frame period, the performance can be further improved. Additionally, we observe a larger improvement over the baseline by incorporating the traditional random cropping strategy to (e.g. perform random cropping before ParamCrop.). We believe this improvement benefits by more randomness provided by random cropping. And thus we use this combination for further experiments.

Effect of enabling spatial rotation. Spatial rotation is an important part of affine transformation. However, as shown in Table 13, introducing spatial rotation to ParamCrop in fact only has adverse effect on the representation. This is in line with previous research findings [chen2020simclr]. Hence we set the rotation angle and both to 0.0 in our other experiments.

Approach
Gradient Reverse
HMDB51 UCF101
Random Crop - 56.0 85.3
ParamCrop (Ours) 56.0 (+0.0) 84.3 (-1.0)
59.9 (+3.9) 86.9 (+1.6)
Table 3: The importance of the gradient reversal operation for adversarial training. The removal of the gradient reversal degenerate our ParamCrop to simple random crop.
Backbone Early Stopping HMDB51 UCF101
FT LFT FT LFT
S3D-G 59.9 32.6 87.0 55.8
59.9 37.3 86.9 59.3
Table 4: Ablation on the early stopping strategy. FT and LFT in the table respectively indicate fully fine-tuning, and linear fine-tuning.

Gradient reversal. Next we investigate the necessity of reversing the gradient for the cropping module. The result is listed in Table 3. The removal of gradient reversal aligns the training objective of cropping and representation learning, which means the cropping module now looks for views to maximize the shared visual content. Since we perform random spatio-temporal transformations before the cropping module, the performance without gradient reversal degenerates to random crop, and even lower since some random cropped spatio-temporal cubes may share visual content and maximizing agreement between cropped views encourages the model to find shortcuts.

Early stopping. Qualitatively, it is observed in Figure 5 (--) that early stopping promotes the diversity of the selected views, in that it avoids extreme locations (compared to no early stopping in Figure 5 (--)) and thus a wider range of cropping options are available. Quantitatively, the results are listed in Table 4. Although early stopping has less effect on the fully fine-tune performance, it notably improves the linear separability of the learned representation. This indicates that early stopping is a necessary strategy for learning a generalized representation.

Approach Backbone ParamCrop HMDB51 UCF101
SimCLR S3D-G 56.0 85.3
SimCLR S3D-G 59.9 86.9
SimCLR R-2D3D 50.4 77.2
SimCLR R-2D3D 53.0 79.4
MoCo S3D-G 52.4 84.1
MoCo S3D-G 54.3 85.1
MoCo R-2D3D 45.4 72.8
MoCo R-2D3D 48.2 73.8
Table 5: Comparison of ParamCrop integrated with different contrastive methods on different video backbones. The proposed ParamCrop improves all combinations by a notable margin.

Different frameworks and backbones. To evaluate the generalization ability of the proposed approach, ParamCrop is integrated to two mainstream contrastive learning frameworks SimCLR [chen2020simclr] and MoCo [he2020moco] and two common video backbones S3D-G [xie2018s3dg] and R-2D3D [tran2018r2d3d]. The results shown in Table 5 demonstrate that the benefit by introducing ParamCrop can be generalized to multiple contrastive frameworks and video backbones, and improvements as high as 3.9% and 2.2% are achieved on HMDB51 and UCF101 respectively.

Backbone ParamCrop 10% Label 50% Label
S3D-G 13.9 29.9
14.4 31.6
Table 6: Application of ParamCrop to the semi-supervised task. All experiments are based on the MeanTeacher [tarvainen2017mean_teacher] framework on HMDB51 dataset, and the models are initialized randomly.
Backbone MLP P.S. C.C.(Flops) HMDB51 UCF101
S3D-G 0 10.3M 71.9G 56.0 85.3
1 +396 +0.002G 57.7 86.6
2 +0.005M +0.002G 59.9 86.9
3 +0.014M +0.002G 58.4 86.8
Table 7: Parameter size(P.S.) and computation cost(C.C.) in ParamCrop under the different layers of Multi-Layer Perceptron(MLP). ‘+’ means the absolute increase from baseline.
Backbone Depth ParamCrop HMDB51 UCF101
R-2D3D 10 43.7 71.9
18 50.4 77.2
34 53.9 80.6
50 52.5 84.0
R-2D3D 10 46.5(+2.8) 74.0(+3.0)
18 52.0(+1.6) 79.5(+2.3)
34 56.8(+2.9) 83.6(+3.0)
50 55.2(+2.7) 85.3(+1.3)
Table 8: Ablations on different network depth.

Parameter size and computation cost of ParamCrop. The additional parameters and calculational overhead brought by ParamCrop are shown in Table 7. The ParamCrop only contains a MLP, which can be negligible for S3D-G both parameters and computations. Notably, the ablations for the layers of MLP show that a two-layer MLP achieve best performance compared with other settings. Since the one-layer MLP can result in poor nonlinear ability and the MLP with more layers may involve optimization difficulties.

ParamCrop with deeper networks. We explore the impact of backbones with different depths in Table 8. This is because the deeper networks may cause gradient disappearance or explosion, which can hinder the optimization of ParamCrop. However, it can be observed that ParamCrop is still effective even when the depth of backbone grows from 10 to 50. Combined with Table 5, our ParamCrop can be applied in any backbone and contrastive learning framework.

Figure 6: Sensitivity analysis of ParamCrop. FT and LFT in the table indicate fully fine-tuning, and linear fine-tuning, respectively.

Sensitivity test. To study how the lower and upper bounds in Equation 7, e.g. and , affect the performance of ParamCrop, sensitivity tests are conducted for both HMDB51 and UCF101. The results are shown in Figure 6. We have the following observations:(1) When equals 0.0, it is equivalent to no Early Stopping, thus the performances of linear fine-tuning are unsatisfactory. (2) With , it may lead to inconsequential drop from the highest performance on fully fine-tuning. However, the performances of linear classification are boosted on both datasets. (3) When grows continuously to 0.4, the fully fine-tuning performance is decrease but the linear fine-tuning has a slight improvement, which shows a trade-off between them. Such results inspire us the better performance for initialize backbone does not always leads a better feature for linear classification. (4) If reaches 0.5, the performances for both fully fine-tuning and linear fine-tuning degrade dramatically, which is caused by detachment of the gradient, and the ParamCrop can not be optimized, rather than the sensitivity to .

4.3 Application on Semi-supervised method

We further evaluate the generalization ability of ParamCrop to semi-supervised learning paradigm, by applying the proposed parametric cubic cropping and the adversarial training to the classic semi-supervised framework MeanTeacher 

[tarvainen2017mean_teacher], as its framework is similar to MoCo [he2020moco], where two different views are required to be generated by data augmentation. We simply plug ParamCrop into MeanTeacher sequentially after random augmentations. Results in Table 6 show consistent improvements of ParamCrop over the baseline in both 10% and 50% label settings.

4.4 Comparison with the State-of-the-art

Approach Architecture
Pre-train
Dataset
HMDB UCF
OPN [lee2017sort_seq] VGG UCF 23.8 59.6
3D-RotNet [jing2018-3drotnet] R3D K400 33.7 62.9
ST-Puzzle [kim2019cubic_puzzles] R3D K400 33.7 63.9
VCOP [xu2019vcop] R(2+1)D UCF 30.9 72.4
DPC [han2019dpc] R-2D3D K400 35.7 75.7
CBT [sun2019cbt] S3D K600 44.6 79.5
MemDPC [han2020memdpc] R-2D3D K400 41.2 78.1
SpeedNet [benaim2020speednet] S3D-G K400 48.8 81.1
DynamoNet [diba2019dynamonet] STCNet Y8M 59.9 88.1
DSM [wang2020dsm] 3D-Res34 K400 52.8 78.2
MoSI [mosi] R2D3D K400 48.6 70.7
MoSI [mosi] R(2+1)D UCF/HMDB 51.8 82.8
RSPNet* [chen2020rspnet] S3D-G K400 59.6 89.9
STS* [wang2021sts] S3D-G K400 62.0 89.0
ParamCrop R-2D3D K400 53.7 82.8
ParamCrop S3D-G K400 62.3 88.9
ParamCrop* S3D-G K400 63.4 91.3
Supervised [xie2018s3dg] S3D-G K400 75.9 96.8
Table 9: Comparison with state-of-the-art methods, where pre-trained models are fully fine-tuned in HMDB51 and UCF101. For pre-train dataset, ‘K400’ and ‘K600’ refer to Kinetics-400 and Kinetics-600 [carreira2018k600] datasets, ‘Y8M’ is the Youtube8M dataset. ‘*’ indicates that 64 frames are used to fine-tune backbone.
Approach Architecture
Pre-train
Dataset
HMDB51 UCF101
MemDPC [han2020memdpc] R-2D3D K400 30.5 54.1
MemDPC n.l. [han2020memdpc] R-2D3D K400 33.6 58.5
ParamCrop R-2D3D K400 39.7 66.8
ParamCrop S3D-G K400 39.4 68.5
Table 10:

Comparison with state-of-the-art methods in linear classification on HMDB51 and UCF101. ‘n.l.’ refers to a nonlinear classifier.

Action Recognition. We compare ParamCrop with state-of-the-art methods under full fine-tune setting on HMDB51 and UCF101, and the results are shown in Table 9. In these experiments, we choose SimCLR [chen2020simclr] as the base contrastive learning framework. From these results, we can draw three following conclusions: (i) ParamCrop outperforms current state-of-the-art methods on the two dataset. Especially, we surpasses SpeedNet [benaim2020speednet] by and on the two dataset respectively, when applying same dataset with backbone, i.e., Kinetics 400 and S3D-G; (ii) ParamCrop can be trained on less data but achieve remarkable performance. DynamoNet [diba2019dynamonet] is trained with 8M videos on Youtube8M dataset, while our method just use 240K videos but still obtain and gains; (iii) ParamCrop further closes the gap between the fully supervised pre-trained model on Kinetics400 and the models trained without manual annotations. Table 10 compares the linear classification performance in HMDB51 and UCF101, which evaluates the linear separability of the representations trained with self-supervised learning and can serve as a good indicator of the representation quality. Compared with MemDPC [han2020memdpc], the proposed ParamCrop gains and with R-2D3D as backbone on HMDB51 and UCF101 respectively, demonstrating the superiority of the extracted feature by models pretrained with ParamCrop.

Video Retrieval. In the video retrieval task, we mainly follow the settings in previous works [xu2019vcop, luo2020vcp, han2020memdpc]. The models pre-trained by ParamCrop on Kinetics400 based on SimCLR framework are directly employed as feature extractors without fine-tuning. Each video in the dataset is divided uniformly along the temporal dimension into 10 clips and a video-level feature is obtained by averaging the features for the 10 clips. Each video in the testing set is used to query -nearest videos in the training set, and a retrieval is considered successful if the retrieved video share the same label with the test video. Recall rate at top

is the evaluation metric. We conduct experiments on both HMDB51 and UCF101 dataset, and compare with state-of-the-art approaches in Table 

11 and Table 12 respectively. Results show that ParamCrop exceeds the state-of-the-art MemDPC [han2020memdpc] by and with the same backbone on HMDB51 and UCF101 respectively, which indicates that our learnt representation is more generalized.

Approach Architecture R@1 R@5 R@10 R@20
VCOP [xu2019vcop] R3D 7.6 22.9 34.4 48.8
VCP [luo2020vcp] R3D 7.6 24.4 36.3 53.6
SpeedNet [benaim2020speednet] S3D-G 13.0 28.1 37.5 49.5
DSM [wang2020dsm] C3D 8.2 25.9 38.1 52.0
DSM [wang2020dsm] I3D 7.6 23.3 36.5 52.5
MemDPC [han2020memdpc] R-2D3D 15.6 37.6 52.0 65.3
ParamCrop R-2D3D 21.9 46.9 59.0 71.5
ParamCrop S3D-G 21.9 46.7 60.3 72.6
Table 11: Nearest neighobor retrieval comparison on HMDB51.
Approach Architecture R@1 R@5 R@10 R@20
Jigsaw [noroozi2016jigsaw] CFN 19.7 28.5 33.5 40.0
OPN [lee2017sort_seq] OPN 19.9 28.7 34.0 40.6
VCOP [xu2019vcop] R3D 14.1 30.3 40.4 51.1
VCP [luo2020vcp] R3D 18.6 33.6 42.5 53.5
SpeedNet [benaim2020speednet] S3D-G 13.0 28.1 37.5 49.5
DSM [wang2020dsm] C3D 16.8 33.4 43.4 54.6
DSM [wang2020dsm] I3D 17.4 35.2 45.3 57.8
MemDPC [han2020memdpc] R-2D3D 40.2 63.2 71.9 78.6
ParamCrop R-2D3D 43.0 59.9 69.2 78.3
ParamCrop S3D-G 46.3 62.3 71.3 79.1
Table 12: Nearest neighobor retrieval comparison on UCF101.

5 Conclusion

In this work, we motivate ourself by challenging the traditional random cropping strategy in its consistent difference between two cropped views along the training process. We propose an parametric cubic cropping for adaptively controlling the disparity between two cropped views along the training process. Specifically, we enable online training by first extending affine transformation matrix to the 3D affine transformation and learn to regress the transformation paramters such that the cropping operation is fully differentiable. For the optimization, the parametric cubic cropping operation is trained with an adversarial objective to the video backbone, and optimized using a simple gradient reversal operation. We additionally show that an early stopping strategy in the optimization process is beneficial for the video backbone to learn better representations. Empirical results demonstrate that the adaptively controlled disparity between views is indeed effective for improving the representation quality. Extensive ablation studies validates the effectiveness of each proposed components and state-of-the-art performance is achieved on both action recognition tasks as well as video retrieval on HMDB51 and UCF101.

References

Appendix

In this supplemental material, we provide additional details on the (a) temporal sampling strategy, (b) evaluation of ParamCrop on the different training epochs in comparison with random cropping, (c) further decomposition of ParamCrop into spatial and temporal components, (d) few-shot evaluation of the model initialized with ParamCrop pre-training and (e) visualization of the cropping area by different variants of ParamCrop.

Appendix A Temporal Sampling

In our 3D affine cropping, especially in the spatial dimensions, pixel values in the output video are obtained using bilinear interpolation. However, simply performing bilinear interpolation on the temporal axis can cause the merge of two frames, which is physically meaningless. Therefore, we employ a differentiable way to round down the temporal coordinates in the original video corresponding to the pixels in the cropped video :

(8)

Appendix B Training Epochs

Here, we provide performance comparisons between ParamCrop and its random crop counterpart with different training periods in Figure 7. Overall, a longer training period gives a better classification accuracy in the downstream tasks. In compairison with random corpping, with the extension of training time, ParamCrop achieves a consistent improvement on both HMDB51 and UCF101.

Figure 7: Evaluation on HMDB51 and UCF101 with different training periods. Our ParamCrop achieves a consistent improvement over the random crop counterpart.

Appendix C Further Decomposing ParamCrop

Rand. ParamCrop HMDB51 UCF101
Temp. Spat. FT LFT FT LFT
56.0 33.5 85.3 57.9
53.5 27.3 84.9 45.0
59.9 32.7 86.5 41.3
58.0 37.2 86.9 58.5
59.9 37.3 86.9 59.3
Table 13: Further decomposition of ParamCrop into spatial (Spat.) and temporal (Temp.) components, with comparison to the baseline and the full ParamCrop variant. ’Rand.’ indicates the usage of random cropping in during the contrastive training process. T, S respectively denotes parametric temporal crop and parametric spatial crop.

We further decompose the ParamCrop into the parametric spatial crop and the parametric temporal crop and evaluate them independently by fixing the other one to sample the whole video. That is, we select the same 16 frames for the parametric spatial crop, and select the whole spatial area for the parametric temporal crop. The results are demonstrated in Table 13. Compared to random cropping, the parametric temporal crop yields a slightly lower performance, while the parametric spatial crop gives a notable improvement. In terms of linear finetune, both only-spatial and only-temporal ParamCrop underperform the random cropping one. However, the combination of temporal and spatial ParamCrop notably boost the performance in the linear finetune, and the representation thus achieves a better linear separability compared to the random cropping baseline, which means that the parametric cubic crop helps the network to learn a more generalized representation.

Appendix D Few-shot Finetuning

Backbone Label HMDB51 UCF101
R.C. P.C. R.C. P.C.
S3D-G 50% 51.5 52.8 82.4 83.3
30% 44.9 46.0 76.9 78.5
10% 33.3 34.6 57.6 60.1
Table 14: Evaluation of ParamCrop on the effect of low-shot fine-tuning on HMDB51 and UCF101, compared to intialization with the representations trained using random augmentations.

We evaluate the pre-trained models by applying less labeled videos in the fine-tune stage on HMDB51 and UCF101. In particular, we randomly select , , and labeled videos from each category in training set. Then the selected videos are used for fully finetuning, after which we calculate the performance on the whole validation set, as shown in Table 14. Results show a steady improvement in the performance in all label settings, which demonstrate that the pretrained models by ParamCrop are not sensitive to the data used by downstream tasks.

Appendix E Visualization

To better comprehend what views are generated by ParamCrop, we first visualize the cropped area of ParamCrop along the training process of 20 epochs in Figure 8, namely (a) in the initialization stage, (b) trained 10 epochs, and (c) trained 20 epochs. In the initialization stage, two views are initialized to share most visual contents. With the process of training ParamCrop, two views gradually become more distinct. But with the early stopping strategy, even at the end of the training process, there still exists shared content between two views, which is shown to be beneficial to the final representation. This process is also demonstrated quantitatively in Figure 5 in the manuscript.

Futher, we visualize four cases comparing random cropping and different variants of ParamCrop in Figure 9 and Figure 10, i.e., (a) random crop, (b) ParamCrop without gradient reversal, (c) ParamCrop without early stopping, and (d) ParamCrop. Figures show that: (a)

Random crop generates views that are uniform distributed in the spatial and temporal dimension, and the visual content in both cropped views are shared to a large extent.

(b) ParamCrop Without gradient reversal always outputs all spatial contents for two views that are uniformly distributed on the temporal axis, as this variant aims to minimize contrastive loss by preserving more visual cues shared by two augmented views. (c) ParamCrop Without early stopping convergences to extreme regions and generates completely different views. Although this strategy can learn a good initialization for backbone, it will hurt the performance of linear finetune. (d) The two cropped views generated by ParamCrop are different, but the ratio of the shared content between two views is relatively increased compared to no early stopping and decreased compared to no gradient reversal and random cropping.

Figure 8: Visualization of the cropped views of ParamCrop along the training process.
Figure 9: Visualization of the cropping views comparing random crop and the variants of ParamCrop.
Figure 10: Visualization of the cropping views comparing random crop and the variants of ParamCrop.