temporal-ssl
Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.
view repo
We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. To learn to distinguish non-trivial motions, the design of the transformations is based on two principles: 1) To define clusters of motions based on time warps of different magnitude; 2) To ensure that the discrimination is feasible only by observing and analyzing as many image frames as possible. Thus, we introduce the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition on UCF101 and HMDB51.
READ FULL TEXT VIEW PDFVideo Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.
A fundamental goal in computer vision is to build representations of visual data that can be used towards tasks such as object classification, detection, segmentation, tracking, and action recognition
[39, 11, 41, 26]. In the past decades, a lot of research has been focused on learning directly from single images and has done so with remarkable success [38, 17, 18]. Single images carry crucial information about a scene. However, when we observe a temporal sequence of image frames, i.e., a video, it is possible to understand much more about the objects and the scene. In fact, by moving, objects reveal their shape (through a change in the occlusions), their behavior (how they move due to the laws of Physics or their inner mechanisms), and their interaction with other objects (how they deform, break, clamp etc.). However, learning such information is non trivial. Even when labels related to motion categories are available (such as in action recognition), there is no guarantee that the trained model will learn the desired information, and will not instead simply focus on a single iconic frame and recognize a key pose or some notable features strongly correlated to the action [40].To build representations of videos that capture more than the information contained in a single frame, we pose the task of learning an accurate model of motion as that of learning to distinguish an unprocessed video from a temporally-transformed one. Since similar frames are present in both the unprocessed and transformed sequence, the only piece of information that allows their discrimination is their temporal evolution. This idea has been exploited in the past [12, 28, 29, 33, 50] and is also related to work in time-series analysis, where dynamic time warping is used as a distance for temporal sequences [20].
In this paper, we analyze different temporal transformations and evaluate how learning to distinguish them yields a representation that is useful to classify videos into meaningful action categories. Our main finding is that the most effective temporal distortions are those that can be identified only by observing the largest number of frames. For instance, the case of substituting the second half of a video with its first half in reverse order, can be detected already by comparing just the 3 frames around the temporal symmetry. In contrast, distinguishing when a video is played backwards from when it is played forward
[50] may require observing many frames. Thus, one can achieve powerful video representations by using as pseudo-task the classification of temporal distortions that differ in their long-range motion dynamics.
|
|
|
|
|
|
|
Towards this goal, we investigate 4 different temporal transformations of a video, which are illustrated in Fig. 1:
Speed: Select a subset of frames with uniform sub-sampling (i.e., with a fixed number of frames in between every pair of selected frames), while preserving the order in the original sequence;
Random: Select a random permutation of the frame sequence;
Periodic: Select a random subset of frames in their natural (forward) temporal order and then a random subset in the backward order;
Warp: Select a subset of frames with a random sub-sampling (i.e., with a random number of frames in between every pair of selected frames), while preserving the natural (forward) order in the original sequence.
We use these transformations to verify and illustrate the hypothesis that learning to distinguish them from one another (and the original sequence) is useful to build a representation of videos for action recognition. For simplicity, we train a neural network that takes as input videos of the same duration and outputs two probabilities: One is about which one of the above temporal transformations the input sequence is likely to belong to and the second is about identifying the correct speed of the chosen
speed transformation.In the Experiments section, we transfer features of standard 3D-CNN architectures (C3D [44], 3D-ResNet [16], and R(2+1)D [45]) pre-trained through the above pseudo-task to standard action recognition data sets such as UCF101 and HMDB51, with improved performance compared to prior works. We also show that features learned through our proposed pseudo-task capture long-range motion better than features obtained through supervised learning. Our project page https://sjenni.github.io/temporal-ssl provides code and additional experiments.
Our contributions can be summarized as follows: 1) We introduce a novel self-supervised learning task to learn video representations by distinguishing temporal transformations; 2) We study the discrimination of the following novel temporal transformations: speed, periodic and warp
; 3) We show that our features are a better representation of motion than features obtained through supervised learning and achieve state of the art transfer learning performance on action recognition benchmarks.
Because of the lack of manual annotation, our method belongs to self-supervised learning. Self-supervised learning appeared in the machine learning literature more than 2 decades ago
[7, 2] and has been reformulated recently in the context of visual data with new insights that make it a promising method for representation learning [9]. This learning strategy is a recent variation on the unsupervised learning theme, which exploits labeling that comes for “free” with the data. Labels could be easily accessible and associated with a non-visual signal (for example, ego-motion [1], audio [35], text and so on), but also could be obtained from the structure of the data (e.g., the location of tiles [9, 34], the color of an image [53, 54, 27]) or through transformations of the data [14, 21, 22]. Several works have adapted self-supervised feature learning methods from domains such as images or natural language to videos: Rotation prediction [23], Dense Predictive Coding [15], and [43] adapt the BERT language model [8]to sequences of frame feature vectors.
In the case of videos, we identify three groups of self-supervised approaches: 1) Methods that learn from videos to represent videos; 2) Methods that learn from videos to represent images; 3) Methods that learn from videos and auxiliary signals to represent both videos and the auxiliary signals (e.g., audio).
Temporal ordering methods. Prior work has explored the temporal ordering of the video frames as a supervisory signal. For example, Misra et al. [33] showed that learning to distinguish a real triplet of frames from a shuffled one yields a representation with temporally varying information (e.g.
, human pose). This idea has been extended to longer sequences for posture and behavior analysis by using Long Short-Term Memories
[5]. The above approaches classify the correctness of a temporal order directly from one sequence. An alternative is to feed several sequences, some of which are modified, and ask the network to tell them apart [12]. Other recent work predicts the permutation of a sequence of frames [28] or both the spatial and temporal ordering of frame patches [6, 24]. Another recent work focuses on solely predicting the arrow of time in videos [50]. Three concurrent publications also exploit the playback speed as a self-supervision signal [10, 4, 52]. In contrast, our work studies a wider range of temporal transformations. Moreover, we show empirically that the temporal statistics extent (in frames) captured by our features correlates to the transfer learning performance in action recognition.Recent work [47] showed how a careful learning of motion statistics led to a video representation with excellent transfer performance on several tasks and data sets. The learning of motion statistics was made explicit by extracting optical flow between frame pairs, by computing flow changes, and then by identifying the region where a number of key attributes (e.g., maximum magnitude and orientation) of the time-averaged flow-change occurred. In this work, we also aim to learn from motion statistics, but we focus entirely our attention on the temporal evolution without specifying motion attributes of interest or defining a task based on appearance statistics. We hypothesize that these important aspects could be implicitly learned and exploited by the neural network to solve the lone task of discriminating temporal transformations of a video. Our objective is to encourage the neural network to represent well motion statistics that require a long-range observation (in the temporal domain). To do so, we train the network to discriminate videos where the image content has been preserved, but not the temporal domain. For example, we ask the network to distinguish a video at the original frame rate from when it is played 4 times faster. Due to the laws of Physics, one can expect that, in general, executing the same task at different speeds leads to different motion dynamics compared to when a video is just played at different speeds (e.g.
, compare marching vs walking played at a higher speed). Capturing the subtleties of the dynamics of these motions requires more than estimating motion between 2 or 3 frames. Moreover, these subtleties are specific to the moving object, and thus they require object detection and recognition.
In our approach, we transform videos by sampling frames according to different schemes, which we call temporal transformations. To support our learning hypothesis, we analyze transformations that require short- (i.e., temporally local) and long-range (i.e., temporally global) video understanding. As will be illustrated in the Experiments section, short-range transformations yield representations that transfer to action recognition with a lower performance than long-range ones.
Fig. 2 illustrates how we train our neural network (a 3D-CNN [44]) to build a video representation (with 16 frames).
In this section, we focus on the inputs to the network. As mentioned above, our approach is based on distinguishing different temporal transformations. We consider 4 fundamental types of transformations: Speed changes, random temporal permutations, periodic motions and temporal warp changes.
Each of these transformations boils down to picking a sequence of temporal indices to sample the videos in our data set.
denotes the chosen subset of indices of a video based on the transformation and with speed .
Speed ():
In this first type we artificially change the video frame rate, i.e., its playing speed. We achieve that by skipping a different number of frames.
We consider 4 cases, Speed 0, 1, 2, 3 corresponding to respectively, where we skip frames. The resulting playback speed of Speed is therefore times the original speed.
In the generation of samples for the training of the neural network we first uniformly sample , the playback speed, and then use this parameter to define other transformations. This sequence is used in all experiments as one of the categories against either other speeds or against one of the other transformations below. The index sequence is thus , where is a random initial index.
Random ():
In this second temporal transformation we randomly permute the indices of a sequence without skipping frames. We fix to ensure that the maximum frame skip between two consecutive frames is not too dissimilar to other transformations.
This case is used as a reference, as random permutations can often be detected by observing only a few nearby frames. Indeed, in the Experiments section one can see that this transformation yields a low transfer performance. The index sequence is thus . This transformation is similar to that of the pseudo-task of Misra et al. [33].
Periodic ():
This transformation synthesizes motions that exhibit approximate periodicity. To create such artificial cases we first pick a point where the playback direction switches. Then, we compose a sequence with the following index sequence: to and then from to . Finally, we sub-sample this sequence by skipping frames. Notice that the randomization of the midpoint in the case of yields pseudo-periodic sequences, where the frames in the second half of the generated sequence often do not match the frames in the first half of the sequence. The index sequence is thus , where , , and .
Warp ():
In this transformation, we pick a set of ordered indices with a non-uniform number of skipped frames between them (we consider sampling any frame so we let ). In other words, between any of the frames in the generated sequence we have a random number of skipped frames, each chosen independently from the set . This transformation creates a warping of the temporal dimension by varying the playback speed from frame to frame.
To construct the index sequence we first sample the frame skips for and set to .
Let denote our network, and let us denote with (motion) and (speed) its two softmax outputs (see Fig. 2). To train we optimize the following loss
(1) | ||||
where is a video sample, the sub-index denotes the set of frames. This loss is the cross entropy both for motion and speed classification (see Fig. 2).
Following prior work [47], we use the smaller variant of the C3D architecture [44] for the 3D-CNN transformation classifier in most of our experiments. Training was performed using the AdamW optimizer [31] with parameters and a weight decay of . The initial learning rate was set to during pre-training and during transfer learning. The learning rate was decayed by a factor of over the course of training using cosine annealing [30]
both during pre-training and transfer learning. We use batch-normalization
[19] in all but the last layer. Mini-batches are constructed such that all the different coarse time warp types are included for each sampled training video. The batch size is set 28 examples (including all the transformed sequences). The speed type is uniformly sampled from all the considered speed types. Since not all the videos allow a sampling of all speed types (due to their short video duration) we limit the speed type range to the maximal possible speed type in those examples. We use the standard pre-processing for the C3D network. In practice, video frames are first resized to pixels, from which we extract random crops of size pixels. We also apply random horizontal flipping of the video frames during training. We use only the raw unfiltered RGB video frames as input to the motion classifier and do not make use of optical flow or other auxiliary signals.speed | UCF101 | HMDB51 | |
Pre-Training Signal | loss | (conv frozen) | (conv fine-tuned) |
Action Labels UCF101 | - | 60.7% | 28.8% |
Speed | YES | 49.3% | 32.5% |
Speed + Random | NO | 44.5% | 31.7% |
Speed + Periodic | NO | 40.6% | 29.5% |
Speed + Warp | NO | 43.5% | 32.6% |
Speed + Random | YES | 55.1% | 33.2% |
Speed + Periodic | YES | 56.5% | 36.1% |
Speed + Warp | YES | 55.8% | 36.9% |
Speed + Random + Periodic | NO | 47.4% | 30.1% |
Speed + Random + Warp | NO | 54.8% | 36.6% |
Speed + Periodic + Warp | NO | 50.6% | 36.4% |
Speed + Random + Periodic | YES | 60.0% | 37.1% |
Speed + Random + Warp | YES | 60.4% | 39.2% |
Speed + Periodic + Warp | YES | 59.5% | 39.0% |
Speed + Random + Periodic + Warp | NO | 54.2% | 34.9% |
Speed + Random + Periodic + Warp | YES | 60.6% | 38.0% |
Datasets and Evaluation. In our experiments we consider three datasets. Kinetics [55] is a large human action dataset consisting of around 500K videos. Video clips are collected from YouTube and span 600 human action classes. We use the training split for self-supervised pre-training. UCF101 [41] contains around 13K video clips spanning 101 human action classes. HMDB51 [26] contains around 5K videos belonging to 51 action classes. Both UCF101 and HMDB51 come with three pre-defined train and test splits. We report the average performance over all splits for transfer learning experiments. We use UCF101 train split 1 for self-supervised pre-training. For transfer learning experiments we skip 3 frames corresponding to transformation Speed 2. For the evaluation of action recognition classifiers in transfer experiments we use as prediction the maximum class probability averaged over all center-cropped sub-sequences for each test video. More details are provided in the supplementary material.
Understanding the Impact of the Temporal Transformations.
We perform ablation experiments on UCF101 and HMDB51 where we vary the number of different temporal transformations the 3D-CNN is trained to distinguish. The 3D-CNN is pre-trained for 50 epochs on UCF101 with our self-supervised learning task. We then perform transfer learning for action recognition on UCF101 and HMDB51. On UCF101 we freeze the weights of the convolutional layers and train three randomly initialized fully-connected layers for action recognition. This experiment treats the transformation classifier as a fixed video feature extractor. On HMDB51 we fine-tune the whole network including convolutional layers on the target task. This experiment therefore measures the quality of the network initialization obtained through self-supervised pre-training. In both cases we again train for 50 epochs on the action recognition task. The results of the ablations are summarized in Table
1. For reference we also report the performance of network weights learned through supervised pre-training on UCF101.We observe that when considering the impact of a single transformation across different cases, the types Warp and Speed achieve the best transfer performance. With the same analysis, the transformation Random leads to the worst transfer performance on average. We observe that Random is also the easiest transformation to detect (based on training performance – not reported). As can be seen in Fig. 1 (e) this transformation can lead to drastic differences between consecutive frames. Such examples can therefore be easily detected by only comparing pairs of adjacent frames. In contrast, the motion type Warp can not be distinguished based solely on two adjacent frames and requires modelling long range dynamics. We also observe that distinguishing a larger number of transformations generally leads to an increase in the transfer performance. The effect of the speed type classification is quite noticeable. It leads to a very significant transfer performance increase in all cases. This is also the most difficult pseudo task (based on the training performance – not reported). Recognizing the speed of an action is indeed challenging, since different action classes naturally exhibit widely different motion speeds (e.g., “applying make-up” vs. “biking”). This task might often require a deeper understanding of the physics and objects involved in the video.
Method | Ref | Network | Train Dataset | UCF101 | HMDB51 |
---|---|---|---|---|---|
Shuffle&Learn [33] | [33] | AlexNet | UCF101 | 50.2% | 18.1% |
O3N [12] | [12] | AlexNet | UCF101 | 60.3% | 32.5% |
AoT [50] | [50] | VGG-16 | UCF101 | 78.1% | - |
OPN [28] | [28] | VGG-M-2048 | UCF101 | 59.8% | 23.8% |
DPC [15] | [15] | 3D-ResNet34 | Kinetics | 75.7% | 35.7% |
SpeedNet [4] | [4] | S3D-G | Kinetics | 81.1% | 48.8% |
AVTS [25] (RGB+audio) | [25] | MC3 | Kinetics | 85.8% | 56.9% |
Shuffle&Learn [33]* | - | C3D | UCF101 | 55.8% | 25.4% |
3D-RotNet [23]* | - | C3D | UCF101 | 60.6% | 27.3% |
Clip Order [51] | [51] | C3D | UCF101 | 65.6% | 28.4% |
Spatio-Temp [47] | [47] | C3D | UCF101 | 58.8% | 32.6% |
Spatio-Temp [47] | [47] | C3D | Kinetics | 61.2% | 33.4% |
3D ST-puzzle [24] | [24] | C3D | Kinetics | 60.6% | 28.3% |
Ours | - | C3D | UCF101 | 68.3% | 38.4% |
Ours | - | C3D | Kinetics | 69.9% | 39.6% |
3D ST-puzzle [24] | [24] | 3D-ResNet18 | Kinetics | 65.8% | 33.7% |
3D RotNet [23] | [23] | 3D-ResNet18 | Kinetics | 66.0% | 37.1% |
DPC [15] | [15] | 3D-ResNet18 | Kinetics | 68.2% | 34.5% |
Ours | - | 3D-ResNet18 | UCF101 | 77.3% | 47.5% |
Ours | - | 3D-ResNet18 | Kinetics | 79.3% | 49.8% |
Clip Order [51] | [51] | R(2+1)D | UCF101 | 72.4% | 30.9% |
PRP [52] | [52] | R(2+1)D | UCF101 | 72.1% | 35.0% |
Ours | - | R(2+1)D | UCF101 | 81.6% | 46.4% |
Notice also that our pre-training strategy leads to a better transfer performance on HMDB51 than supervised pre-training using action labels. This suggests that the video dynamics learned through our pre-training generalize well to action recognition and that such dynamics are not well captured through the lone supervised action recognition.
Transfer to UCF101 and HMDB51.
We compare to prior work on self-supervised video representation learning in Table 2. A fair comparison to much of the prior work is difficult due to the use of very different network architectures and training as well as transfer settings. We opted to compare with some commonly used network architectures (i.e., C3D, 3D-ResNet, and R(2+1)D) and re-implemented two prior works [33] and [23] using C3D. We performed self-supervised pre-training on UCF101 and Kinetics. C3D is pre-trained for 100 epochs on UCF101 and for 15 epoch on Kinetics. 3D-ResNet and R(2+1)D are both pre-trained for 200 epochs on UCF101 and for 15 epochs on Kinetics. We fine-tune all the layers for action recognition. Fine-tuning is performed for 75 epochs using C3D and for 150 epochs with the other architectures.
When pre-training on UCF101 our features outperform prior work on the same network architectures. Pre-training on Kinetics leads to an improvement in transfer in all cases.
|
|
Long-Range vs Short-Range Temporal Statistics. To illustrate how well our video representations capture motion, we transfer them to other pseudo-tasks that focus on the temporal evolution of a video. One task is the classification of the synchronization of video pairs, i.e., how many frames one video is delayed with respect to the other. A second task is the classification of two videos into which one comes first temporally. These two tasks are illustrated in Fig. 3. In the same spirit, we also evaluate our features on other tasks and data sets and we report the results at our project page https://sjenni.github.io/temporal-ssl.
For the synchronization task, two temporally overlapping video sequences and are separately fed to the pre-trained C3D network to extract features and at the conv5 layer. These features are then fused through and fed as input to a randomly initialized classifier consisting of three fully-connected layers trained to classify the offset between the two sequences. We consider random offsets between the two video sequences in the range -6 to +6. For the second task we construct a single input sequence by sampling two non-overlapping 8 frame sub-sequences and , where comes before . The network inputs are then either for class “before” or for the class “after”. We reinitialize the fully-connected layers in this case as well.
Sync. | Before-After | ||
Method | Accuracy | MAE | Accuracy |
Action Labels (UCF101) | 36.7% | 1.85 | 66.6% |
3D-RotNet [23]* | 28.0% | 2.84 | 57.8% |
Shuffle&Learn [33]* | 39.0% | 1.89 | 69.8% |
Ours | 42.4% | 1.61 | 76.9% |
![]() |
![]() |
![]() |
. The first row in each block corresponds to the input video. Rows two and three show the output of our adaptation of Guided Backpropagation
[42] when applied to a network trained through self-supervised learning and supervised learning respectively. In all three cases we observe that the self-supervised network focuses on image regions of moving objects or persons. In (a) we can also observe how long range dynamics are being detected by the self-supervised model. The supervised model on the other hand focuses a lot on static frame features in the background.In Table 3 we compare the performance of different pre-training strategies on the time-related pseudo-tasks. We see that our self-supervised features perform better at these tasks than supervised features and other self-supervised features, thus showing that they capture well the temporal dynamics in the videos.
Visualization. What are the attributes, factors or features of the videos that self-supervised and supervised models are extracting to perform the final classification? To examine what the self-supervised and supervised models focus on, we apply Guided Backpropagation [42]. This method allows us to visualize which part of the input has the most impact on the final decision of the model. We slightly modify the procedure by subtracting the median values from every frame of the gradient video and by taking the absolute value of the result. We visualize the pre-trained self-supervised and supervised models on several test samples from UCF101. As one can see in Fig. 4, a model pre-trained on our self-supervised task tends to ignore the background and focuses on persons performing an action and on moving objects. Models trained with supervised learning on the other hand tend to focus more on the appearance of foreground and background. Another observation we make is that the self-supervised model identifies the location of moving objects/people in past and future frames. This is visible in row number 2 of blocks (a) and (c) of Fig. 4, where the network tracks the possible locations of the moving ping-pong and billiard balls respectively.
A possible explanation for this observation is that our self-supervised task only encourages the learning of dynamics. The appearance of non-moving objects or static backgrounds are not useful to solve the pretext task and are thus ignored.
Learning Dynamics vs. Frame Features. The visualizations in Fig. 4 indicate that features learned through motion discrimination focus on the dynamics in videos and not so much on static content present in single frames (e.g., background) when compared to supervised features. To further investigate how much the features learned through the two pre-training strategies rely on motion, we performed experiments where we remove all the dynamics from videos. To this end, we create input videos by replicating a single frame 16 times (resulting in a still video) and train the three fully-connected layers on conv5 features for action classification on UCF101. Features obtained through supervised pre-training achieve an accuracy of 18.5% (vs. 56.5% with dynamics) and features from our self-supervised task achieve 1.0% (vs. 58.1%). Although the setup in this experiment is somewhat contrived (since the input domain is altered) it still illustrates that our features rely almost exclusively on motion instead of features present in single frames. This can be advantageous since motion features might generalize better to variations in the background appearance in many cases.
Nearest-Neighbor Evaluation.
We perform an additional quantitative evaluation of the learned video representations via the nearest-neighbor retrieval. The features are obtained by training a 3D-ResNet18 network on Kinetics with our pseudo-task and are chosen as the output of the global average pooling layer, which corresponds to a vector of size 512. For each video we extract and average features of 10 temporal crops. To perform the nearest-neighbor retrieval, we first normalize the features using the training set statistics. Cosine similarity is used as the metric to determine the nearest neighbors. We follow the evaluation proposed by
[6] on UCF101. Query videos are taken from test split 1 and all the videos of train split 1 are considered as retrieval targets. A query is considered correctly classified if the -nearest neighbors contain at least one video of the correct class (i.e., same class as the query). We report the mean accuracy for different values of and compare to prior work in Table 4. Our features achieve state-of-the-art performance.Method | Network | Top1 | Top5 | Top10 | Top20 | Top50 |
---|---|---|---|---|---|---|
Jigsaw [34] | AlexNet | 19.7 | 28.5 | 33.5 | 40.0 | 49.4 |
OPN [28] | AlexNet | 19.9 | 28.7 | 34.0 | 40.6 | 51.6 |
Büchler et al. [6] | AlexNet | 25.7 | 36.2 | 42.2 | 49.2 | 59.5 |
Clip Order [51] | R3D | 14.1 | 30.3 | 40.0 | 51.1 | 66.5 |
SpeedNet [4] | S3D-G | 13.0 | 28.1 | 37.5 | 49.5 | 65.0 |
PRP [52] | R3D | 22.8 | 38.5 | 46.7 | 55.2 | 69.1 |
Ours | 3D-ResNet18 | 26.1 | 48.5 | 59.1 | 69.6 | 82.8 |
Qualitative Nearest-Neighbor Results We show some examples of nearest neighbor retrievals in Fig. 5. Frames from the query test video are shown in the leftmost block of three columns. The second and third blocks of three columns show the top two nearest neighbors from the training set. We observe that the retrieved examples often capture the semantics of the query well. This is the case even when the action classes do not agree (e.g., last row).
We have introduced a novel task for the self-supervised learning of video representations by distinguishing between different types of temporal transformations. This learning task is based on the principle that recognizing a transformation of time requires an accurate model of the underlying natural video dynamics. This idea is supported by experiments that demonstrate that features learned by distinguishing time transformations capture video dynamics more than supervised learning and that such features generalize well to classic vision tasks such as action recognition or time-related task such as video synchronization.
Acknowledgements. This work was supported by grants 169622&165845 of the Swiss National Science Foundation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 9922–9931. Cited by: §2, Table 2, Table 4.Improving spatiotemporal self-supervision by deep reinforcement learning
. arXiv preprint arXiv:1807.11293. Cited by: §2, Table 4, §4.Self-supervised video representation learning with odd-one-out networks
. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5729–5738. Cited by: §1, §2, Table 2.Geometry guided convolutional neural networks for self-supervised video representation learning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5589–5597. Cited by: §2.Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?
. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §1.Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 8545–8552. Cited by: §2, Table 2.Sgdr: stochastic gradient descent with warm restarts
. arXiv preprint arXiv:1608.03983. Cited by: §3.3.Split-brain autoencoders: unsupervised learning by cross-channel prediction
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058–1067. Cited by: §2.