Implicit Label Augmentation on Partially Annotated Clips via Temporally-Adaptive Features Learning

by   Yongxi Lu, et al.
University of California, San Diego

Partially annotated clips contain rich temporal contexts that can complement the sparse key frame annotations in providing supervision for model training. We present a novel paradigm called Temporally-Adaptive Features (TAF) learning that can utilize such data to learn better single frame models. By imposing distinct temporal change rate constraints on different factors in the model, TAF enables learning from unlabeled frames using context to enhance model accuracy. TAF generalizes "slow feature" learning and we present much stronger empirical evidence than prior works, showing convincing gains for the challenging semantic segmentation task over a variety of architecture designs and on two popular datasets. TAF can be interpreted as an implicit label augmentation method but is a more principled formulation compared to existing explicit augmentation techniques. Our work thus connects two promising methods that utilize partially annotated clips for single frame model training and can inspire future explorations in this direction.


Frame-To-Frame Consistent Semantic Segmentation

In this work, we aim for temporally consistent semantic segmentation thr...

Local Memory Attention for Fast Video Semantic Segmentation

We propose a novel neural network module that transforms an existing sin...

SF-Net: Single-Frame Supervision for Temporal Action Localization

In this paper, we study an intermediate form of supervision, i.e., singl...

Weak Supervision in Convolutional Neural Network for Semantic Segmentation of Diffuse Lung Diseases Using Partially Annotated Dataset

Computer-aided diagnosis system for diffuse lung diseases (DLDs) is nece...

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Temporal action detection (TAD) is an important yet challenging task in ...

Unsupervised Feature Learning from Temporal Data

Current state-of-the-art classification and detection algorithms rely on...

Nuisance-Label Supervision: Robustness Improvement by Free Labels

In this paper, we present a Nuisance-label Supervision (NLS) module, whi...

1 Introduction

The success of modern machine learning techniques in solving challenging problems such as image recognition depends on the availability of large-scale, well-annotated datasets. Unfortunately, the most complex and useful tasks (e.g. semantic segmentation) are usually also the ones that require the most labeling efforts. This is arguably a major obstacle for large-scale applications to real-world scenarios, such as autonomous driving, where model performance is critical due to safety concerns. In this work, we focus on methods that can utilize partially annotated clip data, more precisely short video sequences with annotations only at key frames, to improve model performance. Datasets in this format are natural byproducts of typical data collection procedures. From clips, a large number of unlabeled frames is available at virtually no additional cost. But clip data can nevertheless encode rich temporal contexts useful for training more accurate models. Fully utilizing partially annotated clips in learning is an interesting problem not only for its practical relevance, but also because it provides partial answers to an interesting scientific question: Humans can naturally learn from continuous evolution of sensing signals without much “labels”, can machines do the same?

We investigate a particularly intriguing case: To train a model that benefits from temporal information during training but is used to make predictions on independent frames at inference. This is in contrast to video prediction models Jampani2017VideoPN ; 8296851 ; 10.1007/978-3-319-54407-6_33 ; Nilsson2018SemanticVS ; Gadde2017SemanticVC ; Wang_2015_ICCV ; Srivastava:2015:ULV:3045118.3045209 ; 8237857 ; Mathieu2016DeepMV ; 10.1007/978-3-319-46478-7_51 where video clips are used at both training and inference. The main intuition of our approach is to decouple fast-changing factors and slow-changing factors in data. Fast-changing factors reflect rapid temporal dynamics and can only be learned from a labeled frame or its immediate neighbors, while slow-changing factors can be learned from data points within a larger temporal context. Our method utilizes the temporal context provided by the partially annotated clips to learn better features without diminishing the ability to learn fine-grained features with rapid temporal changes. This is achieved by allowing different parts of the model to adapt to distinct temporal change rates in data, a.k.a. Temporally Adaptive Features (TAF) learning. We propose a principled approach to formalize this intuition by introducing temporal change rate constraints in the learning problem and show that the resultant optimization problem can be efficiently approximated by a feature swapping procedure with contrastive loss. The TAF paradigm generalizes the well-motivated “slow feature” learning methods 7410822 ; Wiskott:2002:SFA:638940.638941

for self-supervised learning. In this regard, ours is the first to demonstrate significant empirical gains on a challenging real-world application via imposing temporal coherence regularization. It can also be seen as a form of implicit label augmentation and is related to explicit pseudo label generation techniques

Mustikovela2016CanGT ; 8265246 ; Zhu2018ImprovingSS ; 8206371 which also show promising improvements in practice. But ours is a more principled treatment that handles the important issue of label uncertainty automatically. Interestingly, our work is the first to combine these two seemingly unrelated line of research. It thus sheds new light on the theory and practice of the important problem of learning from partially annotated clips and can benefit future explorations on this topic.

The TAF framework can in theory be applied to any recognition tasks with partially annotated clip data. However, the advantage in doing so will well depend on the task. We identify semantic segmentation, the task of assigning class labels to every pixel in an image, as a good test case due to the necessity of multi-scale modeling. Natural images usually feature structures with a great variety of sizes, functions and perspectives. This results in different intrinsic spatial and temporal change rates of different structures. A useful semantic segmentation model needs to provide comprehensive understanding of all these different structures. TAF can address this challenge by allowing different parts of the model to learn features with varying temporal change rates, rather than forcing all the features to vary slowly, as is the case of slow feature learning 7410822 ; Wiskott:2002:SFA:638940.638941

. Beyond this particular task, semantic segmentation is also a good example of the broader set of “dense prediction tasks” in computer vision, such as object detection

Alpher19 ; liu2016ssd ; 7780627 ; 8099589 ; 8237586 ; Law_2018_ECCV ; 8237584

, pose estimation

6909610 , monocular depth estimation 8100183 ; Zou_2018_ECCV ; Yin2018GeoNetUL ; Godard2017UnsupervisedMD ; Garg2016UnsupervisedCF ; Fu2018DeepOR , instance segmentation 8237584 ; Alpher19c ; Alpher19d ; Alpher19e as well as panoptic segmentation DBLP:journals/corr/abs-1801-00868 ; DBLP:journals/corr/abs-1901-03784 , to name a few. Dense prediction tasks all share the key properties of laborious annotation and multi-scale features thus it is likely that our finding from semantic segmentation can directly benefit these tasks.

This paper is organized as follows. Section 2 presents our method. Section 3 compares our method to related works. Section 4 presents our empirical findings and ablation studies. Section 5 concludes the paper and discusses future directions.

2 Methods

We first introduce notations useful to our presentation. We denote the dataset as . Each input and its associated labels can be finely indexed as , where denotes the clip index and the time index within the clip, respectively. Whenever it is clear from the context, we use to denote input-label tuples at time for any particular clip.

2.1 Temporally Adaptive Feature Learning

Our method decouples the fast and slow changing factors in data by forcing the model to learn features that are adaptive to the varying temporal change rates. To be applicable to our framework, we assume the labeling function can be factorized as . We can quantify how fast the labeling function changes w.r.t. time by taking its time derivative.


Note that can be seen as a function with -dimensional input where represents one of its dimensions. quantifies the variation of w.r.t. time through this dimension. The “fast” and “slow” factors are characterized by the degree at which they contribute to temporal variations in the predictive model . To instantiate this idea, our TAF frameworks solves the following empirical risk minimization problem with temporal change rates constraints.

subject to (2b)

For a differentiable model, the analytical form of the constraints is available if is provided. In applications where

is a high dimensional vector (such as an image), this may not be possible. Thus, we propose to use first-order finite difference to approximate the constraints via neighboring samples.


The constrained optimization problem itself is difficult to solve. We can convert it into an unconstrained optimization with the following regularization term which approximate the original constraints. This permits the use of gradient-based solvers if the model is differentiable.


The proposed regularization is defined on any pairs of samples separated by known interval , even if their labels are unknown. This construction thus enables learning from unlabeled data. Needless to say, TAF learning is limited to short clips as the first order approximation is valid only for small . The slack term promotes features that adapts to a distinct temporal change rate. When is small, that dimension is forced to model slow-changing factors shared within a large temporal context, which is an implicit form of data augmentation. The dimensions with large on the other hand can still model rapid motions in data important for the task. As we will discuss in ablation studies and supplementary materials, and are important hyper-parameters.

As a remark on related methods, we note that Eqns. 4 generalizes the temporal coherence regularization in 7410822 to multiple change rates, making it more suitable for real-world applications such as semantic segmentation where multi-scale features are essential. This regularization can also be seen as implicitly assuming constant labels across the entire clip, with the slack term acknowledging the uncertainty introduced by this approximation. In this regard, measures the growth rate of uncertainty in time of the implicit pseudo labels. Prior works suggests that properly modeling the uncertainty in pseudo labels to be important in the final task performance Mustikovela2016CanGT . While TAF learning is motivated differently, it leads to a similar construction.

Finally, denotes the subset of the data with annotations, the half-length of each clip is , is the sampling period and

denotes the uniform distribution defined on integers. The regularized optimization problem becomes

Figure 1: The sampling procedure of TAF learning for a single pair of images, in our implementation for semantic segmentation where

are defined on features extracted from a shared backbone.

2.2 Efficient Frame Sampling

The proposed objective function in Eqns. 5

is computationally inefficient when combined with mini-batch SGD or its variants. First of all, the computation of the sample averages requires two separate sampling streams: One for the key frames with annotations for the loss function, and the other for the regularization term using pairs of frames. In general, there is no ensured overlapping in these two streams of samples. As a result, we usually cannot use features computed from an image to update both terms. This inefficiency is exacerbated by the fact that the regularizer requires pair inputs, making the training even less efficient. Secondly, the regularization term requires separate feature exchanges for each feature dimension. When

is large and the model decoupling requires re-computation of a significant portion of the model, this strategy is highly inefficient.

Our proposal is as follows: Within each mini-batch, a set of image-label tuples are first sampled from the annotated key frame subset . Each of these tuples are associated with a clip. Then, for each key frame sampled a random (unlabeled) pairing image is selected from the same clip by sampling the index difference between the key frame and the unlabeled pairing frame. In this improved procedure, all feature computations contribute to all terms in the objective function. To further make use of cached features, the regularization term is also made symmetric. To ensure tractable mini-batch updates, the summation over the dimensions are replaced by a uniform sampling of the dimension index at each training example. The efficient TAF procedure solves the following problem


where we simplify the notation by assuming that the key frame is always at the center of each clip. Figure 1 illustrates how the training objective is computed between a pair of sampled images.

The number of training iterations of TAF learning is two times of the baseline as only half of the mini-batch have ground truth labels. In order to compute the pair-wise loss, the aggregation function has to be evaluated twice in each forward pass 111The first time using the original features, the second time using features after swapping.. Thus for tractable training, should be chosen to be a lightweight function. Figure 1 illustrates the proposed sampling procedure in the application of semantic segmentation, where we use a Siamese network for the backbone feature extractor and apply the TAF procedure only at the encoder layers (details in Section 2.3).

2.3 Application to Semantic Segmentation

We now provide a brief overview of two of the most popular semantic segmentation models and explain how our framework can be applied. The multi-branch structure that enable TAF learning for FCNs and DeepLab v3+ is used in a broader set of architectures for semantic segmentation 8100143 ; 8099589 ; Yu2016MultiScaleCA ; Chen2017RethinkingAC and we expect similar modifications to be feasible.


Fully convolutional networks (FCNs) 7298965

is one of the earliest and most popular deep-learning based architecture for semantic segmentation. It follows a straightforward multi-scale design: Feature maps at the output of three different stages of a backbone convolutional network are extracted. Due to the downsampling operators between stages, feature maps have a decreasing spatial resolution (in the case of FCN8s that we consider the output stride equals to 8, 16 and 32, respectively). The features maps are converted into class logits maps via a single layer of convolutions. The three predictions are aggregated via a cascade of upsampling and addition operations. In our modification of FCNs we assign

and to represent three feature maps, where represents the stride-32, stride-16 and stride-8 feature maps respectively. The aggregation function is the single convolution layer and the following cascaded addition operations. In practice, we find swapping is sufficient for improved accuracies over the baselines.

DeepLab v3+

DeepLab v3+ Chen2018EncoderDecoderWA is a recent semantic segmentation algorithm that has achieved state-of-the-art accuracy in challenging datasets such as Pascal VOC and Cityscapes. It follows an encoder-decoder structure, where the encoder is an ASPP module 7913730 that consists of five branches with different receptive fields (modeling structures at different scales): Four branches with varying dilation rates and an additional image pooling branch. Similar in spirit to the FCNs case, we assign to the image pooling branch, and to the remaining branches starting from the one with largest dilation rate. The decoder is the aggregation function in our formulation.

3 Related Works

Regularization and Data Augmentation in Deep Neural Networks

There is a rich literature of generic regularization and data augmentation techniques sharing our goal of improving generalization, e.g. norm regularization NIPS1991_563 ; Ng:2004:FSL:1015330.1015435 , reduction of co-adaptation JMLR:v15:srivastava14a ; Wan:2013:RNN:3042817.3043055 ; devries2017cutout and pooling 6144164 ; pmlr-v51-lee16a ; 8099909 ; 8099909 . For semantic segmentation, data augmentations techniques based on simple image transformations 222such as horizontal flipping, random cropping, random jittering, random scaling and rotation are standard practices. Recently, DBLP:journals/corr/abs-1805-09501 ; Hauberg2016DreamingMD ; DBLP:journals/corr/abs-1902-09383 ; NIPS2017_6916 learn optimal transformations. These techniques are limited by not using video information but are complementary to our approach. We follow the default choice of regularization and random transformations when comparing TAF learning with corresponding baselines. Another effective solution is to use generative models for data and label synthesis 8099724 ; 8363576 ; Antoniou2018DataAG ; DBLP:journals/corr/abs-1810-10863 . Similar to ours, these methods can improve model accuracy using unlabeled data. However, it could be intrinsically difficult to generate realistic and diverse data for complicated applications, while our method can directly utilize the large amount of real video clips.

Single Image and Video Semantic Segmentation

We use semantic segmentation Chen2018EncoderDecoderWA ; 7913730 ; 7298965 ; 8100143 ; Yu2016MultiScaleCA ; Yu2017DilatedRN ; Chen2017RethinkingAC as an example to verify our method as discussed in Section 3. Importantly, our method is quite different from the related literature of video semantic segmentation Jampani2017VideoPN ; 8296851 ; 10.1007/978-3-319-54407-6_33 ; Nilsson2018SemanticVS ; Gadde2017SemanticVC ; Wang_2015_ICCV ; Srivastava:2015:ULV:3045118.3045209 ; 8237857 ; Mathieu2016DeepMV ; 10.1007/978-3-319-46478-7_51 , where video clips are utilized at both training and inference. Our work learns models using video clips at training, but the model can be used on independent frames at inference. In semantic segmentation, unlabeled frames can be used via future frame predictions Srivastava:2015:ULV:3045118.3045209 ; 8237857 ; Mathieu2016DeepMV ; 10.1007/978-3-319-46478-7_51 ; Vondrick2015AnticipatingTF and label propagation Mustikovela2016CanGT ; 8265246 . The former is only shown to improve video prediction results but not on single image predictions (as expected as future frame prediction is difficult from a single frame due to the lack of temporal context at test time). A few preliminary works suggest the latter can bring promising improvements to single frame predictions by generating pseudo labels Mustikovela2016CanGT ; 8265246 ; Zhu2018ImprovingSS ; 8206371 . But video propagation notably relies on manual screening and careful hyper-parameter tuning to reject low quality labels Mustikovela2016CanGT , otherwise it could surprisingly lead to performance degradation after including pseudo-labels in some cases 8265246 . Our method has the advantage of not requiring manual intervention. More importantly, our work suggests that regularizing the temporal behavior of features is an implicit form of video augmentation without explicit modeling of temporal dynamics, which compared to video propagation is a simpler pipeline and could be more transferable to other tasks.

Self-Supervised Learning

Self-supervised learning utilizes the large amount of unlabeled data via carefully designed “pretext” tasks or constraints that aim at capturing meaningful real-world invariance structures in the data. The goal is to learn more robust features. Future frame prediction Mathieu2016DeepMV ; 10.1007/978-3-319-46478-7_51 ; Vondrick2015AnticipatingTF ; 8237857 ; Srivastava:2015:ULV:3045118.3045209 , patch consistency via tracking Wang_2015_ICCV , transitive invariance xiaolong_iccv_17 , temporal order verification Misra2016ShuffleAL and motion consistency Jayaraman2017 ; 7780548 ; 8100121 have been proposed as useful pretext tasks. Recently, consistency across tasks are also explored Ren2018CrossDomainSM ; Doersch2017MultitaskSV

, although these methods do not consider videos. However, as pretext tasks usually differ from the target task, a separate transfer learning step is required. Ours in contrast can be used directly on the target task. Via imposing geometric constraints, several recent works use self-supervised learning to directly address real-world tasks, most notably in depth and motion predictions

Garg2016UnsupervisedCF ; 8100183 ; Mahjourian_2018_CVPR ; Mahjourian_2018_CVPR ; DBLP:journals/corr/abs-1812-05642 ; Godard2017UnsupervisedMD ; Jiang_2018_ECCV ; Zou_2018_ECCV . However, these methods cannot transfer easily outside of their intended geometry application. Among them, SIGNet DBLP:journals/corr/abs-1812-05642 points to a unified framework for self-supervised learning of both semantic and geometric tasks which would broaden the applications of this line of works, but the existing work can only improve on geometric tasks. In contrast, TAF is not restricted to any particular task by design. Our TAF framework generalizes the temporal coherence regularization in 7410822 ; Wiskott:2002:SFA:638940.638941 to multiple change rates and is the first to validate the utility of this form of regularization on challenging real-world applications. In contrast, the prior works focus on theoretical insights and are not rigorously validated.

4 Experiments

4.1 Datasets and Evaluation Metrics

We test our approach on two widely-used datasets for semantic segmentation: Camvid BrostowSFC:ECCV08 and Cityscapes Cordts2016Cityscapes . The images of both datasets are frames captured from videos. Detailed annotations are provided on key frames. The meta-data of the datasets include the source frame ids of the annotated frames which makes unlabeled frames within the same clips available. The availability of unlabeled frames in the said clip format makes these two datasets ideal for testing our TAF framework. In particular, Camvid consists of 367 clips for training and 101/233 images for val/test. Key frames from training and test set are captured at 1Hz and annotated with 11 object classes. We capture extra frames around the key frames at 30Hz using the provided raw video. Cityscapes consists of 2975 training key frames and 500 validation images. The key frames are the 20-th frames in the provided 30-frame clips (30Hz) annotated with 19 object classes. In the interest of fast experimentation and to test our methods on small datasets, we sample 20% and 50% of the clips from Cityscapes training set, creating customary datasets with 595 and 1488 training clips respectively. We follow standard evaluation protocols and report mIOU and pixel accuracy on the held-out set.

4.2 Comparison to Baselines

Method Output stride Training set Backbone TAF mIOU () Pixel acc. ()
Camvid test set
FCN8s 8,16,32 Camvid ResNet-50
FCN8s 8,16,32 Camvid ResNet-50
DeepLabV3+ 16 Camvid MobileNetV2
DeepLabV3+ 16 Camvid MobileNetV2
DeepLabV3+ 16 Camvid ResNet-50
DeepLabV3+ 16 Camvid ResNet-50
Cityscapes validation set
DeepLabV3+ 16 CS-0.2 MobileNetV2
DeepLabV3+ 16 CS-0.2 MobileNetV2
DeepLabV3+ 16 CS-0.2 ResNet50
DeepLabV3+ 16 CS-0.2 ResNet50
DeepLabV3+ 16 CS-0.5 MobileNetV2
DeepLabV3+ 16 CS-0.5 MobileNetV2
DeepLabV3+ 16 CS-0.5 ResNet50
DeepLabV3+ 16 CS-0.5 ResNet50
Table 1: Overall results on Camvid and Cityscapes (CS) datasets.

To understand the advantage of the proposed method, we train FCNs and DeepLab v3+ models using the TAF learning paradigm on partially labelled clips and compare against fully supervised training using only key frames, on both Camvid and Cityscapes dataset. We use mean-average-error (L1 norm) for the contrastive loss as it is a common choice of image applications 333This choice is also discussed in the supplementary material

. We find it important to first normalize the per-pixel prediction via softmax function to avoid learning degenerate features. Our training hyper-parameters are detailed in the supplementary material. We ensure to use a comparable set of hyper-parameters for both the baseline and our method whenever is applicable. For FCN8s, we show results for performing feature swapping only on the stride-32 branch as this leads to better accuracy, while for DeepLab v3+ all branches are swapped with equal probabilities. Our main results are summarized in Table

1. The main finding is that our method improves over the respective baseline methods using both segmentation algorithms and on both datasets. We note that TAF learning only affects the training time procedures. At inference time models from TAF has exactly the same complexity as the baselines, ensuring that the improvements from TAF is not resultant from increased complexity.

4.3 Ablation Studies

Figure 2: Effect of varying the change rate.
Figure 3: Effect of varying temporal contexts
(a) Result on train set
(b) Result on test set
Figure 4: Prediction mIOU as a function of feature swapping.

To further understand the proposed method, we perform ablation studies on Camvid dataset using FCN8s models trained with TAF. Feature swapping is only performed on the stride-32 branch for simplicity. We choose to perform ablation studies on this model as its simple design can lead to clearer insights. We use and the half size of each data clip (a.k.a. length of temporal context) to unless specified otherwise in particular studies.

Study on Change Rates Constraint

In our formulation controls change rates of a particular feature dimension. It is interesting to observe the model performance as a function of as this directly informs us on whether the change rate constraint is effective or not. When is , the feature dimension in question will be forced to stay constant across frames. This should force it to learn features that are not informative to the final prediction, effectively reducing the model capacity and consequently, the prediction accuracy. On the other hand, as goes to infinity the regularization term is effectively ignored, leading to sub-optimal results if the proposed regularization is indeed effective. Our finding as summarized in Figure 3 is as expected, verifying that the temporal change rate constraints are not trivially imposed.

Study on Temporal Context

Temporal context refers to how far apart in time a pair of training examples can be. Note that in our derivation, we assume that the pair of frames used in the constraints are sufficiently close. This is important as when the two data points are too far apart, the first order approximation become ineffective. On the other extreme, when setting the temporal context to near zero, the regularization effect is diminished. Results from varying the temporal context are summarized in Figure 3. The mIOU reaches maximum when the temporal context is roughly 15 examples (the same setting used in our main results). Interestingly, while a larger temporal context leads to sub-optimal results, the degradation remains relatively mild, suggesting that precise approximation is not critical.

Study on Feature Swapping

It is particularly interesting to see how the swapping of features between image pairs affects the prediction results. There are two trivial cases that deserve careful consideration: a) If the performance of the model (especially the baseline model) does not show a decrease in accuracy even with feature swapping, then our proposed constraints are not useful as this would suggest a natural tendency for part of the model to learn features that are insensitive to temporal changes. b) If the performance of the model does decrease after feature swapping, but both the baseline model and the TAF models demonstrate similar rate of degradation, then it would cast questions on whether the proposed constraint can actually be successfully imposed in the optimization. Furthermore, whether these constraints imposed on the training set can generalize to a test set. In Figure 4 we show that TAF learning does not result in the aforementioned trivial cases and can indeed generalize to held-out sets. Figure 5 further illustrates the effects of feature swapping.

Figure 5: Visualization of predictions with feature swapping.

Study on Feature Attenuation

Constraining the temporal change rate in features could lead to trivial solutions that are not discriminative 7410822 . In Figure 6 we show the effect of replacing the stride-32 branch either with its sample mean or zeros. The large resultant reduction in per-class accuracy suggests that TAF learning is not producing constant, trivial features as feared. However, this issue can be a function of model architectures and should be investigated further in future works.

Figure 6: Change in prediction IOU after attenuating the stride-32 features.

5 Conclusion and Future Works

In this work, we propose to learn temporally-adaptive features to utilize partially annotated clips. Our proposed framework has demonstrated convincing gains on the challenging task of semantic segmentation. The ablation studies verify that our approach is learning non-trivial features that reflect the proposed temporal rate change constraints, validating our design choices. Our finding suggests the potential of such constraints in enabling self-supervised learning from clip data. It would be interesting to further validate the utility of this approach in related applications. Dense prediction tasks are natural starting points. Another interesting direction is to explore data-driven metrics, such as perceptual loss perceptual_cvpr18 ; Johnson2016PerceptualLF ; 1284395 and adversarial training NIPS2014_5423

, in constructing the contrastive loss, replacing the current heuristic choice of L1 norm. It is also interesting to further explore existing ideas from slow feature learning and video propagation (explicit label augmentation) to better model problem structures, which may in return lead to stronger results.


  • [1] Varun Jampani, Raghudeep Gadde, and Peter V. Gehler. Video propagation networks.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 3154–3164, 2017.
  • [2] Mennatullah Siam, Sepehr Valipour, Martin Jagersand, and Nilanjan Ray. Convolutional gated recurrent networks for video segmentation. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3090–3094, Sep. 2017.
  • [3] Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Fay Huang, and Reinhard Klette.

    Stfcn: Spatio-temporal fully convolutional neural network for semantic segmentation of street scenes.

    In Chu-Song Chen, Jiwen Lu, and Kai-Kuang Ma, editors, Computer Vision – ACCV 2016 Workshops, pages 493–509, Cham, 2017. Springer International Publishing.
  • [4] David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6819–6828, 2018.
  • [5] Raghudeep Gadde, Varun Jampani, and Peter V. Gehler. Semantic video cnns through representation warping. 2017 IEEE International Conference on Computer Vision (ICCV), pages 4463–4472, 2017.
  • [6] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [7] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 843–852., 2015.
  • [8] Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, and Shuicheng Yan. Video scene parsing with predictive feature learning. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5581–5589, Oct 2017.
  • [9] Michaël Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440, 2016.
  • [10] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert.

    An uncertain future: Forecasting from static images using variational autoencoders.

    In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 835–851, Cham, 2016. Springer International Publishing.
  • [11] Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised learning of spatiotemporally coherent metrics. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4086–4093, Dec 2015.
  • [12] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Comput., 14(4):715–770, April 2002.
  • [13] Siva Karthik Mustikovela, Michael Ying Yang, and Carsten Rother. Can ground truth label propagation from video help semantic segmentation? In ECCV Workshops, 2016.
  • [14] Ignas Budvytis, Patrick Sauer, Thomas Roddick, Kesar Breen, and Roberto Cipolla. Large scale labelled video data augmentation for semantic segmentation in driving scenarios. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 230–237, Oct 2017.
  • [15] Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn D. Newsam, Andrew Tao, and Bryan Catanzaro. Improving semantic segmentation via video propagation and label relaxation. CoRR, abs/1812.01593, 2018.
  • [16] Md. Alimoor Reza, Hui Zheng, Georgios Georgakis, and Jana Košecká. Label propagation in rgb-d video. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4917–4922, Sep. 2017.
  • [17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
  • [19] Yongxi Lu, Tara Javidi, and Svetlana Lazebnik. Adaptive object detection using adjacency and zoom prediction. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2351–2359, June 2016.
  • [20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, July 2017.
  • [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, Oct 2017.
  • [22] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In The European Conference on Computer Vision (ECCV), September 2018.
  • [23] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, Oct 2017.
  • [24] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, June 2014.
  • [25] Tinghui Zhou, Matthew Brown, Noah Snavely, and David Lowe. Unsupervised learning of depth and ego-motion from video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6612–6619, July 2017.
  • [26] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In The European Conference on Computer Vision (ECCV), September 2018.
  • [27] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
  • [28] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6602–6611, 2017.
  • [29] Ravi Garg, B. G. Vijay Kumar, Gustavo Carneiro, and Ian D. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
  • [30] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
  • [31] Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR, 2018.
  • [32] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017.
  • [33] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
  • [34] Alexander Kirillov, Kaiming He, Ross B. Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. CoRR, abs/1801.00868, 2018.
  • [35] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. CoRR, abs/1901.03784, 2019.
  • [36] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, July 2017.
  • [37] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2016.
  • [38] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.
  • [39] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, June 2015.
  • [40] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • [41] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, April 2018.
  • [42] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 950–957. Morgan-Kaufmann, 1992.
  • [43] Andrew Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 78–, New York, NY, USA, 2004. ACM.
  • [44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [45] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1058–III–1066., 2013.
  • [46] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [47] Jawad Nagi, Frederick Ducatelle, Gianni A. Di Caro, Dan Cireşan, Ueli Meier, Alessandro Giusti, Farrukh Nagi, Jürgen Schmidhuber, and Luca Maria Gambardella. Max-pooling convolutional neural networks for vision-based hand gesture recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pages 342–347, Nov 2011.
  • [48] Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Arthur Gretton and Christian C. Robert, editors,

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics

    , volume 51 of Proceedings of Machine Learning Research, pages 464–472, Cadiz, Spain, 09–11 May 2016. PMLR.
  • [49] Shuangfei Zhai, Hui Wu, Abhishek Kumar, Yu Cheng, Yongxi Lu, Zhongfei Zhang, and Rogerio Feris. S3pool: Pooling with stochastic spatial sampling. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4003–4011, July 2017.
  • [50] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. CoRR, abs/1805.09501, 2018.
  • [51] Søren Hauberg, Oren Freifeld, Anders Boesen Lindbo Larsen, John W. Fisher, and Lars Kai Hansen. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In AISTATS, 2016.
  • [52] Amy Zhao, Guha Balakrishnan, Frédo Durand, John V. Guttag, and Adrian V. Dalca. Data augmentation using learned transforms for one-shot medical image segmentation. CoRR, abs/1902.09383, 2019.
  • [53] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domain-specific transformations for data augmentation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3236–3246. Curran Associates, Inc., 2017.
  • [54] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. Learning from simulated and unsupervised images through adversarial training. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2242–2251, July 2017.
  • [55] Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. Synthetic data augmentation using gan for improved liver lesion classification. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 289–293, April 2018.
  • [56] Antreas Antoniou, Amos J. Storkey, and Harrison A Edwards. Data augmentation generative adversarial networks. CoRR, abs/1711.04340, 2018.
  • [57] Christopher Bowles, Liang Chen, Ricardo Guerrero, Paul Bentley, Roger N. Gunn, Alexander Hammers, David Alexander Dickie, Maria del C. Valdés Hernández, Joanna M. Wardlaw, and Daniel Rueckert. GAN augmentation: Augmenting training data using generative adversarial networks. CoRR, abs/1810.10863, 2018.
  • [58] Fisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Dilated residual networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 636–644, 2017.
  • [59] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. CoRR, abs/1504.08023, 2015.
  • [60] Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. In ICCV, pages 1338–1347, 10 2017.
  • [61] Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV, 2016.
  • [62] Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to egomotion from unlabeled video. International Journal of Computer Vision, 125(1):136–161, Dec 2017.
  • [63] Yin Li, Manohar Paluri, James M. Rehg, and Piotr Dollár. Unsupervised learning of edges. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1619–1627, June 2016.
  • [64] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6024–6033, July 2017.
  • [65] Zhongzheng Ren and Yong Jae Lee. Cross-domain self-supervised multi-task feature learning using synthetic imagery. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 762–771, 2018.
  • [66] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2070–2079, 2017.
  • [67] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [68] Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, and Dinesh Bharadia. Signet: Semantic instance aided unsupervised 3d geometry perception. CoRR, abs/1812.05642, 2018.
  • [69] Huaizu Jiang, Gustav Larsson, Michael Maire Greg Shakhnarovich, and Erik Learned-Miller.

    Self-supervised relative depth learning for urban scene understanding.

    In The European Conference on Computer Vision (ECCV), September 2018.
  • [70] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV (1), pages 44–57, 2008.
  • [71] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [72] Richard Zhang, Phillip Isola, Alexei Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In CVPR, pages 586–595, 06 2018.
  • [73] Justin Johnson, Alexandre Alahi, and Li Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In ECCV, 2016.
  • [74] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, April 2004.
  • [75] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [76] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [77] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [78] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

Supplmentary Materials for Implicit Label Augmentation on Partially Annotated Clips via Temporally-Adaptive Features Learning

5.1 Temporal Change Rates of Semantic Classes

Figure 7: Changing rate of each class in Camvid train set. The figures show the ratio of prediction accuracy on , between results using features from and , measured in per-class IOU.

We expect different semantic classes to demonstrate different temporal change rates. This as we discussed is a motivation for testing our method on semantic segmentation. To verify it empirically, we compare the predicted segmentation labels at against the ground truth label at . Accuracy are reported using IOU normalized by the prediction accuracy at , as shown in Figure 7. This normalization is necessary as different semantic classes have different intrinsic difficulties. Since labels are not available beyond the key frames, we use the model prediction instead in our study. Interestingly, there is a clear differentiation in the temporal change rates among different classes, as demonstrated by the large differences in change rates of the normalized IOU. Notably, the accuracies of larger or static objects such as road, sky, tree, fence, pavement tend to decrease slowly with time, suggesting low temporal change rates for those structures. On the other hand, smaller or moving objects like car, bicyclist, sign-symbol, pedestrian, pole tend change much faster with time.

5.2 Details of Training Procedures

We use mini-batch SGD optimizer with Nesterov momentum. We set momentum to

and weight decay to

. The batch size is 16 except for training models with ResNet50 backbone on Cityscapes, in which case due to GPU memory constraints we use batch size of 8. The training are performed on 4 Nvidia GTX 1080Ti GPUs. Synchronized batch normalization

444Implementation: is used since the number of images per GPU is small in our setting. The ResNet-50 555Downloaded from He_2016_CVPR and MobileNet v2 666Downloaded from Sandler_2018_CVPR models are pre-trained on ImageNet ILSVRC15 . We adopt a learning schedule with polynomial decay with power set to , following standard practice in semantic segmentation. This schedule multiplies the initial learning rate by the factor . During training, we apply random horizontal flip, random scales between and and random cropping. For both baselines and the TAF models, we report the best results among the initial learning rate from and additionally for ATF learning from . Additional details are summarized in Table 2.

Method Training set Backbone TAF Init. lr Init. Img size Crop size Epoch Val size
FCN8s Camvid ResNet-50 0.02 N/A 600
FCN8s Camvid ResNet-50 0.02 1.0 600
DeepLabV3+ Camvid MobileNetV2 0.02 N/A 600
DeepLabV3+ Camvid MobileNetV2 0.05 1.0 600
DeepLabV3+ Camvid ResNet-50 0.02 N/A 600
DeepLabV3+ Camvid ResNet-50 0.05 0.5 600
DeepLabV3+ CS-0.2 MobileNetV2 0.02 N/A 300
DeepLabV3+ CS-0.2 MobileNetV2 0.05 1.0 300
DeepLabV3+ CS-0.2 ResNet50 0.02 N/A 300
DeepLabV3+ CS-0.2 ResNet50 0.05 1.0 300
DeepLabV3+ CS-0.5 MobileNetV2 0.02 N/A 300
DeepLabV3+ CS-0.5 MobileNetV2 0.02 1.0 300
DeepLabV3+ CS-0.5 ResNet50 0.01 N/A 300
DeepLabV3+ CS-0.5 ResNet50 0.01 1.0 300
Table 2: Hyperparameters used for main results.

5.3 Choice of Temporal Change Rates

For FCN8s, our preliminary studies suggest that assigning and to leads to best performance. We note that this design effectively disables TAF learning on the stride-16 and stride-8 branches. This design is necessary to allow the two high resolution features to model structures with fast temporal change rates sufficiently. For DeepLab v3+, we assign . There is no advantage in disabling TAF learning on any branch. In fact, our study suggests that it leads to worse accuracy. We think that can be attributable to the decoder ( function) design of the DeepLab v3+, which provides a skip connection with output stride of from low level features and can model fast features sufficiently by itself.

5.4 Measure Temporal Change Rates Relative to Input

In our formulation the temporal change rates are directly measured by the variations in the predictive model . However, different clips can have intrinsically different rates of motions, thus it might be wise to impose the temporal change rate constraints relative to the change rates in input images. This, as we also discuss in Section 2, is not trivial since is not available. In our preliminary studies, we empirically test using norm as a measure of the change rate, using the first order approximation as we do for . Then, we set the constraints as the proportion between the change rates in and those in . We find that this does not lead to improvement over the design we presented and the training is usually less stable. We conjecture that this is attributable to our heuristic method in measuring differences between images, a point worth revisiting in future works.

5.5 Choice of Loss Functions

The loss function consists of two parts: The semantic loss function and the contrastive loss. The former compares the prediction of the model against the ground truth annotations at the key frames, while the latter compares the prediction from the model before and after feature swapping (our regularization term). For the semantic loss function, we use cross-entropy loss for both the baseline and TAF learning, per standard practice. For the contrastive loss, in our preliminary studies we experiment with a few different metrics, including mean-squared-error (MSE), mean-average-error (MAE, or L1 loss) as well as symmetric cross entropy. Among them, L1 loss leads to more stable training and best results. We note that it is important to first normalize the prediction at every pixel via a softmax function, as applying L1 norm regularization directly on the logits (before normalization) leads to degenerate solutions. We note that L1 norm is by no means the optimal choice of the metric to compare images, as it does not reflect the rich semantic structures encoded in natural images. We believe that perceptual loss or even adversarial loss (via a learnable model) could lead to better performance and are interesting future directions to explore.