One of the key limitations of supervised learning is generalization under domain shifts Quionero-Candela et al. (2009); Moreno-Torres et al. (2012), which often leads to significant performance drops when the test distribution is even slightly shifted compared to the train distribution. These cases are highly prevalent in real-life scenarios, especially for smaller training datasets. Thus, a considerable amount of attention is given to the general problem of knowledge transfer between input domains in different scenarios and techniques Tzeng et al. (2017); Hoffman et al. (2018); Long et al. (2017); Deng et al. (2018); Courty et al. (2016); Wang and Deng (2018). The vast majority of current works assumes the availability of numerous labeled or unlabeled examples from the target (test) domain Darrell (2014); Sun and Saenko (2016); Ganin and Lempitsky (2015); Lempitsky (2017), while others may assume access to fewer examples with a possible performance compromise Motiian et al. (2017); Hoffman et al. (2013). In contrast, when there is access to only a single instance during test phase, subject to an unknown covariate shift, the concept of target domain is no longer defined, rendering transfer techniques practically irrelevant. Accordingly, most efforts to solve this problem are focused on detecting data to which the model is not robust, and declaring it out-of-distribution (OOD) Che et al. (2019); Snoek et al. (2019); Hendrycks and Dietterich (2019). These techniques could avoid prediction errors in cases of a detectable domain shift, but they do not address prediction improvement in these cases. In this work, we examine a specific type of domain shift, known as covariate shift, where only the distribution of the model inputs is shifted Sugiyama and Kawanabe (2012); Shimodaira (2000). We explore methods that go beyond detection, and attempt to improve model prediction on a "covariate-shifted" test sample rather than discarding it.
Specifically for image classification, it was shown that state-of-the-art models, trained with different robustness enhancements, may suffer severe performance degradation in the presence of a simple covariate shift Hendrycks and Dietterich (2019). Recently, Hendrycks et al. Hendrycks et al. (2019) showed that adding self-supervised learning tasks during training can improve the robustness of visual encoders to these shifts. Sun et al. Sun et al. (2019) extended this concept to allow for a self-supervised test time modification of encoder parameters, thus taking a more adjustable approach for achieving robustness. We continue these two lines of work, by addressing both the question of leveraging self-supervised learning to achieve robustness, and the desired types of model adjustability in test time.
A single unlabeled test instance may contain valuable information about a possible covariate shift. Based on this understanding, allowing the model parameters to depend on may help adjust the model for such shifts Sun et al. (2019); Zhang et al. (2019). Test-time training (TTT), implements this input dependency using a self-supervised mechanism Sun et al. (2019). We further improve on this mechanism while suggesting an additional source for input dependency that does not require training at test time, and significantly improves performance with respect to TTT.
One of questions raised in this paper relates to the components of a model that need to be adjusted to handle covariate shift. Covariate shifts do not typically affect semantic information, and are mostly related to low-level phenomena, which might suggest that robustness should be handled in lower-level processing phases. Specifically for deep neural networks, earlier layers are considered to be more sensitive to low-level information, whereas later layers capture high-level semantic informationYosinski et al. (2014); Goodfellow et al. (2016). We therefore explore the relationship between the depth of a layer and its robustness to covariate shift, allowing the introduction of more careful model adjustment techniques.
We conduct our experiments in the context of image classification on diverse scenarios and show that our proposed method maintains its performance on the original domain while making substantial improvements under covariate shift with respect to previous work. Our key contributions are:
The first framework to explicitly translate self-supervised neural representations to a main task representation, using a direct differentiable mapping approach.
A novel approach for using input-dependent networks to alleviate test-time performance degradation due to covariate shift, and a demonstration of its superiority over competing approaches.
Pushing the envelope of the relatively unexplored idea to achieve robustness through test-time adjustments.
An analysis tool for determining which CNN components require modifications in order to better cope with covariate shifts, and using it to provide a lucid justification for our layer-specific approach.
2 Related work
2.1 Self-supervised learning
The field of self-supervised representation learning has made great strides recently, with massive success in language modelingDevlin et al. (2018); Radford et al. (2019), speech Oord et al. (2018); Pascual et al. (2019)
and very recently in computer visionHénaff et al. (2019); Chen et al. (2020); He et al. (2019). In this work, we focus on a simple yet effective self-supervised technique for learning visual representations Spyros Gidaris and Komodakis (2018), and note that this approach can be extended to other fields and arbitrary (appropriate) self-supervision techniques. In the context of covariate shift, it was recently shown that training an encoder in a multi-task fashion alongside a self-supervised task can result in more robust image classification performance Hendrycks et al. (2019); Sun et al. (2019) (see Fig. 1-a). We show that a fundamental limitation of Hendrycks et al. (2019); Sun et al. (2019) comes from strictly sharing the encoder between the self-supervised and main task. In contrast, we propose a direct approach for optimizing knowledge sharing between the self-supervised encoder and a separate main task encoder.
2.2 Input-dependent neural networks
Our proposed input-dependent method is inspired by the idea of hyper or dynamic networks Bertinetto et al. (2016); Kristiadi et al. ; Nachmani and Wolf (2019), where the parameters of one neural network can be determined by the output of another, applied to the same input. We extend this concept to have a self-supervised encoder dynamically control the parameters of a main task encoder.
2.3 Data augmentation
It has been recently shown that massive data augmentation can improve model robustness to covariate shift in different scenarios and fields Hendrycks et al. (2020); Cubuk et al. (2019); Park et al. (2019); Xie et al. (2019). While this approach is highly applicable in many cases, it requires significant inductive bias which may not be available for every task. In contrast, this work addresses modifications in the learning model, and is therefore agnostic to the data or the task. Additionally, it does not carry the computational overhead introduced by training with data augmentations. Thus, we leave data augmentation techniques outside the scope of this work.
3 Problem formulation
For a supervised learning model , with parameters , a covariate shift occurs where the input distribution changes between training and test time (i.e. ), while the conditional distribution () is preserved. We also assume that is unknown. Our goal is to have the learning model be robust to this shift, such that for a given pair , we will obtain:
Specifically in this work, we are interested in a family of solutions that have the parameters depend on the input. That is .
4.1 Joint training
Joint training (JT) employs a reversed Y-shaped architecture to train a two-head neural network, with a shared feature extractor Hendrycks et al. (2019). One head is for a self-supervised task (e.g. rotation angle prediction), and the second head is for the main downstream task (e.g. classification). See Fig. 1 for a schematic diagram of this architecture. We denote the parameters of the shared feature extractor , the parameters of the self-supervised head , and the parameters of the main task head . At training time, the self-supervised loss and the main-task loss , are jointly optimized using data from a distribution . At test time, the model only employs the main task head, while the self-supervised head is discarded.
4.2 Test-time training
TTT Sun et al. (2019) is a modification of JT that enables the shared encoder to also learn at test time, without the main task labels. Specifically, the self-supervised loss is minimized using the test example and yields the approximated minimizer, , of:
Then, the main-task prediction is done using the updated parameters:
Noting that the model parameters now depend on the test input .
Both the JT and its TTT extension assume that the main-task prediction somehow improves by optimizing the shared-encoder to minimize the self-supervised loss . Though this assumption might be justified by regularization considerations, it seems that the usage of self-supervision to benefit downstream tasks can be done more systematically both in training and in test time. Another point regards the dependency of the model parameters in the test input . In TTT, this is done purely through back-propagation from . This input dependency procedure is empirically backed in Sun et al. (2019), demonstrating correlated gradients of and , with respect to . However, it is not directly being optimized during training. In contrast, we observe that introducing input dependency through the model architecture may allow for a direct optimization of input conditioning. In the next section, we propose two extensions to JT that directly address these issues, and describe how they can also be combined with the TTT principle.
5 Test-time adjustments using dynamic neural networks
Our first observation is that the different layers of deep network may have different sensitivity to covariate shifts. For example, since earlier layers capture low-level information, it may be beneficial to focus our effort on them. Consequently, we propose a split architecture as illustrated in Fig. 1-b, where we split the first layers of the shared encoder of JT to be separate for the main task () and the self-supervised task (). Thus, allowing us to customize their behaviour, while the rest of the encoder () remains shared as in JT. Next, we address the issue of a more systematic usage of self-supervised representations with the split architecture (Sec. 5.1), and continue with proposing an architectural adjustment that allows it to be explicitly optimized on the test sample (Sec. 5.2).
5.1 Data-dependent bridge
Using the split architecture, we wish to optimize the task module , using the self-supervised module , by mapping the weights of to determine those of . For this, every convolutional layer is used as a filter bank, such that every filter in a corresponding layer , with parameters , is a weighted linear combination of the filter parameters of , as follows:
where is the number of convolutional filters in , and are learned, data-dependent parameters, optimized in the training stage. This unique connection between and can be seen as a "bridge" between the filters of the self-supervised task and the main task. Thus, this architecture is able to leverage covariate shift information acquired by , to benefit the main task. Since this bridge is learned end-to-end against all the training data, we term it data-dependent bridge.
5.2 Signal-dependent bridge
Following the mapping motivation of the data-dependent bridge, we propose a similar signal-dependent bridge. In this setting, the weighted combination for is conditioned on the input signal. Specifically, the self-supervision head now predicts both the self-supervised labels and the bridge parameters , which are now signal-dependent. This approach defines a conditioning between the input signal and the main task network. To combine this setting with the data-dependent bridge, we simply sum the filters predicted by both bridges. That is:
We note that the presented methods are complementary with respect to their usage in self-supervised learning: the first (data-dependent) relies on the quality of the self-supervised representation, and statically maps it to the main representation, while the second (signal-dependent) relies on tight tuning of the self-supervised network to the specific input, and thus allows for a more dynamic adjustment.
We note that since our method is an extension to JT and is fully differentiable, it can also benefit from the TTT principle, which introduces a complementary input-dependent signal. We thus examine this combination in our experimental study (Sec. 7).
6 Analysis of covariate shift sensitivity
for the example task of image classification with the exact same hyperparameters. As illustrated in Fig.1(a), a ResNet-26 encoder is composed of an initial convolutional layer (C0), followed by four groups (G1-G4), containing four residual blocks (B1-B4) each. In this section, we present an experiment for analyzing the sensitivity of this specific architecture to covariate shifts.
We conduct the experiment on the CIFAR-10-C dataset Hendrycks and Dietterich (2019), treating the original CIFAR-10 Krizhevsky et al. (2009) as reference, and the corrupted versions as target datasets simulating different covariate shifts. All networks are trained for image classification using a standard supervised-learning procedure.
To analyze the sensitivity of each network component to covariate shift, we developed a method for isolating the effect of covariate shift on the functionality of each block. As illustrated in Fig. 1(a), the following procedure is conducted: First, an encoder is trained using a reference dataset (reference encoder). Then, given a covariate-shifted dataset, the reference encoder is fine-tuned, allowing a specific block to optimize while all other blocks are kept frozen. Finally, a representation similarity index is computed between the reference and tuned block. Repeating this experiment across different blocks and different covariate shifts, enables to analyze the covariate shift sensitivity of each layer.
For control, the similarity index is also computed between multiple reference encoders trained from different random initializations. The similarity score of a fine-tuned encoder and a reference encoder is expected to be lower than the score of two reference encoders.
To measure representation similarity, the Centered Kernel Alignment (CKA) index Kornblith et al. (2019) is employed. A low CKA index implies low similarity between blocks, indicating a higher sensitivity of the specific block to the covariate shift. To get a reliable CKA measure, multiple experiments are ran for each CIFAR-10-C corruption type.
The representation similarity results for each block is presented in Fig. 1(b). We were particularly interested in the blocks where the baseline similarity was kept high and above the similarity under corruptions. Clearly, and consistent with Yosinski et al. (2014), the largest gap was observed in earlier layers, indicating their sensitivity to covariate shift. This behavior was generally consistent across the different corruptions. This suggests that earlier layers are indeed more sensitive to covariate shift, and that their adjustment is more crucial in this context. Note that although we observed significant sensitivity of the first convolutional layer (C0), we found that adjusting this layer with our proposed method resulted in training instability. We thus avoided adjusting this layer in the following experiments, and only adjusted the blocks in G1. The respective ablation study is given in Sec. 7.4.
7 Empirical Analysis
We validate SSDN on a series of image classification experiments, conducted on different datasets. In all experiments, all the methods were trained against a source dataset, and tested against a dataset containing different covariate shifts. We assumed that the test instances went through independent covariate shifts with respect to the training set distribution. This setup is termed "Single". In particular, the following configurations were examined using the "Single" setup:
An additional experiment is included, where unlabeled test instances arrive online and are assumed to come from the same distribution. This setup is termed "Online". As there are different approaches to tackle this unsupervised online problem, we note that this experiment is not complete with respect to the possible state-of-the-art, but is provided solely for compliance with Sun et al. (2019). Specifically, as the online queue increases, comparison to domain-adaptation methods is due.
The following baseline methods are compared, all based on the same ResNet-26 architecture:
Standard - A supervised model trained with classification labels.
Original TTT - Performing test-time training on top of "Joint training" Sun et al. (2019).
SSDN one-pass - Splitting G1 into and and combining the data-dependent and the signal-dependent bridge as described in Sec. 5.
SSDN + TTT - Performing test-time training on top of "SSDN one-pass".
In all TTT experiments, we ran backprop iterations to minimize before final classification. All the other hyperparameters were identical to Sun et al. (2019).
CIFAR-10CIFAR-10-C. Fig. 3 depicts the experiments comparing SSDN to the baseline methods on the CIFAR-10-C dataset. Evidently, this pure feed-forward variant of SSDN performed better than the baseline TTT, making it a very attractive variant for real-time applications. Combining test-time training with SSDN further improves accuracy and achieves 9.7% relative improvement over the closest baseline. The results for the "Online" versions of the baseline TTT and SSDN expectedly outperformed their "Single" counterparts, indicating that this setup is significantly less challenging. In this setup, SSDN outperformed Sun et al. (2019) by a large margin of 21.4% relative improvement.
CIFAR-10CIFAR-10.1. A similar trend was observed in the CIFAR-10.1 experiment (Tab. 2), where both our one-pass and TTT versions significantly outperformed all baselines. Our TTT variant improves by 10.6% over Sun et al. (2019) and 15% over Hendrycks et al. (2019). Consistent with Recht et al. (2018), these results might explain the poor generalization of standard CIFAR-10 models to CIFAR-10.1. Namely, the existence of covariate shifts. It seems that our input-dependent model was able to generalize from the internal covariate shift exists in the CIFAR-10 dataset to better handle some of the variations presented by the CIFAR-10.1 dataset.
SVHNMNIST. Tab. 2 shows results for SVHN MNIST experiment. Our "Single" variant outperforms all baselines by at least 14% relative improvement. Interestingly, both SSDN and JT Hendrycks et al. (2019) do not gain further improvement after running TTT Sun et al. (2019) on this scenario. We hypothesise this may be due to the large domain gap in this experiment. While an initial bridging improves performance by a considerable margin, TTT might overfit the sample, resulting in bridge parameters which are too far from the original distribution, and thus degrades performance.
When running the experiment the other way around (MNIST
SVHN), all methods performed very poorly. SVHN has much larger sample variance than MNIST, thus generalization in this direction is far from trivial and out of reach for the compared methods.
|Joint training Hendrycks et al. (2019)||16.7|
|Original TTT Sun et al. (2019)||15.9|
|SSDN + TTT||14.2|
|Joint training Hendrycks et al. (2019)||29.6|
|Original TTT Sun et al. (2019)||29.7|
|SSDN + TTT||25.7|
7.4 Ablation study
To better understand the effect of bridging between and at different layers, we evaluated different versions of SSDN, where each time we used different parts of the encoder as the bridge (see Tab. 3). While bridging all layers may sound like the optimal scheme from an optimization perspective, there seems to be an important inductive bias related to the sensitivity of each block to covariate shifts (see Sec. 1(b)), and specifically in focusing on the lower level part of the encoder (G1). It also seems that optimizing the bridge is most stable when choosing single groups rather than mixing them.
7.5 Analysis of the bridge parameters
To directly analyze the effect of the proposed signal-dependent bridge (Sec. 5.2), we inspected the parameters under different covariate shifts from the CIFAR-10-C experiment under the "one-pass" variant. Results are visualized in Fig. 4, where distinct clusters are prominent for almost every covariate shift. Thus, it seems that the signal-dependent bridge was able to detect the covariate shift in a completely unsupervised manner, and therefore allowed the encoder to adapt accordingly.
Interestingly, the cluster location on the plot is a good proxy for the classification accuracy under the respective covariate shift. For example, "brightness" is the closest to "original", and both have the lowest test error. In contrast, "gaussian_noise" is fairly far from "original", and has a very high test error. Moreover, when looking at the t-SNE plots of the different layers, the clusters seem to behave isometrically, further suggesting that the model gains valuable and consistent covariate shift information from each sample, in an unsupervised fashion.
The results presented in this study demonstrate the contribution of self-supervised learning for handling covariate shifts, with all compared variants surpassing the equivalent fully supervised model. We observed how a direct optimization of this contribution significantly improved image classification performance on a variety of scenarios, perhaps calling for further experimentation with the presented ideas on different problems and other self-supervision techniques. In light of the analysis in 7.5, the prominent empirical performance of the proposed input-dependent mechanism might be attributed to the hierarchical nature of this architecture. It seems that the model was able to first gather information about the covariate shift, and then use it for a more appropriate analysis of the input. Such input conditional behaviour can prove crucial when dealing with significant intra-dataset variations, or equivalently for dealing with catastrophic forgetting in continual learning or online setups.
In this paper, the problem of model robustness to covariate shift was addressed. We proposed a sensitivity analysis to better pinpoint the sources of non-robustness of CNNs. Motivated by the analysis, we introduced methods for optimizing the relationship between self-supervised representations and main task representations, and for direct input-dependent adjustment to covariate shift. We demonstrated the attractiveness of our methods in several image classification scenarios versus recently proposed alternatives. With some architectural modifications, we believe these proposed ideas can be applied to other machine learning domains to improve robustness with test-time adjustments. Specifically, we believe that the proposed input conditional mapping between representations can be extended to tackle other challenges beyond covariate shift robustness, such as online domain adaptation and continual learning, and can contribute to the growing impact of self-supervised learning techniques.
-  (2016) Learning feed-forward one-shot learners. In Advances in neural information processing systems, pp. 523–531. Cited by: §2.2.
-  (2019) Deep verifier networks: verification of deep discriminative models with deep generative models. arXiv preprint arXiv:1911.07421. Cited by: §1.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.1.
-  (2016) Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
Autoaugment: learning augmentation strategies from data.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: §2.3.
-  (2014) Contrastive adaptation network for unsupervised domain adaptation. In arXiv preprint arXiv:1412.3474, Cited by: §1.
-  (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
Unsupervised domain adaptation by backpropagation. In the 32nd International Conference on Machine Learning (ICML), Cited by: §1.
-  (2016) Deep learning. MIT press. Cited by: §1.
-  (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §2.1.
-  (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2.1.
-  (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §1, §1, §6.1, item 1.
-  (2019) Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pp. 15637–15648. Cited by: Figure 1, §1, §2.1, §4.1, §7.2, §7.3, §7.3, Table 2.
-  (2020) AugMix: a simple data processing method to improve robustness and uncertainty. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.3.
-  (2013) One-shot adaptation of supervised deep convolutional models. arXiv preprint arXiv:1312.6204. Cited by: §1.
-  (2018) CyCADA: cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, Cited by: §1.
-  (2019) Similarity of neural network representations revisited. arXiv preprint arXiv:1905.00414. Cited by: §6.2.
-  Uncertainty quantification with compound density networks. Cited by: §2.2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §6.1, item 1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: item 3.
-  (2017) Domain-adversarial training of neural networks. In Journal of Machine Learning Research, Vol. 17, pp. 2096–2030. Cited by: §1.
Deep transfer learning with joint adaptation networks. In Proceedings of International Conference on Machine Learning, Cited by: §1.
-  (2012) A unifying view on dataset shift in classification. Pattern recognition 45 (1), pp. 521–530. Cited by: §1.
-  (2017) Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2019) Hyper-graph-network decoders for block codes. In Advances in Neural Information Processing Systems, Cited by: §2.2.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.1.
Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §2.3.
-  (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv preprint arXiv:1904.03416. Cited by: §2.1.
-  (2009) Dataset shift in machine learning. The MIT Press. Cited by: §1.
-  (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §2.1.
Do CIFAR-10 classifiers generalize to CIFAR-10?. arXiv preprint arXiv:1806.00451. Cited by: item 2, §7.3.
-  (2012) Convolutional neural networks applied to house numbers digit classification. In Proceedings of International Conference on Pattern Recognition, Cited by: item 3.
-  (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §1.
-  (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §1.
-  (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2.1.
-  (2012) Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cited by: §1.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. In 2016 European Conference on Computer Vision, Cited by: §1.
-  (2019) Test-time training for out-of-distribution generalization. arXiv preprint arXiv:1909.13231. Cited by: Figure 1, §1, §1, §2.1, §4.2, §4.2, §6, item 1, §7.2, §7.1, §7.2, §7.3, §7.3, §7.3, Table 2.
-  (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
-  (2018) Deep visual domain adaptation: a survey. Neurocomputing. Cited by: §1.
-  (2019) Unsupervised data augmentation for consistency training. Cited by: §2.3.
-  (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §1, §6.3.
-  (2019) Domain-aware dynamic networks. arXiv preprint arXiv:1911.13237. Cited by: §1.