Unsupervised Domain Adaptation via Calibrating Uncertainties

07/25/2019 ∙ by Ligong Han, et al. ∙ 9

Unsupervised domain adaptation (UDA) aims at inferring class labels for unlabeled target domain given a related labeled source dataset. Intuitively, a model trained on source domain normally produces higher uncertainties for unseen data. In this work, we build on this assumption and propose to adapt from source to target domain via calibrating their predictive uncertainties. The uncertainty is quantified as the Renyi entropy, from which we propose a general Renyi entropy regularization (RER) framework. We further employ variational Bayes learning for reliable uncertainty estimation. In addition, calibrating the sample variance of network parameters serves as a plug-in regularizer for training. We discuss the theoretical properties of the proposed method and demonstrate its effectiveness on three domain-adaptation tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to model uncertainty is important in unsupervised domain adaptation (UDA). For example, self-training [16, 32]

often requires the model to reliably estimate the uncertainty of its prediction on target domain in the pseudo-label selection phase. However, traditional deep neural networks (DNN) can easily assign high confidence to a wrong prediction 

[5, 18], thus are not able to reliably and quantitatively render the uncertainty given data.

On the one hand, Bayesian neural networks (BNN) [20, 5, 1, 12] tackles this problem by taking a Bayesian view of the model training. Instead of obtaining a point estimate of weights, BNN tries to model the distribution over weights. We leverage BNN as a powerful tool for uncertainty estimation. On the other hand, one can estimate the empirical uncertainty of the model by the variance of network parameters, which we call gradient variance regularization (GVR).

Finally, our approach builds on the intuition that a model gives similar uncertainty estimates on the two domains learns to adapt from source to target well. Thus, we propose to directly calibrate the estimated uncertainties between source and target domains during training. This calibration can be considered in three-folds, from which we listed our contributions as follows:

  • We propose a general framework for unsupervised domain adaptation by calibrating the predictive uncertainty, and discuss its relationship with entropy regularization [8] and self-training [16].

  • We introduce variational Bayes neural networks to provide reliable uncertainty estimations.

  • We propose to calibrate the variance of network parameters as a model-and-objective-agnostic regularization (GVR) on the optimization dynamics.

2 Related Work

Shannon entropy is commonly used to quantify the uncertainty of a given distribution. Entropy-based UDA has already been proposed in [28]. Unlike [28], we avoid using adversarial learning which tends to be unstable and hard to train. Also, entropy regularization is proposed in  [8]

for semi-supervised learning and can be directly applied to UDA. However, our framework is more general since the uncertainty is not necessarily to be the Shannon entropy. In fact, we formalize the uncertainty as Rényi entropy which is a generalization of Shannon entropy. Many other methods in UDA can be modeled under this framework, for example, self-train 

[16, 32] can be viewed as minimizing the min-entropy which is a special case of Rényi entropy.

As pointed out by [7]

, directly optimizing the estimated Shannon entropy given data requires the classifier to be locally-Lipschitz 

[19]. Co-DA [14] and DIRT-T [25] propose to solve this problem by incorporate the locally-Lipschitz constraint via virtual adversarial training (VAT) [19].

Another complimentary line of research employs self-ensemble and shows promising results [3]. Indeed, BNN [5] performs Bayesian ensembling by nature. This is part of the reason why BNN provides a better uncertainty estimation.

In supervised learning, regularization is proposed to avoid overfitting. Besides weight decay, typical regularization techniques include label smoothing [6, 26], network output regularization [23], knowledge distillation [11]. We believe our proposed gradient variance regularizer GVR can also be used in supervised settings.

3 Uncertainty in Deep Neural Networks

Rényi entropy.

For a discrete probability distribution

, the Rényi entropy [29] of order () is defined as


The limiting value of when is the Shannon entropy, and corresponds to the min-entropy, . A typical deep neural network for classification usually produces a discrete distribution over possible classes given the input data. Thus, we quantify the predictive uncertainty by the Rényi entropy on this probability distribution.

Bayesian neural networks. BNN estimates the posterior over network weights while optimizing the training objective. Given the dataset , the output of BNN is denoted as where is input data and are the weights. For a classification task,

is the predicted logits and the resulting probability vector is given by a softmax function:

. The predictive distribution over labels given input is . We define the uncertainty evaluated by BNNs as the entropy .

We adopt the method from [12], where aleatoric and epistemic uncertainties are jointly modeled. In [12], the logits are assumed to be Gaussian and the reparameterization trick is utilized. The predicted logit is with . The final predicted probability vector is approximated by Monte Carlo sampling (with samples),


Variational inference. As estimating the posterior is often intractable [1, 12], variational inference is commonly adopted, where the posterior of weights is approximated by with parameter . Specifically, in supervised learning, is estimated by maximizing the evidence lower bound (ELBO) [13, 5]:


where is the prior, and term (I) is the standard cross-entropy loss evaluated at with parameter . Gal et al. [4, 5] proposes to view dropout together with weight decay as a Bayesian approximation, where sampling from is equivalent to performing dropout and term (II) becomes regularization (or weight decay) on .

Gradient variance. Rather than finding a variational approximation of the posterior , one can instead estimate the model-dependent uncertainty by the sample variance of (or the sample variance of

in the case of non-Bayesian networks). To be precise, sampling mini-batches

from a batch , one can compute the adapted parameters by performing one gradient step (at ): , where is the objective and is the inner learning-rate. Then the variance of can be defined as the trace of the covariance of vectorized s:


where and denotes a collection or a set. It can be easily seen that regularizing the variance of parameters is essentially regularizing the variance of gradients. We will discuss the usage of this gradient variance as a regularizer as well as its relationship with MAML [2] in the next section.

4 Domain Adaptation via Calibrating Uncertainties

Rényi entropy regularization. Denote source and target dataset as and respectively, where indicate the samples and is the label in source domain, and . We propose to calibrate the predictive uncertainty of target dataset with the source domain uncertainties. Concretely, we minimize the cross-entropy (CE) loss in the source domain while constraining the predicted entropy in the target domain:

s.t. (5)

where is the cross-entropy and indicates the strength of the applied constraint. In practice, the network is first pretrained on labeled source dataset using ELBO in Equation 3. Then, unlabeled target data is introduced in the above Equation 5, and is computed from Equation 2. Note that the resulting CE loss is no longer the term (I) in ELBO, since the expectation is inside logarithm. We simply treat “as is” the true posterior and evaluate CE using . For a non-Bayesian network, is used as a replacement of .

To solve Equation 5, rewrite it as a Lagrangian with a multiplier ,


Since , an upper bound on is obtained,


Ideally, Equation 6 can be optimized via dual gradient descent and is jointly updated along with . For simplicity, we follow the work of [10] and fix as a hyper-parameter in the experiment and minimize the upper bound .

Note that letting in Equation 7 is in fact the (Shannon) entropy regularization as described in [7, 8], except that here we consider a variational BNN. As pointed out in [8], directly optimizing Equation 7

can be difficult and expectation maximization (EM) algorithms are often used. Proposed in 

[30, 8], deterministic annealing EM anneals the predicted probabilities as soft-labels and minimizes the resulting cross-entropy. In an extreme case, soft-labels become one-hot vectors and the algorithm terms out to be self-training with pseudo-labels [16]. In our Rényi entropy regularization framework, self-training is essentially optimizing the min-entropy (). Then the objective reads


with to be pseudo-labels in target domain. Subscript denotes the -th element in a given -dim vector. The relationship between and can be immediately realized by noticing that the Shannon entropy is an upper bound of the min-entropy:


We build our method on top of class-balanced self-training (CBST) proposed in [32] and use it as the backbone of RER. CBST seeks to select most confident predictions pseudo-labels in a self-paced (“easy-to-hard”) scheme, since jointly learning the model and optimizing pseudo-labels on all unlabeled data is naturally difficult. The authors also propose to normalize the class-wise confidence levels in pseudo-label generation to balance the class distribution. For a detailed formulation, we suggest readers referring Section 4.1 and 4.2 in [32].

Gradient variance regularization.

The entropy regularization or self-training framework as formulated above implicitly encourages cross-domain feature alignment. However, pseudo-labels can be quite noisy even if BNN is employed to estimated their reliability. Trusting all selected pseudo-labels as one-hot-encoded “ground-truth” is overconfident and self-training with noisy pseudo-labels can lead to incorrect entropy minimization. Indeed, we observe that the model can quickly converge to its overconfident predictions. Therefore, the parameter variance evaluated in target domain using pseudo-labels via Equation 

4 can be even smaller than that of the source domain. To address this problem, we again propose to regularize the self-training by maximizing the gradient variance. Algorithm 1 illustrates the regularized self-training procedure on target domain (the training on source and target domains are preformed alternately, which is omitted in the algorithm box). and are the inner- and outer-stepsize, and is the hyper-parameter weighting the regularization term.

1:while not done do
2:     Sample mini-batches
3:     for all  do
4:          Evaluate
5:          Compute
6:     end for
7:     Collect
8:     Compute
9:     Update
10:end while
Algorithm 1 Gradient Variance Regularization

Notice that the proposed GVR shares similarities with MAML [2]

, comparing from a dynamical systems standpoint and despite that MAML samples mini-batches of different tasks. Taking a first-order Taylor expansion of the loss function around



we demonstrate that MAML tries to maximize the sensitivity of the loss functions with respect to the parameters by maximizing the norms of the gradients. On the contrary, GVR maximizes the variance of gradients, which intuitively encourages the model to escape from bad local minima.

It is worth mentioning that GVR is not only model-agnostic but also objective-agnostic. This is useful when the regularizer itself is the objective to be optimized. Moreover, GVR is complementary to VAT [19] since in VAT the gradient is computed with respect to input data. We conjecture that the data gradient somewhat captures the aleatoric (data-dependent) uncertainty, which we leave for future work.

5 Experiments

We first show results on three digit datasets MNIST [15], USPS and SVHN [21], where we consider MNISTUSPS and SVHNMNIST. Then we present preliminary results on a challenging benchmark: VisDA17 (classification) [22] which contains 12 classes. We follow the standard protocol in [22, 27, 24]. Classification accuracies on source and target domains for base models are reported in Table 3. We use DTN [31] as our base model for MNISTUSPS and SVHNMNIST. To implement its Bayesian variant (BDTN), we add another classifier to predict the logarithm of variance.

Model Source Target
DTN 100.00 83.94
BDTN-M1 100.00 83.78
BDTN-M5 100.00 86.83
BDTN-M10 100.00 86.28
BDTN-M20 100.00 86.78
BDTN-M100 100.00 87.06
Model Source Target
DTN 97.42 72.91
BDTN-M1 95.91 65.51
BDTN-M5 99.16 71.12
BDTN-M10 99.42 71.38
BDTN-M20 99.50 73.64
BDTN-M100 99.33 74.91
(b) SVHN
Table 3: Training base models on MNIST and SVHN. BDTN is a modified Bayesian DTN [31], with different values (as defined in Equation 2). Classification accuracies in source and target domains are reported.

Domain adaptation results on digit datasets are shown in Table 6. Our proposed Rényi entropy regularization methods with non-Bayesian and Bayesian base models are listed as RERs and BRERs respectively. We see self-training with pseudo-labels ((B)RER-) are in general more stable than directly minimizing the Shannon entropy ((B)RER-1). Also, adding GVR in (B)RER- improves the performance. However, we also observe that GVR is not helpful in (B)RER-1 settings.

Model Target Acc (%) Acc Gain (%)
Source-DTN 83.94 -
Source-BDTN 84.89 -
RER-1 91.570.13 7.63
RER-1-GVR 91.970.26 8.03
RER- 93.570.30 9.63
RER--GVR 93.880.14 9.94
BRER-1 92.780.42 7.89
BRER-1-GVR 93.070.72 8.18
BRER- 94.420.12 9.53
BRER--GVR 94.530.23 9.64
Model Target Acc (%) Acc Gain (%)
Source-DTN 64.48 -
Source-BDTN 70.98 -
RER-1 88.460.90 23.98
RER-1-GVR 85.484.71 21.00
RER- 88.161.19 23.68
RER--GVR 90.312.31 25.83
BRER-1 92.494.73 21.51
BRER-1-GVR 92.374.76 21.39
BRER- 96.060.68 25.08
BRER--GVR 96.380.05 25.40
Table 6: Results on MNISTUSPS and SVHNMNIST. RER uses DTN [31] as the base model. BRER- uses BDTN as the base model and optimizes , while BRER-1 optimizes . Results are averaged over 4 runs with different random seeds.

Mean accuracies on VisDA17 dataset are reported in Table 7. Following the protocol in [32], we train a standard ResNet101 [9] as the base model and add a second classifier (denoted as BRes101) to predict the logarithm of variance on logits. Results show that BNN improves upon non-Bayesian baselines by a large margin. GVR has not been tested on VisDA17 with (B)Res101 since the memory requirement exceeds our GPU capacities.

Model Target mean-Acc (%) Acc Gain (%)
Source-Res101 48.02 -
Source-BRes101 46.03 -
MMD [17] 61.1 -
GTA-Res152 [24] 77.1 -
RER- 76.812.73 28.79
BRER- 80.591.39 34.56
Table 7: Preliminary results on VisDA17 [22] classification benchmark (validation set). Results are averaged over 5 runs with different random seeds.

6 Conclusion

In this work, we propose to approach unsupervised domain adaptation via calibrating the predictive uncertainties between source and target domains. The uncertainty is quantified under a general Rényi entropy regularization framework, within which we introduce Bayesian neural networks for accurate and reliable uncertainty estimations. From a frequentist point of view, we in addition propose to approximate the model uncertainty via the sample variance of network parameters (or gradients) during training. Results show that the uncertainty estimation by Bayesian networks and gradient variances is effective and leads to stable performance in unsupervised domain adaptation.


  • [1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §1, §3.
  • [2] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1126–1135. Cited by: §3, §4.
  • [3] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §2.
  • [4] Y. Gal and Z. Ghahramani (2015)

    Bayesian convolutional neural networks with bernoulli approximate variational inference

    arXiv preprint arXiv:1506.02158. Cited by: §3.
  • [5] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In international conference on machine learning, pp. 1050–1059. Cited by: §1, §1, §2, §3.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §2.
  • [7] Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2, §4.
  • [8] Y. Grandvalet and Y. Bengio (2006) Entropy regularization. Semi-supervised learning, pp. 151–168. Cited by: 1st item, §2, §4.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §5.
  • [10] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Vol. 3. Cited by: §4.
  • [11] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • [12] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in neural information processing systems, pp. 5574–5584. Cited by: §1, §3, §3.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.
  • [14] A. Kumar, P. Sattigeri, K. Wadhawan, L. Karlinsky, R. Feris, B. Freeman, and G. Wornell (2018) Co-regularized alignment for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pp. 9345–9356. Cited by: §2.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
  • [16] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: 1st item, §1, §2, §4.
  • [17] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791. Cited by: Table 7.
  • [18] C. Louizos and M. Welling (2017) Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. Cited by: §1.
  • [19] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii (2015) Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677. Cited by: §2, §4.
  • [20] R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
  • [21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.
  • [22] X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and K. Saenko (2018) VisDA: a synthetic-to-real benchmark for visual domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2021–2026. Cited by: Table 7, §5.
  • [23] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton (2017) Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Cited by: §2.
  • [24] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8503–8512. Cited by: Table 7, §5.
  • [25] R. Shu, H. H. Bui, H. Narui, and S. Ermon (2018) A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735. Cited by: §2.
  • [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.
  • [27] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §5.
  • [28] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2018) ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. arXiv preprint arXiv:1811.12833. Cited by: §2.
  • [29] Wikipedia contributors (2018) Rényi entropy — Wikipedia, the free encyclopedia. Note: [Online; accessed 13-May-2019] External Links: Link Cited by: §3.
  • [30] A. L. Yuille, P. Stolorz, and J. Utans (1994) Statistical physics, mixtures of distributions, and the em algorithm. Neural Computation 6 (2), pp. 334–340. Cited by: §4.
  • [31] X. Zhang, F. X. Yu, S. Chang, and S. Wang (2015) Deep transfer network: unsupervised domain adaptation. arXiv preprint arXiv:1503.00591. Cited by: Table 3, Table 6, §5.
  • [32] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §1, §2, §4, §5.