CDOT: Continuous Domain Adaptation using Optimal Transport

09/20/2019 ∙ by Guillermo Ortiz-Jiménez, et al. ∙ EPFL 0

In this work, we address the scenario in which the target domain is continually, albeit slowly, evolving, and in which, at different time frames, we are given a batch of test data to classify. We exploit the geometry-awareness that optimal transport offers for the resolution of continuous domain adaptation problems. We propose a regularized optimal transport model that takes into account the transportation cost, the entropy of the probabilistic coupling, the labels of the source domain, and the similarity between successive target domains. The resulting optimization problem is efficiently solved with a forward-backward splitting algorithm based on Bregman distances. Experiments show that the proposed approach leads to a significant improvement in terms of speed and performance with respect to the state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The vast majority of machine learning algorithms are designed and built around the assumption that the training and test samples are independent and identically distributed. Nevertheless, in most situations this is not the case, and in practice some type of distributional shift exists between the training and test distributions. And this may cause a significant drop in the performance of any classification method.

Domain adaptation algorithms Wang and Deng (2018) try to solve this mismatch, and propose ways to design classifiers that can handle differences in the test and trained distribution. The corpora of research in this field is extremely prolific, but in most cases, it has targeted the scenario in which there is access to a large amount of labeled training data from a source domain, but only a set of unlabeled test samples is given from a target domain. In this scenario, it can be shown that the performance of the adaptation depends on the distance between the source and target domains. In this sense, optimal transport Villani (2008)

, and its powerful mathematical machinery that defines distances between probability distributions by taking into account the geometry of the underlying space, has been very successful to define a theoretical framework to consider this type of problem.

In this work, we address the scenario in which the target domain is continually, albeit slowly, evolving, and in which, at different time frames, we are given a batch of test data to classify (cf. Figure 1). This type of behaviour can be seen in a variety of applications, such as traffic monitoring with gradually changing lightning and atmospheric conditions Hoffman et al. (2014), spam emails evolving through time, or smooth regional variations of language across a country Ruder et al. (2016). Continuous domain adaptation has also found applications in healthcare, adapting the problem of X-ray segmentation to different domains Venkataramani et al. (2018).

In particular, we exploit the geometry-awareness that optimal transport offers for the resolution of continuous domain adaptation problems. We propose a regularized optimal transport model that takes into account the transportation cost, the entropy of the probabilistic coupling, the labels of the source domain, and the similarity between successive target domains. The resulting optimization problem is efficiently solved with a forward-backward splitting algorithm based on Bregman distances Van Nguyen (2017); Bùi and Combettes (2019). Experiments show that the proposed approach leads to a significant improvement in terms of speed and performance with respect to the state of the art Rakotomamonjy et al. (2015).

Figure 1: Example of a source domain and a sequence of slowly-varying target domains.

Related work

Previous work has addressed the problem of domain adaptation using optimal transport. In Courty et al. (2016), Courty et al. propose to learn a transportation plan (using a barycentric mapping) that matches the distributions of the two domains, while constraining labelled samples of the same class to remain close during the transportation. This constraint is achieved through a regularization term based on group sparsity or a graph Laplacian matrix. This work is extended in Courty et al. (2017), where the authors propose a new formulation to jointly train a classifier while learning the transport plan. To allow for the inclusion of out-of-sample examples, Perrot et al. Perrot et al. (2016) propose to move beyond a barycentric mapping and learn an explicit mapping jointly with the optimal transport objective. This approach was scaled up in Seguy et al. (2017), where the Monge map is obtained in a two step approach Seguy et al. (2017)

using the dual formulation of the optimal transport problem with entropic regularization. In this sense, they use the OT plan computed in the first step to train a neural network than can be used as an estimation for the Monge map. Finally, Yan et al.

Yan et al. (2018) address the problem of semi-supervised domain adaptation, i.e., when a few labelled examples are given from the target domain, through the use of the Gromov-Wasserstein Peyré et al. (2016) distance instead of Wasserstein.

To our knowledge, we are the first to tackle the problem of continuous domain adaptation using optimal transport. However, other authors have considered different approaches to solve this problem. Bitarafan et al. Bitarafan et al. (2016) propose an incremental approach that finds a feature space where source and target data are similarly distributed, so as to classify them in a semi-supervised manner. Bobu et al. Bobu et al. (2018) tackle the problem while ensuring the model continues to correctly classify examples from previous domains, thus avoiding the issue of catastrophic forgetting. Wulfmeier et al. train a generative adversarial network to incrementally adapt continuously changing target domains Wulfmeier et al. (2018).

2 Problem formulation

We denote the available discrete samples by , where is the matrix of the source signal positions, and are the matrices of the moving target positions for each time . We denote the corresponding distributions as , which are embedded in a metric space. As for the optimal transport, we denote as the transport cost between two distributions (transported to ) and , where each pairwise term measures the cost to transport the -th component of to the -th component of . In addition, we assume that the training samples are associated with a set of class labels , and the sequence of test samples are associated with unknown labels. Moreover, denotes the entropic regularization defined for every as

(1)

Proposed approach

In order to infer the unknown labels , we propose to estimate the sequential transport plans through the following consecutive steps.

  1. Firstly, we compute the probabilistic coupling between the source distribution and the first target distribution as the solution to the entropic optimal transport, that is

    (2)

    where is the transport cost between and , , and . The regularization affects the sparsity of , which decreases by setting a higher value for , thus leading to a more fuzzy coupling between the source and the target.

  2. Then, for every , we compute the probabilistic coupling between the distribution and the subsequent distribution as follows

    (3)

    where , , is the class-based regularizer, and is the time-based regularizer.

    • The regularizer aims at conveying a label-based information that is grounded on the assumption that each target sample has to receive masses only from source samples that have the same label Courty et al. (2014, 2016), yielding

      (4)

      Here above, gathers the row indices of that belong to the same class . The mixed norm is used in order to model the “group sparsity”, e.g., dependencies between the group of points that belong to the same class.

    • aims at enforcing a time varying regularizer, which is modeled through the barycentric mapping111A barycentric mapping , of the source signal to a target signal at time , is defined by a weighted barycenter of its neighbors in , e.g. .

      (5)
  3. Finally, we train a classifier on the mapped source samples and evaluate the accuracy on the new target datapoints .

1:Function with -Lipschitz continuous gradient
2:Cost , marginals , and
3:Step-size
4:Initialization
5:for  do
6:     
7:     
8:return
Algorithm 1 Fast algorithm for the regularized optimal transport problem defined in (3).

3 Optimization algorithm

We propose to solve the CDOT problem discribed in the previous section via a forward-backward splitting algorithm based on Bregman distances Van Nguyen (2017); Bùi and Combettes (2019), whose iterations are summarized in Algorithm 1.222In Algorithm 1, denotes the natural logarithm applied element-wise to the matrix . To this end, we remark that the Problem (3) can be generically formulated as

(6)

where is a differentiable function with -Lipschitz continuous gradient, is a lower semicontinuous convex function defined as

(7)

and is a convex subset defined as

(8)

The above problem fits nicely into the forward-backward splitting framework of Van Nguyen (2017); Bùi and Combettes (2019), which allows us to solve (6) through the following iterative algorithm333 denotes the indicator function of , which is equal to for every , and otherwise.

(9)

where , , and is a Legendre function. The key ingredient in the algorithm above is the -proximity operator of , which is defined as

(10)

By setting , the proximity operator boils down to an entropic optimal transport problem

(11)

whose solution can be efficiently computed with the Sinkhorn algorithm Cuturi (2013). According to the iterations in (9), replacing with leads to Algorithm 1, which is guaranteed to converge to a solution to Problem (6) by adequately setting the step-size , as discussed in Van Nguyen (2017); Bùi and Combettes (2019).

Note that Algorithm 1 is strikingly similar to the generalized gradient splitting algorithm (CGS) proposed in Rakotomamonjy et al. (2015). Indeed, they both consist of a sequential application of the Sinkhorn algorithm to an initial coupling, until it converges to a solution to the CDOT problem. However, the CGS method performs a line search at each iteration to ensure the convergence, whereas Algorithm 1 simply works with a constant step-size, leading to an optimization method that is both efficient and much easier to implement.

4 Experiments

4.1 Continuous domain adaptation of a rotating distribution

To assess the effectiveness of the proposed time regularization, we compare the adaptation and tracking performance of several optimal transport strategies that use different combinations of regularizers to perform the domain adaptation. In our experiments we replicate the setup proposed in Courty et al. (2016) in which they use the standard two entangled moon dataset as source. To create the sequence of targets we sample new data points from the source distribution and rotate them around the origin in batches with steps of 18 degrees. In our simulations we use 500 labeled data samples in the source domain, and 50 samples in each target domain. After each adaptation step, we train a new 1-Nearest Neighbor (1-NN) classifier on the mapped source samples and evaluate its accuracy on a 1000 new datapoints for each target.

We compare three different methods for continuous domain adaptation that use optimal transport. The algorithm proposed in Courty et al. (2016), where on top of the typical entropic regularization, they add a group lasso regularization term to the optimal transport problem to penalize transport mappings where samples from different classes in the source are coupled with the same samples in the target. Our proposed algorithm, in which we add the time regularization term introduced in Section 2 to promote temporal smoothness. And finally, a combination of the two algorithms where we add both regularizers in the optimization. Besides, for each of the algorithms we run two sets of experiments:

  • Sequential cost (seq): We sequentially map the source samples to the targets at time . We use the positions of the mapped samples and the positions of to compute the optimal transport cost .

  • Static cost: We fix the source samples to and directly match them to the target samples at time . In this case, the transport cost .

In each set of experiments, we compare three different settings:

  • Time-based regularizer (time reg): , , and .

  • Class-based regularizer (class reg): , , and .

  • Class-based regularizer + time-based regularizer (class reg + time reg): , , and .

Figure 2 shows the performance of the different methods. Clearly, the use of a sequential adaptation strategy, instead of a static one, allows for better tracking and adaptation. Furthermore, we can see that using the previously proposed group lasso regularization on the source labelsCourty et al. (2014, 2016) is not enough to guarantee a continuous adaptation. On the contrary, the time regularizer ensures temporal consistency along the sequence of adaptations and preserves the accuracy of the classification method on all the targets.

Figure 2: Performance comparison of different continuous domain adaptation strategies with optimal transport. Plot shows average and minimum and maximum values over 10 runs using the best regularization parameters for each of the methods (tuned using grid search on different samples).

Figure 3 illustrates how the source samples are mapped to the targets for a selection of angles (see Appendix Appendix: Comparison of all source to target mappings for the complete sequence). As we can see, at the beginning of the sequence (cf. Figure 2(a)) all methods produce similar mappings. However, as the sequence progresses we see that the method that does not use any time regularization fails to produce a consistent continuous mapping and moves all mapped source samples of the same class to a single point in space. On the other hand, adding a time regularization term in the optimization makes sure that the source samples follow the flow of the target distribution and that they are not pulled towards a single center.

(a) 18 degrees rotation
(b) 90 degrees rotation
(c) 180 degrees rotation
Figure 3: Comparison of mapping estimation on moon dataset for selected angles. Circles represent target samples and crosses source samples mapped to the target.

4.2 Speed of convergence of the optimization algorithm

We study the performance of the proposed approach on the simulated example discussed in the previous section. The experiments are performed with samples in the source domain and in one target domain, using two different sets of regularization parameters. For both Algorithm 1 and CGS algorithm Rakotomamonjy et al. (2015), Figure 4 reports the normalized cost evaluations versus the cumulative time per iteration. The curves show that the proposed approach converges faster than CGS, especially with a low entropic regularization (i.e., small ). This is due to the fact that one iteration of Algorithm 1 is cheaper than one iteration of CGS algorithm, due to the line search performed by the latter to adjust the step size.

(a) , , .
(b) , , .
Figure 4: Objective value versus time. The step size is set to .

5 Conclusions

We presented an optimal transport based approach to perform continuous domain adaptation on slowly varying target distributions. Our solution is based on the introduction of a new time regularization term to the optimal transport problem that promotes smoothness along the trajectory of the mapped source samples. Furthermore, we proposed a new forward-backward splitting algorithm to solve the optimization problem. Nevertheless, this new algorithm is general enough and it can be used with any form of differentiable regularization placed alongside the standard entropic regularization of optimal transport. Finally, we have tested the our framework on a synthetic example and showed its superior performance over the state-of-the-art algorithms.

References

  • A. Bitarafan, M. S. Baghshah, and M. Gheisari (2016) Incremental evolving domain adaptation. IEEE Transactions on Knowledge and Data Engineering 28 (8), pp. 2128–2141. Cited by: §1.
  • A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell (2018) Adapting to continuously shifting domains. Workshop track - ICLR 2018, pp. 4 (en). Cited by: §1.
  • M.N. Bùi and P. Combettes (2019) Bregman forward-backward operator splitting. Vietnam Journal of Mathematics. Cited by: §1, §3, §3.
  • N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy (2017) Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3730–3739. Cited by: §1.
  • N. Courty, R. Flamary, and D. Tuia (2014) Domain adaptation with regularized optimal transport. In ECML/PKDD 2014, LNCS, Nancy, France, pp. 1–16. Cited by: 1st item, §4.1.
  • N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy (2016) Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1853–1865. Cited by: §1, 1st item, §4.1, §4.1, §4.1.
  • M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2292–2300. Cited by: §3.
  • J. Hoffman, T. Darrell, and K. Saenko (2014) Continuous manifold based adaptation for evolving visual domains. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 867–874. Cited by: §1.
  • M. Perrot, N. Courty, R. Flamary, and A. Habrard (2016) Mapping estimation for discrete optimal transport. In Advances in Neural Information Processing Systems, pp. 4197–4205. Cited by: §1.
  • G. Peyré, M. Cuturi, and S. J. (2016) Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 2664–2672. Cited by: §1.
  • A. Rakotomamonjy, R. Flamary, and N. Courty (2015) Generalized conditional gradient: analysis of convergence and applications. Research Report LITIS ; Lagrange ; IRISA. Cited by: §1, §3, §4.2.
  • S. Ruder, P. Ghaffari, and J. G. Breslin (2016) Towards a continuous modeling of natural language domains. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, pp. 53–57. Cited by: §1.
  • V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel (2017) Large-scale optimal transport and mapping estimation. arXiv preprint arXiv:1711.02283. Cited by: §1.
  • Q. Van Nguyen (2017) Forward-backward splitting with bregman distances. Vietnam Journal of Mathematics 45 (3), pp. 519–539. Cited by: §1, §3, §3.
  • R. Venkataramani, H. Ravishankar, and S. Anamandra (2018) Towards continuous domain adaptation for healthcare. arXiv preprint arXiv:1812.01281. Cited by: §1.
  • C. Villani (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §1.
  • M. Wang and W. Deng (2018) Deep visual domain adaptation: A survey. Neurocomputing 312, pp. 135–153. Cited by: §1.
  • M. Wulfmeier, A. Bewley, and I. Posner (2018) Incremental adversarial domain adaptation for continually changing environments. In 2018 IEEE International conference on robotics and automation (ICRA), pp. 1–9. Cited by: §1.
  • Y. Yan, W. Li, H. Wu, H. Min, M. Tan, and Q. Wu (2018) Semi-supervised optimal transport for heterogeneous domain adaptation.. In

    International Joint Conference on Artificial Intelligence

    ,
    pp. 2969–2975. Cited by: §1.

Appendix: Comparison of all source to target mappings

(a) 18 degrees rotation
(b) 36 degrees rotation
(c) 54 degrees rotation
Figure 5: Comparison of mapping estimation on moon dataset. Circles represent target samples and crosses source samples mapped to the target.
(a) 72 degrees rotation
(b) 90 degree rotation
(c) 108 degree rotation
(d) 126 degree rotation
(e) 144 degree rotation
Figure 6: Comparison of mapping estimation on moon dataset. Circles represent target samples and crosses source samples mapped to the target.
(a) 162 degree rotation
(b) 180 degree rotation
Figure 7: Comparison of mapping estimation on moon dataset. Circles represent target samples and crosses source samples mapped to the target.