FRuDA: Framework for Distributed Adversarial Domain Adaptation

12/26/2021
by   Shaoduo Gan, et al.
0

Breakthroughs in unsupervised domain adaptation (uDA) can help in adapting models from a label-rich source domain to unlabeled target domains. Despite these advancements, there is a lack of research on how uDA algorithms, particularly those based on adversarial learning, can work in distributed settings. In real-world applications, target domains are often distributed across thousands of devices, and existing adversarial uDA algorithms – which are centralized in nature – cannot be applied in these settings. To solve this important problem, we introduce FRuDA: an end-to-end framework for distributed adversarial uDA. Through a careful analysis of the uDA literature, we identify the design goals for a distributed uDA system and propose two novel algorithms to increase adaptation accuracy and training efficiency of adversarial uDA in distributed settings. Our evaluation of FRuDA with five image and speech datasets show that it can boost target domain accuracy by up to 50 the training efficiency of adversarial uDA by at least 11 times.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/05/2020

Universal Multi-Source Domain Adaptation

Unsupervised domain adaptation enables intelligent models to transfer kn...
07/18/2022

Multi-step domain adaptation by adversarial attack to ℋ Δℋ-divergence

Adversarial examples are transferable between different models. In our p...
10/19/2020

Continual Unsupervised Domain Adaptation with Adversarial Learning

Unsupervised Domain Adaptation (UDA) is essential for autonomous driving...
03/04/2021

Unsupervised Domain Adaptation for Image Classification via Structure-Conditioned Adversarial Learning

Unsupervised domain adaptation (UDA) typically carries out knowledge tra...
06/10/2020

Adversarial Training Based Multi-Source Unsupervised Domain Adaptation for Sentiment Analysis

Multi-source unsupervised domain adaptation (MS-UDA) for sentiment analy...
06/21/2021

f-Domain-Adversarial Learning: Theory and Algorithms

Unsupervised domain adaptation is used in many machine learning applicat...
08/05/2021

Visual Domain Adaptation for Monocular Depth Estimation on Resource-Constrained Hardware

Real-world perception systems in many cases build on hardware with limit...

1 Introduction

Unsupervised Domain Adaptation (uDA)

is a sub-field of machine learning aimed at adapting a model trained on a labeled source domain to a different, but related, unlabeled target domain. Over the last few years,

adversarial domain adaptation has emerged as a prominent method for uDA, wherein the core idea is to learn domain-invariant feature representations from the data using adversarial learning [long2015learning, long2017deep, long2018conditional, ganin2016domain, tzeng2017adversarial, hoffman2018cycada, shen2018wasserstein, zou2019consensus]. This paper focuses on exploring adversarial domain adaptation in a distributed setting. Prior works [ganin2016domain, tzeng2017adversarial, hoffman2018cycada, shen2018wasserstein] have taken an algorithmic viewpoint to adversarial domain adaptation and assumed that datasets from the source and target domains are available on the same machine and can be accessed freely during the adaptation process. While this assumption has made it easy to research uDA algorithms, it can be easily violated in real-world scenarios where the domain datasets reside on massively distributed devices and are not allowed to be shared for privacy reasons.

As an example, let us consider the task of developing personalized health monitoring solutions on mobile devices, such as the prediction of COVID-19 from cough sounds collected from a smartphone [brown2020exploring]. A company (i.e., a source domain) can collect a labeled dataset for the task and train a prediction model on it. Later, this source model needs to be deployed for smartphone users around the world (i.e., target domains) and will require adapting to each user’s personal health condition and biomarkers. As collecting labels from the target domains is challenging in this setting, unsupervised domain adaptation could become a promising approach to adapt the source model and tailor it to each user’s data.

However, since the health data records from target users are distributed across thousands of devices and cannot be uploaded on a central server due to potential privacy reasons, we cannot use the centralized uDA techniques proposed in the literature. Performing adversarial domain adaptation in such distributed settings remains an under-explored problem and is the main focus of this paper. Clearly, if adversarial uDA can be extended to such distributed settings, it will further enhance the potential for real-world impact of these algorithms.

Setup and Challenges. We consider a distributed system where each domain dataset resides on a different node of the system. We assume that one domain dataset is labeled and represents the source domain. Other domain datasets are unlabeled and represent the target domains which need to undergo domain adaptation to learn a model tailored to their data distribution. New target domains can join the distributed system at any time. The nodes are able to communicate with each other over a network. Based on this novel but realistic setup, we highlight three key challenges in building a distributed adversarial uDA system:

  • [align=left, leftmargin=*]

  • The first challenge relates to system design

    : how do we design distributed adversarial neural network architectures that can perform uDA without exchanging any raw data between domains. Distributed training techniques have been studied extensively for supervised learning 

    [lian2017can, tang2018d], but their investigation for adversarial uDA remains under-explored.

  • Next is the challenge of efficiency. In distributed machine learning, any communication between the nodes incurs a cost, both in terms of training time and data transfer expenses. As such, the design of a distributed adversarial uDA system should be as efficient as possible.

  • Finally, a key challenge is to obtain the highest possible accuracy for each target domain after adaptation. Let us define a collaborator as the domain with which a target domain undergoes adaptation, e.g., in XY adaptation, X is the collaborator for the target domain Y. Seminal theoretical works [ben2010theory] in domain adaptation have proven that the accuracy obtained for a given target domain is highly dependent on the characteristics of the collaborator with which the adaptation is performed. Hence, it becomes critical to select the optimal collaborator for each target domain in a distributed setting, to ensure that the target domain can achieve the highest accuracy through adaptation.

(a) Optimal Collaborator Selection (OCS)
(b) Distributed uDA using DILS
Fig. 1: Illustration of FRuDA. (a) A new target domain finds its optimal adaptation collaborator from a set of candidate domains. (b) performs distributed uDA with to learn a model for its distribution.

Contributions. In this paper, we propose an end-to-end learning framework, named Framework for Realistic uDA (FRuDA) which addresses all the above challenges in a unified manner, and makes adversarial uDA accurate, efficient, and privacy-preserving in a distributed setting. Figure 1 demonstrates the system overview of FRuDA. The source code of FRuDA can be found in Appendix A.1. At the core of FruDA are two novel contributions:

  1. [align=left, leftmargin=*]

  2. A distributed collaborator selection algorithm called Optimal Collaboration Selection (OCS) which finds the best adaptation collaborator for each unlabeled target domain in the system. OCS is built on a novel theoretical formulation, which selects the optimal collaborator based on the collaborator’s own in-domain error and the Wasserstein distance between the collaborator and target domain. Our results show that OCS can lead to an increase in the target domain accuracy in distributed uDA systems by as much as 50% over various baselines.

  3. A distributed training strategy called Discriminator-based Lazy Synchronization (DILS) which decomposes the adversarial learning architecture across distributed nodes and performs uDA by exchanging the gradients of the domain discriminator between nodes. DILS ensures that unlabeled target domains can learn a prediction model through adaptation from other domains in a privacy-preserving manner, without revealing or exchanging their raw data. DILS also allows for setting a trade-off between the accuracy and efficiency objectives of distributed uDA, using a tunable parameter.

The paper is structured as follows. In §2, we provide a brief primer on adversarial domain adaptation and further motivate the design requirements of a distributed uDA system. In §4, we present our end-to-end learning framework FruDA, describe the two core algorithms on which FruDA is built, and provide theoretical justifications for our design. In §5, we provide a comprehensive evaluation of FruDA on multiple vision and speech datasets. We compare FruDA against various baselines for selecting adaptation collaborators and observe that it significantly outperforms them on target domain accuracy, by as much as 50%. We also illustrate that FruDA can reduce the amount of data exchanged during training by at least when compared to state-of-the-art baselines, without significantly compromising the adaptation accuracy. Finally, we show that FruDA can co-exist with various types of adversarial uDA algorithms proposed in the literature, thus making it a generalizable framework for supporting distributed adversarial uDA.

2 Preliminaries and Design Goals

2.1 Primer on Adversarial uDA

We first provide a brief primer on adversarial uDA with one source and one target domain. In the subsequent sections, we will explain how we extend adversarial uDA to distributed systems with multiple target domains.

Let be a source domain with labeled training set and be a target domain with unlabeled training set . We can train a feature extractor,

, and a classifier,

for the source domain using supervised learning by optimizing a classification loss as follows:

, where

is a loss function such as categorical cross-entropy.

The goal of adversarial uDA is to learn a feature extractor for the unlabeled target domain, which minimizes the divergence between the empirical source and target feature distributions. If the divergence in feature representations between domains is minimized, we can apply the pre-trained source classifier on the target features and obtain inferences, without requiring to learn a separate target classifier . To learn , two losses are optimized using adversarial learning, namely the Discriminator Loss and the Mapping Loss , as follows:

(1)
(2)

Here represents a domain discriminator tasked with separating data from source and target domains. and are the adversarial loss functions, which have been studied by several previous works. For example, DANN [ganin2016domain] uses a cross-entropy loss to compute , where the labels indicate the domain of the data. For the Mapping loss, they simply compute using a Gradient Reversal Layer (GRL). Instead, ADDA [tzeng2017adversarial] computes the mapping loss using the label inversion trick. Other works such as [shen2018wasserstein] use Wasserstein distance as the metric to compute .

The important takeaway here is that even though different algorithms employ different loss functions, the general training paradigm of adversarial uDA remains similar as shown in Eq. 1 and 2. This simple yet important insight means that it is possible to design a generalized domain adaptation framework for a distributed setting that can work for many uDA algorithms. This intuition will be confirmed in §5 where we show that our proposed framework can work with four types of uDA optimization objectives.

2.2 Design Requirements for a Distributed Domain Adaptation System

In this section, we describe two key design requirements for a distributed domain adaptation system, which will serve as a guide for the algorithms and training strategies proposed in the paper.

Training Efficiency. Efficiency is a key design metric for any distributed system. Compared with the powerful computation capabilities of modern hardware, communication tends to be the main bottleneck in distributed training. Especially, in distributed domain adaptation, a domain dataset could possibly reside on a smartphone, a laptop or a wearable device, whose communication capabilities are far inferior than data center machines. In such massively distributed scenarios, an optimal communication strategy is crucial for system’s efficiency and user experience.

Recall the model components , , , and mentioned in §2.1 that are involved in adversarial uDA. In a non-distributed setting, these components reside on the same machine and passing data from one to another has negligible cost. However, in a distributed setup, each domain has to keep a copy of these components on their machine and any exchange of information between them takes time and incurs a communication cost.

Since raw data is private and cannot be be exchanged in our problem setup, we exchange the gradients of these model components to facilitate distributed training. Hence, the key challenge is: what is the most communication-efficient strategy to exchange model gradients across domains? More specifically, this question can be decomposed into three sub-questions: 1) how many domains should be involved in the communication? 2) which model components are necessary to be communicated? and 3) how frequently do they need to be communicated?

To answer these questions, we analyze various types of adversarial uDA approaches in the literature. Based on the number of domains involved in training, adversarial uDA algorithms can be categorized into pairwise adaptation [ganin2016domain, shen2018wasserstein, zou2019consensus, tzeng2017adversarial], multi-source adaptation [peng2019federated, zhao2018adversarial], and multi-target adaptation [ragab2020adversarial]. Clearly, as the latter two approaches require either multiple source or target domains, they will incur higher communication costs and are less efficient than pairwise adaptation. Another way to classify adversarial uDA algorithms is tied [ganin2016domain] or untied [tzeng2017adversarial] algorithms, depending on whether the feature extractors of source and target domains share weights (i.e., tied) or not (i.e., untied). Untied algorithms are communication-efficient because the feature extractors could be trained separately and their gradients do not need to be exchanged during training. Only the gradients of domain discriminator need to be shared across nodes. For these reasons, the design of our proposed framework will be based on the pairwise and untied adversarial uDA algorithms.

After narrowing down the scope of uDA algorithms from an efficiency perspective, we need to decide how frequently should we communicate gradients between nodes. Naively, we can exchange gradients of the discriminator between nodes after training each batch of data, which is common in data-parallel distributed training. However, this approach comes at the expense of a significant communication cost. Instead, we will propose a lazy synchronization strategy that exchanges the gradients every steps, and further reduces the communication costs of adversarial uDA with negligible impact on adaptation accuracy. Our proposed training strategy is presented in detail in §4.2.

FADA [peng2019federated] is a recently proposed technique for performing adversarial uDA in a federated learning setting. Based on the above analysis, FADA can be characterized as a multi-source uDA approach which exchanges the gradients of feature extractors after every training step. Although FADA originally solves a different problem than ours and assumes that the system has multiple labeled domains, we implement a special case of FADA in §5 to provide a fair comparison of its efficiency with our technique.

(a)
(b)
Fig. 2: 0° is the labeled source domain while the domains in blue are unlabeled target domains appearing sequentially in the system. The numbers in rectangle denote the post-adaptation accuracy for a domain. (a) Static Design: Labeled Source acts the collaborator for each target domain. (b) Flexible Design: Each target domain chooses its collaborator dynamically. Previously adapted target domains can also act as collaborators. Note that choosing the right collaborator leads to major accuracy gains over the Static Design for many domains (shown in red).

Target Domain Accuracy. Obtaining a high accuracy in the target domain is arguably the most important metric of success for a uDA solution. Prior works in learning theory [ben2010theory] have shown that the classification error obtained in a target domain depends on the characteristics of the collaborator with which the adaptation is performed. This means that if we select the right (or optimal) collaborator for each target domain, it could significantly reduce the target domain error and boost the target domain accuracy.

To better explain this aspect, we show an experiment on Rotated MNIST, a variant of MNIST in which digits are rotated clockwise by different degrees. Assume that 0°, i.e., no rotation, is the labeled source domain while 30°, 60°, 45°, 90° and 15° are five unlabeled target domains appearing sequentially, for which we would like to learn a model using adversarial uDA. The naïve approach adopted by many existing uDA approaches is to always choose the labeled source domain as the collaborator for a target domain. Figure 

1(a) shows the performance of such existing approaches if each target domain always chooses the labeled source 0° as its collaborator. While this approach results in high accuracies for 15° and 30°, it performs poorly for other domains. Can we do better? What if a target domain can adapt not just from the labeled source, but also from other target domains which themselves have undergone adaptation in the past. Figure 1(b) shows that if target domains could flexibly choose their collaborators, they can achieve significantly higher accuracies. E.g., if 90° adapts from 60° (which itself underwent adaptation in the past), it could be achieve an accuracy of 92.6%, which is almost 75% higher than what could be achieved by adapting from 0°, as shown in Figure 2.

In summary, this example provides a clear insight that we can significantly increase the target domain accuracy if we adapt from the right (or the optimal) collaborator. How do we choose such an optimal collaborator? In some aspects, this problem is similar to the source selection problem [tan2015transitive] where a metric such as A-distance is employed to compute a distance measure between domains, and the domain with the least distance from the target is chosen as its collaborator. However, none of the prior works are designed for a distributed setup and they require access to raw data samples from both domains to compute the distance metric for source selection. Instead, we propose a fully-distributed approach which selects an optimal collaborator based on the collaborator’s in-domain error and the Wasserstein distance between the collaborator and target.

3 Related Work

Related work for OCS. There are prior works on computing similarity between domains, e.g., using distance measures such as Maximum Mean Discrepancy [wang2020transfer], Gromov-Wasserstein Discrepancy [yan2018semi], A-distance [wang2018deep], and subspace mapping [gong2012geodesic]. However, our results in §5

show that merely choosing the most similar domain as the collaborator is not optimal. Instead, OCS directly estimates the target cross-entropy error for collaborator selection. Another advantage of OCS over prior methods is that it relies on Wasserstein distance which could be computed in a distributed setting without exchanging raw data between domains.

There are also works on selecting or generating intermediate domains for uDA. [tan2015transitive]

studied a setting when source and target domains are too distant (e.g., image and text) which makes direct knowledge transfer infeasible. As a solution, they propose selecting intermediate domains using A-distance and domain complexity. However, as we discussed, merely using distance metrics does not guarantee the most optimal collaborator. Moreover, this work was done on a KNN classifier and did not involve adversarial uDA algorithms.

[gong2019dlow] and [choi2020visual] use style transfer to generate images in intermediate domains between the source and target. Although interesting, these works are orthogonal to OCS in which the goal is to select the best domain from a given set of candidates. Moreover, these works are primarily focused on visual adaptation, while OCS is a general method that can work for any modality. Finally, [wulfmeier2018incremental, bobu2018adapting] are techniques for incremental uDA in continuously shifting domains. However, in our problem, different target domains may not have any inherent continuity in them and can appear in random order, and hence it becomes important to perform OCS.

Related work for DILS. There is prior work on distributed model training, wherein training data is partitioned across multiple nodes to accelerate training. These methods include centralized aggregation [li2014scaling, sergeev2018horovod], decentralized training [lian2017can, tang2018d], and asynchronous training [lian2017asynchronous]. Similarly, with the goal of preserving data privacy, Federated Learning proposes sharing model parameters between distributed nodes instead of the raw data [konevcny2016federated, yang2019federated]. However, these distributed and federated training techniques are primarily designed for supervised learning and do not extend directly to adversarial uDA architectures. A notable exception is FADA by [peng2019federated] which extends uDA to federated learning. As we discussed earlier, FADA is designed for a multi-source setting and assumes that all source domains are labeled. Moreover, it exchanges features and gradients of the feature extractor between nodes after every step. Instead, DILS operates by doing pairwise adaptation through lazy synchronization of discriminator gradients between nodes, which brings significant efficiency gains over FADA.

Source-free uDA. [kundu2020towards] and [liang2020we] are two very promising recent works on source-dataset-free uDA. Although the scope of these works are different from us, we share the same goal of making uDA techniques more practical in real-world settings.

PropertyMethod
ADDA [tzeng2017adversarial],
DANN [ganin2016domain],
CADA [zou2019consensus]
Multi-Source
uDA [zhao2018adversarial, bascol2019improving, cui2020multi]
Federated
uDA [peng2019federated]
Tan et al.
 [tan2015transitive]
FRuDA
(Ours)
Adversarial Domain
Adaptation
Collaborator Selection
before Adaptation
Supports
Distributed uDA
Communication
Efficient Distributed uDA
Framework to support
multiple uDA algorithms
Number of labeled
source domains
1 Multiple Multiple 1 1
TABLE I: Overview of the related work in unsupervised domain adaptation and the novelty of FRuDA over prior works. Check marks denote the core property of different methods. FRuDA is unique in providing a framework to scale multiple adversarial uDA algorithms using optimal collaborator selection and privacy-preserving, communication-efficient distributed training.

4 FRuDA: Framework for Distributed Adversarial Domain Adaptation

We first present our two algorithmic contributions on Optimal Collaborator Selection (named OCS) and Discriminator-based Lazy Synchronization (named DILS). Later in §4.3, we explain how these two algorithms work in conjunction with each other in an end-to-end framework called FRuDA, and enable adversarial uDA algorithms to work in a distributed machine learning system.

4.1 Optimal Collaborator Selection (OCS)

Recall from §2.2 that the goal of collaborator selection is to find the optimal collaborator for each target domain. We first formulate the technical problem and then present our solution.

4.1.1 Problem Formulation.

In our problem setting shown in Figure 0(a), there is one labeled source domain and multiple unlabeled target domains. Let be the unlabeled target domains for which we want to learn a prediction model using uDA. We assume that target domains join the distributed system sequentially.

We define a candidate set as the set of candidate domains that are available to collaborate with a target domain at step . When the system initializes at step , only the labeled source domain has a learned model, hence . When the first target domain appears, it can only adapt from and learn a model . Having learned the model, is now added to the candidate set (along with its unlabeled data) and can act as a collaborator for future domains.

In general, at step , as shown in Figure 0(a). For a new target domain , the goal of OCS is to find an optimal collaborator domain , such that:

where is the candidate domain in and is a metric that quantifies the error of collaboration between and . In other words, OCS aims to select a candidate domain from which has the least error of collaboration with the target domain.

4.1.2 Solution

Our key idea is quite intuitive: the optimal collaborator should be a domain, such that adapting from it will lead to the highest classification accuracy (or equivalently, the lowest classification error) in the target domain. We first introduce some notations and then present the key theoretical insight that underpins OCS.

Notations. We use domain to represent a distribution on input space and a labeling function . A hypothesis is a function . Let denote the error of a hypothesis w.r.t. under the distribution , where is an error metric such as error or cross-entropy error . Further, a function is called -Lipschitz if it satisfies the inequality for some . The smallest such is called the Lipschitz constant of .

Theorem 1. Let and be two domains sharing the same labeling function . Let denote the Lipschitz constant of the cross-entropy loss function in . For any two -Lipschitz hypotheses , we can derive the following error bound for the cross-entropy (CE) error in :

(3)

where denote the first Wasserstein distance between the domains and , and denotes the error in . A full proof is provided in the Appendix.

Theorem 1 has two key properties that make it apt for our problem setting. First, it can be used to directly estimate the cross-entropy (CE) error in a target domain (), given a hypothesis (or a classifier) from a collaborator domain (). Since target CE error is the key metric of interest in classification tasks, this bound is more useful than the one proposed by [shen2018wasserstein] which estimates the error in the target domain. Secondly, Theorem 1 depends on the Wasserstein distance metric between the domains, which could be computed in a distributed way without exchanging any private data between domains. This property is very important to our distributed problem setup and differentiates it from other distance metrics such as A-distance or Maximum Mean Discrepancy (MMD) which cannot be computed in a distributed manner.

Selecting the optimal collaborator. Motivated by Theorem 1, we now discuss how to select the optimal collaborator for a target domain. Given a collaborator domain , a learned hypothesis and a labeling function , we can estimate the CE error for a target domain using Theorem 1 as:

(4)

We can tighten this bound to get a more reliable estimate of the target CE error. This is achieved by reducing the Lipschitz constant () of the hypothesis during training. In uDA, the hypothesis is parameterized by a neural network, and we can train neural networks with small Lipschitz constants by regularizing the spectral norm of each network layer as implemented in [gouk2018regularisation]. Our empirical results show that the upper bound in Eq. 4 is a good approximation of the target domain error for the purpose of collaborator selection. In our implementation, we make a simplifying assumption about the availability of a small test set on which the collaborator error can be computed. In future work, we will study more sophisticated error propagation strategies across target domains.

Now that we have a way to estimate the target CE error, we use it to select an optimal collaborator that yields the minimum target CE error. Let be a set of candidate domains each with a pre-trained model with Lipschitz constants and . Let be a target domain for which the collaborator is to be chosen. We use Eq. 4 to select the optimal collaborator :

(5)

4.1.3 Computing Wasserstein Distance across Distributed Datasets

Our optimal collaborator selection algorithm requires computing an estimate of the Wasserstein () distance between a candidate domain () and the target domain (). Let and denote the unlabeled datasets from the two domains. As shown by [shen2018wasserstein], the distance can be computed as:

(6)

where and are the number of samples in the dataset, and are the feature encoders of each domain, and is an optimal discriminator trained to distinguish the features from the two domains. To train the optimal discriminator, following loss is minimized:

where is the gradient penalty used to enforce 1-Lipschitz continuity on the discriminator.

Interestingly, Equation 6 has a similar structure to the optimization objectives for ADDA and other uDA algorithms discussed in the paper. Hence, we can use the same principle as DILS (described in 4.2) and exchange discriminator gradients between nodes to compute the Wasserstein Distance in a distributed manner, without requiring any exchange of raw data.

We initialize with and decompose the discriminator into two parts ( and ) which reside on the respective nodes. The raw data from both nodes is fed into their respective encoders and discriminators, and we compute the gradients of each discriminator as follows:

Both nodes exchange their discriminator gradients during a synchronization step and compute aggregated gradients:

(7)

Finally, both discriminators and are updated with these aggregated gradients, and gradient penalty is applied to enforce the 1-Lipschitz continuity on the discriminators. This process continues until convergence and results in an optimal discriminator. Once the discriminators are trained to convergence, we can calculate the Wasserstein distance as:

4.2 Distributed uDA using DIscriminator-based Lazy Synchronization (DILS)

4.2.1 Algorithm description

Upon selecting an optimal collaborator for the target domain , the next step is to learn a model for by doing pairwise adversarial uDA with . In line with our problem setting, both domains are located on distributed nodes and cannot share their training data with each other.

algocf[t]    

As shown in Figure 0(b), we split the adversarial architecture across the distributed nodes. The feature encoders of the collaborator () and target () reside on their respective nodes, while the discriminator is split into two components and . As we highlighted in §2, our framework is based on the untied adversarial uDA algorithms and exchanges information between distributed nodes using gradients of the discriminators. Our training process works in two steps: 1) update the target feature extractor to optimize the mapping loss , 2) update the discriminators and to optimize the discriminator loss . Note that is assumed to be pre-trained in the past and is not updated.

We present our DIscriminator-based Lazy Synchronization (DILS) strategy in Algorithm 1. Specifically, at each training step , both nodes feed their domain data and into their extractors and discriminators, and compute the gradients of , and , i.e., , and respectively. Since and are supposed to be shared between nodes, how to synchronize these two components is crucial to adaptation accuracy and communication cost.

If we exchange gradients of and after every training step, we can keep the discriminators strictly synchronized and ensure that distributed training converges to the same loss as the non-distributed case. However, this approach has a major downside from an efficiency perspective, as exchanging gradients in every step incurs significant communication costs and increases the overall uDA training time. To boost the training efficiency, we propose a Lazy Synchronization approach, wherein instead of every step, the discriminators are synchronized every training steps, thereby reducing the total communication amount by a factor of . We denote the training steps at which the synchronization takes place as the sync-up steps while other steps are called local steps.

In effect, DILS uses synced stale gradients () from the latest sync-up step to update the discriminators (line 11), instead of using their local gradients. There are two reasons behind this design choice. Firstly, and are two replicas of the same component and are intended to be consistent. Updating them with different local gradients can cause them to diverge. Secondly, the local gradients of or are derived from data of only one domain, and are likely to be biased to that domain. Our design choice of applying the latest synced gradients () circumvents these limitations of local gradients and guarantees the convergence of adversarial uDA algorithms. Our experiment results also confirm that this approach significantly reduces the uDA communication cost without degrading the target accuracy.

4.2.2 Convergence Analysis

The network structure of Lazy Synchronization can be taken as a typical Generative Adversarial Net (GAN) [goodfellow2014generative] where the generative model is target encoder and the discriminative model is discriminator ( and ). and

can separately define two probability distributions

and , noted as and respectively. Then according to the theoretical analysis in [goodfellow2014generative], we know that:

Proposition 1. For a given , if is allowed to reach its optimum, and is updated accordingly to optimize the value function, then converges to , which is the optimization goal of .

In other words, if we can guarantee that under the updating strategy of DILS, convergence behaviours of and are close enough to the non-distributed case, then the should be able to converge as well. Note that although and are lazily synced, their weights are always identical because they are equally initialized and apply same gradients (latest synced one) for every training step. Therefore, we can take it as one discriminator in the convergence analysis. Then we have the following theorem.

Theorem 2. In DILS, given a fixed target encoder, under certain assumptions, we have the following convergence rate for the discriminators and (full proof is in Appendix B):

(8)

Where is the sync-up step, is the learning rate. Set . When , the impact of stale update will be very small, and thus it can converge with rate , which is same as the classic SGD algorithm.

4.2.3 Privacy Analysis of DILS

Recall that a key feature of our distributed training algorithm is to exchange information between the distributed nodes using gradients of the discriminators. This clearly affords certain privacy benefits over existing uDA algorithms since we no longer have to transmit raw training data between nodes. However, prior works have shown that model gradients can potentially leak raw training data in collaborative learning [melis2019exploiting], therefore it is critical to examine: can the discriminator gradients also indirectly leak training data of a domain?

We study the performance of DILS under a state-of-the-art gradient leakage attack proposed by [zhu2019deep]. They showed that gradient matching can be a simple but robust technique to reconstruct the raw training data from stolen gradients. Let us say we are given a machine learning model with weights . Let be the gradients with respect to a private input pair . During distributed training, are exchanged between the nodes.

A reconstruction attack happens as follows: an attacker first randomly initializes a dummy input and label input . This data is fed into the model to compute dummy gradients as follows:

Finally, the attacker minimizes the distance between the dummy gradients and the actual gradients using gradient descent to reconstruct the private data as follows:

(9)

Can this attack succeed on DILS? There are two key assumptions in this attack: (i) the weights of the end-to-end machine learning model are available to an adversary in order for them to compute the dummy gradients, (ii) the gradients of all the layers () between the input and output are available to the adversary.

DILS never exchanges the weights of the target domain model (i.e., the feature extractor and the discriminator) during the adversarial training process. As shown in Algorithm 1, the target feature extractor is trained locally and only discriminator gradients are exchanged. Without the knowledge of the model weights , an attacker can not generate the dummy gradients necessary to initiate the attack on the target domain.

Looking at the source or collaborator domain, we do exchange its feature extractor with the target domain in the initialization step of uDA, which could be used by the attacker to generate the dummy gradients . However, for the attack to succeed, the attacker also needs the real gradients () of all the layers between the input and output in the source domain. This includes gradients of and . In DILS however, we only exchange the gradients of the domain discriminator during training; the gradients of are never exchanged. Without the knowledge of the gradients of , an attacker cannot use Eq. 9 to reconstruct the training data of the source domain.

In summary, we have proven that our strategy of distributed uDA based on discriminator gradients does not allow an attacker to reconstruct the private data of either the source or the target domain.

4.3 The End-to-End View of FRuDA

We now discuss how OCS and DILS work together to address the challenges of distributed adversarial uDA introduced in §1. As shown in Figure 0(a), a new target domain first performs OCS with all candidate domains in to find its optimal collaborator . This step makes uDA systems more flexible and ensures that each target domain is able to achieve the best possible adaptation accuracy in the given setting. Next, as shown in Figure 0(b), and use DILS to engage in distributed uDA. This step ensures that private raw data is not exposed during adaptation and yet the target domain is able to learn a model for its distribution in an efficient manner. Finally, the newly adapted target domain (with its model and unlabeled data) is added to the candidate set to serve as a potential collaborator for future domains.

5 Evaluation

RMNIST Digits Office-Caltech Mic2Mic DomainNet
No Adaptation 34.65 35.54 59.59 72.09 66.40 85.25 76.45 75.83 26.12 28.32
Random 28.666.50 37.114.32 62.772.19 69.134.0 69.181.51 80.12.44 80.171.60 77.341.09 20.663.10 28.152.89
Labeled Source (LS) 47.14 0.85 49.08 0.75 64.890.23 79.870.31 67.770.15 90.620.13 80.86 0.09 79.910.05 33.510.09 36.440.14
Multi-Collaborator 40.510.30 42.730.39 60.940.13 75.910.30 68.900.24 82.170.87 76.90 0.13 79.00.24 18.411.04 25.460.61
Proxy A-Distance 93.51 0.22 74.140.05 70.090.45 83.070.15 69.370.2 90.620.13 80.34 0.19 80.020.31 35.030.21 36.00.49
FRuDA (Ours) 97.08 0.14 81.720.3 73.010.87 85.310.26 74.560.52 90.620.13 81.43 0.06 81.810.10 37.100.28 35.460.31
TABLE II: Mean adaptation accuracy obtained over all target domains in a given order. The sync-up step size is used for this experiment.

5.1 Setup

Datasets. We evaluate FRuDA on five image and speech datasets:

  • [leftmargin=*]

  • Rotated MNIST: A variant of MNIST with digits rotated clockwise by different degrees. Each rotation is considered a separate domain.

  • Digits

    : Five domains of digits: MNIST (M), USPS (U), SVHN (S), MNIST-M (MM) and SynNumbers (SYN), each consisting of digit classes ranging from 0-9.

  • Office-Caltech: Images of 10 classes from four domains: Amazon (A), DSLR (D), Webcam (W), and Caltech-256 (C).

  • Mic2Mic: A speech keyword detection dataset recorded with four microphones: Matrix Creator (C), Matrix Voice (V), ReSpeaker (R) and USB (U). Each microphone represents a domain.

  • DomainNet: A new challenge dataset from which we use four labeled image domains containing 345 classes each: Real (R), QuickDraw (Q), Infograph (I), and Sketch (S).

Evaluation Metrics and Implementation. FRuDA is designed for a distributed system where each node represents a domain. Initially, one labeled source domain

is given in the system and thereafter unlabeled target domains appear sequentially in a random order. The overall accuracy of the system is measured by the mean adaptation accuracy obtained over all target domains. The communication cost is measured by the amount of traffic due to gradient exchange between nodes. For our experiments, we use a Nvidia V100 GPU to represent a distributed node. All nodes are connected via TCP network. We use Message Passing Interface (MPI) as the communication primitive between nodes. The system is implemented with TensorFlow 2.0. We use the implementation provided by

[gouk2018regularisation] for calculating Lipschitz constants of a neural network. More details on network architectures, hyper-parameters and downloading the source code are provided in the Appendix.

5.2 Adaptation Accuracy of FRuDA

Let {} denote an ordering of one labeled source domain and unlabeled target domains. For each target domain , we first choose a collaborator domain, which could be either the labeled source domain or any of the previous target domains that have already learned a model using uDA. Upon choosing a collaborator (using OCS or any of the baseline techniques), we use DILS () to perform distributed adversarial uDA between the target domain and the collaborator, and compute the test accuracy in the target domain. We report the mean adaptation accuracy obtained over all target domains, i.e., .

For each dataset, we choose two random orderings of source and target domains as shown in Table III. The optimization objectives of ADDA [tzeng2017adversarial] as discussed in §2.1 are used for adaptation in this experiment.

Collaborator Selection Baselines. We use four baselines for collaborator selection: (i) Labeled Source wherein each target domain only adapts from the labeled source domain; (ii) Random Collaborator: each target domain chooses a random collaborator from the available candidates; (iii) Proxy A-distance (PAD)

where we choose the domain which has the least PAD

[ben2007analysis] from the target; (iv) Multi-Collaborator is based on MDAN [zhao2018adversarial], where all available candidate domains contribute to the adaptation in a weighted-average way. While this baseline obviously is less efficient than pairwise adaptation, we are interested in comparing its accuracy with our system. Note that MDAN was originally developed assuming that all candidate domains are labeled, which is not the case in our setting. Hence, for a fair comparison, we modify MDAN by only optimizing its adversarial loss during adaptation.

Dataset Order 1 Order 2
RMNIST
0,30,60,90,120,150,180,
210,240,270,300,330
0,180,210,240,270,300,
330,30,60,90,120,150
Digits
MM, Syn, M, U, S
Syn, M, MM, U, S
Office-Caltech D, W, C, A W, C, D, A
Mic2Mic V, U, C, R U, C, V, R
DomainNet S, R, I, Q S, Q, I, R
TABLE III: Domain orderings used in our experiments. Domains in bold correspond to the labeled source domain, which is introduced first in the system. All other domains have no training labels.

Results. Table II reports the mean accuracy obtained across all target domains for two orderings in each dataset. We can observe that collaborator selection techniques have a crucial impact on the adaptation accuracy, and FRuDA outperforms the baseline techniques in almost all cases. Below we describe our key findings:

Firstly, we observe that Random Collaborator and Multi-Collaborator are in general the worst among all the methods. We surmise that this is due to the fact that these methods do not consider the distances between domains, and end up with collaborator domains that are too different from the target domain. Compared with Proxy A-distance (PAD), FRuDA has better performance because we jointly consider the in-domain error and the Wasserstein distance between domains during collaborator selection.

Next, for Labeled Source (LS), we observe that its relative performance against FRuDA depends on the characteristics of the tasks, and the number and nature of domains. In RMNIST, FRuDA provides 41% accuracy gains over LS on average — this could be partly attributed to the large number of target domains () in this dataset. As the number of target domains increase, there are more opportunities for benefiting from collaboration selection, which leads to higher accuracy gains over LS in this dataset. For Office-Caltech (Order 2), the performance of FRuDA is the same as the Labeled Source baseline because here the the labeled source domain W turned out to be the optimal collaborator for all target domains. Overall, out of the 10 different orderings across 5 datasets, we found that LS outperforms FRuDA in only one setting (DomainNet Order 2). For this order, we found that the labeled source: Sketch (S) is the best adaptation collaborator for all subsequent domains, hence the LS performs the best. FRuDA makes one error in collaborator selection here: for target domain = Infograph (I), it picks Quickdraw (Q) as the collaborator instead of picking the labeled source Sketch (S), which leads to a 1% mean accuracy drop.

In general, our results demonstrate that as uDA systems scale to multiple target domains, the need for choosing the right adaptation collaborator becomes important, hence warranting the need for accurate collaboration selection algorithms.

Generalization to other uDA optimization objectives. The results presented in Table II used the optimization objectives of ADDA [tzeng2017adversarial]. However, as we discussed earlier, FRuDA is intended to be a general framework for distributed uDA and not limited to one specific uDA algorithm. We now evaluate FRuDA with three other uDA loss formulations (i) [ganin2016domain], which uses a Gradient Reversal Layer (GRL) to compute the mapping loss, (ii) WassDA[shen2018wasserstein], which uses Wasserstein Distance as a loss metric for the domain discriminator and (iii) CADA [zou2019consensus] which operates by enforcing consensus between source and target features. In Table IV, we observe that different uDA techniques yield different target accuracies after adaptation, depending on their optimization objective. Regardless, FRuDA can work in conjunction with all of them to improve the target accuracy over the Labeled Source baseline because our proposed framework is designed to be agnostic to the learning algorithm.

Takeaways. This section highlighted the accuracy gains achieved by FRuDA in a distributed uDA setting by selecting the optimal collaborator for each domain. We also showed that FRuDA can act as a general framework for distributed adversarial uDA by incorporating the optimization objectives of various uDA algorithms.

RMNIST () Digits ()
ADDA GRL WassDA CADA ADDA GRL WassDA CADA
No Adaptation 34.65 34.65 34.65 34.65 59.59 59.59 59.59 59.59
Labeled Source 47.14 47.26 44.39 41.30 64.89 65.51 70.34 65.22
Proxy A-distance 93.51 94.91 89.04 78.70 70.09 67.14 72.66 67.0
FRuDA 97.08 97.35 91.15 83.37 73.01 69.80 75.36 70.19
TABLE IV: Mean target accuracy for four uDA methods. FRuDA can be used in conjunction with various uDA methods, and improves mean accuracy over the Labeled Source baseline.

5.3 Communication Efficiency of FRuDA

To evaluate the communication efficiency of FRuDA, we use the amount of data communicated during pairwise adversarial training as a metric. Indeed, if a method exchanges less amount of data between two nodes during distributed training, it will be communication-efficient. Note that in addition to the adversarial training costs, we also incur communication costs during collaborator selection. However, those costs are negligible as compared to adversarial uDA and are not included in our analysis.

Since no previous work has studied the communication costs of adversarial uDA, we choose the most related work FADA [peng2019federated] as our baseline. As discussed in §4.2, FADA was originally designed for multiple sources in a federated learning setup, which is not the case in our setup. For a fair comparison with our single-source setting, we modify FADA by only implementing its Federated Adversarial Alignment component and setting the number of source domains to one. The modified FADA (referred to as FADA) has the same optimization objectives as single-source DA, but it operates in a distributed setting. Hence, it is fair to compare it with DILS.

Fig. 3: Test accuracy vs. communication amount per node. DILS (p=4) provides up to reduction in amount of data exchanged during training.

Results. In Figure 3, we plot the test accuracy with the amount of data transmission during pairwise training between two nodes. For DILS, we have two curves with different value of . The results show that with the lazy synchronization strategy (), DILS can achieve the same test accuracy as FADA or strict synchronized case (), but consume much less amount of communication data. On RMNIST and Office dataset, we save up to traffic compared with FADA. We also present the similar results for Digits, Mic2Mic and DomainNet in Table V.

There are two key reasons why DILS can be much more efficient than FADA: i) FADA (and FADA) exchange gradients of the feature extractor during the training, while DILS only exchanges gradients of the discriminator. As the discriminator has significantly fewer parameters than the feature extractor, DILS is inherently more communication-friendly. ii) In DILS (), the synchronization between two nodes happens every steps. Instead, FADA requires data exchange after every single step. Therefore, DILS requires much less number of network synchronization. In sum, DILS is able to reduce the communication overhead over FADA by exchanging lesser amount of data at a lesser frequency.

Effect of sync-up step. In DILS, the sync-up step size controls the frequency of gradient exchange between nodes. In Figure 4, we vary from 1 and 10 and calculate the adaptation accuracy in the target domain for two adaptation tasks. The results show that DILS is fairly resilient to step-size up to . The difference in adaptation accuracy between and is just 0.5% for RMNIST and 1.7% for Digits. This accuracy loss is offset by reduction in data communication and gains in training speed. When is high, DILS exchanges gradients less frequently, which leads to less communication overhead but slightly worse accuracy. When is low, the training accuracy is improved at the expense of higher communication cost.

Effectively, could be considered as a tunable parameter to trade-off between target domain accuracy, training time and communication overhead. For applications where it is important to minimize the training time and communication overhead, could be set to a higher value. Empirically, we find that provides a good tradeoff between accuracy and training speed for all datasets studied in this paper.

Dataset Digits Mic2Mic DomainNet
FADA 7.9 GB 25 GB 86 GB
802 MB 2 GB 4 GB
201 MB 500 MB 1 GB
TABLE V: Communication Amount per node till convergence.

5.4 Analysis and Discussion

We now analyze some additional properties of FRuDA, and provide our thoughts on FRuDA’s future research directions.

Fig. 4: Effect of varying the sync-up step from 1 to 10 on target domain accuracy for two adaptation tasks

Error accumulation and negative transfer in sequential adaptation. FRuDA can be interpreted as a form of sequential adaptation, because new target domains can choose a collaborator from previously adapted target domains using the OCS algorithm. In sequential adaptation, error accumulation and negative transfer for downstream target domains is a possibility, especially if an unrelated domain appears in the sequence. We analyze OCS from this lens in the following text.

Figure 4(a) shows a sequence of one source domain (0°) and 5 target domains (30°, 60°, 330°, 90°, 120°) from the RMNIST dataset. If we simply do a sequential adaptation where each domain adapts from the domain, we see that 60°330° results in high error and poor adaptation accuracy for 330°. This behavior is due to the high divergence between the two domains. More critically however, we see that this error propagates in all the subsequent adaptation tasks (for 90°, 120°) and causes poor adaptation performance in them as well. In fact, for 120° we also observe negative transfer, as its post-adaptation accuracy (20.7%) is worse than the pre-adaptation accuracy obtained with source domain model.

The ability of OCS to flexibly choose a collaborator for each target domain inherently counters this problem. Firstly, Figure 4(b) shows that we can choose a better collaborator for 330° using OCS and obtain almost 20% higher adaptation accuracy than in Figure 4(a). More importantly, the subsequent target domains are no longer reliant on 330° as their collaborator and can adapt from any available candidate domain, e.g., 90° adapts from 60° and obtains the best possible adaptation accuracy of 92.6% in this setup. Similarly, for 120°, we can prevent the negative transfer and achieve 91.7% accuracy by adapting from 90° (which itself underwent adaptation previously with 60°).

In summary, for a given sequence of domains, OCS enables each target domain to flexibly find its optimal collaborator and achieve the best possible adaptation accuracy.

(a) Sequential Adaptation without OCS. Each domain adapts from the domain.
(b) Sequential Adaptation with OCS. Each domain flexibly chooses its optimal collaborator and shows accuracy gains (in red) over the baseline.
Fig. 5: OCS prevents negative transfer and error accumulation in sequential adaptation caused by the presence of unrelated domains.

Future Work. We showed that FRuDA can work with multiple uDA algorithms that follow the adversarial training paradigm introduced in §2.1. However, other uDA techniques such as those based on generative algorithms (e.g., [hoffman2018cycada]) and those with no adversarial learning component (e.g., [sun2017correlation]) were out of scope of this paper. Other future works include extending FRuDA to scenarios where the label spaces of source and target domains do not overlap [you2019universal] or when there is label shift between them [wu2019domain].

FRuDA, in its current form, is designed as a sequential adaptation algorithm, that is, it only deals with one target domain at a time. If multiple target domains join together as a batch, FRuDA will still process them one by one, and the processing order of the domain could impact their training accuracy. In future work, we will extend FRuDA to batch-based processing of target domains. In particular, it will be interesting to explore that given a pool of target domains, how do we sequentially feed them to FRuDA such that it maximizes target domain accuracies.

6 Conclusion

In real-world ML applications, datasets are often distributed across thousands of users and devices. Despite the rapid advancements in adversarial uDA research, their extension to distributed settings has surprisingly remained under-explored. We introduced FRuDA, an end-to-end framework for distributed adversarial uDA which brings a novel and complementary perspective to uDA research. Through a careful analysis of the literature, we identified the key design requirements for a distributed uDA system and proposed two novel algorithms to increase adaptation accuracy and training efficiency in distributed uDA settings. A comprehensive evaluation with multiple image and speech datasets show that FRuDA can increase target accuracy over baselines by as much as 50% and improve the training efficiency of adversarial uDA by up to . Overall, this paper contributes to both domain adaptation and distributed learning literature, by showing for the first time, how domain adaptation and adversarial training can work in a distributed setting.

References

Appendix A Reproducibility Details

a.1 Architectures, Pre-Processing and Hyperparameters

We now describe the neural architectures used for each dataset along with the pre-processing steps and hyperparameters used for supervised and adversarial learning.

Rotated MNIST: The MNIST dataset is obtained from the Tensorflow Dataset repository and is rotated by different degrees to generate different domains. The same training and test partitions as in the original dataset are used in our experiments. We employ the LeNet architecture for training the feature extractor. The model was trained for each source domain with a learning rate of using the Adam optimizer and a batch size of 32.

In the adversarial training process, we used the ADDA loss formulations to perform domain adaptation with a learning rate of for the target extractor and for the discriminators.

Digits: This task consists of five domains: MNIST, SVHN, USPS, MNIST-Modified and SynthDigits. We used the same train/test split as in the original domain datasets. The images from all domains were normalized between 0 and 1, and resized to 32x32x3 for consistency. The following architecture was used for the feature extractor and it was trained for each source domain with a learning rate of using the Adam optimizer and a batch size of 64.

inputs = tf.keras.Input(shape=(32,32,3), name=’img’)
x = Conv2D(filters = 64, kernel_size = 5, strides=2)(inputs)
x = BatchNormalization()(x, training=is_training)
x = Dropout(0.1)(x, training=is_training)
x = ReLU()(x)
x = Conv2D(filters = 128, kernel_size = 5, strides=1)(x)
x = BatchNormalization()(x, training=is_training)
x = Dropout(0.3)(x, training=is_training)
x = ReLU()(x)
x = Conv2D(filters = 256, kernel_size = 5, strides=1)(x)
x = BatchNormalization()(x, training=is_training)
x = Dropout(0.5)(x, training=is_training)
x = ReLU()(x)
x = Flatten()(x)
x = Dense(512)(x)
x = BatchNormalization()(x, training=is_training)
x = ReLU()(x)
x = Dropout(0.5)(x, training=is_training)
outputs = Dense(10)(x)

In the adversarial training process, we used the ADDA losses to perform domain adaptation with a learning rate of for the target extractor and for the discriminator.

Office-Caltech: We used the pre-trained DeCAF features [decaf] for each domain along with the original train/test splits. The following architecture was used for the feature extractor and it was trained with a learning rate of using the Adam optimizer and a batch size of 32.

Dense(512, activation=’linear’)
Dropout(0.7)
Dense(256, activation=’linear’)
Dropout(0.7)
Dense(10, activation=None)

In the adversarial training process, we used the ADDA losses to perform domain adaptation with a learning rate of for the target extractor and for the discriminator.

DomainNet:

We used four labeled domains from the DomainNet dataset (Real, Quickdraw, Infograph, Sketch) along with their original train/test splits. A ResNet50-v2 pre-trained on ImageNet was employed as the base model for this task. We froze all but the last four layers of the base model and fine-tuned it for each source domain with a learning rate of 1e-5 using the Adam optimizer and a batch size of 64.

ResNet50V2(include_top=False, input_shape=(224, 224,3), avg=’pool’),
Dense(345, activation=’softmax’)

In the adversarial training process, we used the ADDA losses to perform domain adaptation with a learning rate of for the target extractor and for the discriminator.

Mic2Mic: Similarly, we followed the same train/test splits as in the original dataset provided by the authors of [mathur2019unsupervised]

. The spectrogram tensors were normalized between 0 and 1 during the training and test stages. The following model was trained for each source domain with a learning rate of

using the Adam optimizer and a batch size of 64.

Conv2D(filters = 64, kernel_size = (8,20), activation=’relu’)
MaxPooling2D(pool_size = (2,2)),
Conv2D(filters = 128, kernel_size = (4,10), activation=’relu’),
MaxPooling2D(pool_size = (1,4)),
Flatten(),
Dense(256, activation=’relu’),
Dense(31)

In the adversarial training process, we used the ADDA losses to perform domain adaptation with a learning rate of for the target extractor and for the discriminator.

Appendix B Proofs and Analysis

b.1 Optimal Collaborator Selection (OCS)

Theorem 1

Let and be two domains sharing the same labeling function . Let denote the Lipschitz constant of the cross-entropy loss function in , and let be the Lipschitz constant of a hypothesis learned on . For any two -Lipschitz hypotheses , we can derive the following error bound for the cross-entropy (CE) error in :

(10)

where denote the first Wasserstein distance between the domains and , and denotes the error in .

Proof. The error between two hypotheses on a distribution is given by:

(11)

We define softmax cross-entropy on a given distribution as

(12)

where is the softmax function , is the labelling function, and denotes the projection of to the -component.

Then we have,

(13)

Further, using the definition of Lipschitz continuity, we have

(14)

where is the Lipschitz constant of the softmax cross-entropy function.

Next, we follow the triangle inequality proof from [shen2018wasserstein, proof of Lemma 1] to find that

(15)

where is a Lipschitz constant for and , if the label were constant. Since is constant outside of a measure 0 subset where the labels change, and and are Lipschitz, so in particular measurable, Equation B.1 holds everywhere.

Then, by substituting from Eq. B.1 and Eq. 14 in Eq. B.1, we get Theorem 1:

(16)

b.2 Convergence of DIscriminator-based Lazy Synchronization (DILS)

The network structures of adversarial uDA methods resemble a GAN, where the target encoder and discriminators ( and ) play a minimax game similar to a GAN’s generator and discriminator. and

can separately define two probability distributions on the feature representations

and , noted as and respectively. Then according to the theoretical analysis in [goodfellow2014generative], we know that:

Proposition 2. For a given , if is allowed to reach its optimum, and is updated accordingly to optimize the value function, then converges to , which is the optimization goal of .

In other words, if we can guarantee that under the training strategy of Lazy Synchronization, convergence behaviour of and is similar to the non-distributed case, then should also converge. We have the following theorem.

Theorem 1. In Lazy Synchronization, given a fixed target encoder, under certain assumptions, we have the following convergence rate for the discriminators and .

(17)

Where is the sync-up step, is the learning rate. Set . When , the impact of stale update will be very small, and thus it can converge with rate , which is same as the classic SGD algorithm.

Proof. As we mentioned in the paper, and are lazily synced, such that their weights are always identical, because they are initialized with the same weights and always apply the same synced gradients at every training step. Therefore, we can consider them as one discriminator for our convergence analysis.

Notations. Throughout the proof, we use following notations and definitions:

  • denotes model weights of .

  • denotes the loss function of .

  • and denote the datasets of source and target feature representations (i.e., outputs of the respective feature extractors).

  • denotes one batch of training instances sampled from .

  • denotes empirical loss of model on batch .

  • denotes gradients of function .

  • denotes the gradient bound.

  • denotes learning rate.

  • denotes the number of training steps.

  • denotes the size of sync-up step in our proposed Lazy Synchronization approach.

We can formalize the optimization goal of the discriminator as:

(18)

Assumption 1. is with -Lipschitz gradients:

(19)

Equivalently, we can get:

(20)

In training step , we first simplify the updating rule as:

(21)

Combine Inequality 20 and Equation 21, at time step we have:

(22)

Sum all inequalities:

(23)

Take expectation on in both sides and re-arrange terms:

(24)

In our proposed Lazy Synchronization algorithm, the update term is:

(25)

where is the latest sync-up step given a certain time step , and is the sync-up step of our algorithm. This updating rule formalizes our approach, wherein the gradient applied to the discriminator is the averaged gradient over steps before the latest sync up step.

We transform as follows:

(26)

Bring Equation 26 back to Equation 24 (Let ):

(27)

Then we analyze and :

(28)

For :