1 Introduction
Data generated by networks of mobile and IoT devices poses unique challenges for training machine learning models. Due to the growing storage/computational power of these devices and concerns about data privacy, it is increasingly attractive to keep data and computation locally on the device (federated_mtl). Federated Learning (FL) (mohassel2018aby; federated_secure; secureml) provides a privacypreserving mechanism to leverage such decentralized data and computation resources to train machine learning models. The main idea behind federated learning is to have each node learn on its own local data and not share either the data or the model parameters.
While federated learning promises better privacy and efficiency, existing methods ignore the fact that the data on each node are collected in a noni.i.d manner, leading to domain shift between nodes (datashift_book2009). For example, one device may take photos mostly indoors, while another mostly outdoors. In this paper, we address the problem of transferring knowledge from the decentralized nodes to a new node with a different data domain, without requiring any additional supervision from the user. We define this novel problem Unsupervised Federated Domain Adaptation (UFDA), as illustrated in Figure 1(a).
There is a large body of existing work on unsupervised domain adaptation (long2015; DANN; adda; CycleGAN2017; gong2012geodesic; long2018conditional), but the federated setting presents several additional challenges. First, the data are stored locally and cannot be shared, which hampers mainstream domain adaptation methods as they need to access both the labeled source and unlabeled target data (ddc; JAN; ghifary2016deep; SunS16a; DANN; adda). Second, the model parameters are trained separately for each node and converge at different speeds, while also offering different contributions to the target node depending on how close the two domains are. Finally, the knowledge learned from source nodes is highly entangled (bengio2013representation), which can possibly lead to negative transfer (pan2010survey).
In this paper, we propose a solution to the above problems called Federated Adversarial Domain Adaptation (FADA) which aims to tackle domain shift in a federated learning system through adversarial techniques. Our approach preserves data privacy by training one model per source node and updating the target model with the aggregation of source gradients, but does so in a way that reduces domain shift. First, we analyze the federated domain adaptation problem from a theoretical perspective and provide a generalization bound. Inspired by our theoretical results, we propose an efficient adaptation algorithm based on adversarial adaptation and representation disentanglement applied to the federated setting. We also devise a dynamic attention model to cope with the varying convergence rates in the federated learning system. We conduct extensive experiments on realworld datasets, including image recognition and natural language tasks. Compared to baseline methods, we improve adaptation performance on all tasks, demonstrating the effectiveness of our devised model.
2 Related Work
Unsupervised Domain Adaptation Unsupervised Domain Adaptation (UDA) aims to transfer the knowledge learned from a labeled source domain to an unlabeled target domain. Domain adaptation approaches proposed over the past decade include discrepancybased methods (ddc; JAN; ghifary2014domain; SunS16a; peng2017synthetic), reconstructionbased UDA models (yi2017dualgan; CycleGAN2017; hoffman2017cycada; kim2017learning), and adversarybased approaches (cogan; adda; ufdn; DANN). For example, DANN propose a gradient reversal layer to perform adversarial training to a domain discriminator, inspired by the idea of adversarial learning. adda
address unsupervised domain adaptation by adapting a deep CNNbased feature extractor/classifier across source and target domains via adversarial training.
ben2010theory introduce an divergence to evaluate the domain shift and provide a generalization error bound for domain adaptation. These methods assume the data are centralized on one server, limiting their applicability to the distributed learning system.Federated Learning Federated learning (mohassel2018aby; rivest1978data; federated_secure; secureml) is a decentralized learning approach which enables multiple clients to collaboratively learn a machine learning model while keeping the training data and model parameters on local devices. Inspired by Homomorphic Encryption (rivest1978data), (gilad2016cryptonets) propose CryptoNets to enhance the efficiency of data encryption, achieving higher federated learning performance. federated_secure introduce a secure aggregation scheme to update the machine learning models under their federated learning framework. Recently, secureml propose SecureML to support privacypreserving collaborative training in a multiclient federated learning system. However, these methods mainly aim to learn a single global model across the data and have no convergence guarantee, which limits their ability to deal with noni.i.d. data. To address the noni.i.d data, federated_mtl introduce federated multitask learning, which learns a separate model for each node. ftl
propose semisupervised federated transfer learning in a privacypreserving setting. However, their models involve full or semisupervision. The work proposed here is, to our best knowledge, the first federated learning framework to consider unsupervised domain adaptation.
Feature Disentanglement
Deep neural networks are known to extract features where multiple hidden factors are highly entangled. Learning disentangled representations can help remove irrelevant and domainspecific features and model only the relevant factors of data variation. To this end, recent work
(mathieu2016disentangling; makhzani2015adversarial; ufdn; cisac_gan) explores the learning of interpretable representations using generative adversarial networks (GANs) (gan)and variational autoencoders (VAEs)
(vae). Under the fully supervised setting, (cisac_gan) propose an auxiliary classifier GAN (ACGAN) to achieve representation disentanglement. (ufdn) introduce a unified feature disentanglement framework to learn domaininvariant features from data across different domains. (kingma2014semi) also extend VAEs into the semisupervised setting for representation disentanglement. (drit) propose to disentangle the features into a domaininvariant content space and a domainspecific attributes space, producing diverse outputs without paired training data. Inspired by these works, we propose a method to disentangle the domaininvariant features from domainspecific features, using an adversarial training process. In addition, we propose to minimize the mutual information between the domaininvariant features and domainspecific features to enhance the feature disentanglement.3 Generalization Bound for Federated Domain Adaptation
We first define the notation and review a typical theoretical error bound for singlesource domain adaptation (ben2007analysis; da_bound_bkitzer) devised by BenDavid et al. Then we describe our derived error bound for unsupervised federated domain adaptation. We mainly focus on the highlevel interpretation of the error bound here and refer our readers to the appendix (see supplementary material) for proof details.
Notation. Let ^{1}^{1}1In this literature, the calligraphic denotes data distribution, and italic D denotes domain discriminator. and denote source and target distribution on input space and a groundtruth labeling function . A hypothesis is a function with the error w.r.t the groundtruth labeling function : . We denote the risk and empirical risk of hypothesis on as and . Similarly, the risk and empirical risk of on are denoted as and . The divergence between two distributions and is defined as: , where is a hypothesis class for input space , and denotes the collection of subsets of that are the support of some hypothesis in .
The divergence of a measurable hypothesis class is defined as: , (: the XOR operation). We denote the optimal hypothesis that achieves the minimum risk on the source and the target as and the error of as . blitzer2007learning prove the following error bound on the target domain.
Theorem 1.
Let be a hypothesis space of dimension and , be the empirical distribution induced by samples of size drawn from and
. Then with probability at least
over the choice of samples, for each ,(1) 
Let =, and be source domains and the target domain in a UFDA system, where . In federated domain adaptation system, is distributed on nodes and the data are not shareable with each other in the training process. The classical domain adaptation algorithms aim to minimize the target risk . However, in a UFDA system, one model cannot directly get access to data stored on different nodes for security and privacy reasons. To address this issue, we propose to learn separate models for each distributed source domain . The target hypothesis is the aggregation of the parameters of , i.e. , , . We can then derive the following error bound:
Theorem 2.
(Weighted error bound for federated domain adaptation). Let be a hypothesis class with VCdimension d and , be empirical distributions induced by a sample of size m from each source domain and target domain in a federated learning system, respectively. Then, , , with probability at least over the choice of samples, for each ,
(2) 
where is the risk of the optimal hypothesis on the mixture of and , and is the mixture of source samples with size . denotes the divergence between domain and .
Comparison with Existing Bounds The bound in (2) is extended from (1) and they are equivalent if only one source domain exists (). Mansour_nips2018 provide a generalization bound for multiplesource domain adaptation, assuming that the target domain is a mixture of the source domains. In contrast, in our error bound (2), the target domain is assumed to be an novel domain, resulting in a bound involving discrepancy (ben2010theory) and the VCdimensional constraint (vapnik1998statistical). blitzer2007learning
propose a generalization bound for semisupervised multisource domain adaptation, assuming that partial target labels are available. Our generalization bound is devised for unsupervised learning.
zhao2018nips_multisource introduce classification and regression error bounds for multisource domain adaptation. However, these error bounds assume that the multiple source and target domains exactly share the same hypothesis. In contrast, our error bound involves multiple hypotheses.4 Federated Adversarial Domain Adaptation
The error bound in Theorem (2) demonstrates the importance of the weight and the discrepancy
in unsupervised federated domain adaptation. Inspired by this, we propose dynamic attention model to learn the weight
and federated adversarial alignment to minimize the discrepancy between the source and target domains, as shown in Figure 1. In addition, we leverage representation disentanglement to extract domaininvariant representations to further enhance knowledge transfer.Dynamic Attention In a federated domain adaptation system, the models on different nodes have different convergence rates. In addition, the domain shifts between the source domains and target domain are different, leading to a phenomenon where some nodes may have no contribution or even negative transfer (pan2010survey) to the target domain. To address this issue, we propose dynamic attention, which is a mask on the gradients from source domains. The philosophy behind the dynamic attention is to increase the weight of those nodes whose gradients are beneficial to the target domain and limit the weight of those whose gradients are detrimental to the target domain. Specifically, we leverage the gap statistics (inertia) to evaluate how well the target features
can be clustered with unsupervised clustering algorithms (KMeans). Assuming we have
clusters, the gap statistics are computed as:(3) 
where we have clusters , with denoting the indices of observations in cluster , and =. Intuitively, a smaller gap statistics
value indicates the feature distribution has smaller intraclass variance. We measure the contribution of each source domain by the
gap statistics gain between two consecutive iterations: ( indicating training step), denoting how much the clusters can be improved before and after the target model is updated with the th source model’s gradient. The mask on the gradients from source domains is defined as (,,…,).Federated Adversarial Alignment The performance of machine learning models degrades rapidly with the presence of domain discrepancy (long2015). To address this issue, existing work (hoffman2017cycada; Tzeng_2015_ICCV) proposes to minimize the discrepancy with an adversarial training process. For example, Tzeng_2015_ICCV
proposes the domain confusion objective, under which the feature extractor is trained with a crossentropy loss against a uniform distribution. However, these models require access to the source and target data simultaneously, which is prohibitive in UFDA. In the federated setting, we have multiple source domains and the data are locally stored in a privacypreserving manner, which means we cannot train a single model which has access to the source domain and target domain simultaneously. To address this issue, we propose federated adversarial alignment that divides optimization into two independent steps, a domainspecific local feature extractor and a global discriminator. Specifically, (1) for each domain, we train a local feature extractor,
for and for , (2) for each (, ) sourcetarget domain pair, we train an adversarial domain classifier to align the distributions. Note thatonly gets access to the output vectors of
and , without violating the UFDA setting. Given the th source domain data , target data , the objective for are defined as follows:(4) 
In the second step, remains unchanged, but is updated with the following objective:
(5) 
Representation Disentanglement We employ adversarial disentanglement to extract the domaininvariant features. Specifically, we first train the disentangler and the way class identifier to correctly predict the labels with a crossentropy loss, based on and features. The objective is:
(6) 
where , denote the domaininvariant and domainspecific features respectively. In the next step, we freeze the class identifier and only train the feature disentangler to confuse the class identifier by generating the domainspecific features , as shown in Figure 1. This can be achieved by minimizing the negative entropy loss of the predicted class distribution. The objective is as follows:
(7) 
To enhance the disentanglement, we minimize the mutual information between domaininvariant features and domainspecific features, following Peng2019DomainAL. Specifically, the mutual information is defined as , where
is the joint probability distribution of (
), and , are the marginals. Despite being a pivotal measure across different distributions, the mutual information is only tractable for discrete variables, for a limited family of problems where the probability distributions are unknown (mine). Following Peng2019DomainAL, we adopt the Mutual Information Neural Estimator (MINE) (mine) to estimate the mutual information by leveraging a neural network : . Practically, MINE can be calculated as  . To avoid computing the integrals, we leverage MonteCarlo integration to calculate the estimation:(8) 
where
are sampled from the joint distribution and
is sampled from the marginal distribution. The domaininvariant and domainspecific features are forwarded to a reconstructor with a L2 loss to reconstruct the original features, aiming to keep the representation integrity, as shown in Figure 1(b).Optimization
Our model is trained in an endtoend fashion. We train federated alignment and representation disentanglement component with Stochastic Gradient Descent
(SGD). The federated adversarial alignment loss and representation disentanglement loss are optimized together with the task loss. The detailed training procedure is presented in Algorithm 1.5 Experiments
We test our model on the following tasks: digit classification (DigitFive), object recognition (OfficeCaltech10 (gong2012geodesic), DomainNet (lsdac)
) and sentiment analysis (
Amazon Review dataset (ref:blitzer)). Figure 2 shows some data samples and Table 9(see supplementary material) shows the number of data per domain we used in our experiments. We perform our experiments on a 10 TitanXp GPU cluster and simulate the federated system on a single machine (as the data communication is not the main focus of this paper). Our model is implemented with PyTorch. We repeat every experiment 10 times on the
DigitFive and Amazon Review datasets, and 5 times on the OfficeCaltech10 and DomainNet (lsdac) datasets, reporting the mean and standard derivation of accuracy. To better explore the effectiveness of different components of our model, we propose three different ablations, i.e. model I: with dynamic attention; model II: I + adversarial alignment; and model III: II + representation disentanglement.5.1 Experiments on Digit Recognition
DigitFive This dataset is a collection of five benchmarks for digit recognition, namely MNIST (lecun1998gradient), Synthetic Digits (DANN), MNISTM (DANN), SVHN, and USPS. In our experiments, we take turns setting one domain as the target domain and the rest as the distributed source domains, leading to five transfer tasks. The detailed architecture of our model can be found in Table 6 (see supplementary material).
Since many DA models (MCD_2018; SE; hoffman2017cycada) require access to data from different domains, it is infeasible to directly compare our model to these baselines. Instead, we compare our model to the following popular domain adaptation baselines: Domain Adversarial Neural Network (DANN) (DANN), Deep Adaptation Network (DAN) (long2015), Automatic DomaIn Alignment Layers (AutoDIAL) (carlucci2017)
, and Adaptive Batch Normalization (
AdaBN) li2016revisiting. Specifically, DANN minimizes the domain gap between source domain and target domain with a gradient reversal layer. DAN applies multikernel MMD loss (gretton2007kernel) to align the source domain with the target domain in Reproducing Kernel Hilbert Space. AutoDIAL introduces domain alignment layer to deep models to match the source and target feature distributions to a reference one. AdaBN applies Batch Normalization layer (batch_normalization) to facilitate the knowledge transfer between the source and target domains. When conducting the baseline experiments, we use the code provided by the authors and modify the original settings to fit federated DA setting (i.e. each domain has its own model), denoted by DAN and DANN. In addition, to demonstrate the difficulty of UFDA where accessing all source data with a single model is prohibitive, we also perform the corresponding multisource DA experiments (shared source data).


Models  mt,sv,sy,upmm  mm,sv,sy,upmt  mt,mm,sy,upsv  mt,mm,sv,upsy  mt,mm,sv,syup  Avg 
Source Only  63.30.7  90.50.8  88.70.8  63.50.9  82.40.6  77.7 
DAN  63.70.7  96.30.5  94.20.8  62.40.7  85.40.7  80.4 
DANN  71.30.5  97.60.7  92.30.8  63.40.7  85.30.8  82.1 


Source Only  49.60.8  75.41.3  22.70.9  44.30.7  75.51.4  53.5 
AdaBN  59.30.8  75.30.7  34.20.6  59.70.7  87.10.9  61.3 
AutoDIAL  60.71.6  76.80.9  32.40.5  58.71.2  90.30.9  65.8 
fDANN  59.50.6  86.11.1  44.3 0.6  53.40.9  89.70.9  66.6 
fDAN  57.5 0.8  86.4 0.7  45.30.7  58.40.7  90.8 1.1  67.7 
FADA+attention (I)  44.20.7  90.50.8  27.80.5  55.60.8  88.31.2  61.3 
FADA+adversarial (II)  58.20.8  92.5 0.9  48.30.6  62.10.5  90.61.1  70.3 
FADA+disentangle (III)  62.50.7  91.4 0.7  50.5 0.3  71.80.5  91.71.0  73.6 

Results and Analysis The experimental results are shown in Table 1. From the results, we can make the following observations. (1) Model III achieves 73.6% average accuracy, significantly outperforming the baselines. (2) The results of model I and model II demonstrate the effectiveness of dynamic attention and adversarial alignment. (3) Federated DA displays much weaker results than multisource DA, demonstrating that the newly proposed UFDA learning setting is very challenging.
To dive deeper into the feature representation of our model versus other baselines, we plot in Figure 3(a)3(d) the tSNE embeddings of the feature representations learned on mm,mt,sv,syup task with sourceonly features, DANN features, DAN features and FADA features, respectively. We observe that the feature embeddings of our model have smaller intraclass variance and larger interclass variance than DANN and DAN, demonstrating that our model is capable of generating the desired feature embedding and can extract domaininvariant features across different domains.
5.2 Experiments on OfficeCaltech10
OfficeCaltech10 (gong2012geodesic) This dataset contains 10 common categories shared by Office31 (office) and Caltech256 datasets (griffin2007caltech). It contains four domains: Caltech (C), which are sampled from Caltech256 dataset, Amazon (A), which contains images collected from amazon.com, Webcam (W) and DSLR (D), which contains images taken by web camera and DSLR camera under office environment.


Method  C,D,W A  A,D,W C  A,C,W D  A,C,D W  Average 
AlexNet  80.10.4  86.90.3  82.70.5  85.10.3  83.7 
DAN  82.50.5  87.20.4  85.60.4  86.10.3  85.4 
DANN  83.10.4  86.50.5  84.8  86.40.5  85.2 
FADA+attention (I)  81.20.3  87.10.6  83.5  85.90.4  84.4 
FADA+adversarial (II)  83.10.6  87.80.4  85.4  86.80.5  85.8 
FADA+disentangle (III)  84.30.6  88.40.5  86.10.4  87.30.5  86.5 


ResNet101  81.90.5  87.90.3  85.70.5  86.90.4  85.6 
AdaBN  82.20.4  88.20.6  85.90.7  87.40.8  85.7 
AutoDIAL  83.30.6  87.70.8  85.60.7  87.10.6  85.9 
DAN  82.70.3  88.10.5  86.50.3  86.50.3  85.9 
DANN  83.50.4  88.50.3  85.90.5  87.10.4  86.3 
FADA+attention (I)  82.10.5  87.50.3  85.80.4  87.30.5  85.7 
FADA+adversarial (II)  83.20.4  88.40.3  86.40.5  87.80.4  86.5 
FADA+disentangle (III)  84.20.5  88.70.5  87.10.6  88.10.4  87.1 



Models  inf,pnt,qdr,
rel,sktclp 
clp,pnt,qdr,
rel,sktinf 
clp,inf,qdr,
rel,sktpnt 
clp,inf,pnt,
rel,sktqdr 
clp,inf,pnt,
qdr,sktrel 
clp,inf,pnt,
qdr,relskt 
Avg 


AlexNet  39.20.7  12.70.4  32.70.4  5.90.7  40.30.5  22.70.6  25.6 
DAN  41.60.6  13.70.5  36.30.5  6.50.5  43.50.8  22.90.5  27.4 
DANN  42.60.8  14.10.7  35.20.3  6.20.7  42.90.5  22.70.7  27.2 
FADA+disentangle (III)  44.90.7  15.90.6  36.30.8  8.60.8  44.50.6  23.20.8  28.9 


ResNet101  41.6 0.6  14.50.7  35.70.7  8.40.7  43.50.7  23.30.7  27.7 
DAN  43.50.7  14.10.6  37.60.7  8.30.6  44.50.5  25.10.5  28.9 
DANN  43.10.8  15.20.9  35.70.4  8.20.6  45.20.7  27.10.6  29.1 
FADA+disentangle (III)  45.30.7  16.30.8  38.9 0.7  7.90.4  46.70.4  26.80.4  30.3 



Method  D,E,K B  B,E,K D  B,D,K E  B,D,E K  Average 
Source Only  74.4  79.20.4  73.5 0.2  71.40.1  74.6 
DANN  75.20.3  82.70.2  76.50.3  72.80.4  76.8 
AdaBN  76.70.3  80.90.3  75.70.2  74.60.3  76.9 
AutoDIAL  76.30.4  81.30.5  74.80.4  75.60.2  77.1 
DAN  75.60.2  81.60.3  77.90.1  73.20.2  77.6 
FADA+attention (I)  74.80.2  78.90.2  74.50.3  72.50.2  75.2 
FADA+adversarial (II)  79.70.2  81.10.1  77.30.2  76.40.2  78.6 
FADA+disentangle (III)  78.10.2  82.70.1  77.40.2  77.50.3  78.9 

We leverage two popular networks as the backbone of feature generator , i.e. AlexNet (alexnet) and ResNet (resnet)
. Both the networks are pretrained on ImageNet
(ImageNet). Other components of our model are randomly initialized with the normal distribution. In the learning process, we set the learning rate of randomly initialized parameters to ten times of the pretrained parameters as it will take more time for those parameters to converge. Details of our model are listed in Table
8 (supplementary material).Results and Analysis The experimental results on OfficeCaltech10 datasets are shown in Table 2. We utilize the same backbones as the baselines and separately show the results. We make the following observations from the results: (1) Our model achieves 86.5% accuracy with an AlexNet backbone and 87.1% accuracy with a ResNet backbone, outperforming the compared baselines. (2) All the models have similar performance when C,D,W are selected as the target domain, but perform worse when A is selected as the target domain. This phenomenon is probably caused by the large domain gap, as the images in A are collected from amazon.com and contain a white background.
To better analyze the effectiveness of FADA, we perform the following empirical analysis: (1) distance ben2010theory suggests distance as a measure of domain discrepancy. Following long2015, we calculate the approximate distance for C,D,WA and A,C,WD tasks, where is the generalization error of a twosample classifier (e.g. kernel SVM) trained on the binary problem of distinguishing input samples between the source and target domains. In Figure 4(a), we plot for tasks with raw ResNet features, DANN features, and FADA features, respectively. We observe that the on DADA features are smaller than ResNet features and DANN features, demonstrating that FADA features are harder to be distinguished between source and target. (2) To show how the dynamic attention mechanism benefits the training process, we plot the training loss w/ or w/o dynamic weights for A,C,WD task in Figure 4(b). The figure shows the target model’s training error is much smaller when dynamic attention is applied, which is consistent with the quantitative results. In addition, in A,C,WD setting, the weight of A
decreases to the lower bound after first a few epochs and the weight of
W increases during the training process, as photos in both D and W are taken in the same environment with different cameras. (3) To better analyze the error mode, we plot the confusion matrices for DAN and FADA on A,C,D>W task in Figure 4(c)4(d). The figures show that DAN mainly confuses ”calculator” vs. “keyboard”, “backpack” with “headphones”, while FADA is able to distinguish them with disentangled features.5.3 Experiments on DomainNet
DomainNet ^{2}^{2}2http://ai.bu.edu/M3SDA/ This dataset contains approximately 0.6 million images distributed among 345 categories. It comprises of six domains: Clipart (clp), a collection of clipart images; Infograph (inf), infographic images with specific object; Painting (pnt), artistic depictions of object in the form of paintings; Quickdraw (qdr), drawings from the worldwide players of game “Quick Draw!"^{3}^{3}3https://quickdraw.withgoogle.com/data; Real (rel, photos and real world images; and Sketch (skt), sketches of specific objects. This dataset is very largescale and contains rich and informative vision cues across different domains, providing a good testbed for unsupervised federated domain adaptation. Some sample images can be found in Figure 2.
Results The experimental results on DomainNet are shown in Table 3. Our model achieves 28.9% and 30.3% accuracy with AlexNet and ResNet backbone, respectively. In both scenarios, our model outperforms the baselines, demonstrating the effectiveness of our model on largescale dataset. Note that this dataset contains about 0.6 million images, and so even a onepercent performance improvement is not trivial. From the experiment results, we can observe that all the models deliver less desirable performance when infograph and quickdraw are selected as the target domains. This phenomenon is mainly caused by the large domain shift between inf/qdr domain and other domains.
5.4 Experiments on Amazon Review
Amazon Review (ref:blitzer) This dataset provides a testbed for crossdomain sentimental analysis of text. The task is to identify whether the sentiment of the reviews is positive or negative. The dataset contains reviews from amazon.com users for four popular merchandise categories: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K). Following gong2013connecting, we utilize 400dimensional bagofwords representation and leverage a fully connected deep neural network as the backbone. The detailed architecture of our model can be found in Table 7 (supplementary materials).
Results The experimental results on Amazon Review dataset are shown in Table 4. Our model achieves an accuracy of 78.9% and outperforms the compared baselines. We make two major observations from the results: (1) Our model is not only effective on vision tasks but also performs well on linguistic tasks under UFDA learning schema. (2) From the results of model I and II, we can observe the dynamic attention and federated adversarial alignment are beneficial to improve the performance. However, the performance boost from Model II to Model III is limited. This phenomenon shows that the linguistic features are harder to disentangle comparing to visual features.
5.5 Ablation Study
To demonstrate the effectiveness of dynamic attention, we perform the ablation study analysis. The Table 5 shows the results on “DigitFive”, OfficeCaltech10 and Amazon Review benchmark. We observe that the performance drops in most of the experiments when dynamic attention model is not applied. The dynamic attention model is devised to cope with the varying convergence rates in the federated learning system, i.e., different source domains have their own convergence rate. In addition, it will increase the weight of a specific domain when the domain shift between that domain and the target domain is small, and decrease the weight otherwise.
target  mm  mt  sv  sy  up  Avg  A  C  D  W  Avg  B  D  E  K  Avg 
FADA w/o. attention  60.1  91.2  49.2  69.1  90.2  71.9  83.3  85.7  86.2  88.3  85.8  77.2  82.8  77.2  76.3  78.3 
FADA w. attention  62.5  91.4  50.5  71.8  91.7  73.6  84.2  88.7  87.1  88.1  87.1  78.1  82.7  77.4  77.5  78.9 
6 Conclusion
In this paper, we first proposed a novel unsupervised federated domain adaptation (UFDA) problem and derived a theoretical generalization bound for UFDA. Inspired by the theoretical results, we proposed a novel model called Federated Adversarial Domain Adaptation (FADA) to transfer the knowledge learned from distributed source domains to an unlabeled target domain with a novel dynamic attention schema. Empirically, we showed that feature disentanglement boosts the performance of FADA in UFDA tasks. An extensive empirical evaluation on UFDA vision and linguistic benchmarks demonstrated the efficacy of FADA against several domain adaptation baselines.
References
7 Model Architecture
We provide the detailed model architecture (Table 6 and Table 8) for each component in our model: Generator, Disentangler, Domain Classifier, Classifier and MINE.


layer  configuration 


Feature Generator  


1  Conv2D (3, 64, 5, 1, 2), BN, ReLU, MaxPool 
2  Conv2D (64, 64, 5, 1, 2), BN, ReLU, MaxPool 
3  Conv2D (64, 128, 5, 1, 2), BN, ReLU 


Disentangler  


1  FC (8192, 3072), BN, ReLU 
2  DropOut (0.5), FC (3072, 2048), BN, ReLU 


Domain Identifier  


1  FC (2048, 256), LeakyReLU 
2  FC (256, 2), LeakyReLU 


Class Identifier  


1  FC (2048, 10), BN, Softmax 


Reconstructor  


1  FC (4096, 8192) 


Mutual Information Estimator  


fc1_x  FC (2048, 512), LeakyReLU 
fc1_y  FC (2048, 512), LeakyReLU 
2  FC (512,1) 

Model architecture for digit recognition task (“DigitFive” dataset). For each convolution layer, we list the input dimension, output dimension, kernel size, stride, and padding. For the fullyconnected layer, we provide the input and output dimensions. For dropout layers, we provide the probability of an element to be zeroed.


layer  configuration 


Feature Generator  


1  FC (400, 128), BN, ReLU 


Disentangler  


1  FC (128, 64), BN, ReLU 
2  DropOut (0.5), FC (64, 32), BN, ReLU 


Domain Identifier  


1  FC (64, 32), LeakyReLU 
2  FC (32, 2), LeakyReLU 


Class Identifier  


1  FC (32, 2), BN, Softmax 


Reconstructor  


1  FC (64, 128) 


Mutual Information Estimator  


fc1_x  FC (32, 16), LeakyReLU 
fc1_y  FC (32, 16), LeakyReLU 
2  FC (16,1) 

layer  configuration 
Feature Generator: ResNet101 or AlexNet  
Disentangler  
1  Dropout(0.5), FC (2048, 2048), BN, ReLU 
2  Dropout(0.5), FC (2048, 2048), BN, ReLU 
Domain Identifier  
1  FC (2048, 256), LeakyReLU 
2  FC (256, 2), LeakyReLU 
Class Identifier  
1  FC (2048, 10), BN, Softmax 
Reconstructor  
1  FC (4096, 2048) 
Mutual Information Estimator  
fc1_x  FC (2048, 512), LeakyReLU 
fc1_y  FC (2048, 512), LeakyReLU 
2  FC (512,1) 
8 Details of datasets
We provide the detailed information of datasets. For DigitFive and DomainNet, we provide the train/test split for each domain. For OfficeCaltech10, we provide the number of images in each domain. For Amazon review dataset, we show the detailed number of positive reviews and negative reviews for each merchandise category.


DigitFive  


Splits  mnist  mnist_m  svhn  syn  usps  Total  
Train  25,000  25,000  25,000  25,000  7,348  107,348  
Test  9,000  9,000  9,000  9,000  1,860  37,860  


OfficeCaltech10  


Splits  Amazon  Caltech  Dslr  Webcam  Total  
Total  958  1,123  157  295  2,533  


DomainNet  


Splits  clp  inf  pnt  qdr  rel  skt  Total 
Train  34,019  37,087  52,867  120,750  122,563  49,115  416,401 
Test  14,818  16,114  22,892  51,750  52,764  21,271  179,609 
Amazon Review  


Splits  Books  DVDs  Electronics  Kitchen  Total  
Positive  1,000  1,000  1,000  1,000  4,000  
Negative  1,000  1,000  1,000  1,000  4,000 
9 Proof of Theorem 2
Theorem 3.
(Weighted error bound for federated domain adaptation). Let be a hypothesis class with VCdimension d and , be empirical distributions induced by a sample of size m from each source domain and target domain in a federated learning system, respectively. Then, , , with probability at least over the choice of samples, for each ,
(9) 
where is the risk of the optimal hypothesis on the mixture of and , and is the mixture of source samples with size .
Proof.
Consider a combined source domain which is equivalent to a mixture distribution of the source domains, with the mixture weight , where and . Denote the mixture source domain distribution as (where ), and the data sampled from as . Theoretically, we can assume and to be the source domain and target domain, respectively. Apply Theorem 1, we have that for , with probability of at least 1 over the choice of samples, for each ,
(10) 
where is the risk of optimal hypothesis on the and . The upper bound of can be derived as follows:
the first inequality is derived by the triangle inequality. Similarly, with the triangle inequality property, we can derive . On the other hand, for , we have: . Replace , and in Eq. 10, we have:
∎
Remark.
The equation in Theorem 2 provides a theoretical error bound for unsupervised federated domain adaptation as it assumes that the source data distributed on different nodes can form a mixture source domain. In fact, the data on different node can not be shared under the federated learning schema. The theoretical error bound is only valid when the weights of models on all the nodes are fully synchronized.
Comments
There are no comments yet.