Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization

04/29/2019 ∙ by Kei Akuzawa, et al. ∙ The University of Tokyo 8

Learning domain-invariant representation is a dominant approach for domain generalization (DG), where we need to build a classifier that is robust toward domain shifts. However, previous domain-invariance-based methods overlooked the underlying dependency of classes on domains, which is responsible for the trade-off between classification accuracy and domain invariance. Because the primary purpose of DG is to classify unseen domains rather than the invariance itself, the improvement of the invariance can negatively affect DG performance under this trade-off. To overcome the problem, this study first expands the analysis of the trade-off by Xie et. al., and provides the notion of accuracy-constrained domain invariance, which means the maximum domain invariance within a range that does not interfere with accuracy. We then propose a novel method adversarial feature learning with accuracy constraint (AFLAC), which explicitly leads to that invariance on adversarial training. Empirical validations show that the performance of AFLAC is superior to that of domain-invariance-based methods on both synthetic and three real-world datasets, supporting the importance of considering the dependency and the efficacy of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In supervised learning we typically assume that samples are obtained from the same distribution in training and testing; however, because this assumption does not hold in many practical situations it reduces the classification accuracy for the test data

[30]. This motivates research into domain adaptation (DA) [9] and domain generalization (DG) [3]. DA methods operate in the setting where we have access to source and (either labeled or unlabeled) target domain data during training, and run some adaptation step to compensate for the domain shift. DG addresses the harder setting, where we have labeled data from several source domains and collectively exploit them such that the trained system generalizes to target domain data without requiring any access to them. Such challenges arise in many applications, e.g., hand-writing recognition (where domain shifts are induced by users, [28]), robust speech recognition (by acoustic conditions, [29]), and wearable sensor data interpretation (by users, [7]).

Figure 1: Explanation of domain-class dependency and the induced trade-off. (a) When the domain and the class are independent, (b) domain invariance and classification accuracy can be optimized at the same time, and the invariance prevents the classifier from overfitting to source domains. (c) When they are dependent, a trade-off exists between these two: (d) optimal classification accuracy cannot be achieved when perfect invariance is achieved, and (e) vice versa. We propose a method to lead explicitly to (e) rather than (d), because the primary purpose for domain generalization is classification, not domain-invariance itself.

This paper considers DG under the situation where domain and class labels are statistically dependent owing to some common latent factor (Figure 1-(c)), which we referred to as domain-class dependency. For example, the WISDM Activity Prediction dataset [16], where classes and domains correspond to activities and wearable device users, exhibits this dependency because of the (1) data characteristics: some activities (jogging and climbing stairs) are strenuous to the extent that some unathletic subjects avoided them, and (2) data-collection errors: other activities were added only after the study began and the initial subjects could not perform them. Note that the dependency is common in real-world datasets and a similar setting has been investigated in DA studies [36, 12], but most prior DG studies overlooked the dependency; moreover, we need to follow a approach separate from DA because DG methods cannot require any access to target data, as we discuss further in Sec. 2.2.

Most prior DG methods utilize invariant feature learning (IFL) [27, 7, 10, 33], which can be negatively affected by the dependency. IFL attempts to learn latent representation from input data which is invariant to domains , or match multiple source domain distributions in feature space. When source and target domains have some common structure (see, [27]), matching multiple source domains leads to match source and target ones and thereby prevent the classifier from overfitting to source domains (Figure 1-(b)). However, under the dependency, merely imposing the perfect domain invariance (which means and are independent) adversely affects the classification accuracy as pointed out by Xie et al. [33] and illustrated in Figure 1. Intuitively speaking, since contains information about under the dependency, encoding information about into helps to predict ; however, IFL attempts to remove all domain information from , which causes the trade-off. Although that trade-off occurs in source domains (because we use only source data during optimization), it can also negatively affect the classification performance for target domains. For example, if the target domain has characteristics similar (or same as an extreme case) to those of a certain source domain, giving priority to domain invariance obviously interferes with the DG performance (Figure 1-(d)).

In this paper, considering that prioritizing domain invariance under the trade-off can negatively affect the DG performance, we propose to maximize domain invariance within a range that does not interfere with the classification accuracy (Figure 1-(e)). We first expand the analysis by [33] about domain adversarial nets (DAN), a well-used IFL method, and derive Theorem 1 and 2 which show the conditions under which domain invariance harms the classification accuracy. In Theorem 3 we show that accuracy-constrained domain invariance, which we define as the maximum ( denotes entropy) value within a range that does not interfere with accuracy, equals . In other words, when , i.e., the learned representation contains as much domain information as the class labels, it does not affect the classification performance. After deriving the theorems, we propose a novel method adversarial feature learning with accuracy constraint (AFLAC), which leads to that invariance on adversarial training. Empirical validations show that the performance of AFLAC is superior to that of baseline methods, supporting the importance of considering domain-class dependency and the efficacy of the proposed approach for overcoming the issue.

The main contributions of this paper can be summarized as follows. Firstly, we show that the implicit assumption of previous IFL methods, i.e., domain and class are statistically independent, is not valid in many real-world datasets, and it degrades the DG performance of them. Secondly, we theoretically show to what extent latent representation can become invariant to domains without interfering with classification accuracy. This is significant because the analysis guides the novel regularization approach that is suitable for our situation. Finally, we propose a novel method which improves domain invariance while maintaining classification performance, and it enjoys higher accuracy than the IFL methods on both synthetic and three real-world datasets.

2 Preliminary and Related Work

2.1 Problem Statement of Domain Generalization

Denote , and

as the input feature, class label, and domain spaces, respectively. With random variables

,

,we can define the probability distribution for each domain as

. For simplicity this paper assumes that and are discrete variables. In domain generalization, we are given a training dataset consisting of for all , where each is drawn from the source domain . Using the training dataset, we train a classifier , and use the classifier to predict labels of samples drawn from unknown target domain .

2.2 Related Work

DG has been attracting considerable attention in recent years [27, 28]. [18] showed that non-end-to-end DG methods such as DICA [27] and MTAE [11] do not tend to outperform vanilla CNN, thus end-to-end methods are desirable. End-to-end methods based on domain invariant representation can be divided into two categories: adversarial-learning-based methods such as DAN [9, 33] and pre-defined-metric-based methods [10, 20].

In particular, our analysis and proposed method are based on DAN, which measures the invariance by using a domain classifier (also known as a discriminator) parameterized by deep neural networks and imposes regularization by deceiving it. Although DAN was originally invented for DA,

[33] demonstrated its efficacy in DG. In addition, they intuitively explained the trade-off between classification accuracy and domain invariance, but did not suggest any solution to the problem except for carefully tuning a weighting parameter. AFLAC also relates to domain confusion loss [31]

in that their encoders attempted to minimize Kullback-Leibler divergence (KLD) between the output distribution of the discriminators and some domain distribution (

in AFLAC and uniform distribution in

[31]), rather than to deceive the discriminator as DAN.

Several studies that address DG without utilizing IFL have been conducted. For example, CCSA [26], CIDG [21], and CIDDG [22] proposed to make use of semantic alignment, which attempts to make latent representation given class label () identical within source domains. This approach was originally proposed by [12] in the DA context, but its efficacy to overcome the trade-off problem is not obvious. Also, CIDDG, which is the only adversarial-learning-based semantic alignment method so far, needs the same number of domain classification networks as domains whereas ours needs only one. CrossGrad [28], which is one of the recent state-of-the-art DG methods, utilizes data augmentation with adversarial examples. However, because the method relies on the assumption that and are independent, it might not be directly applicable to our setting. MLDG [19], MetaReg [2], and Feature-Critic [23], other state-of-the-art methods, are inspired by meta-learning. These methods make no assumption about the relation between and ; hence, they could be combined with our proposed method in principle.

As with our paper, [21, 22] also pointed out the importance of considering the types of distributional shifts that occur, and they address the shift of across domains caused by the causal structure . However, the causal structure does not cause the trade-off problem as long as and are independent (Figure 1

-(a, b)), thus it is essential to consider and address domain-class dependency problem. They also proposed to correct the domain-class dependency with the class prior-normalized weight, which enforces the prior probability for each class to be the same across domains. Its motivation is different from ours in that it is intended to avoid overfitting whereas we address the trade-off problem.

In DA, [36, 12] address the situation where changes across the source and target domains by correcting the change of using unlabeled target domain data, which is often accomplished at the cost of classification accuracy for the source domain. However, this approach is not applicable (or necessary) to DG because we are agnostic on target domains and cannot run such adaptation step in DG. Instead, this paper is concerned with the change of within source domain and proposes to maximize the classification accuracy for source domains while improving the domain invariance.

It is worth mentioning that IFL has been used for many other context other than DG, e.g., DA [32, 9], domain transfer [17, 6], and fairness-aware classification [35, 24, 25]. However, adjusting it to each specific task is likely to improve performance. For example, in the fairness-aware classification task [25] proposed to optimize the fairness criterion directly instead of applying invariance to sensitive variables. By analogy, we adapted IFL for DG so as to address the domain-class dependency problem.

3 Our approach

3.1 Domain Adversarial Networks

In this section, we provide a brief overview of DAN [9], on which our analysis and proposed method are based. DAN trains a domain discriminator that attempts to predict domains from latent representation encoded by an encoder, while simultaneously training the encoder to remove domain information by deceiving the discriminator.

Formally, we denote and (, and are their parameters) as the deterministic encoder, probabilistic model of the label classifier, and that of domain discriminator, respectively. Then, the objective function of DAN is described as follows:

(1)

Here, the second term in Eq. 1 simply maximizes the log likelihood of and as well as in standard classification problems. On the other hand, the first term corresponds to a minimax game between the encoder and discriminator, where the discriminator tries to predict from and the encoder tries to fool .

As [33] originally showed, the minimax game ensures that the learned representation has no or little domain information, i.e., the representation becomes domain-invariant. This invariance ensures that the prediction from to is independent from , and therefore hopefully facilitates the construction of a classifier capable of correctly handling samples drawn from unknown domains (Figure 1-(b)). Below is a brief explanation.

Because is a deterministic mapping of , the joint probability distribution can be defined as follows:

(2)

and in the rest of the paper, we denote as because it depends on the encoder’s parameter . Using Eq. 2, Eq. 1 can be replaced as follows:

(3)

Assuming is fixed, the solutions and to Eq. 3 satisfy and . Substituting and into Eq. 3 enable us to obtain the following optimization problem depending only on :

(4)

Solving Eq. 4 allows us to obtain the solutions , , and , which are in Nash equilibrium. Here, means conditional entropy with the joint probability distribution . Thus, minimizing the second term in Eq. 4 intuitively means learning (the mapping function to) the latent representation which contains as much information about as possible. On the other hand, the first term can be regarded as a regularizer that attempts to learn that is invariant to .

3.2 Trade-off Caused by Domain-Class Dependency

Here we show that the performance of DAN is impeded by the existence of domain-class dependency. Concretely, we show that the dependency causes the trade-off between classification accuracy and domain invariance: when and are statistically dependent, no values of would be able to optimize the first and second term in Eq. 4 at the same time. Note that the following analysis also suggests that most IFL methods are negatively influenced by the dependency.

To begin with, we consider only the first term in Eq. 4 and address the optimization problem:

(5)

Using the property of entropy, is bounded:

(6)

Thus, Eq. 5 has the solution which satisfies the following condition:

(7)

Eq. 7 suggests that the regularizer in DAN is intended to remove all information about domains from latent representation , thereby ensuring the independence of domains and latent representation.

Next, we consider only the second term in Eq. 4, thereby addressing the following optimization problem:

(8)

Considering is the mapping of , i.e., , the solution to Eq. 8 satisfies the following equation:

(9)

Here we obtain and , which can achieve perfect invariance and optimal classification accuracy, respectively. Using them, we can obtain the following theorem, which shows the existence of the trade-off between invariance and accuracy: perfect invariance () and optimal classification accuracy () cannot be achieved at the same time.

Theorem 1

When ,i.e., there is no labeling error, and , i.e., the domain and class are statistically dependent, holds.

Proof 1

Assume . Using the properties of entropy, we can obtain the following:

(10)

Substituting and into Eq. 10, we can obtain the following condition:

(11)

Because the domain and class are dependent on each other, the following condition holds:

(12)

but Eq. 12 contradicts with . Thus, .

Theorem 1 shows that the domain-class dependency causes the trade-off problem. Although it assumes for simplicity, we cannot know the true value of and there are many cases in which little or no labeling errors occur and thus is close to .

In addition, we can omit the assumption and obtain a more general result:

Theorem 2

When , holds.

Proof 2

Similar to Proof 1, we assume that and thus Eq. 11 is obtained. Obviously, Eq. 11 does not hold when .

Theorem 2 shows that when the mutual information of the domain and class is greater than the labeling error , the trade-off between invariance and accuracy occurs. Then, although we cannot know the true value of , the performance of DAN and other IFL methods are likely to decrease when has large value.

3.3 Accuracy-Constrained Domain Invariance

If we cannot avoid the trade-off, the next question is to decide how to accommodate it, i.e., to what extent the representation should become domain-invariant for DG tasks. Here we provide the notion of accuracy-constrained domain invariance, which is the maximum domain invariance within a range that does not interfere with the classification accuracy. The reason for the constraint is that the primary purpose of DG is the classification for unseen domains rather than the invariance itself, and the improvement of the invariance could detrimentally affect the performance. For example, in WISDM, if we know the target activity was performed by a young rather than an old man, we might predict the activity to be jogging with a higher probability; thus, we would have to avoid removing such domain information that may be useful in the classification task.

Theorem 3

Define accuracy-constrained domain invariance as the maximum value under the constraint that , i.e., there is no labeling error, and classification accuracy is maximized, i.e., . Then, accuracy-constrained domain invariance equals .

Proof 3

Using Eq. 10 and , the following inequation holds:

(13)

Substituting into Eq. 13, the following inequation holds:

(14)

Thus, the maximum value under the optimal classification accuracy constraint is .

Note that we could improve the invariance more when (that is obvious considering Eq. 13), but we cannot know the true value of as we discussed in Sec. 3.2. Thus, accuracy-constrained domain invariance can be viewed as the worst-case gurantee.

3.4 Proposed Method

Based on the above analysis, the remaining challenge is to determine how to achieve accuracy-constrained domain invariance, i.e., imposing regularization such that makes holds. Although DAN might be able to achieve this condition by carefully tuning the strength of the regularizer ( in Eq. 1), such tuning is time-consuming and impractical, as suggested by our experiments. Alternatively, we propose a novel method named AFLAC by modifying the regularization term of DAN: whereas the encoder of DAN attempts to fool the discriminator, that of AFLAC attempts to directly minimize the KLD between and . Formally, AFLAC solves the following joint optimization problem by alternating gradient descent.

(15)
(16)

The minimization of and , respectively, means maximization of the log-likelihood of and as well as in DAN. However, the minimization of differs from the regularizer of DAN in that it is intended to satisfy . And if well approximates by the minimization of in Eq. 15, the minimization of leads to . Figure 2-(b) outlines the training of AFLAC.

Figure 2: Comparative illustration of DAN and AFLAC. (a) The classifier and discriminator try to minimize and , respectively. The encoder tries to minimize and maximize (fool the discriminator). (b) The discriminator tries to approximate true by minimizing . The encoder tries to minimize divergence between and by minimizing .

Here we formally show that AFLAC is intended to achieve (accuracy-constrained domain invariance) by a Nash equilibrium analysis smilar to [13, 33]. As well as in Section 3.1, and , which are the solutions to Eqs. 15, 16 with fixed , satisfy and , respectively. Thus, in Eq. 16 can be written as follows:

(17)

, which is the solution to Eq. 17 and in Nash equilibrium, satisfies not only (optimal classification accuracy) but also
, which is a sufficient condition for
by the definition of the conditional entropy.

In training, in the objectives (Eqs. 15, 16) is approximated by empirical distribution composed of the training data obtained from source domains, i.e., . Also, used in Eq. 16

can be replaced by the maximum likelihood or maximum a posteriori estimator of it. Note that, we could use some distances other than

in Eq. 16, e.g., , but in doing so, we could not observe performance gain, hence we discontinued testing them.

4 Experiments

4.1 Datasets

Here we provide a brief overview of one synthetic and three real-world datasets used for the performance evaluation. The concrete sample sizes for each and , and the network architectures for each dataset are shown in supplementary.

BMNISTR We created the Biased and Rotated MNIST dataset (BMNISTR) by modifying the sample size of the popular benchmark dataset MNISTR [11], such that the class distribution differed among the domains. In MNISTR, each class is represented by 10 digits. Each domain was created by rotating images by 15 degree increments: 0, 15, 30, 45, 60, and 75 (referred to as M0, …, M75). Each image was cropped to 1616 in accordance with [11]. We created three variants of MNISTR that have different types of domain-class dependency, referred to as BMNISTR-1 through BMNISTR-3. As shown in Table 1-left, BMNISTR-1, -2 have similar trends but different degrees of dependency, whereas BMNISTR-1 and BMNISTR-3 differ in terms of their trends. We used two convolution layers and two fully connected (FC) layers (with nonlinear activations) as the encoder, three FC layers as the classifier, and two FC layers as the discriminator. 111Code for the experiment is available at https://github.com/akuzeee/AFLAC

PACS The PACS dataset [18] contains 9991 images across 7 categories (dog, elephant, giraffe, guitar, house, horse, and person) and 4 domains comprising different stylistic depictions (Photo, Art painting, Cartoon, and Sketch). It has domain-class dependency probably owing to the data characteristics. For example, is much higher than

, indicating that photos of a person are easier to obtain than those of animals, but sketches of persons are more difficult to obtain than those of animals in the wild. For training, we used the ImageNet pre-trained AlexNet CNN

[15] as the base network, following previous studies [18, 19]. The two-FC-layer discriminator was connected to the last FC layer, following [9].

WISDM The WISDM Activity Prediction dataset contains sensor data of accelerometers of six human activities (walking, jogging, upstairs, downstairs, sitting, and standing) performed by 36 users (domains). WISDM has the dependency for the reason noted in Sec. 1. In data preprocessing, we use the sliding-window procedure with 60 frames (=3 seconds) referring to [1]

, and the total number of samples was 18210. We parameterized the encoder using three 1-D convolution layers followed by one FC layer and the classifier by logistic regression, following previous studies

[34, 14]. The two-FC-layer discriminator was connected to the output of the encoder.

IEMOCAP The IEMOCAP dataset [4] is the popular benchmark dataset for speech emotion recognition (SER), which aims at recognizing the correct emotional state of the speaker from speech signals. It contains a total of 10039 utterances pronounced by ten actors (domains, referred to as Ses01F, Ses01M through Ses05F, Ses05M) with emotional categories, and we only consider the four emotional categories (happy, angry, sad, and neutral) referring to [5, 8]. Also, we refered to [5] about data preprocessing: we split the speech signal into equal-length segments of 3s, and extracted 40-dimensional log Mel-spectrogram, its deltas, and delta-deltas. We parameterized the encoder using three 2-D convolution layers followed by one FC layer and the classifier by logistic regression. The two-FC-layer discriminator was connected to the output of the encoder.

4.2 Baselines

To demonstrate the efficacy of the proposed method AFLAC, we compared it with vanilla CNN and adversarial-learning-based methods. Specifically, (1) CNN is a vanilla convolutional networks trained on the aggregation of data from all source domains. Although CNN has no special treatments for DG, [18] reported that it outperforms many traditional DG methods. (2) DAN [33] is expected to generalize across domains utilizing domain-invariant representation, but it can be affected by the trade-off between domain invariance and accuracy as explained in Section 3.2. (3) CIDDG is our re-implementation of the method proposed in [22], which is designed to achieve semantic alignment on adversarial training. Additionally, we used (4) AFLAC-Abl, which is a version of AFLAC modified for ablation studies. AFLAC-Abl replaces in Eq. 16 of , thus it attempts to learn the representation that is perfectly invariant to domains or make hold as well as DAN. Comparing AFLAC and AFLAC-Abl, we measured the genuine effect of taking domain-class dependency into account. When training AFLAC and AFLAC-Abl, we cannot obtain true and , hence we used their maximum likelihood estimators for calculating the KLD terms.

4.3 Experimental Settings

For all the datasets and methods, we used RMSprop for optimization. Further, we set the learning rate, batch size, and the number of iterations as 5e-4, 128, and 10k for BMNISTR; 5e-5, 64, and 10k for PACS; 1e-4, 64, and 10k for IEMOCAP; 5e-4 (with exponential decay with decay step 18k and 24k, and decay rate 0.1), 128, and 30k for WISDM, respectively. Also, we used the annealing of weighting parameter

proposed in [9], and unless otherwise mentioned chose from for DAN, CIDDG, AFLAC-Abl, and AFLAC. Specifically, on BMNISTR and PACS, we employed a leave-one-domain-out setting [11], i.e., we chose one domain as target and used the remaining domains as source data. Then we split the source data into 80% of training data and 20% of validation data, assuming that target data are not absolutely available in the training phase. On IEMOCAP, we chose the best from
using disjoint validation domain, referring to [8, 5]. On WISDM, we randomly selected 20 / 16 users as source / target

domains, and split the source data into training and validation data because one-domain-leave-out evaluation is computationally expensive. Then, we conducted experiments multiple times with different random weight initialization; we trained the models on 10, 20, and 20 seeds in BMNISTR, WISDM, and IEMOCAP, chose the best hyperparameter that achieved the highest validation accuracies measured in each epoch, and reported the mean scores (accuracies and F-measures) for the hyperparameter. On PACS, because it requires a long time to train on, we chose the best

from after three experiments, and reported the mean scores in experiments with 15 seeds.

Dataset Class M0 M15 M30 M45 M60 M75 BMNISTR-1 04 100 85 70 55 40 25 59 100 100 100 100 100 100 BMNISTR-2 04 100 90 80 70 60 50 59 100 100 100 100 100 100 BMNISTR-3 04 100 25 100 25 100 25 59 100 100 100 100 100 100 CNN DAN CIDDG AFLAC AFLAC RI Dataset Class -Abl BMNISTR-1 04 83.86 84.54 87.50 87.46 90.62 3.6% 59 83.90 85.24 87.46 86.46 88.10 1.9% BMNISTR-2 04 82.54 85.30 87.64 88.60 89.64 1.2% 59 82.18 85.80 86.74 87.60 89.04 1.6% BMNISTR-3 04 71.26 79.22 76.76 76.56 80.02 4.5% 59 78.62 83.14 82.64 82.94 82.80 -0.2%
Table 1: Left: Sample sizes for each domain-class pair in BMNISTR. Those for the classes 04 are variable across domains, whereas the classes 59 have identical sample sizes across domains. Right: Mean F-measures for the classes 04 and classes 59 with the target domain M0. RI denotes relative improvement of AFLAC to AFLAC-Abl

4.4 Results

We first investigated the extent to which domain-class dependency affects the performance of the IFL methods. In Table 1-right, we compared the mean F-measures for the classes 0 through 4 and classes 5 through 9 in BMNISTR with the target domain M0. Recall that the sample sizes for the classes 04 are variable across domains, whereas the classes 59 have identical sample sizes across domains (Table 1-left). The F-measures show that AFLAC outperformed baselines in most dataset-class pairs, which supports that domain-class dependency reduces the performance of domain-invariance-based methods and that AFLAC can mitigate the problem. Further, the relative improvement of AFLAC to AFLAC-Abl is more significant for the classes 04 than for 59 in BMNISTR-1 and BMNISTR-3, suggesting that AFLAC tends to increase performance more significantly for classes in which the domain-class dependency occurs. Moreover, the improvement is more significant in BMNISTR-1 than in BMNISTR-2, suggesting that the stronger the domain-class dependency is, the lower the performance of domain-invariance-based methods becomes. This result is consistent with Theorem 2, which shows that the trade-off is likely to occur when is large. Finally, although the dependencies of BMNISTR-1 and BMNISTR-3 have different trends, AFLAC improved the F-measures in both datasets.

I(d; y) CNN DAN CIDDG AFLAC-Abl AFLAC
Dataset Target
BMNISTR-1 M0 0.026 83.9±0.4 85.0±0.4 87.4±0.3 87.0±0.4 89.3±0.4
M15 0.034 98.5±0.2 98.5±0.1 98.3±0.2 98.3±0.2 98.8±0.1
M30 0.037 97.5±0.1 97.4±0.1 97.4±0.2 97.6±0.1 98.3±0.2
M45 0.036 89.9±0.9 90.2±0.6 89.8±0.5 92.8±0.5 93.3±0.6
M60 0.030 96.7±0.3 97.0±0.2 97.2±0.1 96.6±0.2 97.4±0.2
M75 0.017 87.1±0.5 87.3±0.4 88.2±0.3 87.7±0.5 88.1±0.4
Avg 92.3 92.6 93.1 93.3 94.2
BMNISTR-2 Avg 92.2 92.7 93.1 94.0 94.5
BMNISTR-3 Avg 90.6 91.7 91.4 91.6 92.9
PACS photo 0.102 82.2±0.4 81.8±0.4    - 82.5±0.4 83.5±0.3
art_painting 0.117 61.0±0.5 60.9±0.5    - 62.6±0.4 63.3±0.3
cartoon 0.131 64.9±0.5 64.9±0.6    - 64.2±0.3 64.9±0.3
sketch 0.023 61.4±0.5 61.4±0.5    - 59.6±0.7 60.1±0.7
Avg 67.4 67.2    - 67.2 68.0
WISDM 16 users 0.181 84.0±0.4 83.8±0.3 84.4±0.4 83.7±0.3 84.4±0.3
IEMOCAP Ses01F 0.005 56.0±0.7 60.1±0.7    - 62.9±0.5 60.4±0.9
Ses01M 61.0±0.3 63.5±0.5    - 68.0±0.5 66.1±0.3
Ses02F 0.045 61.2±0.5 60.4±0.5    - 65.8±0.5 64.2±0.4
Ses02M 76.6±0.4 47.2±0.7    - 64.7±1.7 74.3±1.3
Ses03F 0.037 69.2±0.9 71.9±0.4    - 70.0±0.6 70.1±0.4
Ses03M 56.9±0.4 57.3±0.5    - 56.2±0.4 56.8±0.4
Ses04F 0.120 75.5±0.5 75.5±0.6    - 75.4±0.6 75.7±0.6
Ses04M 58.5±0.5 57.4±0.5    - 58.7±0.5 59.2±0.5
Ses05F 0.063 61.8±0.4 62.4±0.5    - 61.9±0.3 63.4±0.7
Ses05M 47.6±0.3 46.9±0.4    - 49.6±0.4 49.9±0.4
Avg 62.4 60.3    - 63.3 64.0
Table 2: Accuracies for each dataset and target domain. The column is estimated from source domain data, which indicates the strength of domain-class dependency.

Next we compared the mean accuracies (with standard errors) in both synthetic (BMNISTR) and real-world (PACS, WISDM, and IEMOCAP) datasets (Table

2). Note that the performance of our baseline CNN on PACS, WISDM, and IEMOCAP is similar but partly different from that reported in previous studies ([22], [1], and [8], respectively) probably because the DG performance strongly depends on validation methods and other implementation details as reported in many recent studies [1, 8, 2, 23]. Also, we trained CIDDG only on BMNISTR and WISDM due to computational resource constraint. This table enables us to make the following observations. (1) Domain-class dependency in real-world datasets negatively affects the DG performance of IFL methods. The results obtained on PACS (Avg) and WISDM showed that the vanilla CNN outperformed the IFL methods (DAN and AFLAC-Abl). Additionally, the results on IEMOCAP shows that AFLAC tended to outperform AFLAC-Abl when had large values (in Ses04 and Ses05), which is again consistent with Theorem 2. These results support the importance of considering domain-class dependency in real-world datasets. (2) AFLAC performed better than the baselines on all the datasets in average, except for CIDDG on WISDM. Note that AFLAC is more parameter efficient than CIDDG as we noted in Sec. 2.2. These results supports the efficacy of the proposed model to overcome the trade-off problem.

Finally, we investigated the relationship between the strength of regularization and performance. In DG, it is difficult to choose appropriate hyperparameters because we cannot use target domain data at valiadtion step (since they are not available during training); therefore, hyperparameter insensitivity is significant in DG. Figure 3 shows the hyperparameter sensitivity of the classification accuracies for DAN, CIDDG, AFLAC-Abl, and AFLAC. These figures suggest that DAN and AFLAC-Abl sometimes outperformed AFLAC with appropriate values, but there is no guarantee that such values will be chosen by validation whereas AFLAC is robust toward hyperparameter choice. Specifically, as shown in Figures 3-(b, d), DAN and AFLAC-Abl outperformed AFLAC with and , respectively. One possible explanation of those results is that accuracy for target domain sometimes improves by giving priority to domain invariance at the cost of the accuracies for source domains, but AFLAC improves domain invariance only within a range that does not interfere with accuracy for source domains. However, as shown in Figure 3, the performance of DAN and AFLAC-Abl are sensitive to hyperparameter choice. For example, although they got high scores with in Figure 3-(b), the scores dropped rappidly when increases to or decreases to . Also, the scores of DAN and AFLAC-Abl in Figure 3-(c) dropped significantly with , and such large was indeed chosen by overfitting to validation domain. On the other hand, Figures 3-(a, b, c, d) show that the accuracy gaps of AFLAC-Abl and AFLAC increase with strong regularization (such as when or ). These results suggest that AFLAC, as it was designed, does not tend to reduce the classification accuracy with strong regularizer, and such robustness of AFLAC might have yileded the best performance shown in Table 2.

(a) BMNISTR-1, M0 (b) WISDM (c) IEMOCAP, 02M (d) IEMOCAP, 05F
Figure 3: Classification Accuracy with various . Each caption shows dataset name and target domain. The round markers correspond to values chosen by validation. The error bars correspond to standard errors.

5 Conclusion

In this paper, we addressed domain generalization under domain-class dependency, which was overlooked by most prior DG methods relying on IFL. We theoretically showed the importance of considering the dependency and the way to overcome the problem by expanding the analysis of [33]. We then proposed a novel method AFLAC, which maximizes domain invariance within a range that does not interfere with classification accuracy on adversarial training. Empirical validations show the superior performance of AFLAC to the baseline methods, supporting the importance of the domain-class dependency in DG tasks and the efficacy of the proposed method to overcome the issue.

References

  • [1]

    Andrey, I.: Real-time human activity recognition from accelerometer data using convolutional neural networks. Applied Soft Computing (2017)

  • [2] Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: Towards domain generalization using meta-regularization. In: Advances in Neural Information Processing Systems 31 (2018)
  • [3] Blanchard, G., Lee, G., Scott, C.: Generalizing from several related classification tasks to a new unlabeled sample. In: Proc. of the 24th International Conference on Neural Information Processing Systems (2011)
  • [4] Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4),  335 (Nov 2008)
  • [5]

    Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters

    25,  1–1 (07 2018)
  • [6] Chou, J.C., chieh Yeh, C., yi Lee, H., shan Lee, L.: Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. In: Proc. Interspeech (2018)
  • [7]

    Erfani, S., Baktashmotlagh, M., Moshtaghi, M., Nguyen, V., Leckie, C., Bailey, J., Kotagiri, R.: Robust domain generalisation by enforcing distribution invariance. In: 25th International Joint Conference on Artificial Intelligence (2016)

  • [8] Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B.: Speech emotion recognition with data augmentation and layer-wise learning rate adjustment. CoRR abs/1802.05630 (2018), http://arxiv.org/abs/1802.05630
  • [9] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. (2016)
  • [10] Ghifary, M., Balduzzi, D., Kleijn, W.B., Zhang, M.: Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  • [11]

    Ghifary, M., Bastiaan Kleijn, W., Zhang, M., Balduzzi, D.: Domain generalization for object recognition with multi-task autoencoders. In: Proc. of the IEEE International Conference on Computer Vision (ICCV) (2015)

  • [12]

    Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., Schölkopf, B.: Domain adaptation with conditional transferable components. In: Proc. of the 33rd International Conference on International Conference on Machine Learning (2016)

  • [13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proc. of the 27th International Conference on Neural Information Processing Systems (2014)
  • [14] Iwasawa, Y., Nakayama, K., Yairi, I., Matsuo, Y.: Privacy issues regarding the application of dnns to activity-recognition using wearables and its countermeasures by use of adversarial training. In: Proc. of the 26th International Joint Conference on Artificial Intelligence. pp. 1930–1936 (2017)
  • [15] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proc. of the 25th International Conference on Neural Information Processing Systems. pp. 1097–1105 (2012)
  • [16] Kwapisz, J.R., Weiss, G.M., Moore, S.A.: Activity recognition using cell phone accelerometers. SIGKDD Explor. Newsl. (2011)
  • [17] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.: Fader networks:manipulating images by sliding attributes. In: Proc. of the 30th Neural Information Processing Systems (2017)
  • [18] Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proc. of the IEEE International Conference on Computer Vision (ICCV) (2017)
  • [19] Li, D., Yang, Y., Song, Y., Hospedales, T.M.: Learning to generalize: Meta-learning for domain generalization. In: Proc. of the 32nd AAAI Conference on Artificial Intelligence (2018)
  • [20]

    Li, H., Jialin Pan, S., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

  • [21] Li, Y., Gong, M., Tian, X., Liu, T., Tao, D.: Domain generalization via conditional invariant representations. In: Proc. of the 32nd AAAI Conference on Artificial Intelligence (2018)
  • [22] Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., Tao, D.: Deep domain generalization via conditional invariant adversarial networks. In: The European Conference on Computer Vision (ECCV) (September 2018)
  • [23] Li, Y., Yang, Y., Zhou, W., Hospedales, T.M.: Feature-critic networks for heterogeneous domain generalization. CoRR abs/1901.11448 (2019), http://arxiv.org/abs/1901.11448
  • [24] Louizos, C., Swersky, K., Li, Y., Welling, M., Zemel, R.S.: The variational fair autoencoder. In: Proc. International Conference on Representation Learning (2016)
  • [25] Madras, D., Creager, E., Pitassi, T., Zemel, R.S.: Learning adversarially fair and transferable representations. In: Proc. of the 35th International Conference on Machine Learning (2018)
  • [26] Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: Proc. of the IEEE International Conference on Computer Vision (ICCV) (2017)
  • [27] Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via invariant feature representation. In: Proc. of the 30th International Conference on Machine Learning (2013)
  • [28] Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., Sarawagi, S.: Generalizing across domains via cross-gradient training. In: Proc. International Conference on Learning Representations (2018)
  • [29] Sriram, A., Jun, H., Gaur, Y., Satheesh, S.: Robust speech recognition using generative adversarial networks. In: The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
  • [30] Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (2011)
  • [31] Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proc. of the IEEE International Conference on Computer Vision (ICCV) (2015)
  • [32] Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximizing for domain invariance. CoRR abs/1412.3474 (2014), http://arxiv.org/abs/1412.3474
  • [33] Xie, Q., Dai, Z., Du, Y., Hovy, E., Neubig, G.: Controllable invariance through adversarial feature learning. In: Proc. of the 30th International Conference on Neural Information Processing Systems (2017)
  • [34] Yang, J., Nguyen, M.N., San, P.P., Li, X., Krishnaswamy, S.: Deep convolutional neural networks on multichannel time series for human activity recognition. In: Proc. of the 24th International Joint Conference on Artificial Intelligence (2015)
  • [35] Zemel, R., Wu, Y., Swersky, K., Pitassi, T., Dwork, C.: Learning fair representations. In: Proc. of the 30th International Conference on Machine Learning (2013)
  • [36] Zhang, K., Schölkopf, B., Muandet, K., Wang, Z.: Domain adaptation under target and conditional shift. In: Proc. of the 30th International Conference on Machine Learning (2013)

Sample Sizes for PACS, IEMOCAP, and WISDM

Tables 3-5 show sample sizes for each class and domain for PACS, IEMOCAP, and WISDM, respectively.

Guitar House Giraffe Person Horse Dog Elephant
Art Painting 184 295 285 449 201 379 255
Cartoon 135 288 346 405 324 389 457
Photo 186 280 182 432 199 189 202
Sketch 608 80 753 160 816 772 740
Table 3: Sample sizes for each domain, class pair in PACS dataset. The column shows category name while and index shows style.
hap ang neu sad
Ses01F 45 67 222 191
Ses02F 54 31 456 124
Ses03F 40 45 127 198
Ses04F 16 208 190 194
Ses05F 62 79 371 186
Ses01M 25 69 161 129
Ses02M 67 44 263 218
Ses03M 89 128 267 174
Ses04M 98 39 163 164
Ses05M 178 13 292 189
Table 4: Sample sizes for each domain, class pair in IEMOCAP dataset. The column shows emotion name while and index shows actor id.
Jogging Walking Upstairs Downstairs Sitting Standing
User 1 49 248 36 75 54 26
User 2 48 161 94 62 0 0
User 3 215 218 80 77 260 89
User 4 213 207 79 72 38 26
User 5 205 217 77 70 19 27
User 6 213 192 34 29 0 0
User 7 196 206 27 23 27 11
User 8 200 207 54 57 34 27
User 9 200 103 90 69 41 32
User 10 199 209 40 40 24 32
User 11 204 206 63 39 50 27
User 12 209 119 0 0 26 17
User 13 207 202 73 44 0 0
User 14 0 208 23 26 49 32
User 15 106 204 56 54 27 25
User 16 201 217 71 63 0 27
User 17 0 236 48 49 0 21
User 18 198 220 60 63 0 0
User 19 221 230 136 47 0 0
User 20 204 104 50 48 11 9
User 21 206 179 44 47 38 27
User 22 205 109 80 32 0 0
User 23 14 101 22 29 20 0
User 24 0 209 70 64 25 51
User 25 214 222 65 47 26 22
User 26 171 285 74 55 44 54
User 27 234 281 77 64 35 43
User 28 159 208 80 67 26 47
User 29 183 216 56 55 26 47
User 30 103 117 90 60 0 0
User 31 184 214 52 49 0 0
User 32 0 215 0 0 0 0
User 33 108 116 0 0 0 0
User 34 196 195 0 0 0 0
User 35 153 183 60 37 42 39
User 36 270 293 71 43 42 35
Table 5: Sample sizes for each domain, class pair in WISDM dataset. The column shows activity name and the index shows user id.

Network architectures for BMNISTR, IEMOCAP, and WISDM

Tables 6-8 show details of the DNN architectures. In these table,
”ConvD()” denotes a -dimensional convolution layer. Here, and mean the number of channels of input and output, and and

denote the convolution window size and stride width, respectively. ”Linear(

)” denotes a linear layer with -dimensional input and -dimenstional output features. ”MaxPoolD(k)” denotes a

-dimensional max pooling layer with window size

. ”” means dropout whose ratio was set to .

Encoder
Conv2D(1, 32, 5, 1) ReLU
Conv2D(32, 48, 5, 1) ReLU MaxPool2D(2)
Linear(768, 100) ReLU
Linear(100, 100) ReLU
Classifier
Linear(100, 100) ReLU
Linear(100, 100) ReLU
Linear(100, 10)
Discriminator
Linear(100, 100) ReLU
Linear(100, 5)
Table 6: DNN architectures used on BMNISTR.
Encoder
Conv1D(3, 50, 5, 1) ReLU MaxPool1D(2)
Conv1D(50, 40, 5, 1) ReLU MaxPool1D(2)
Conv1D(40, 20, 3, 1) ReLU DP(0.5)
Linear(200, 400) ReLU DP(0.5)
Classifier
Linear(400, 6)
Discriminator
Linear(400, 400) ReLU DP(0.5)
Linear(400, 20)
Table 7: DNN architectures used on WISDM.
Encoder
Conv2D(40, 64, (10, 3), (4, 2)) ReLU DP(0.5)
Conv2D(64, 128, (10, 3), (4, 2)) ReLU DP(0.5)
Conv2D(128, 128, (5, 3), (4, 2)) ReLU DP(0.5)
Linear(3200, 128) ReLU
Classifier
Linear(128, 4)
Discriminator
Linear(128, 100) ReLU
Linear(100, 8)
Table 8: DNN architectures used on IEMOCAP.