Multi-layer Domain Adaptation for Deep Convolutional Networks

by   Ozan Ciga, et al.

Despite their success in many computer vision tasks, convolutional networks tend to require large amounts of labeled data to achieve generalization. Furthermore, the performance is not guaranteed on a sample from an unseen domain at test time, if the network was not exposed to similar samples from that domain at training time. This hinders the adoption of these techniques in clinical setting where the imaging data is scarce, and where the intra- and inter-domain variance of the data can be substantial. We propose a domain adaptation technique that is especially suitable for deep networks to alleviate this requirement of labeled data. Our method utilizes gradient reversal layers and Squeezeand-Excite modules to stabilize the training in deep networks. The proposed method was applied to publicly available histopathology and chest X-ray databases and achieved superior performance to existing state-of-the-art networks with and without domain adaptation. Depending on the application, our method can improve multi-class classification accuracy by 5-20 DANN introduced in (Ganin, 2014).


page 1

page 2

page 3

page 4


Deep Adversarial Domain Adaptation Based on Multi-layer Joint Kernelized Distance

Domain adaptation refers to the learning scenario that a model learned f...

One-Shot Adaptation of Supervised Deep Convolutional Models

Dataset bias remains a significant barrier towards solving real world co...

On Minimum Discrepancy Estimation for Deep Domain Adaptation

In the presence of large sets of labeled data, Deep Learning (DL) has ac...

Deep Visual Domain Adaptation: A Survey

Deep domain adaption has emerged as a new learning technique to address ...

Mitigating the Effect of Dataset Bias on Training Deep Models for Chest X-rays

Deep learning has gained tremendous attention on CAD (Computer-aided Dia...

Conditional Domain Adaptation GANs for Biomedical Image Segmentation

Due to visual differences in biomedical image datasets acquired using di...

Fast Training of Convolutional Networks through FFTs

Convolutional networks are one of the most widely employed architectures...

1 Introduction

Deep learning models have achieved great success in recent years on computer vision tasks. Fully convolutional networks (FCNs) consistently achieve the state-of-the-art performance in various tasks such as segmentation, classification and detection. Despite their success, however, FCNs usually require large amounts of labeled data from the domain in which the network will be deployed. As network architectures become deeper with more trainable parameters, the requirement for large amounts of data is further exacerbated as the networks are more prone to overfitting. This leads to a need for even larger amounts of data to achieve generalization. Furthermore, regardless of the size or the domain diversity of the training set, there is no performance guarantee on an unseen dataset from a domain that the network was not exposed to at training time. These issues are especially problematic in medical image analysis, as the labeled data is scarce due to the tedious and expensive data annotation process, and a large distributional shift can be observed even if data comes from the same source.

Several methods, including network weight regularization, semi-supervised approaches [3], meta-learning [8], and domain adaptation [4] have been proposed to improve generalization performance on unseen datasets. In the present work, we will focus on the domain adaptation. These methods aim to leverage large amounts of cheap unlabeled data from a target domain to improve generalization performance using small amounts of labeled data. In past work, [11] proposed correcting covariate shift between domains by reweighting samples from source domain to minimize the discrepancy between source and the target. This approach was later improved by minimizing distances between feature mappings of source and target domains instead of the samples itself [4]. Further modifications were proposed later that improved the benchmark performances such as tri-learning, which assumes high confidence predictions are correct [10], or leveraging the cluster assumption, in which the decision boundaries based on the modified feature representations should not cross the high density data regions [12].

In the present article we propose a simple, robust method that requires minimal modifications to an existing deep network to achieve domain adaptation. Our model repurposes Squeeze-and-Excite blocks, introduced by [6]

for feature selection, to perform domain classification in the intermediate layers of a large network. We use the “squeeze” operation to get a summary statistic at the end of each convolutional block, and use a domain adaptation technique

[4] to extract domain-independent features at each layer. The “excitation network” is repurposed to perform domain classification. We extend this method by matching distributions of source and target features at each layer via minimizing the Wasserstein distance.

2 Methods

Due to its conceptual simplicity, we will build our model on top of the gradient reversal layer (GRL) based domain adaptation, which was first introduced in [4]. In an FCN, convolutional layers extract salient features layer by layer as the feature maps shrink in spatial size and expand in semantic (depthwise) information. Once enough abstraction on the image is achieved, features

are flattened and typically fed into a few fully connected layers to perform the task objective, e.g., classification. As the network usually optimizes a minimization objective, extracted features may (and are likely to) overfit to the domain-specific noise. Domain adaptation via gradient reversal aims to alleviate this by attaching another classifier to the input

, which simultaneously optimizes an adversarial objective: Given , it tries to minimize the domain classification loss between samples of the domain classifier with parameters while trying to maximize this loss with respect to the feature extractor (with parameters ) of the original FCN. In effect, this procedure aims to remove the learned features which are domain-specific, while forcing the network to retain the domain-independent features with error gradient signals and , where are the parameters of the label classifier.

In [4]

, domain adaptation is achieved by backpropagating the negative binomial cross-entropy loss of the domain classifier network. Features from the last layer prior to the fully connected classification layers are used as inputs to the domain classifier network. We note several problems with this approach: (1) as the network depth increases, the error signal from the domain classifier will tend to vanish, or will be insufficient to remove domain specific features in the earlier layers, (2) given feature maps

and where , it becomes more challenging for the network to extract domain-independent features for if the features from are domain dependent, (3) even if domain specific features in map somehow are discarded in the later layers, the encoding of these features into map results in capacity underuse of the network, (4) even with the adversarial training objective which forces the preservation of salient features, it is likely for a high capacity network to employ arbitrary transformations on the target samples to match source and target distributions (for a formal derivation, see Appendix E of [12]). For simple tasks that do not require deep networks, vanishing gradients or accumulation of domain dependent features across layers do not affect the performance as much. However, in more complex medical imaging analysis tasks, larger networks tend to perform better; hence, the domain adaptation techniques are more likely to suffer from aforementioned issues. We aim to alleviate this by regulating extracted features at each layer simultaneously by attaching a domain classifier at the end of the layer (see Figure 1), or by performing unsupervised matching of distributions at each layer.

Given a feature map , we transform into by average pooling, i.e., , where indexes the element of the response to the kernel of the map , and is the

element of the vector

. We will use the shorthand for the transformation of map (feature maps of layer ) into , which is coined as the “squeeze” operation by [6]. Although itself is not enough for downstream tasks such as classification or segmentation, it may contain enough information to differentiate between two samples at a given layer. Given this information, we aim to be able to perform domain adaptation at each layer, rather than just the final feature map representation at the end of the network.

2.1 Gradient reversal layer based domain adaptation

Figure 1: Proposed modification to the DANN architecture.

Analogous to [4], we add domain classifiers at the end of each feature map

. By interfering at the intermediate layers, we aim to extract robust features that are invariant to the training domain using the supervision signal. The network is then trained simultaneously for the domain adaptation along with the original objective. We denote this as layer-wise domain-adversarial neural network, or

L-DANN, as our model is based on DANN [4].

The mini domain classifier network for each layer has the same structure for each layer , but with varying number of parameters (see Table 1, indicates the reduction ratio). As the earlier layers in convolutional networks tend to extract more high level information such as texture patterns and edges, we increase the complexity of the domain classifier network progressively, proportional to the depth of the feature map . Given domains, the domain classifier network maximizes the -class cross entropy loss via backpropagation to obscure domain information by removing the features from the map .

2.2 Wasserstein distance based domain adaptation

Instead of using the domain labels directly, we can also achieve domain adaptation by interpreting as samples drawn from different distributions. Given two domains , , with and are samples drawn from and , respectively, our objective is where is an arbitrary distribution divergence. For our experiments, we use the Wasserstein-1 distance, also known as the Earth mover’s distance, due to its stability in training [2]. In order to stabilize the training further, we will use the method described in [5]

to ensure Lipschitz constraint on the critic, as opposed to the gradient clipping method suggested in

[2]. We use the term “critic” as opposed to discriminator/classifier, to be consistent with [2, 5]. The procedure is summarized in Algorithm 1, we omit the details for brevity, and refer the interested reader to [5]. In the upcoming sections, we will refer to this method as L-WASS, or layer-wise Wasserstein.

1:source with samples and labels , target , number of critic iterations per generator iteration, batch size , learning rates , gradient penalty coefficient , initial parameters for the critic and the neural network for the objective, ,
3:     for each layer  do
4:         for t=1 to  do
5:              for i=1 to  do
6:                  Sample , a random number
10:              end for
12:         end for
13:     end for
15:until  converges
Algorithm 1 Unsupervised domain adaptation via Wasserstein distance with gradient penalty for feature matching. Squeezed feature map from layer is , given input . The objective loss is (e.g., cross-entropy for classification).

3 Experimental results

3.1 Implementation details

We do not use any padding or bias in the convolutional layers described in Table

1, and use the reduction ratio for all the layers. We use ResNet architecture enhanced with Squeeze-and-Excite blocks as our task objective network with varying number of layers depending on the task. Contrary to [4], we do not use a constant to scale

, nor do we use annealing to stabilize the training. We use stochastic gradient descent (SGD) optimizer in all domain classifier, critic, and the objective network with the learning rate 0.001, momentum 0.9 and weight decay of 0.0001. We have tried updating the domain classifier and critic parameters with and without freezing the preceding layers and observed simultaneous training achieves superior performance. We perform 10 runs per experiment, and report the mean accuracy

the standard deviation. All experiments are run for 100 epochs regardless of the network architecture or the data, and we use the model with the highest validation accuracy achieved in the last 30 epochs for testing, to avoid selecting a model that achieved high accuracy randomly, and has actually converged.

Input shape Kernel size Output shape
[1 1] C’ - -
Conv [1 1] C’ [1 1] C’/r [1 1] C’/r
ReLU [1 1] C’/r - [1 1] C’/r
Conv [1 1] C’/r [1 1]
Table 1: Domain classifier/critic D(). The final output shape depends on the architecture used: For L-DANN, we use , or number of classes, and for L-WASS, we use , number of input channels to perform distribution matching.

3.2 Effect of layer-wise domain adaptation on small networks

In order to determine whether layer-wise domain adaptation improves results on networks with a small number of layers, we use the MNIST handwritten digits, MNIST-M (MNIST blended with random RGB color patches from the BSDS500 dataset), and the SVHN (street view house numbers) to perform digit classification given an image which contains a single digit. SVHN has more variation within the dataset; hence classifying SVHN digits is considered to be more challenging than MNIST or MNIST-M. For all experiments, we use images per dataset for training, and for testing. We use a single 2-layer neural network, MNIST architecture defined in [4]

, enhanced with batch normalization prior to ReLU layers. As we do not optimize the architecture depending on the dataset, or the direction of the adaptation, our results should only be interpreted within the context of Table

2, and not to the results reported in [4]. As the MNIST architecture is not convolutional, we use the domain classifier given in MNIST architecture for each layer. For L-WASS, the classifier remains the same, with the exception that the number of output elements are 100, to achieve more meaningful matching of distributions. Although the performance of L-DANN remains comparable to DANN, L-WASS fails to converge for the simplest experiment, hinting that for simple distributions, layer-wise Wasserstein distribution matching is not suitable.

No adaptation 58 2 27.95.41 770.96
DANN 90.8 1.06 27.71.43 46.12.27
L-DANN 90.5 0.12 22.8 1.72 53.8 2.22
L-WASS N/C 21.02.11 71.20.91
Table 2: Comparison between DANN, L-DANN and L-WASS for smaller networks. N/C: Network did not converge.

3.3 Effect of model complexity on domain adaptation

We test our method on another modality, namely on chest X-ray images acquired from two separate institutions in USA, and in China that are classified into normal patients as well as patients with manifestations of tuberculosis [9]. The datasets vary in resolution, quality, contrast, positive to negative samples ratio, and the number of samples. In addition, each dataset has separate watermarks and descriptive texts in different parts of the X-rays, which are known to degrade performance in neural networks. The first dataset consists of 138 images, which we refer to as S, or small, and the second dataset consists of 662 image, which we refer to as L, or large. In order to show that our method performs better with deeper architectures, we compare two architectures: SE-ResNet-101 (49.6 million trainable parameters) and SENET 154 (116.3M). Results are shown in Table 3. Note that although DANN slightly outperforms L-WASS in one of the experiments, its performance is not consistent. In some settings, it performs worse than networks without any domain adaptation, and even fails to converge for the deepest setting. In contrast, both L-DANN and L-WASS consistently perform better than the no domain adaptation baseline. The utility of using a deeper architecture can be observed in the L direction, where we gain up to in accuracy, for setting. In other words, deeper networks can help better generalize to larger datasets given a small labeled dataset, which is often the case in the clinical setting.

Architecture Source Target Method Precision Recall F1-score Accuracy
SE-ResNet-101 LS No adaptation 100. 18.9 31.8 65.9
DANN 80.9 65.5 72.4 79.
L-DANN 88.1 63.8 74. 81.2
L-WASS 91.1 53.4 67.3 78.3
SL No adaptation 68.7 72.6 70.6 69.3
DANN 71.8 67.6 69.6 70.1
L-DANN 72.6 73.7 73.1 73.9
L-WASS 70.9 76.1 73.4 72.1
SENET 154 LS No adaptation 100. 3.4 6.6 59.4
DANN 90.9 51.7 65.9 77.5
L-DANN 100. 43.1 60.3 76.1
L-WASS 90.9 68.9 78.4 84.1
SL No adaptation 79.3 65.1 71.5 73.7
L-DANN 75.1 84.5 79.5 80.9
L-WASS 88.8 75.1 81.6 81.3
Table 3: Comparison between DANN, L-DANN and L-WASS for deeper networks.

3.4 Domain adaptation for feature regularization

We also test our method on the BACH (BreAst Cancer Histopathology) challenge [7]. This challenge is composed of classification of patches extracted from whole-slide images (WSI) into 4 classes (normal, benign, in-situ, and invasive cancer) and segmentation of the WSI into these classes. As it is not uncommon to achieve

accuracy on the classification part, we turn our attention to the segmentation. There are 10 labeled + 20 unlabeled WSI for training, and 10 for testing. Given the stain variation among WSI, we are using the unlabeled 20 images for stain normalization, and for source (i.e., the institution, scanner or the hospital) agnostic feature extraction. In this respect, the domain adaptation acts as a regularizer on extracted features, retaining only the features which are common in both domains. We train the same network, SE-ResNet-50, without domain adaptation, with

L-DANN module, with L-WASS, and with DANN, and achieve scores (as defined in [1], which penalizes false negatives, or incorrect “normal” class, more than false positives, or any of the remaining three classes) 0.63, 0.68, 0.66, 0.65, respectively. Note that the best score on the public leaderboard is 0.63.

4 Conclusions

We presented a novel domain adaptation method for fully convolutional networks that can alleviate the requirements for large amounts of data, especially in deep networks. Our method is simple, requires minimal amount of modification to the original network architecture, adds small overhead to the training cost, and is cost-free in test time. We tested our method with multiple public medical imaging datasets and showed promising gains on multiple baseline networks.

5 Acknowledgments

This work was funded by Canadian Cancer Society (grant 705772) and NSERC RGPIN-2016-06283.


  • [1] G. e. al. Aresta (2019) Bach: grand challenge on breast cancer histology images. Medical image analysis. Cited by: §3.4.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
  • [3] C. Baur, S. Albarqouni, and N. Navab (2017) Semi-supervised deep learning for fully convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 311–319. Cited by: §1.
  • [4] Y. Ganin and V. Lempitsky (2014) Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495. Cited by: Multi-layer Domain Adaptation for Deep Convolutional Networks, §1, §1, §2.1, §2, §2, §3.1, §3.2.
  • [5] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.2.
  • [6] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7132–7141. Cited by: Multi-layer Domain Adaptation for Deep Convolutional Networks, §1, §2.
  • [7] ICIAR (2018) ICIAR2018-challenge - home. Note: Last accessed 16 August 2019 External Links: Link Cited by: §3.4.
  • [8] G. Maicas, A. P. Bradley, J. C. Nascimento, I. Reid, and G. Carneiro (2018) Training medical image analysis systems like radiologists. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 546–554. Cited by: §1.
  • [9] Openi (2018) What is open-i ?. Note: Last accessed 16 August 2019 External Links: Link Cited by: §3.3.
  • [10] K. Saito, Y. Ushiku, and T. Harada (2017) Asymmetric tri-training for unsupervised domain adaptation. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 2988–2997. Cited by: §1.
  • [11] H. Shimodaira (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §1.
  • [12] R. Shu, H. H. Bui, H. Narui, and S. Ermon (2018) A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735. Cited by: §1, §2.