Beyond Domain Adaptation: Unseen Domain Encapsulation via Universal Non-volume Preserving Models

12/09/2018 ∙ by Thanh-Dat Truong, et al. ∙ Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM University of Illinois at Urbana-Champaign IEEE University of Arkansas 0

Recognition across domains has recently become an active topic in the research community. However, it has been largely overlooked in the problem of recognition in new unseen domains. Under this condition, the delivered deep network models are unable to be updated, adapted or fine-tuned. Therefore, recent deep learning techniques, such as: domain adaptation, feature transferring, and fine-tuning, cannot be applied. This paper presents a novel Universal Non-volume Preserving approach to the problem of domain generalization in the context of deep learning. The proposed method can be easily incorporated with any other ConvNet framework within an end-to-end deep network design to improve the performance. On digit recognition, we benchmark on four popular digit recognition databases, i.e. MNIST, USPS, SVHN and MNIST-M. The proposed method is also experimented on face recognition on Extended Yale-B, CMU-PIE and CMU-MPIE databases and compared against other the state-of-the-art methods. In the problem of pedestrian detection, we empirically observe that the proposed method learns models that improve performance across a priori unknown data distributions.



There are no comments yet.


page 1

page 2

page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning-based detection and recognition studies have been recently achieving very accurate performance in visual applications. However, many such methods expect the test images to come from the same distribution as the training images, and often fail when presented with new unseen visual domains. For examples, in face recognition, a system is trained on RGB images/videos and then deployed on infrared or thermal images/videos. Far apart from domain adaptation [6, 29]

, feature transfer learning

[34] or feature fine-tuning [15], in the context of the proposed problem, there is no available information about new unseen environments or domains where the system will be deployed.

Figure 2: The ideas of unseen domain encapsulation. The deep model is trained only in a single domain (A), i.e. RGB images. It is deployed in other unseen domains, i.e. thermal (B) and infrared (C) images

Indeed, detection and classification crossing domains have recently become active topics in the research communities. In particular, domain adaptation has received significant attention in computer vision. As shown in Figure

1(A), in the domain adaptation, we usually have a large-scale training set with groundtruth, i.e. the source domain A, and a small training set with or without groundtruth, i.e. the target domain B. The knowledge from the source domain A will be learned and adapted to the target domain B. During the testing time, the trained model will be deployed only in the target domain B. The recent results in domain adaptation have showed significant improvement in the performance. However, in real-world applications, the trained models are potentially deployed not only in the target domain B but also in many other new unseen domains, e.g. C, D, etc. In this scenarios, the released deep network models are usually unable to be retrained or fine-tuned with the inputs in new domains or environments as shown in Figure 2. Thus, domain adaptation methods cannot be applied in these problems since the new unseen target domains are unavailable during training. Moreover, domain adaptation methods only allow a pair of domains, i.e. the source domain and the target domain. Meanwhile, real-world applications usually require more than just a pair of domains. In practice, the number of domains that released models are potentially deployed is usually large and unpredictable.

Besides, there are some prior work to perform recognition problems with high accuracy by presenting new loss functions

[25, 35] or increasing deep network structures [9] via mining hard samples in training sets. These loss functions are deployed to deal with hard samples that can be considered as unseen domains, one can consider it as hard-sample problems and then can be solved using new loss functions, e.g. Center Loss [33], Range Loss [35]

, etc. However, these methods are also limited to generalize in new unseen domains. Increasing the depth of the deep network can be also considered to deal with hard samples. However, some real world problems are unable to observe the training samples in new unseen domains during the training process. Therefore, in the scope of this work, there is no assumption about the new unseen domains. Our proposed method can be supportively incorporated with these Convolutional Neural Network (CNN) based detection and classification methods to train within end-to-end deep learning framework to potentially improve the performance.

Our UNVP FT[34] ADDA[29] DANN[6] CoGAN[20] I2IAdapt[21] ADA[31] UBM[24]
Loss Function LL+ + ✗LL
End-to-End Yes ✓Yes ✓Yes Yes Yes ✓Yes ✗Yes ✗No
Require samples in new domains No ✓Yes ✓Yes Yes Yes ✓Yes ✗No ✗Yes
Pair Domains Any Two Two Two Any Two ✗Any ✗Any
Table 1: Comparing the properties between our UNVP approach and other recent methods. Feature Transferring (FT), Adversarial Discriminative Domain Adaptation (ADDA), Domain-Adversarial Training of Neural Networks (DANN), Generalizing to Unseen Domain via Adversarial Domain Adaptation(ADA), Universal Background Models (UBM), Coupled Generative Adversarial Networks (CoGAN), DL (Deep Learning), Generative Adversarial Network (GAN), Convolutional Neural Network (CNN), Adversarial Loss (), Log Likelihood Loss (LL), Cycle Consistency Loss ()

In this paper, instead of the domain shift, we target on exploring the problem of domain generalization in the context of deep learning. This work is inspired from the Universal Background Models (UBM) [24]

to model the background environment in the speech verification system. However, instead of approaching the background or environment modeling by using Gaussian Mixture Models (GMMs), this work presents a novel approach to generalize the domain representation by a new deep network design.

1.1 Contributions of this Work

This work presents a novel approach to domain encapsulation that can learn to better generalize new unseen domains. The restrictive setting is considered in this work where there is only single source domain for training data. Table 1 summarizes the difference between our approach and the prior methods. The contribution of this work can be summarized as follows.

A novel approach named Universal Non-volume Preserving Models (UNVP)

is firstly introduced to generalize environments of new unseen domains from a given single source training domain. Secondly, the environmental features extracted from the environment modeling via UNVP and the discriminative features extracted from the deep network classifiers are then unified together to provide a final encapsulated deep features that are robustly discriminative in new unseen domains. The proposed UNVP approach is designed and implemented within an end-to-end deep learning framework and inherits the power of the Convolutional Neural Network. The proposed UNVP can be easily end-to-end integrated with a CNN deep network design for object detection, object recognition or segmentation so that it can perform improvement results. Finally, the proposed method will be experimented in numerous vision modalities and applications with improvement results to demonstrate the impact of the method.

2 Related work

This section first overviews the Universal Background Models method. Then, the recent work in domain adaptation and image-to-image translation are also summarized.

2.1 Universal Background Models (UBM)

A Universal Background Models is a destiny estimation modeling method originally used in the speaker verification system


. In the conventional Gaussian Mixture Model - Universal Background Models (GMM-UBM) framework, UBM is a GMMs trained on a pool of data, known as the background or the environment from a large number of speakers. The speaker-specific models are then adapted from the UBM using the Maximum a Posteriori Probability (MAP) parameter estimation. During the evaluation phase, each test segment is scored against all enrolled speaker models to determine the speaker identities, i.e. speaker identification. It is also scored against the background model and a given speaker model to accept or reject an identity claim, i.e. speaker verification.

2.2 Domain Adaptation

Domain adaptation has recently become one of the most popular research topics in the field [6, 28, 26, 30, 29]. The key idea of domain adaptation is to map both source and target domains into a common feature space. Tzeng et al. proposed a unified framework for unsupervised domain Adaptation based on Adversarial learning objectives (ADDA) [29]. It uses a loss function in the discriminator to be solely dependent on its target distribution. Ganin et al. proposed a method for both training and domain adaptation within a unified network [6]. It aims to learn both domain adaptation and classification at the same time.

2.3 Image-to-Image Translation

The research Image-to-Image Translation topic has been grown significantly for several years. Image Translation has many potential applications, e.g. image style transfer [5], semantic segmentation [12, 32], depth estimation [2], etc. Indeed, Domain Adaptation is the one of important applications of Image Translation [21, 3, 19, 10].

Liu et. al. presented Coupled Generative Adversarial Network (CoGAN) for learning a joint distribution of multi-domain images

[20] which applies for domain adaptation. Liu et al. introduced Unsupervised Image to Image Translation (UNIT) method [19] inherited from CoGAN. This method aims at learning joint distribution of two marginal distributions from two image domains. The shared-latent space assumption was used in CoGAN [20] for joint distribution learning. In order to improve UNIT method, Huang et al. presented Multimodal UNIT (MUNIT) [10]. This method assumes that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. It allows images in different domains can be decoded into the shared feature space.

Figure 3: The Proposed Method
Figure 4: The distributions: (A) The left figure is the distribution of MNIST dataset. The right figure is the distributions of MNIST and MNIST-M datasets. The right image illustrates that when modeling MNIST-M using the model trained on MNIST, the distribution of MNIST-M is shifted to the right. (B) The distributions of class 0 and 8 of MNIST and MNIST-M after using unseen domain generalization to train environment variation modeling.

3 The Proposed Universal Non-volume Preserving Models (UNVP)

The proposed UNVP approach presents a new tractable CNN deep network

to not only extract the deep CNN features but also formulate the probability densities of the samples in the source environment in the form of Gaussian distributions. From these learned distributions, a density-based augmentation approach is employed to expand data distribution of the source environment for generalizing to different unseen domains. Far apart from other augmentation techniques where samples are generated directly in image domain using some prior knowledge, e.g. synthesizing blurry version of the image, or adding synthesis backgrounds, our approach focuses on augmentation in semantic space via the estimation of environment density. As a results, more semantic samples are augmented and, therefore, generalize the learning process. With this architecture design, UNVP is clever to unify the power of CNN deep features modeling in the first phase in the network and the distribution modeling in the later phase in the network within an end-to-end training framework as shown in Figure

3. In particular, the proposed framework consists of three main components. (1) Domain variation UNVP modeling via deep mapping functions; (2) Unseen domain generalization; and (3) End-to-end joint training deep network.

3.1 Environment Variation Modeling via Density Functions

Modeling environment variation directly in high-dimensional image domain is extremely complicated and easy to diverse due to the effects of noisy samples. This section aims at learning a function that maps an image in image domain to its latent representation in latent domain such that the density function of

can be estimated via the probability density function

. Then via , rather than representing the environment variation directly in the image domain, it can be easily modeled via variables in latent space that provides more semantic manner. Structure and Variable Relationship. Let be a data sample in image domain , be its corresponding class label, and where denotes the parameters of , the probability density function of can be formulated via the change of variable formula as follows:


where and define the distributions of samples of class in image and latent domains, respectively. denotes the Jacobian matrix with respect to . Then the log-likelihood is computed by.


Eqns. (1) and (2) have provided two facts: (1) learning the density function of samples in class is equivalent to estimate the density of its latent representation and determinant of the associated Jacobian matrix ; and (2) if the latent distribution is defined as a Gaussian distribution, the learned function explicitly becomes the mapping function from a real data distribution to a Gaussian distribution in latent space. Then, we can model the environment variation via deviations from the Gaussian distributions of all classes in latent domain. Furthermore, when is well-defined with tractable computation of its Jacobian determinant, the two-way connection (i.e. inference and generation) can be established between and .

The prior class distributions. Motivated from these properties, given classes, we choose the Gaussian distributions with different means and covariances as prior distributions for these classes, i.e. .

Mapping Function Structure. In order to enforce the information flow from image domain to latent space with different abstraction levels, the mapping function is formulated as a composition of several sub-functions as follows.


where is the number of sub-functions. The Jacobian can be derived by . With this structure, the properties of each will define the properties for the whole mapping function . For example, if the Jacobian of is tractable, then is also tractable. Furthermore, if is a non-linear function built from a composition of CNN layers then becomes a deep convolution neural network. There are several ways to construct the sub-functions, i.e. borrowing different CNN structures for non-linearity property. In our approach, the sub-function in [23] is adopt thanks to its tractable and invertible.

where is a binary mask, and is the Hadamard product. and define the scale and translation functions during mapping process.

Learning the mapping function and Environment Modeling. In order to learn the parameter for mapping function , the log-likelihood in Eqn. (2) is maximized as follows.


Notice that after learning the mapping function, all images of all classes are mapped into the distributions of their classes. Then the environment density can be considered as the composition of these distributions. Figure 4(A) (left) illustrated an example of the learned environment distributions of MNIST dataset with 10 digit classes. The density distributions of two different environments (i.e. MNIST and MNIST-M) are also presented in Figure 4(A) (right). In the next section, a generalization approach is proposed so that using only samples in source environment, the learned model can expand the density distributions of source environment so that they can cover as much as possible the distributions of unseen environments.

3.2 Unseen Domain Generalization

After modeling the source environment variation as the compositions of its class distributions, this section introduces the generalization process of these distributions with respect to a classification model such that the expansion of these distributions can help generalize to unseen environments with high accuracy.

In particular, let be the training loss function of , and be the parameters of . The generalization process of can be formulated as updating the parameters such that even the class distributions of the unseen environment are distance away from the source environment, is still robust with high accuracy. The objective function is reformulated as.


where denotes the sets of images and their labels;

is the distance between probability distributions;

and be the density distributions of the source and the new unseen environments, respectively.

Since both and are density distributions, the Wasserstein distance with respect to and can be adopt as follows.


where denotes the transformation cost. Notice that from previous section, we have leaned a mapping function that maps the density functions from image space to prior distribution in latent space. Moreover, since is invertible with the specific formula of sub-function, computing is equivalent to . From this, we can estimate as the transformation cost between Gaussian distributions. Then is reformulated by.


where and are the means and covariances of the distributions of class in the source environment and unseen environment, respectively. Plugging this distance and applying the Lagrangian relaxation to Eqn. (5), we have

To solve this objective function, the optimization process can be divided into two alternative steps: (1) generate the sample for each class such that , and consider this is a new “hard” example for class ; and (2) add to the training data and optimize the model . In other words, this two-step optimization process aims at finding new samples belonging to distributions that are distance far away from the distributions of the source environment, and making became more robust when classifying these examples. By this way, after a certain of iteration, the distributions learned from can be generalized so that they can cover as much as possible the distributions of new unseen environments.

Figure 4(B) shows that the distributions of MNIST and MNIST-M have been joint after using unseen domain generalization to train environment variation modeling. The red and green line of Figure 4(B) illustrate the distribution of class 0 of MNIST and MNIST-M, respectively. The yellow and blue line of Figure 4(B) illustrate the distribution of class 8 of MNIST and MNIST-M, respectively. Its figure proves that by our method can cover both distributions of the source domain and distributions of an unseen domain.

Figure 5: Examples in (A) MNIST, (B) MNIST-M, (C) USPS and (D) SVHN databases

3.3 Universal Deep Models

The whole end-to-end joint training process for Universal Deep Models is illustrated in Figure 3. Given a large-scale training set in the source environment, UNVP is employed to learn the mapping function from image domain to distributions in latent space. Then the two-step training process as presented in Sec. 3.2 is adopted to train the Deep Classifier for generalization. Notice that, to further constraint the perturbation in latent space while generating for new samples , we incorporate a regularization of the latent space learned by the classifier to Eqn. (7) as follows:

New generated samples are then added to the training set and used for updating both UNVP and CNN classifiers.

4 Experimental Results

This section first validates the proposed approach in digit recognition on four digit datasets, i.e. MNIST [17], USPS [11], SVHN [22] and MNIST-M. To obtain the MNIST-M, we blend digits from the original set over patches randomly extracted from color photos from BSDS500 [1]. In this experiment, MNIST is used as the only training set and the others are used as the testing sets. Then, Subsection 4.2 shows the proposed approach in face recognition with three standard face recognition databases, i.e. Extended Yale-B [7], CMU-PIE [27], CMU-MPIE [8]. Facial images with normal illumination are used as training domain and the ones in dark illumination conditions are used as testing set on the new unseen domains (Figure 6). Finally, we show the advantages of our proposed method in the cross-domain pedestrian recognition, i.e. RGB and thermal domains. we compare the detection results using our proposed method against other standard methods in subsection 4.3.

Figure 6: Examples of Yale-B [7] and CMU-PIE [27] databases. Face images in normal illumination condition (the 1st and the 3rd rows) are used as training domain and the ones in dark illumination conditions (the 2nd and the 4th rows) are used for testing as new unseen domains.

4.1 Digit Recognition on Unseen Domains

We show the experimental results using our proposed approach to digit recognition on new unseen domains with four digit databases, i.e. MNIST, MNIST-M, USPS and SVHN, as shown in Figure 5. In order to simplify it, LeNet CNN deep network model [16] is used as the classifier in this experiment. This deep network can be technically replaced by any other deep network models. We use a ConvNet with the designed architecture conv-pool-conv-pool-fc-fc-softmax. For environment variation modeling, we use Real NVP [4] as domain variation UNVP modeling.

About the training network hyper-parameters, learning rate, batch size and regularization rate are set to , respectively. In the generalizing phase (it is equivalent to maximization of ADA), we set hyper-parameters to , respectively, as shown in Algorithm 1 of ADA [31]. Adam Optimizer is used to optimize and update the deep network.

In this experiment, MNIST is the only database used to train the classifier. Then, three other datasets, i.e. MNIST-M, USPS and SVHN are used as the new unseen domains to benchmark the performance. In training, the classifier is trained using 60,000 images of MNIST. In order to generalizing new image phase, we use 10,000 images in this set to perturb and generalize new samples. To be convenient, all digit images are resized to pixels. In testing, the proposed method is benchmarked on the testing set of MNIST and three other unseen digit datasets, i.e. USPS, SVHN and MNIST-M. The classification results using the proposed approach are compared against the pure LeNet classifier (Pure-CNN), and the Adversarial Data Augmentation (ADA) [31] methods. We also show the recognition results on these datasets using the Domain Adaptation methods, including: Adversarial Discriminative Domain Adaptation (ADDA) [29], Domain-Adversarial Training of Neural Networks (DANN) [6] and Image to Image Translation for Domain Adaptation (I2IAdapt) [21]. It is notice Pure-CNN, ADA and our approaches do not require the target domain data during training. However, ADDA, DANN and I2IAdapt require the target domain data in the training steps.

The experimental results are shown in Table 2. The results on SVHN and MNIST-M shows that the proposed approach achieves better accuracy than ADA on these datasets. As we mentioned, our perturb phase generalizes images based on semantic space via the estimation of environment density. It helps our generated images are more diverse than the synthesized images using ADA method. In USPS benchmarking, USPS and MNIST datasets have the similar environment conditions as shown in Figure 5(A) and (C). Therefore, the image domain that USPS belongs to has been learned very well by the pure classifier and do not need any extra work for it. This scenarios eventually also happens to ADA method. The proposed method achieves better accuracy than the domain adaptation method, i.e. ADDA, on SVHN dataset. However, these method requires images in new domains in training. Meanwhile our method do not required training images in new domains.

Pure-CNN 99.36% 82.46% 37.89% 56.93%
ADA 99.17% 81.77% 37.87% 60.02%
Ours 99.17% black78.85% 40.22% 60.51%
ADDA 99.29% 94.81% 32.20% 63.39%
DANN 76.66%
CoGAN 91.20%
I2IAdapt 92.10%
Table 2: Experimental results on digit classification on four digit datasets. ADA and our proposed approaches do not require target domain data during training. ADDA, DANN, CoGAN and I2IAdapt require training data from target domains during training steps.
Figure 7: Examples of pedestrian detection results on thermal images using our proposed approach and Deformable-ConvNets training only on RGB images.

4.2 Face Recognition on Unseen Domains

In this experiment, our proposed approach is compared with Pure-CNN, ADA and ADDA methods in the face recognition application on three face recognition databases, including: Extended Yale-B, CMU-PIE and CMU-MPIE databases as shown in Figure 6. In each database, the face images with normal lighting conditions will be selected as the source domain (Normal Illumination) and the face images with dark lighting conditions will be selected as the target domain (Dark Illumination). We use the same framework as Digit Recognition. All images are resized to pixels. Table 3 shows our experimental results on Extened Yale-B, CMU-PIE and CMU-MPIE datasets. The results show that our proposed method help to improve the recognition performance on new unseen domains where the lighting conditions are not known.

Database Extended Yale-B CMU-PIE CMU-MPIE
Normal Dark Normal Dark Normal Dark
Pure-CNN 98.90% 49.15% 96.09% 62.87% 99.95% 95.18%
ADA 99.00% 53.08% 96.49% 62.69% 99.92% 96.08%
Ours 99.73% 70.09% 97.32% 67.87% 99.83% 98.25%
ADDA 99.17% 75.28% 96.09% 70.33% 99.93% 97.71%
Table 3: Experimental results of face recognition on Extended Yale-B [7], CMU-PIE [27] and CMU-MPIE [8] databases. It is notice that ADA and our proposed approaches do not require target domain data during training. ADDA requires training data from target domains during training steps. For Extended Yale-B, CMU-PIE and CMU-MPIE datasets, we split each dataset into two sets: training set (80%) and testing set (20%).

4.3 Pedestrian Recognition on Unseen Domains

This experiment aims for improving pedestrian detection on thermal images on the Thermal Dataset111 This dataset includes both thermal and RGB images. We create two datasets for the pedestrian recognition: (1) RGB pedestrian and (2) Thermal pedestrian datasets. To make these datasets, pedestrian objects are cropped from images in the Thermal Dataset. Our model is trained only on RGB pedestrian dataset. The baseline is trained on RGB pedestrian dataset and tested on Thermal pedestrian dataset. In the training phase, we use images to generalize new images, all images of two datasets are resized to pixels. Table 4 shows our experimental results on RGB pedestrian dataset and Thermal pedestrian dataset.

Database RGB Thermal
Pure-CNN 96.61% 94.44%
ADA 98.05% 95.42%
Ours 97.04% 96.29%
Table 4: Experimental results of pedestrian recognition on RGB pedestrian dataset and Thermal pedestrian dataset.

To further improve pedestrian detection, we apply our pedestrian recognition trained on RGB pedestrian dataset to the Deformable-ConvNets detector [14, 13] trained on COCO [18] dataset. After the image proposal phase, we crop proposed bounding boxes and feed into our pedestrian recognition framework.

Table 5 shows our experimental results of pedestrian detection on Thermal Dataset. Because of the abstract pedestrian shape (body shape) is keep on the thermal image. Therefore, the vanilla Deformable-ConvNets detector also have good results. Although our proposal helps to softly improve the results, the results prove that our method is promising.

No. proposed bounding boxes Detector Detector + Ours
300 53.64% 53.83%
500 53.21% 53.54%
700 52.83% 53.30%
Table 5: Experimental results of pedestrian detection on Thermal Dataset. The results are based on mean average precision (mAP) metric for object detection.

5 Conclusions

This paper has introduced a new UNVP learning model approach that generalize well to different unseen domains. Only using training data from the source domain, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is ”hard” under the current model. On digit recognition, we benchmark on four popular digit recognition databases, i.e. MNIST, MNIST-M, USPS, and SVHN. The method is also experimented on face recognition on Extended Yale-B, CMU-PIE and CMU-MPIE databases and compared against other the state-of-the-art methods. In the problem of pedestrian detection tasks, we empirically observe that the proposed method learns models that improve performance across a priori unknown data distributions.