1 Introduction
Deep learningbased detection and recognition studies have been recently achieving very accurate performance in visual applications. However, many such methods expect the test images to come from the same distribution as the training images, and often fail when presented with new unseen visual domains. For examples, in face recognition, a system is trained on RGB images/videos and then deployed on infrared or thermal images/videos. Far apart from domain adaptation [6, 29]
, feature transfer learning
[34] or feature finetuning [15], in the context of the proposed problem, there is no available information about new unseen environments or domains where the system will be deployed.Indeed, detection and classification crossing domains have recently become active topics in the research communities. In particular, domain adaptation has received significant attention in computer vision. As shown in Figure
1(A), in the domain adaptation, we usually have a largescale training set with groundtruth, i.e. the source domain A, and a small training set with or without groundtruth, i.e. the target domain B. The knowledge from the source domain A will be learned and adapted to the target domain B. During the testing time, the trained model will be deployed only in the target domain B. The recent results in domain adaptation have showed significant improvement in the performance. However, in realworld applications, the trained models are potentially deployed not only in the target domain B but also in many other new unseen domains, e.g. C, D, etc. In this scenarios, the released deep network models are usually unable to be retrained or finetuned with the inputs in new domains or environments as shown in Figure 2. Thus, domain adaptation methods cannot be applied in these problems since the new unseen target domains are unavailable during training. Moreover, domain adaptation methods only allow a pair of domains, i.e. the source domain and the target domain. Meanwhile, realworld applications usually require more than just a pair of domains. In practice, the number of domains that released models are potentially deployed is usually large and unpredictable.Besides, there are some prior work to perform recognition problems with high accuracy by presenting new loss functions
[25, 35] or increasing deep network structures [9] via mining hard samples in training sets. These loss functions are deployed to deal with hard samples that can be considered as unseen domains, one can consider it as hardsample problems and then can be solved using new loss functions, e.g. Center Loss [33], Range Loss [35], etc. However, these methods are also limited to generalize in new unseen domains. Increasing the depth of the deep network can be also considered to deal with hard samples. However, some real world problems are unable to observe the training samples in new unseen domains during the training process. Therefore, in the scope of this work, there is no assumption about the new unseen domains. Our proposed method can be supportively incorporated with these Convolutional Neural Network (CNN) based detection and classification methods to train within endtoend deep learning framework to potentially improve the performance.
Our UNVP  FT[34]  ADDA[29]  DANN[6]  CoGAN[20]  I2IAdapt[21]  ADA[31]  UBM[24]  
Model Type  DL  DL  DL  DL  DL  DL  DL  GMM 
Architecture  PGM+CNN  CNN  CNN+GAN  CNN  CNN+GAN  CNN+GAN  CNN  PGM 
Loss Function  LL+  +  ✗LL  
EndtoEnd  Yes  ✓Yes  ✓Yes  Yes  Yes  ✓Yes  ✗Yes  ✗No 
Require samples in new domains  No  ✓Yes  ✓Yes  Yes  Yes  ✓Yes  ✗No  ✗Yes 
Pair Domains  Any  Two  Two  Two  Any  Two  ✗Any  ✗Any 
In this paper, instead of the domain shift, we target on exploring the problem of domain generalization in the context of deep learning. This work is inspired from the Universal Background Models (UBM) [24]
to model the background environment in the speech verification system. However, instead of approaching the background or environment modeling by using Gaussian Mixture Models (GMMs), this work presents a novel approach to generalize the domain representation by a new deep network design.
1.1 Contributions of this Work
This work presents a novel approach to domain encapsulation that can learn to better generalize new unseen domains. The restrictive setting is considered in this work where there is only single source domain for training data. Table 1 summarizes the difference between our approach and the prior methods. The contribution of this work can be summarized as follows.
A novel approach named Universal Nonvolume Preserving Models (UNVP)
is firstly introduced to generalize environments of new unseen domains from a given single source training domain. Secondly, the environmental features extracted from the environment modeling via UNVP and the discriminative features extracted from the deep network classifiers are then unified together to provide a final encapsulated deep features that are robustly discriminative in new unseen domains. The proposed UNVP approach is designed and implemented within an endtoend deep learning framework and inherits the power of the Convolutional Neural Network. The proposed UNVP can be easily endtoend integrated with a CNN deep network design for object detection, object recognition or segmentation so that it can perform improvement results. Finally, the proposed method will be experimented in numerous vision modalities and applications with improvement results to demonstrate the impact of the method.
2 Related work
This section first overviews the Universal Background Models method. Then, the recent work in domain adaptation and imagetoimage translation are also summarized.
2.1 Universal Background Models (UBM)
A Universal Background Models is a destiny estimation modeling method originally used in the speaker verification system
[24]. In the conventional Gaussian Mixture Model  Universal Background Models (GMMUBM) framework, UBM is a GMMs trained on a pool of data, known as the background or the environment from a large number of speakers. The speakerspecific models are then adapted from the UBM using the Maximum a Posteriori Probability (MAP) parameter estimation. During the evaluation phase, each test segment is scored against all enrolled speaker models to determine the speaker identities, i.e. speaker identification. It is also scored against the background model and a given speaker model to accept or reject an identity claim, i.e. speaker verification.
2.2 Domain Adaptation
Domain adaptation has recently become one of the most popular research topics in the field [6, 28, 26, 30, 29]. The key idea of domain adaptation is to map both source and target domains into a common feature space. Tzeng et al. proposed a unified framework for unsupervised domain Adaptation based on Adversarial learning objectives (ADDA) [29]. It uses a loss function in the discriminator to be solely dependent on its target distribution. Ganin et al. proposed a method for both training and domain adaptation within a unified network [6]. It aims to learn both domain adaptation and classification at the same time.
2.3 ImagetoImage Translation
The research ImagetoImage Translation topic has been grown significantly for several years. Image Translation has many potential applications, e.g. image style transfer [5], semantic segmentation [12, 32], depth estimation [2], etc. Indeed, Domain Adaptation is the one of important applications of Image Translation [21, 3, 19, 10].
Liu et. al. presented Coupled Generative Adversarial Network (CoGAN) for learning a joint distribution of multidomain images
[20] which applies for domain adaptation. Liu et al. introduced Unsupervised Image to Image Translation (UNIT) method [19] inherited from CoGAN. This method aims at learning joint distribution of two marginal distributions from two image domains. The sharedlatent space assumption was used in CoGAN [20] for joint distribution learning. In order to improve UNIT method, Huang et al. presented Multimodal UNIT (MUNIT) [10]. This method assumes that the image representation can be decomposed into a content code that is domaininvariant, and a style code that captures domainspecific properties. It allows images in different domains can be decoded into the shared feature space.3 The Proposed Universal Nonvolume Preserving Models (UNVP)
The proposed UNVP approach presents a new tractable CNN deep network
to not only extract the deep CNN features but also formulate the probability densities of the samples in the source environment in the form of Gaussian distributions. From these learned distributions, a densitybased augmentation approach is employed to expand data distribution of the source environment for generalizing to different unseen domains. Far apart from other augmentation techniques where samples are generated directly in image domain using some prior knowledge, e.g. synthesizing blurry version of the image, or adding synthesis backgrounds, our approach focuses on augmentation in semantic space via the estimation of environment density. As a results, more semantic samples are augmented and, therefore, generalize the learning process. With this architecture design, UNVP is clever to unify the power of CNN deep features modeling in the first phase in the network and the distribution modeling in the later phase in the network within an endtoend training framework as shown in Figure
3. In particular, the proposed framework consists of three main components. (1) Domain variation UNVP modeling via deep mapping functions; (2) Unseen domain generalization; and (3) Endtoend joint training deep network.3.1 Environment Variation Modeling via Density Functions
Modeling environment variation directly in highdimensional image domain is extremely complicated and easy to diverse due to the effects of noisy samples. This section aims at learning a function that maps an image in image domain to its latent representation in latent domain such that the density function of
can be estimated via the probability density function
. Then via , rather than representing the environment variation directly in the image domain, it can be easily modeled via variables in latent space that provides more semantic manner. Structure and Variable Relationship. Let be a data sample in image domain , be its corresponding class label, and where denotes the parameters of , the probability density function of can be formulated via the change of variable formula as follows:(1) 
where and define the distributions of samples of class in image and latent domains, respectively. denotes the Jacobian matrix with respect to . Then the loglikelihood is computed by.
(2) 
Eqns. (1) and (2) have provided two facts: (1) learning the density function of samples in class is equivalent to estimate the density of its latent representation and determinant of the associated Jacobian matrix ; and (2) if the latent distribution is defined as a Gaussian distribution, the learned function explicitly becomes the mapping function from a real data distribution to a Gaussian distribution in latent space. Then, we can model the environment variation via deviations from the Gaussian distributions of all classes in latent domain. Furthermore, when is welldefined with tractable computation of its Jacobian determinant, the twoway connection (i.e. inference and generation) can be established between and .
The prior class distributions. Motivated from these properties, given classes, we choose the Gaussian distributions with different means and covariances as prior distributions for these classes, i.e. .
Mapping Function Structure. In order to enforce the information flow from image domain to latent space with different abstraction levels, the mapping function is formulated as a composition of several subfunctions as follows.
(3) 
where is the number of subfunctions. The Jacobian can be derived by . With this structure, the properties of each will define the properties for the whole mapping function . For example, if the Jacobian of is tractable, then is also tractable. Furthermore, if is a nonlinear function built from a composition of CNN layers then becomes a deep convolution neural network. There are several ways to construct the subfunctions, i.e. borrowing different CNN structures for nonlinearity property. In our approach, the subfunction in [23] is adopt thanks to its tractable and invertible.
where is a binary mask, and is the Hadamard product. and define the scale and translation functions during mapping process.
Learning the mapping function and Environment Modeling. In order to learn the parameter for mapping function , the loglikelihood in Eqn. (2) is maximized as follows.
(4) 
Notice that after learning the mapping function, all images of all classes are mapped into the distributions of their classes. Then the environment density can be considered as the composition of these distributions. Figure 4(A) (left) illustrated an example of the learned environment distributions of MNIST dataset with 10 digit classes. The density distributions of two different environments (i.e. MNIST and MNISTM) are also presented in Figure 4(A) (right). In the next section, a generalization approach is proposed so that using only samples in source environment, the learned model can expand the density distributions of source environment so that they can cover as much as possible the distributions of unseen environments.
3.2 Unseen Domain Generalization
After modeling the source environment variation as the compositions of its class distributions, this section introduces the generalization process of these distributions with respect to a classification model such that the expansion of these distributions can help generalize to unseen environments with high accuracy.
In particular, let be the training loss function of , and be the parameters of . The generalization process of can be formulated as updating the parameters such that even the class distributions of the unseen environment are distance away from the source environment, is still robust with high accuracy. The objective function is reformulated as.
(5) 
where denotes the sets of images and their labels;
is the distance between probability distributions;
and be the density distributions of the source and the new unseen environments, respectively.Since both and are density distributions, the Wasserstein distance with respect to and can be adopt as follows.
(6) 
where denotes the transformation cost. Notice that from previous section, we have leaned a mapping function that maps the density functions from image space to prior distribution in latent space. Moreover, since is invertible with the specific formula of subfunction, computing is equivalent to . From this, we can estimate as the transformation cost between Gaussian distributions. Then is reformulated by.
(7) 
where and are the means and covariances of the distributions of class in the source environment and unseen environment, respectively. Plugging this distance and applying the Lagrangian relaxation to Eqn. (5), we have
To solve this objective function, the optimization process can be divided into two alternative steps: (1) generate the sample for each class such that , and consider this is a new “hard” example for class ; and (2) add to the training data and optimize the model . In other words, this twostep optimization process aims at finding new samples belonging to distributions that are distance far away from the distributions of the source environment, and making became more robust when classifying these examples. By this way, after a certain of iteration, the distributions learned from can be generalized so that they can cover as much as possible the distributions of new unseen environments.
Figure 4(B) shows that the distributions of MNIST and MNISTM have been joint after using unseen domain generalization to train environment variation modeling. The red and green line of Figure 4(B) illustrate the distribution of class 0 of MNIST and MNISTM, respectively. The yellow and blue line of Figure 4(B) illustrate the distribution of class 8 of MNIST and MNISTM, respectively. Its figure proves that by our method can cover both distributions of the source domain and distributions of an unseen domain.
3.3 Universal Deep Models
The whole endtoend joint training process for Universal Deep Models is illustrated in Figure 3. Given a largescale training set in the source environment, UNVP is employed to learn the mapping function from image domain to distributions in latent space. Then the twostep training process as presented in Sec. 3.2 is adopted to train the Deep Classifier for generalization. Notice that, to further constraint the perturbation in latent space while generating for new samples , we incorporate a regularization of the latent space learned by the classifier to Eqn. (7) as follows:
New generated samples are then added to the training set and used for updating both UNVP and CNN classifiers.
4 Experimental Results
This section first validates the proposed approach in digit recognition on four digit datasets, i.e. MNIST [17], USPS [11], SVHN [22] and MNISTM. To obtain the MNISTM, we blend digits from the original set over patches randomly extracted from color photos from BSDS500 [1]. In this experiment, MNIST is used as the only training set and the others are used as the testing sets. Then, Subsection 4.2 shows the proposed approach in face recognition with three standard face recognition databases, i.e. Extended YaleB [7], CMUPIE [27], CMUMPIE [8]. Facial images with normal illumination are used as training domain and the ones in dark illumination conditions are used as testing set on the new unseen domains (Figure 6). Finally, we show the advantages of our proposed method in the crossdomain pedestrian recognition, i.e. RGB and thermal domains. we compare the detection results using our proposed method against other standard methods in subsection 4.3.
4.1 Digit Recognition on Unseen Domains
We show the experimental results using our proposed approach to digit recognition on new unseen domains with four digit databases, i.e. MNIST, MNISTM, USPS and SVHN, as shown in Figure 5. In order to simplify it, LeNet CNN deep network model [16] is used as the classifier in this experiment. This deep network can be technically replaced by any other deep network models. We use a ConvNet with the designed architecture convpoolconvpoolfcfcsoftmax. For environment variation modeling, we use Real NVP [4] as domain variation UNVP modeling.
About the training network hyperparameters, learning rate, batch size and regularization rate are set to , respectively. In the generalizing phase (it is equivalent to maximization of ADA), we set hyperparameters to , respectively, as shown in Algorithm 1 of ADA [31]. Adam Optimizer is used to optimize and update the deep network.
In this experiment, MNIST is the only database used to train the classifier. Then, three other datasets, i.e. MNISTM, USPS and SVHN are used as the new unseen domains to benchmark the performance. In training, the classifier is trained using 60,000 images of MNIST. In order to generalizing new image phase, we use 10,000 images in this set to perturb and generalize new samples. To be convenient, all digit images are resized to pixels. In testing, the proposed method is benchmarked on the testing set of MNIST and three other unseen digit datasets, i.e. USPS, SVHN and MNISTM. The classification results using the proposed approach are compared against the pure LeNet classifier (PureCNN), and the Adversarial Data Augmentation (ADA) [31] methods. We also show the recognition results on these datasets using the Domain Adaptation methods, including: Adversarial Discriminative Domain Adaptation (ADDA) [29], DomainAdversarial Training of Neural Networks (DANN) [6] and Image to Image Translation for Domain Adaptation (I2IAdapt) [21]. It is notice PureCNN, ADA and our approaches do not require the target domain data during training. However, ADDA, DANN and I2IAdapt require the target domain data in the training steps.
The experimental results are shown in Table 2. The results on SVHN and MNISTM shows that the proposed approach achieves better accuracy than ADA on these datasets. As we mentioned, our perturb phase generalizes images based on semantic space via the estimation of environment density. It helps our generated images are more diverse than the synthesized images using ADA method. In USPS benchmarking, USPS and MNIST datasets have the similar environment conditions as shown in Figure 5(A) and (C). Therefore, the image domain that USPS belongs to has been learned very well by the pure classifier and do not need any extra work for it. This scenarios eventually also happens to ADA method. The proposed method achieves better accuracy than the domain adaptation method, i.e. ADDA, on SVHN dataset. However, these method requires images in new domains in training. Meanwhile our method do not required training images in new domains.
Database  MNIST  USPS  SVHN  MNISTM 
PureCNN  99.36%  82.46%  37.89%  56.93% 
ADA  99.17%  81.77%  37.87%  60.02% 
Ours  99.17%  black78.85%  40.22%  60.51% 
ADDA  99.29%  94.81%  32.20%  63.39% 
DANN  76.66%  
CoGAN  91.20%  
I2IAdapt  92.10% 
4.2 Face Recognition on Unseen Domains
In this experiment, our proposed approach is compared with PureCNN, ADA and ADDA methods in the face recognition application on three face recognition databases, including: Extended YaleB, CMUPIE and CMUMPIE databases as shown in Figure 6. In each database, the face images with normal lighting conditions will be selected as the source domain (Normal Illumination) and the face images with dark lighting conditions will be selected as the target domain (Dark Illumination). We use the same framework as Digit Recognition. All images are resized to pixels. Table 3 shows our experimental results on Extened YaleB, CMUPIE and CMUMPIE datasets. The results show that our proposed method help to improve the recognition performance on new unseen domains where the lighting conditions are not known.
Database  Extended YaleB  CMUPIE  CMUMPIE  
Normal  Dark  Normal  Dark  Normal  Dark  
PureCNN  98.90%  49.15%  96.09%  62.87%  99.95%  95.18% 
ADA  99.00%  53.08%  96.49%  62.69%  99.92%  96.08% 
Ours  99.73%  70.09%  97.32%  67.87%  99.83%  98.25% 
ADDA  99.17%  75.28%  96.09%  70.33%  99.93%  97.71% 
4.3 Pedestrian Recognition on Unseen Domains
This experiment aims for improving pedestrian detection on thermal images on the Thermal Dataset^{1}^{1}1https://www.flir.com/oem/adas/adasdatasetform/. This dataset includes both thermal and RGB images. We create two datasets for the pedestrian recognition: (1) RGB pedestrian and (2) Thermal pedestrian datasets. To make these datasets, pedestrian objects are cropped from images in the Thermal Dataset. Our model is trained only on RGB pedestrian dataset. The baseline is trained on RGB pedestrian dataset and tested on Thermal pedestrian dataset. In the training phase, we use images to generalize new images, all images of two datasets are resized to pixels. Table 4 shows our experimental results on RGB pedestrian dataset and Thermal pedestrian dataset.
Database  RGB  Thermal 
PureCNN  96.61%  94.44% 
ADA  98.05%  95.42% 
Ours  97.04%  96.29% 
To further improve pedestrian detection, we apply our pedestrian recognition trained on RGB pedestrian dataset to the DeformableConvNets detector [14, 13] trained on COCO [18] dataset. After the image proposal phase, we crop proposed bounding boxes and feed into our pedestrian recognition framework.
Table 5 shows our experimental results of pedestrian detection on Thermal Dataset. Because of the abstract pedestrian shape (body shape) is keep on the thermal image. Therefore, the vanilla DeformableConvNets detector also have good results. Although our proposal helps to softly improve the results, the results prove that our method is promising.
No. proposed bounding boxes  Detector  Detector + Ours 
300  53.64%  53.83% 
500  53.21%  53.54% 
700  52.83%  53.30% 
5 Conclusions
This paper has introduced a new UNVP learning model approach that generalize well to different unseen domains. Only using training data from the source domain, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is ”hard” under the current model. On digit recognition, we benchmark on four popular digit recognition databases, i.e. MNIST, MNISTM, USPS, and SVHN. The method is also experimented on face recognition on Extended YaleB, CMUPIE and CMUMPIE databases and compared against other the stateoftheart methods. In the problem of pedestrian detection tasks, we empirically observe that the proposed method learns models that improve performance across a priori unknown data distributions.
References
 [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, May 2011.

[2]
A. AtapourAbarghouei and T. Breckon.
Realtime monocular depth estimation using synthetic data with domain
adaptation.
In
Proc. Computer Vision and Pattern Recognition
, pages 1–8. IEEE, June 2018.  [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [4] L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using real nvp. 2017.
 [5] A. Dundar, M.Y. Liu, T.C. Wang, J. Zedlewski, and J. Kautz. Domain stylization: A strong, simple baseline for synthetic to real image domain adaptation. CoRR, abs/1807.09384, 2018.

[6]
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.
In F. Bach and D. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning
, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.  [7] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
 [8] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image Vision Comput., 28(5):807–813, May 2010.
 [9] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [10] X. Huang, M.Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised imagetoimage translation. In ECCV, 2018.
 [11] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, May 1994.

[12]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
CVPR, 2017.  [13] K. H. J. S. Jifeng Dai, Yi Li. RFCN: Object detection via regionbased fully convolutional networks. 2016.
 [14] Y. X. Y. L. G. Z. H. H. Y. W. Jifeng Dai, Haozhi Qi. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017.
 [15] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint finetuning in deep neural networks for facial expression recognition. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2983–2991, Dec 2015.
 [16] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov. 1998.
 [17] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
 [18] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
 [19] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 700–708. Curran Associates, Inc., 2017.
 [20] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 469–477. Curran Associates, Inc., 2016.
 [21] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [22] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [23] C. Nhan Duong, K. Gia Quach, K. Luu, N. Le, and M. Savvides. Temporal nonvolume preserving approach to facial ageprogression and ageinvariant face recognition. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [24] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. In Digital Signal Processing, page 2000, 2000.
 [25] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
 [26] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2118–2126, USA, 2016. Curran Associates Inc.
 [27] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, FGR ’02, pages 53–, Washington, DC, USA, 2002. IEEE Computer Society.
 [28] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. CoRR, abs/1510.02192, 2015.
 [29] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [30] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014.
 [31] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. CoRR, abs/1805.12018, 2018.
 [32] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
 [33] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, ECCV (7), volume 9911 of Lecture Notes in Computer Science, pages 499–515. Springer, 2016.
 [34] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Feature transfer learning for deep face recognition with longtail data. CoRR, abs/1803.09014, 2018.
 [35] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with longtail. CoRR, abs/1611.08976, 2016.
Comments
There are no comments yet.