1 Introduction
In learningbased computer vision, the probability distribution mismatch between the training and test samples is an essential problem to overcome for the success in real world scenarios. For example, suppose we have an object recognizer learned from a training set containing objects with specific viewpoints, backgrounds, and transformations. It is then applied to an environment with a similar object category, but different viewpoints, backgrounds, and transformations condition. This situation might happen due to a lack of labeled data representing the target environment or insufficient knowledge regarding to the target condition. A good recognition model on this setting can not be guaranteed if it is trained by using traditional learning techniques.
Methods to address the distribution mismatch have been investigated under the names of domain adaptation^{1}^{1}1In this area, the term “domain“ and “probability distribution“ are considered to be identical. and transfer learning. More specifically, given a training set and test set sampled from a distribution and respectively, the goal is to predict the target labels when and the information about is not sufficient. In recent years, many solutions to this problem have been proposed for computer vision applications (Saenko et al., 2010, Gopalan et al., 2011, Gong et al., 2012, Long et al., 2013)
and natural language processing
(Pan and Yang, 2010, DauméIII, 2009).In image recognition, the Office data set (Saenko et al., 2010) has become a standard image set to evaluate the performance of domain adaptation models. The standard evaluation protocol on this data set is based on using the SURF feature descriptor (Bay et al., 2008)
as inputs to the model. However, the utilization of such a descriptor usually needs a careful engineering to get good discriminative features. Furthermore, it may bring more complexity in the context of real time feature extraction processes. It is therefore worthwhile to build good models without using any handcrafted feature descriptors.
Representation or feature learning provides a framework to reduce the dependency on manual feature engineering (Bengio et al., 2012)
. Examples that can be considered as representation learning are Principal Component Analysis (PCA), Independent Component Analysis (ICA), Sparse Coding, Neural Networks, and Deep Learning. In deep learning, the greedy layerwise unsupervised training, which is known as the
pretraining, has played an important role for the success of deep neural networks (Bengio et al., 2007, Erhan et al., 2010). Although representation learningbased techniques have brought some successes over many applications, methods to address the distribution mismatch have not yet been well studied.In this work, we propose a simple neural network model with good domain adaptation performance on raw image pixels. More particularly, we utilize a nonparametric probability distribution distance measure, i.e, the Maximum Mean Discrepancy (MMD), as a regularization embedded in the supervised backpropagation training. MMD is used to reduce the distribution mismatch between two hidden layer representations induced by samples drawn from different domains. Despite its effectiveness, to our best knowledge, the use of MMD in the context of neural networks has not been investigated yet. This work is therefore the first study to use MMD in neural networks. Specifically, we will investigate whether the MMD regularization can indeed improve the discriminative domain adaptation performance of neural networks.
2 Preliminaries
In this section, we will describe several tools related to our proposed method such as MMD measure, feed forward neural network, and denoising autoencoder. Some reviews about such tools in recent literature will be also included.
2.1 Maximum Mean Discrepancy
The Maximum Mean Discrepancy (MMD) is a measure of the difference between two probability distributions from their samples. It is an effective criterion that compares distributions without initially estimating their density functions. Given two probability distributions
and on , MMD is defined as(1) 
where is a class of functions . By defining as the set of functions of the unit ball in a universal Reproducing Kernel Hilbert Space (RKHS), denoted by , it was shown that will detect any discrepancy between and (Borgwardt et al., 2006).
Let and
be data vectors drawn from distributions
and on the data space , respectively. Based on the fact that is in the unit ball in a universal RKHS, one may rewrite the empirical estimate of MMD as(2) 
where is referred to as the feature space map.
By casting (2) into a vectormatrix multiplication form, we come up with a kernelized equation of the form (Borgwardt et al., 2006)
(3)  
(4) 
where is the grammatrix of all possible kernels in the data space.
In domain adaptation or transfer learning, MMD has been used to reduce the distribution mismatch between the source and target domain. Pan et al. (2009) proposed a PCAbased model referred to as Transfer Component Analysis (TCA) that used MMD to induce a subspace where the data distributions in different domains are closed to each other. Long et al. (2013) presented a Transfer Sparse Coding (TSC) that utilizes MMD in the encoding stage to match the distributions of the sparse codes.
Our work here adopts an idea of incorporating MMD into the learning algorithm similarly to TCA and TSC. The difference is that we carry out the MMD regularization with respect to the supervised criterion while both TCA and TSC are unsupervised learning. We expect that the MMD regularization embedded in the supervised training will induce better discriminative features.
2.2 Feed Forward Neural Networks
The Feed Forward Neural Network (FFNN) has been used extensively for solving many discrimative tasks during the past decades, including object recognition tasks. The standard FFNN structure consists of three types of layer that are the input, hidden, and output layers with weighted interlayer connections. The FFNN training corresponds to adjusting the connection weights with respect to a specific criterion.
Let us consider a single hidden layer neural network with , , and as the visible, hidden, and output layers, respectively. We denote and as the connection weights between the adjacent layers. The FFNN can be written in the form of
(5)  
(6) 
where and are the hidden and output units’ biases, respectively.
Note that both and
are the nonlinear activation functions. In this work, we use the rectifier function approximated by the softplus function
and the softmax function , where and . The rectifier function has been argued to be more biologically plausible than the logistic function (Glorot et al., 2011). More importantly, several experimental works proved that the rectifier activation function can improve the performance of neural network models (Nair and Hinton, 2010). Furthermore, the use of the softmax function induces a probabilistic interpretation of the FFNN output.Given the labeled training data , where
represents the label with one active output node per class, the objective function of FFNN in the form of the empirical loglikelihood loss function is given as
(7) 
which is typically minimized by the backpropagation algorithm.
2.3 Denoising Autoencoder
An autoencoder refers to an unsupervised neural network used for learning efficient codings. In deep learning research, it is known as an effective technique for pretraining deep neural networks (Bengio et al., 2007). In terms of the structure, the autoencoder is very similar to the standard feedforward neural network except that its output layer has an equal number of nodes as the input layer. The objective of the autoencoder is to reconstruct its own inputs by means of a reconstruction loss function.
A denoising autoencoder (DAE) is a variant of the autoencoder model that captures robust representations by reconstructing clean inputs given their noisy counterparts (Vincent et al., 2010). Qualitatively, the use of several types of noise such as zero masking, Gaussian, and saltandpepper noises characterizes particular “filters“ that correspond to the first hidden layer parameters (Vincent et al., 2010)
. DAEs have been considered better than standard autoencoders and comparable to restricted Boltzmann machines in the context of deep learning discriminative performance
(Erhan et al., 2010, Vincent et al., 2010).In this work, we consider DAE as the pretraining stage of our proposed domain adaptive model. Unlabeled images from both source and target domains are considered as inputs to the DAE pretraining. We will investigate the effect with and without the DAE pretraining regarding to the domain adaptation performance.
3 Domain Adaptive Neural Networks
We propose a variant of the standard feed forward neural network that we refer to as the Domain Adaptive Neural Network (DaNN). This model incorporates MMD measure (2) as a regularization embedded in the supervised backpropagation training. By using such a regularization, we aim to train the network parameters such that the supervised criterion is optimized and the hidden layer representations are encouraged to be invariant across different domains.
Given the labeled source data and the unlabeled target data , the loss function of a single layer DaNN is given by
(8) 
where is the same loss function as shown in (7) but applied only over the source data, , are the linear combination outputs before the activation, and is the regularization constant controlling the importance of MMD contribution to the loss function.
To minimize (8), we need the gradient of . While computing the gradient of over , is trivial, computing the gradient of depends on the choice of the kernel function. We choose the Gaussian kernel, which is considered as a universal kernel (Steinwart, 2002), as the kernel function of the form , where
is the standard deviation.
We can rewrite the function in terms of the Gaussian kernel by a matrixvector form. Let us denote the sample vectors and . The additional element of 1 in each sample is utilized to incorporate the computation with the biases. Let us define the parameter matrices and . Hence, the function can be rewritten as
(9)  
Let be the gradient of , where the symbol can be either or , with respect to . Then, takes the form
(10) 
Now it is straightforward to see that the gradient of w.r.t ( for short) is given by
(11) 
The main reason for choosing the Gaussian kernel is that it has been well studied and proven to make MMD useful in practice (Gretton et al., 2012)
. Furthermore, it is worth noting that MMD here is applied to linear combination outputs before we put on the nonlinear activation function. This means that MMD provides a biased estimate with respect to an actual distribution discrepancy of the hidden representations. However, since we use the rectifier activation function that is close to linear, we expect that the measure in
would be able to produce good approximation of the true distribution discrepancy.In the implementation, we separate the minimization of and into two steps. Firstly, is minimized using a minibatchedstochastic gradient descent with respect to update. The minibatched setting has become a standard practice in neural network training to establish a compromise between speed and accuracy. Then, is minimized by reupdating with respect to the gradient (11). The latter step is accomplished by a fullbatched gradient descent. The detail of this procedure are summarized in Algorithm 1.

Initialize and with small random real values;
Update and using the batched stochastic gradient descent by the standard forward  backward pass w.r.t. ;
Update by the offline gradient descent as follows
Repeat Steps 2 and 3 until the end of the epoch;
4 Experiments and Analysis
We evaluated our proposed method in the context of object recognition over several domain mismatches. We first compared the DaNN to baselines and other recent domain adaptation methods. The results in terms of the recognition accuracy represented by the mean and standard deviation over 30 independent runs are then reported. At last, we investigated the effect of the MMD regularization by measuring the difference of the first hidden layer activations between one domain to another domain.
4.1 Setup
Our experiments used the Office data set (Saenko et al., 2010) that contains images of 31 object classes from three different domains: amazon, webcam, and dslr. In amazon, the images contain a single centered object, while for the others the images were acquired in unconstrained settings with some variations such as lighting and background changes. Here we only used 10 object classes following the protocol designed by Gong et al. (2012), which ends up with 1410 instances in total. The number of images for amazon, webcam, and dslr, respectively, are 958, 295, and 157. Webcam and dslr are known to be more similar to each other based on the Rank of Domain (ROD) measure (Gong et al., 2012). Examples of the Office images can be seen in Figure 1.
The DaNN model used in the experiments has only one hidden layer, i.e., a shallow network of 256 hidden nodes.^{2}^{2}2This number was to obtain dimensionality reduction. We tried other values such as 100, 300, and 500. Eventually, the number of 256 hidden nodes gave us the best performance among other values. The input layer of the DaNN can be either raw pixels or SURF features. The output layer contains ten nodes corresponding to the ten classes.
In all our experiments, we used the parameter setting for the supervised backpropagation learning specified in Table 1. Note that we employed the dropout regularization introduced by Hinton et al. (2012), the regularization of which randomly omits a hidden node for each training case with a certain probability. It has been proven to produce better performance in the sense of reducing the overfitting if a neural network is trained from a small training set.
Learning rate ()  0.02 

Iterations  900 
Momentum  0.05 
L2 weight regularization  
Dropout fraction 
For the MMD regularization, we set the standard deviation of the Gaussian kernel by the following calculation: (Baktashmotlagh et al., 2013), where is the median squared distance between all source samples. The MMD regularization constant was set to be sufficiently large () to accommodate small values of (11) compared to for each iteration.
We conducted six domain shift settings, each of which is a domain pair, based on three domains originated from the Office data set (, , , , , and ). The evaluation was divided into two settings: 1) unsupervised adaptation, and 2) semisupervised adaptation. The unsupervised adaptation corresponds to the setting when we can use both labeled images from the source domain and unlabeled images from the target domain during the training, but no labels from the target domain are incorporated. In the semisupervised adaptation, we incorporate a few labeled images from the target domain as additional training images. First three images per object category from the target domain are selected. Differently from what was conducted in the initial work (Saenko et al., 2010), we used all labeled images from the source domain instead of randomly sampled from it.
The performance of our model was then compared to SVMbased baselines, two existing domain adaptation methods, and a simple neural network as follows:
LSVM: an SVM (Cortes and Vapnik, 1995) model with a linear kernel that was applied to the original features.^{3}^{3}3http://www.csie.ntu.edu.tw/~cjlin/liblinear
LSVM + PCA: the same model as the LSVM but preceded by PCA to reduce feature dimensionality.
GFK (Gong et al., 2012):
the Geodesic Flow Kernel approach by considering an infinite number of intermediate subspaces between the source and target domains followed by kNN classification.^{4}^{4}4Here we used the subspaces constructed by PCA only
TSC (Long et al., 2013)
: the Transfer Sparse Coding technique based on the combination of the graph regularized sparse coding, the MMD regularization, and the logistic regression.
^{5}^{5}5http://learn.tsinghua.edu.cn:8080/2011310560/long.htmlNN: a single layer neural network with the same structure and parameter setting (Table 1) used in our DaNN, but without the MMD regularization.^{6}^{6}6It is basically Algorithm 1 without Step 3.
4.2 Results on SURF Features
We first investigated the performance of our model on the standard image features provided by Gong et al. (2012). Briefly, the image features were acquired by first utilizing the SURF descriptor on resized and grayscaled images to detect local scaleinvariant interest points. It was then followed by encoding the data points into 800bin histograms using a codebook trained from a subset of amazon images (Saenko et al., 2010)
. The final features were then normalized and zscored to have zero mean and unit variance. We conducted the unsupervised setting evaluation with the results shown in Table
2.We found that DaNN and TSC have better performance than the other approaches on these standard features. More specifically, DaNN performs well when there is the amazon set in a particular domain pairs. In the case of webcamdslr shifts, the TSC, which has not been tested on the Office dataset in the previous work, is surprisingly the best model. Despite its effectiveness, TSC has longer feature extraction time than, for example, neural networkbased approaches so that it is less efficient in real world situation. We also noted that the GFK, which incorporates multiple intermediate subspaces, fails to surpass the baselines in several cases. This indicates that the projection onto the subspaces generated by GFK is insufficient to reduce the domain mismatch.
4.3 Results on Raw Pixels
We also conducted the evaluation against the raw pixels of the Office images. Previous works on the Office image set were mostly done using the SURFbased features. It is worth investigating the performance on the Office raw pixels directly since good models on raw pixels are preferable in the sense of reducing the needs of handcrafted feature extractors. We first converted the pixels of the Office images in 2D RGB values into grayscaled pixels and resized them into a dimension of . They were then zscored to have zero mean and unit variance.
Domain Adaptation Setting
In this experiment, we ran both the unsupervised and semisupervised adaptation setting for all domain pairs. In addition, we also investigated the effect of DAE pretraining that precedes the NN and DaNN supervised training with respect to the performance. The DAE pretraining will slightly change Step 1 of Algorithm 1. We denoted these models as DAE + NN and DAE + DaNN. Examples of the pretrained weights are depicted in Figure 2. The complete accuracy rates on the Office raw pixels for all domain pairs are presented in Table 3.
Methods  

Unsupervised Setting  
LSVM  
PCA + LSVM  
GFK (Gong et al., 2012)  
TSC (Long et al., 2013)  
NN  
DAE + NN  
DaNN  
DAE + DaNN  
Semisupervised Setting  
LSVM  
PCA + LSVM  
GFK (Gong et al., 2012)  
TSC (Long et al., 2013)  
NN  
DAE + NN  
DaNN  
DAE + DaNN 
It is clear that our DaNN always provides accuracy improvements in all domain pairs compared to the SVMbased baselines and the NN model. In other words, the MMD regularization indeed improves the performance of neural networks. Compared to TSC that also employs the MMD regularization in the unsupervised training stage, our DaNN performs better in most cases. However, TSC can match the DaNN performance on webcamdslr couples, which has lower level mismatch than the other couples. This indicates that the utilization of the MMD regularization in the supervised training might gain more adaptation ability than that in the unsupervised training for pairs with more difficult mismatches to solve.
The DAE pretraining applied to NN and DaNN indeed improves the performances for all couples of domains. The improvements are quite significant for several cases, especially for webcamdslr couples. In general, the DAE pretraining also produces more stable models in the sense of resulting in lower standard deviations over 30 independent runs. Furthermore, the combination of DAE pretraining and DaNN performs best among other methods in these experiments in almost all cases. In the sense of qualitative analysis, as can be seen in Figure 2, the DAE pretraining captures more distinctive “filters“ from local blob detectors to object parts detectors, especially when the amazon images are included. This effect is somewhat consistent with what was found in the initial DAE work (Vincent et al., 2010) suggesting that the DAE pretraining provides more useful neural network representations.
In the semisupervised setting, the performance trend is somewhat similar to the unsupervised setting. However, the performance discrepancies between NN and DaNN here becomes smaller than those in the unsupervised setting. This outcome also holds for the case of the DAE pretraining. This suggests that both the MMD regularization and DAE pretraining might be less impactful when some labeled images from the target domain can be acquired.
Indomain Setting
One may ask whether the domain adaptation results shown in Table 3 are reasonable compared to the standard learning setting. We refer this standard setting to as the indomain setting, where the training and test samples come from the same domain. The indomain performance can be considered as a reference that indicates the effectiveness of domain adaptation models in dealing with the domain mismatch.
We investigated the indomain performances of nondomain adaptive models described in Section 4.1, i.e., LSVM, PCA+LSVM, and NN on raw pixels of the Office images. For each domain, we conducted fold crossvalidation. The complete indomain results in terms of the mean and standard deviation are shown in Table 4. In general, we can see that the best indomain model is the NN model on both training and test images.
Methods  amazon  webcam  dslr  

Training  Test  Training  Test  Training  Test  
LSVM  99.0 0.3  52.0 4.6  100.0 0.0  57.7 13.9  100.0 0.0  51.0 14.1 
PCA+LSVM  64.4 0.8  60.6 6.4  72.0 1.5  62.8 8.7  75.6 2.1  55.2 13.1 
NN  99.3 0.1  74.2 3.2  100.0 0.0  87.2 5.4  100.0 0.0  77.9 8.8 
In comparison to the domain adaptation results, the highest indomain accuracies are better than the results with domain mismatches when the amazon or webcam are used as the target sets (see the highest accuracy rates in column and on Table 3). This indicates that a better domain adaptation model might be necessary to overcome those mismatches. However, this is not the case for the dslr as the target set where the indomain accuracy is even lower than the best domain adaptation result on pair. Knowing the facts that the webcam and dslr images are quite similar and the webcam set has more images, this shows that the domain adaptation indeed helps to produce a better object recognition model for this kind of setting.
5 Conclusions and Future Work
This paper aimed to reduce the domain mismatch problem in object recognition using a simple neural network model, which we refer to as the Domain Adaptive Neural Network (DaNN). In this work, we utilized the MMD measure as a regularization in the supervised backpropagation training. This regularization encouraged the hidden layer representation distributions to be similar to each other. We demonstrated that the DaNN performs well on the Office image set, especially on raw image pixels as inputs. Furthermore, the DaNN preceded by the denoising autoencoder (DAE) pretraining has better performance compared to SVMbased baselines, GFK (Gong et al., 2012), and TSC (Long et al., 2013) on the Office image set (Saenko et al., 2010) in almost all domain pairs.
Despite the effectiveness of the MMD regularization, there are still many aspects that can be further improved. We have seen that the performance on raw pixels, which is a main concern in representation learning approach, is still not as good as that on SURF features. We note that good models that perform well without any preceding handcrafted feature extractors are preferable to reduce complexity. A better model on raw pixels might be achieved by using deeper neural network layers with a similar strategy since deep architectures have brought some successes in many applications in recent years (Bengio, 2013). Our initial work using a standard deep neural network with the DAE pretraining, which is not shown here due to page limit, suggested that deeper representations do not always improve the performance against the domain mismatch.
In addition, a study on the kernel choice for computing MMD regarding to the domain adaptation problem might be worth addressing. We assumed that the universal Gaussian kernel function can detect any underlying distribution mismatches in the Office data set, which might be not true. A better understanding about the relationship between a kernel function and a particular image mismatch, e.g., background, lighting, affine transformation changes, would induce a great impact in this field of research.
References
 Baktashmotlagh et al. (2013) M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of International Conference on Computer Vision, pages 769–776, 2013.
 Bay et al. (2008) H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.
 Bengio (2013) Y. Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science, pages 1–37. Springer, 2013.
 Bengio et al. (2007) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems (NIPS), volume 19, page 153, 2007.
 Bengio et al. (2012) Y. Bengio, A. C. Courville, and P. Vincent. Representation learning: A review and new perspectives. Computing Research Repository, abs/1206.5538, 2012.
 Borgwardt et al. (2006) K. M. Borgwardt, A. Gretton, M. J. Rasch, H.P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 Cortes and Vapnik (1995) C. Cortes and V. N. Vapnik. SupportVector Networks. Machine Learning, 20(3):273–297, 1995.
 DauméIII (2009) H. DauméIII. Frustratingly easy domain adaptation. CoRR, abs/0907.1815, 2009.
 Erhan et al. (2010) D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, and P. Vincent. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, 11:625–660, 2010.

Glorot et al. (2011)
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural network.
In
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 315–323, 2011. 
Gong et al. (2012)
B. Gong, Y. Shi, F. Sha, and K. Grauman.
Geodesic flow kernel for unsupervised domain adaptation.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 2066–2073, 2012.  Gopalan et al. (2011) R. Gopalan, R. Li, and R. Chellapa. Domain adaptation for object recognition: An unsupervised approach. In IEEE International Conference on Computer Vision, pages 999–1006, 2011.
 Gretton et al. (2012) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch’́olkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, pages 723–773, 2012.
 Hinton et al. (2012) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012.
 Long et al. (2013) M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. S. Yu. Transfer sparse coding for robust image representation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 404–414, 2013.
 Nair and Hinton (2010) V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
 Pan and Yang (2010) S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
 Pan et al. (2009) S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 1187–1192, 2009.
 Saenko et al. (2010) K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual cateogry models to new domains. In ECCV, pages 213–226, 2010.

Steinwart (2002)
I. Steinwart.
On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2:67–93, 2002. 
Vincent et al. (2010)
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11:3371–3408, 2010.