Many practical machine-learning applications, such as sentiment classification, spam filtering, and object recognition, require the repetitive building of new predictive models as fresh data becomes readily available. Assuming the classification of new data is laborious or costly, it is common to encounter a new target domain with a shortage of class labels, together with a previously analyzed source domain with an abundance of class labels. Under the assumption that both domains share the same feature (input) representation, one is tempted to use a model trained on the source domain and apply it on the target domain. The problem of domain adaptation emerges when source and target domain distributions differ; the discrepancy between the two domains precludes applying the source model on the target domain directly[1, 2, 3, 4, 5]. Based on certain assumptions, domain adaptation techniques have been proposed to alleviate such distributional discrepancy.
A popular technique in domain adaptation is to search for a new common feature space where both source and target distributions show high overlap. Deep learning has been recently used successfully in this scenario [6, 7, 8]; the goal is to use a deep network architecture to transform low-level features into high-level representations. Features detected using deep networks have been shown to capture specific underlying factors of variation in the data, while being robust to other variations .
In this paper we introduce the notion of conceptual domain adaptation in which high-level concepts within source and target domains are identified and aligned in order to define a common feature space. When source and target domains contain concepts with known semantic similarity, but marked difference in low-level representations, traditional domain adaptation techniques using deep learning fail to unify both domains. Conceptual domain adaptation, in contrast, focuses on the alignment of high-level concepts only, which provides the ability to solve a wider range of problems through imposing less constraints on the relation between the two domains.
The paper is organized as follows. Section II gives background information on deep learning and de-noising auto-encoders. Section III shows related work combining deep learning with domain adaptation. Our main methodology and fundamental ideas are described in Section IV. Section V explains how jointly training a model on source and target data does not guarantee proper node alignment. Section VI shows our framework for aligning high-level representations, leading to the notion of conceptual domain adaptation. Experiments and results are described in Section VIII. Finally, Section IX gives our conclusions.
Ii Background Information
Deep learning networks iteratively learn multiple layers of intermediate non-linear data representations (i.e., data abstractions). Each layer contains a set of nodes that compute a non-linear combination of the output values of the nodes in the adjacent layer below. Although according to the Universal Approximation Theory 
, a feed-forward neural network with only one hidden layer and sufficient hidden units is able to approximate any continuous function, deep architectures bring about added benefits, such as the ability to do feature re-use and feature abstraction. Re-using features not only yields a reduction in the number of computational nodes, it also reduces the number of parameters of the model, and thus the need for more samples. Furthermore, abstract features emerging from the network tend to show more resilience to data variations.
Ii-a Stacked De-Noising Auto-Encoders
An auto-encoder is a three layer neural network consisting of an input layer , middle layer , and output layer . The network encodes data in the middle layer and decodes data in the output layer
. The output layer and hidden layers are composed of processing units (e.g., logistic sigmoid functions).
The goal of an auto-encoder is to learn the hidden layer
(i.e., the weights performing the coding step) by reconstructing the input on the output. This is achieved by minimizing a loss function (i.e., reconstruction error):. The minimization is performed using gradient descent by iteratively updating the weights of the encoder and the decoder. In de-noising auto-encoders, the network encodes a corrupted version of the input while trying to do the reconstruction. Accordingly, the network is forced to capture statistical dependencies between input features to filter out noise.
In order to capture complex feature abstractions, multiple layers of non-linear units are required. This is attainable using stacked de-noising auto-encoders (SDAE) , essentially made of a layer-wise training of multiple auto-encoders. The input of each auto-encoder is composed of the hidden layer of the auto-encoder trained in the previous iteration. Here we refer to them as deep auto-encoders.
Iii Related Work
The idea of transforming source and target data into a common space as an effective solution to the distribution discrepancy problem in domain adaptation has recently seen a surge of different techniques using deep learning architectures. As an example, 
(2011) proposed learning intermediate representations using stacked de-noising auto-encoders; the higher-level representation is learned using information from both source and target domains; the classifier is finally trained in the new space using source data only. Following a similar approach, (2012) used marginalized stacked de-noising auto-encoders as an alternative architecture exhibiting lower computational costs, and better scalability on high-dimensional feature spaces.  (2015) proposed using a sparse and hierarchical network (DASH-N) for domain adaptation that is similarly trained using source and target data jointly.  (2014a) proposed an alternative architecture that imposes sparse locally-connected weights in the bottom layer, in addition to the use of a sparsity regularizer.
Another research direction is to explore new cost functions during training. As an example,  (2014) proposed a deep network architecture to jointly learn a common representation space by minimizing reconstruction error and distribution discrepancy using a technique named gradient reversal layer.  (2014b) proposed an architecture that minimizes the maximum mean discrepancy between source and target distributions. Other work has focused on the use of regularizers during training. For example,  (2016) proposed a multi-task learning architecture for domain adaptation composed of an encoder, followed by a source class predictor, and target reconstruction; the target-reconstruction network works as a regularizer to prevent the source classifier from data overfitting. A different direction is to alleviate the distribution discrepancy between source and target by shifting domains.  (2015) proposed an architecture composed of one encoder followed by two decoders one for each domain; the encoder captures source and target into a common space, while each decoder is responsible for its corresponding domain.
Iv Conceptual Domain Adaptation
A high-level concept in domain can be captured by a pattern of output values at the top level of a deep network; it is an abstract entity that is assumed to carry a clear semantic meaning. For instance in the domain of hand-written digits, seven is a high-level concept that carries the same meaning regardless of the writing form or style. More formally, we say that concept in domain (source domain) is correspondent to concept in domain (target domain), , if and only if they carry the same semantic meaning.
In this paper, we will refer to the representation of a concept
as a binary vector corresponding to the output of the last hidden layerof a deep auto-encoder, after performing a layer-wise training of multiple auto-encoders (Section II). Vector is obtained by applying a step function on each node on . Specifically, assuming is the th node in , and is the output of node , then if , and otherwise.
Concepts can be represented in two main forms: using local or distributed
representations. In local representations, activation of one hidden unit is necessary to represent the concept; in distributed representations, the concept is represented by an activation pattern over more than one hidden unit. In higher layers of deep networks, representations tend to be more local, rather than distributed. In this paper we consider (strictly) local representations of high-level concepts defined as follows:
Definition 1 (strictly local representation).
where is one representation of concept and is the th hidden unit of representation .
Definition 2 (Aligned local representations).
Assuming (strictly) local representations, representation of concept in domain is aligned to representation of its correspondent concept in domain , if and only if the activating unit of is also activated in .
While many current approaches to domain adaptation are based on projecting source and target data into a new common space, our proposed approach extracts high-level concepts from each domain separately, followed by an alignment of correspondent concepts. As an illustration, consider the domain of hand-written digits and rotated hand-written digits. Here, concept seven in the first domain corresponds to the concept of rotated seven in the second domain, as they clearly carry the same semantic meaning. In order to perform domain adaptation, the two corresponding concepts must be aligned. Assuming a hierarchical representation of data (as is the case with deep neural networks), a domain-adaptation solution is considered conceptual –as opposed to representational– if the alignment between high-level correspondent concepts in the two domains does not rely on their low-level representations. Following up with the example described above, a representational approach to domain adaptation must rely on the lower-level pixel-wise relationship between seven and rotated seven to align the two concepts. In contrast, conceptual domain adaptation seeks to align the two concepts while discarding information from low-level representational properties (e.g. pixel information).
We contend representational domain adaptation imposes a stringent limitation on the range of solvable problems, by focusing on those situations with low-level similarities between correspondent concepts across domains. By relaxing this limitation, conceptual domain adaptation leads to a less-constrained form of transfer knowledge across similar domains.
V The Problem Behind the Joint Training Approach
An important step in conceptual domain adaptation is to align correspondent concepts and such that the active hidden unit in representation of is also active in representation of (see definition 2 for local alignment). This stands in stark contrast to previous approaches where no alignment takes place; an implicit assumption is made that the hidden unit capturing concept is also able to simultaneously capture its correspondent concept .
To better understand the problem behind jointly training source and target domains without any form of concept alignment, consider that deep networks (e.g., deep auto-encoder) continuously update weights across the whole network, such that the formation of high-level concepts are dependent on the low-level representation of their constituent patterns. Now, in order to align correspondent concepts in the same hidden units using solely joint training, we would require that correspondent concepts exhibit lower-level representational similarities. This is rarely the case in real-world applications. A more common scenario occurs when correspondent concepts from two different domains exhibit inherently lower-level representational discrepancy (e.g., image of digit seven and image of rotated digit seven). In this case, the popular joint training approach for domain adaptation could capture correspondent concepts in two different hidden units, which would lead to an inevitable misalignment between the two semantically-identical concepts.
V-a The Effect of Concept Misalignment
A misalignment in the (local) representation of correspondent concepts between two domains (after joint-training) directly affects the performance of domain adaptation techniques. Consider concept in the source domain represented by the activation of hidden unit in local representation ; correspondent concept in the target domain is also trained using the same network and the two concepts are not locally aligned. The misalignment will result in deactivation of hidden unit in (the local representation of ).
Figure 2 illustrates the misalignment problem using strictly local representations of high level concepts, under the joint-training approach.
We now introduce a search framework that provides a solution to the problem described above by adjusting high-level representations trained with deep learning (deep auto-encoders).
Vi Alignment and Adjustment
Our methodology follows three main steps: (i) learning high-level concepts from source and target domains (independently) using deep auto-encoders, (ii) aligning correspondent concepts in source and target by adjusting their representations, and (iii) building a classifier on the aligned representations. Our main contribution lies on the second step, which we explain next.
Vi-a Concept Alignment under the Mapping Matrix
Our goal is to have correspondent concepts from target and source domains fall into the same hidden units along the upper layer of a deep auto-encoder. As illustrated in Figure 3(a), the target representation experiences an adjustment by using a mapping function that ensures concept correspondence with the source representation. Specifically, the target data is adjusted by defining a mapping function over hidden units (referred here as the nodes on the top hidden layer of a deep auto-encoder architecture). The mapping function for each hidden unit gives a new representation as follows:
where is the node being adjusted, and is the new representation for that node. Each weight is restricted to a binary value, and at most one . In essence, the new representation will take the value of the old representation as specified by the position where , or will take the value of if . As explained below, this can be seen as a mapping function intended to align the target and source representations.
The formulation above can be rephrased by defining a mapping matrix (Figure 3(b)) with the number of rows and columns corresponding to the number of nodes on the target and source representations respectively. Similar to equation 5, activation of each element in the mapping matrix corresponds to a mapping from the specified target unit to the corresponding source unit. Each column can only be activated by either one, or none of the units in the target representation (rows in the matrix).
Using mapping matrix
, the aligned target representation can be computed through a linear transformation:
where and are the new and original training samples (target domain). The intuition behind this type of adjustment is based on the assumption that the (separate) training of the source and target deep auto-encoders had already been able to extract meaningful concepts from the data in both domains. As a result, an adjustment as proposed above should properly align correspondent high-level concepts.
At this point, the main challenge is to find an optimal mapping matrix as follows:
where and are the source and target samples, and quantifies the goodness of matrix .
Vi-B Search for Matrix
Finding matrix requires exploring a space of possible solutions, and a metric to quantify the quality of each solution. The total number of possible solutions (combinations) for a mapping matrix is
. To handle such large space, our work employs genetic algorithms. Each mapping matrix is encoded as an offspring through a vector of integers as follows:
Figure 3(b) illustrates the mapping matrix and the corresponding encoding of an individual solution. We use three main operations: elite selection, crossover and mutation, to generate the next population at each new iteration. The fitness value is obtained by training a -NN classifier on the (high-level representation) of the source data and testing the performance of such classifier on the adjusted target data. We take accuracy as the fitness value. Pseudo-code to compute the fitness value of matrix is shown in Algorithm 1.
Vii Experimental Settings
Our experiments used a stacked de-noising auto-encoder  architecture for model training, and a modified version of Matlab’s implementation of the deep network222https://github.com/rasmusbergpalm/DeepLearnToolbox. The model comprised nine layers of de-noising auto-encoders, each using batch gradient descent for optimization. Training stopped if no improvement was achieved for the last iterations, or if the number of iterations exceeded a threshold ( iterations). The learning rate was defined following the method described by , as follows:
where is the minimum number of iterations to reduce the learning rate (set to iterations). We adopted a layer-wise search for hyper-parameters; specifically, we performed a grid search on sets of hyper-parameter values and opted for the best setting. The size of the hidden layer was chosen from where is the size of the previous layer. Similarly, the range of learning-rate values and corruption level hyper-parameters where chosen from the sets and respectively.
Regarding the genetic algorithm, at each iteration we kept of the population as elite instances, the remaining was generated using crossover and mutation. The size of the population was set to instances. The algorithm stopped when the best score did not improve for the last iterations. The nearest-neighbor classifier was set to and adopted distance.
We applied domain adaptation to the digit recognition task and used the following datasets: MNIST 333http://yann.lecun.com/exdb/mnist/, USPS 444http://statweb.stanford.edu/ tibs/ElemStatLearn/data.html and rotated USPS. The datasets where processed following the standard format of grayscale images.
Viii Results and Analysis
Viii-a The Role of Adjustment
For the first batch of experiments, we tested the adjustment approach on two domain-adaptation scenarios including MNIST to USPS and MNIST to rotated USPS. Based on accuracy performance, we limited the architecture to 5 layers, with each layer being 2/3 of the size of the previous layer. Figure 4 shows the improvement gained with our proposed adjustment during domain adaptation on each scenario.
In the MNIST to USPS scenario, correspondent concepts between the two domains have stronger representational similarities compared to the other scenario (MNIST to rotated USPS); the proposed approach shows only a small performance gain. On the MNIST to rotated USPS scenario, the adjustment approach displays a significant improvement due to the high degree of low-level representational discrepancy. Overall, the proposed search-based framework shows performance gain, regardless of the presence –or lack of– low-level representational discrepancy.
Viii-B The Role of Depth in the Auto-encoder
In order to assess the effectiveness of the proposed mapping with respect to the number of hidden layers in the deep auto-encoder, we captured the deviation of
from the identity matrix(where we assume and are square matrices). In the extreme case where , the mapping is direct: concepts are represented by the same hidden unit. Figure 5 (top) shows the adjustment degree for each number of of layers in a deep auto-encoder, on the two domain-adaptation scenarios described above. Adjustment degree is the percentage of hidden nodes in the new representation that had to be changed (adjusted) to be correctly mapped from the corresponding hidden unit in the old target representation.
Results show how lower layers lead to an increase of direct mappings (more identity mappings). This is to be expected: low-level features contain more ”representational” rather than ”conceptual” relationships; here representational alignments adopted by traditional joint-training approaches suffice to achieve good results.
Figure 5 (bottom) compares accuracy versus depth (number of layers). We observe that when our proposed adjustment is invoked, accuracy exhibits significant variation, and there is usually an optimal value that maximizes performance. However, the performance seen when using solely joint training does not show much variation with depth. The same behavior is observed in the MNIST to USPS scenario. We conclude that conceptual domain adaptation –as compared to joint-training– justifies a search for a performance maximum, with depth as the control parameter. Figure 5 (bottom) shows a decrease in accuracy using conceptual domain adaptation past layer 5. This can be explained by the increase in adjustment degree (Figure 5) at high layers, where more adjustments are required to align correspondent units. This is also an effect of our 2/3 rule in the design of the network architecture: while reducing the size of the network, more information is lost and performance degradation accrues.
Viii-C The Role of Jointly Learning New Concepts
We now compare two approaches to test different learning strategies: (i) jointly learning the source and target concepts in the same network, (ii) separately learning the source and target concepts in different networks, and a third case (iii) where the representation of each data point is constructed by concatenating the representation obtained from the previous two approaches. For each case, the adjustment was performed by constructing the mapping matrix. The size of the mapping matrix is dependent on the size of the network used for training the source and target data in each case as follows:
In case (i), we follow the same approach as previous experiments, where we search for mapping matrix ; we assume the same number of rows and columns (corresponding to the number of hidden units in the highest layer of the network). In case (ii), the size of is , where and correspond to the number of nodes along the highest layer of the deep auto-encoders trained with target and source datasets respectively ( and may differ). Finally, in case (iii), is an matrix initialized with four sub-matrices as follows:
where is the a sub-matrix corresponding to the joint training of source and target, and corresponds to the case where source and target are learned separately. The and
matrices are initialized as a diagonal matrix and random matrix respectively. A diagonal mapping matrix is one where only the diagonal elements are activated, corresponding to a direct mapping from target and source hidden units. A random matrix is one where the activation of elements are randomly distributed; each source hidden unit is randomly mapped to one or none of the target units. Figure6 compares accuracy among all three approaches.
Results show that separately training source and target improves accuracy significantly, compared to using a single-joint network. The reason can be traced to the inability of the joint network to capture correspondent concepts for each domain separately when there is low representational similarity. We conclude that a single joint network architecture is unable to map correspondent concepts under low representational relationships. In contrast, training a separate network for each domain facilitates capturing correspondent concepts, since there is no interference during learning. Results also show it is possible for a joint network to capture unique concepts common to both domains (under representational similarities). As illustrated in Figure 6, a reasonable approach is to use the concatenation of both representations to achieve high accuracy performance.
Apart from semantic similarity, MNIST and (rotated) USPS have low-level similarities. To add more complexity to the adaptation process, we have experimented with the braille 555Images of four dots colored black or white based on each digit. dataset and the street-view house numbers (SVHN) datasets 666http://ufldl.stanford.edu/housenumbers/. We also compare our approach with an additional domain-adaptation method known as subspace alignment . As shown in Figure 7(a) and Figure 7(b), conceptual domain adaptation outperforms joint training and subspace alignment in both scenarios.
This paper describes an approach to domain adaptation that employs deep learning to extract high-level concepts from source and target domains, while relaxing the alignment dependency on lower-level representations of correspondent concepts. The proposed alignment is based on adjusting the final (high-level) representation of the target data by matching the corresponding representation on the source data.
Experimental results show that our approach brings significant gains in accuracy, particularly under scenarios with high discrepancy in low-level representations. Increasing the depth of the network (i.e., of the deep auto-encoder) leads to more adjustments in order to align correspondent concepts. An abundance of representational discrepancy leads to more adjustments. Finally, we show that a combined approach that concatenates the representations of both the joint network and each of the domain networks yields best results on both experimental settings.
-  S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine Learning, vol. 79, no. 1, pp. 151–175, 2010.
H. D. III and D. Marcu, “Domain adaptation for statistical classifiers,”
Journal of Artificial Intelligence Research, vol. 26, pp. 101–126, 2006.
-  Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” arXiv:0902.3430, 2009.
-  Y. Shi and F. Sha, “Information-theoretical learning of discriminative clusters for unsupervised domain adaptation,” in Proceedings of the 29th International Conference on Machine Learning, ser. ICML’12. Omnipress, 2012, pp. 1275–1282.
-  K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in Proceedings of the 30th International Conference on Machine Learning, 2013, p. 819–827.
-  X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in International Conference on Machine Learning, 2011, pp. 513–520.
-  Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
-  M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neural networks for object recognition,” in Pacific Rim International Conference on Artificial Intelligence. Springer, 2014b, pp. 898–904.
-  I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng, “Measuring invariances in deep networks,” in Advances in neural information processing systems, 2009, pp. 646–654.
-  B. C. Csáji, “Approximation with artificial neural networks,” Faculty of Sciences, Etvs Lornd University, Hungary, vol. 24, p. 48, 2001.
-  Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”JMLR, vol. 11, no. Dec, pp. 3371–3408, 2010.
-  M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” arXiv:1206.4683, 2012.
-  H. V. Nguyen, H. T. Ho, V. M. Patel, and R. Chellappa, “Dash-n: Joint hierarchical domain adaptation and feature learning,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5479–5491, 2015.
-  M. Ghifary, W. B. Kleijn, and M. Zhang, “Deep hybrid networks with good out-of-sample object recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014a, pp. 5437–5441.
M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep
reconstruction-classification networks for unsupervised domain adaptation,”
European Conference on Computer Vision. Springer, 2016, pp. 597–613.
-  M. Kan, S. Shan, and X. Chen, “Bi-shifting auto-encoder for unsupervised domain adaptation,” in IEEE International Conference on Computer Vision, 2015, pp. 3846–3854.
-  R. A. Wilson and F. C. Keil, The MIT encyclopedia of the cognitive sciences. MIT press, 2001.
-  S. Thorpe, “Localized versus distributed representations,” in The handbook of brain theory and neural networks. MIT Press, 1998, pp. 549–552.
A. E. Eiben and J. E. Smith,
Introduction to evolutionary computing. Springer, 2003, vol. 53.
-  Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 437–478.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Subspace alignment for domain adaptation,” arXiv preprint arXiv:1409.5241, 2014.