1 Introduction
A typical assumption in learning based visual recognition is that training and test data obey an identical distribution as they belong to the same domain. In practical applications, this assumption can be easily violated due to the distribution divergence of training data from source domain and test data from target domain. Such domain shift [BenDavid et al.2010] is an universal issue in applications such as image recognition with varying lighting conditions and shooting angles of camera, challenging traditional recognition models. Domain adaptation [Pan and Yang2010] addresses this issue by training the model using the data from both domains so as to transfer the discriminative knowledge from the source and the target.
Based on the amount of available labeled samples in target domain, domain adaptation can be performed in two scenarios [Patel et al.2015], semisupervised domain adaptation (SDA) and unsupervised domain adaptation (UDA). In SDA, a small number of target samples are with class label, so it’s essential to learn the discriminative model with the assistance of labeled source samples. The labels are unavailable in UDA, thus it relies on modeling the distribution relation between domains to achieve crossdomain recognition. In this paper, we aims to tackle domain shift problem in the scenario of UDA, which is more challenging and widespread in reality.
Instance adaptation [Mansour et al.2009, Yu and Szepesvári2012] specifies the important weights of source samples in the objective function to match the data distribution of source and target domain. This principle works well only when the support of target distribution is contained in that of source distribution. Feature adaptation seeks domaininvariant representations of samples so that their distributions are coincident and the discriminative information is preserved. The domaininvariant feature can be obtained through linear projection [Fernando et al.2013, Sun et al.2016], kernel mapping [Gong et al.2012, Zhang et al.2018], sparse coding [Shekhar et al.2013, Yang et al.2018, Tang et al.2018], and metric learning [Kulis et al.2011, Herath et al.2017]. Classifier adaptation retrains a predefined classifier by learning the classifier parameters to guarantee its good generalization in the target domain [Duan et al.2009, Xu et al.2018]
. Besides the aforementioned shallow learning based domain adaptation methods, domain adaptation via deep learning
[Ganin and Lempitsky2015, Long et al.2015, Bousmalis et al.2016, Tzeng et al.2017, Saito et al.2018, Hoffman et al.2018] achieves notable improvement and becomes increasingly popular. The deep DA methods extract nonlinear domaininvariant feature and train domainrobust classifier in an endtoend manner.Most shallow UDA methods treat data as vectors, meaning that multidimensional data such as images and videos or their features need to be converted from other form to vectors beforehand. This operation can incur several obstacles to domain adaption, including (1) the vectorization breaks the internal structure of data, which is demonstrated to be essential for recognition [AjaFernndez et al.2009]; (2) the vectorization increases the risk of model overfitting because resulted vector is always long. Deep learning based domain adaptation methods encounter the dilemma of structure information loss because feature maps from convolutional layers need to be converted into vectors before they feed into the fully connected layers. In addition, the number of parameters in fully connected layers becomes large when feature map is transformed from tensor to vector, increasing the overfitting risk of deep model, especially when training data are insufficient.
To address the aforementioned issues, we propose a Structured Discriminative Tensor Dictionary Learning (SDTDL) approach for unsupervised domain adaptation. SDTDL seeks data representation that is discrimiantive and robust to domainshift by separating the domain factor and class factor in tensor space (Fig. 1). Specifically, a sample is factorized into domain part and class part characterized by domainspecific subdictionary and classspecific subdictionary, respectively. The resulted representation is a block diagonal sparse tensor with its nonzero blocks consisting of domainspecific representation and classspecific representation. Classification is accomplished base on reconstruction error associated with classspecific representation.
Overall, our main contributions are threefold: (1) we propose a discriminative dictionary learning approach based on tensor model for UDA. The method preserves the internal structure information of data and is able to tackle the smallsamplesize problem. (2) we model domain factor and class factor separately to build a structured dictionary to guarantee the discriminativeness and domain invariance of feature. (3) exhaustive experiments on object recognition and digit recognition tasks demonstrate that the proposed SDTDL outperforms the existing shallow methods and achieves competitive results compared with deep learning approaches.
2 Related Work
Feature adaptation methods based on shallow learning include feature augmentation, feature alignment and feature transformation. [Gopalan et al.2011] and GFK [Gong et al.2012] are two representative feature augmentation methods using intermediate subspaces to model domain shift. Subspace Alignment (SA) [Fernando et al.2013] extract linear features by aligning the subspaces of source and target domains. The feature alignment idea is extended in CORAL [Sun et al.2016] through covariance recoloring. Feature transformation methods seek a common latent feature space in which source samples and target samples are indistinguishable. The features can be obtained by linear projection [Baktashmotlagh et al.2013, Long et al.2014] or nonlinear mapping [Aljundi et al.2015]. Most recently, TAISL [Lu et al.2017] is proposed to learn a tensorform feature via Turker tensor decomposition, which is the most related work with our method. In contrast, the proposed SDTDL is able to use the valuable label information in source samples and do not need to train a classifier, which promotes the performance and efficiency in UDA.
Recently, deep convolutional neural network (CNN) based methods are developed with promising performance. Domain Adaptation Nural Network (DANN)
[Ganin and Lempitsky2015] combines CNN and adversarial learning to achieve an endtoend unsupervised domain adaptation. DDC [Tzeng et al.2017] learns two feature extractors for the source and target domains respectively with GAN. DIFA [Volpi et al.2018] extends the feature augmentation principle to generative adversarial networks. As deep UDA methods requires a large number of samples for parameter training, their effects are prone to be limited in the scenario of small sample size. By comparison, the proposed SDTDL is more suitable to address small sample size problem in domain adaptation, which is demonstrated by the experimental results in Sec. 5.3 Notations and Background
Symbol  Description  Symbol  Description 

,  Tensor samples  Class labels  
,  Tensor dictionaries  ,  Factor matrices 
,  Sparse coefficients  Mode flatting of  
The stack of and  Product of with  
Identity matrix  Vector with all ones 
Tensor Preliminaries. Table 1 lists the symbols used in this paper. An th order tensor is an dimensional data array, with element denoted as . The Frobenius squared norm of is defined as . The mode flatting of reorders its elements into a matrix . The mode product of a tensor with a matrix , denoted as , performs matrix multiplication along the th mode, which can be performed equivalently by matrix multiplication and retensorization of undoing the mode flattening. For conciseness and clarity, we denote the product of a tensor with a set of matrices by
(1) 
Similarly, We define .
The Tucker decomposition of tensor is defined as
(2) 
where is a scale, and is a rankone tensor produced by the outer product of factor vectors. Given , the core tensor can be obtained as , where . Tucker decomposition can be written in matrix format as
(3) 
where . Note that the factor matrix in each mode satisfies the constraint .
Problem Definition. A domain is composed of a feature space
with a marginal probability distribution
, where . A task associated with a specific domain is defined by a label spaceand the conditional probability distribution
, where . Domain adaptation considers a source domain and a target domain satisfying , and .In this paper, we are given a set of labeled source samples , where is a thmode tensor and is its class label. We are also given a set of unlabeled target samples . We aim to infer the class label of by learning from the source and target samples.
4 The Proposed SDTDL
4.1 Formalization
For easy understanding, we assume the labels of target samples have been predicted at present, and provide the details of label prediction and target sample selection in sec 4.3. We select partial target samples based on their prediction confidence to be additional training samples to aid modal training. The selected target samples from the th class are denoted as , and the source samples belonging to the th class are denoted as .
We model the generation process of crossdomain data as the combination of domain factor and class factor, in which a sample () is factorized as
(4) 
where is determined by the unique character of the domain from which is sampled and is determined by the semantic information of the class to which belongs.
In order to obtain “ parsimonious” representations of th order tensor samples, we propose to learn a structured tensor dictionary composed of factor matrices, i.e. . The structure of arises from the structure of each factor matrix. Specifically, is composed of domainspecific subdictionary and classshared subdictionary matrix , i.e. . In order to distinguish source domain from target domain, is further divided into sourcespecific subdictionary and targetspecific subdictionary . This leads us to the following factorization of a source sample
(5) 
where is the sourcespecific subdictionary, and is the domain representation of in tensor format. Similarly, we have for target sample with the targetspecific subdictionary .
Model (4) indicates that is merely determined by class factor, thus it’s safe to assume that and can be represented over a shared subdictionary . Due to the success of structured discriminative dictionary learning in image classification [Yang et al.2011], we divide into a serial of classspecific subdictionaries for discriminative representation. To from class , its tensor representation over is given by
(6) 
where is the classspecific representation. Similarly, we have , where provides the classspecific representation.
Based on our notation of and in section 3, we define and as the representations of source sample set and target sample set over the shared of class respectively. In order to correct the domain shift, the classconditional distributions of representation in source and target should be aligned. Here we adopt Maximum Mean Discrepancy [Gretton et al.2012] to measure distribution divergence, then we have
(7) 
where and
. Beyond that, the intraclass variance of representation should be small to facilitate the discriminativeness. To that end, the following objective is to be minimized for source domain
(8) 
where is produced by arranging duplicate so that and have the same size. In the same way, should be minimized for target domain. To satisfy both (7) and (8), we need to minimize the following objective
(9) 
By considering the above criteria together, our learning model can be written as
(10)  
s. t.  
The first and second terms are the fidelity of the reconstruction over the structured tensor dictionary. The third term can be viewed as discriminant analysis of the representation. determines the weighting of target domain compared with source domain, and trades off between fidelity term and discriminative term. The constraints require that the factor matrices in each mode are orthogonal matrices.
4.2 Optimization
In this section, we solve model (10) using alternative optimization strategy, in which we seek the optimal solution for some certain variables while keeping all the others fixed at the values of the previous iteration till the iteration converges.
Optimize . With the fixed , the fidelity loss in regard to class can be written as , where . Considering all the source samples, model (10) becomes
(11)  
s. t. 
Model (11) is a typical best rank tensor approximation problem that can be solved by HOOI algorithm [Lathauwer et al.2000].
Optimize . In the same way as in (11), the targetspecific dictionary and the domainspecific representation of target samples are obtained by applying HOOI to the following optimal problem.
(12)  
s. t. 
where .
Optimize . We seek the optimal subdictionary classbyclass, so we have the model for class as
(13)  
s. t. 
where , .
We adopt the alternating optimization strategy in [Lathauwer et al.2000] to update and by turns. With fixed and , the optimal is provided by theorem 2. With fixed , the optimal is given by and .
Theorem 1.
Let be the augmented sample tensor of class which is generated by concatenating and along with the th mode. Define matrix as
(14) 
where and are identical matrices, and are the column vectors with all ones. Let be the mode flatting matrix of . Then, the optimal in (19) is provided by , with columns as the eigenvectors corresponding to the first largest eigenvalues of the following eigenvalueproblem
(15) 
The proof is given in Appendix A.
4.3 Label Prediction and Sample Selection
The probability of belonging to class can be computed based on the fidelity error, i.e.
(16) 
where
is the parameter of exponent function whose value is set as median value of the denominator. The posterior probability can also be computed based on the deviation of
from the centroid of class . Thus we have(17) 
is adopted to replace for two reasons: (1) is more reliable because it is computed according to the real source labels; (2) it is beneficial to alleviate the domain shift to make the target sample towards to the corresponding class center in source domain.
Through the convex combination of the two kinds of probabilities, We ultimately can predict the class label of by
(18) 
In order to select target samples with reliable pseudolabels for training, we sort s in descend order. Then we add the target samples associated with highest poster probability into training sample set. The ratio of the selected target samples in the whole target sample set is a parameter of our model.
4.4 Initialization
The initialization process includes the following three steps. In step 1, the classspecific dictionary are initialized by structured discriminant dictionary learning (e.g. [Yang et al.2011]) based on the labeled source samples, followed by computing classwise sparse coding and . Then the domainspecific dictionary are initialized through (11). In step 2, the target labels are predicted by (18) without the influence of , i.e. set in (16). Note that at this stage, although the estimated target labels may be deviated from the actual ones,they provide a reasonable start point for iteration because of the underlying correlation between source and target. In step 3, we select partial target samples with their estimated labels to initialize the targetspecific dictionary through (12). In summary, the proposed method can be expressed in Algorithm 1.
5 Experiments
5.1 Experimental Setup
Datasets. We employ two pubic datasets to evaluate the propsoed method. (1) Office+Caltech dataset is released by [Gong et al.2012], which consists of images of object classes from domains, i.e., Amazon (A), Webcam (W), Dslr (D) and Caltech (C). We randomly select labeled images per class from Webcam/DSLR/Caltech and from Amazon as source samples respectively according to [Gong et al.2012]. We ran
different trials and report the average rate and standard deviation of recognition accuracy. For fair comparison, we use the tensor data provided by
[Lu et al.2017], which is produced by CONV5_3 layer of the VGG16 model. For other methods, we report the results in the literature. (2) To evaluate the performance of the methods in the settings of small sample size, we adopt the USPS+MNIST dataset released by [Long et al.2014], which consists of digital images from USPS and digit images in MNIST from to . Thus these two domains lead to two DA tasks. The tensor samples are produced by CONV5_3 layer of VGG16 model pretrained with all the data in MNIST.Baseline Models. The proposed SDTDL is compared with seven competitive UDA methods, i.e., No Adaptation (NA), TCA [Pan et al.2011], GFK [Gong et al.2012], DIP [Baktashmotlagh et al.2013], SA [Fernando et al.2013], LTSL [Shao et al.2014], LSSA [Aljundi et al.2015], and three stateoftheart UDA methods, i.e., CORAL [Sun et al.2016], TAISL [Lu et al.2017] and JGSA [Zhang et al.2017]. For digit recognition task, two deep UDA methods DANN [Ganin and Lempitsky2015] and DDC [Tzeng et al.2017] are added into comparison to evaluate SDTDL in small sample size scenario.
Parameter Settings. The optimal parameters of SDTDL are set empirically based on grid searching. Specifically, for object recognition, the parameters are set as: , , , , . For digital recognition task, the parameters are set as: , , , , . The parameters of the other methods in comparison are set according to the corresponding papers.
5.2 Experimental Results
Feature Visualization.
To qualitatively evaluate the discriminativeness and robustness to domainshift of the feature extracted by SDTDL, we visualize the feature embeddings in the domain pair Webcam to Caltch (W
C). We compare SDTDL with CORAL, TAISL and JGSA in terms of the D scatter plot given by tdistributed stochastic neighbor embedding (tSNE) [van der Maaten and Hinton2008]. Fig. 2 (ad) illustrate the visualized distributions of the features corresponding to source and target samples. The features extracted by SDTDL are more prone to form separate clusters associated with the categories compared with other baselines. For both source and target samples, the intraclass scatter is small and the interclass scatter is large, indicating that SDTDL is able to guarantee the feature to be discriminative. Besides, the distributions of source samples and target samples are aligned for each category, which suggests that our method can suppresses the interference of domain factor to discriminative information transfer from the source domain to the target domain.Method  C A  C W  C D  A C  A W  A D  W C  W A  W D  D C  D A  D W  MEAN 

NA  89.0(2.0)  79.4(2.7)  86.2(4.0)  77.3(1.8)  74.6(3.1)  82.8(2.2)  63.7(2.1)  74.0(2.5)  94.9(2.4)  70.5(1.9)  81.1(1.9)  91.1(1.7)  80.4 
TCA  78.1(6.1)  69.0(6.6)  74.3(5.2)  56.7(4.5)  55.5(6.4)  59.9(6.7)  54.7(3.8)  68.3(4.1)  90.6(3.2)  51.9(2.2)  61.2(4.2)  89.9(2.2)  67.5 
GFK  87.6(2.3)  81.9(4.9)  84.8(.45)  75.1(3.9)  74.3(5.2)  81.4(4.3)  79.1(2.7)  84.0(4.4)  95.2(2.2)  82.2(2.4)  90.4(1.4)  92.8(2.2)  84.1 
DIP  84.8(4.3)  73.5(4.9)  82.8(7.7)  59.8(5.7)  45.5(9.1)  52.2(8.1)  65.2(4.5)  69.3(6.9)  94.1(3.1)  61.9(6.3)  76.4(3.7)  90.9(2.3)  71.4 
SA  82.0(2.6)  65.9(4.0)  73.7(4.3)  67.7(4.2)  61.1(5.1)  67.8(4.8)  70.4(4.1)  80.1(4.3)  91.1(3.3)  66.9(3.3)  77.4(6.0)  87.3(3.1)  74.3 
LTSL  87.5(2.8)  75.3(4.2)  82.3(4.1)  70.2(2.4)  66.7(4.6)  77.7(4.6)  59.1(4.4)  66.6(5.7)  90.0(3.8)  60.8(3.1)  69.2(4.5)  86.0(2.9)  74.3 
LSSA  86.4(1.7)  45.4(6.6)  73.5(2.3)  80.3(2.3)  84.0(1.7)  90.9(1.7)  29.5(7.0)  86.6(4.5)  85.8(4.7)  65.9(6.5)  92.3(0.6)  93.4(2.2)  76.2 
CORAL  80.3(1.9)  63.8(3.1)  62.1(3.0)  77.6(1.2)  61.2(2.4)  64.3(2.9)  66.6(2.2)  69.1(2.6)  82.8(2.8)  72.0(1.7)  74.2(2.2)  89.6(1.6)  72.0 
TAISL  90.0(1.9)  85.3(3.1)  90.6(1.9)  80.1(1.4)  77.9(2.6)  85.1(2.2)  82.6(2.2)  85.6(3.5)  97.7(1.5)  84.0(1.0)  87.6(2.1)  95.9(1.0)  86.9 
JGSA  87.0(0.8)  69.4(6.7)  77.29(7.0)  79.6(1.2)  67.8(4.8)  76.27(6.1)  81.4(1.0)  87.1(0.7)  96.9(1.8)  82.2(0.7)  88.5(0.8)  94.9(1.0)  82.1 
SDTDL  94.8(3.2)  89.5(4.4)  90.4(4.7)  86.4(2.5)  82.8(5.7)  88.8 (3.6)  84.4(2.2)  91.7(3.6)  97.9(1.7)  83.9(1.1)  92.1(1.4)  98.1(1.2)  90.1 
Recognition Accuracy. Table 2 shows that SDTDL achieves the highest accuracy in pairs out of and gains performance improvements in average accuracy of compared to the best method for comparison. We observe that in CD and DC, our method reaches a close second to the best results ( vs. and vs. , respectively). The leading performance of SDTDL compared with other vectorbased UDA methods indicates that the internal information of highdimensional visual data are indeed crucial to crossdomain recognition. It meanwhile demonstrates that SDTDL indeed effectively preserves the useful internal information in the visual data. We also observe that SDTDL outperforms TAISL in all the pairs, which demonstrates the proposed method is able to restrain the interference of domain factors and facilitate the discriminativeness of feature. In Table 3, we can see that SDTDL outperforms both the competitive shallow and deep UDA methods on digit datesets. One one hand, this demonstrates the strong power for discriminative domaininvariant feature extraction of SDTDL. One the other hand, the results validate the advantageous over other methods of SDTDL when large training samples are unavailable for crossdomain recognition.
Small Sample Size Scenarios. We evaluate the performance of SDTDL in addressing the small sample size problem through crossdomain recognition tasks WC and MNISTUSPS. For WC, random samples per class from domain W and all the target samples of domain C are selected to compose the dataset. As shown in Fig.2 (ef), SDTDL outperforms other three methods when the label source samples are limited, suggesting that SDTDL can achieve knowledge between domains when few label samples are available. We also note that SDTDL underperforms when only one source sample from each class is available. The reason is that the class mean of source sample becomes to zero in this case, thwarting the discriminative term in mode (10). In addition, we select all the source samples and target samples per class to simulates the scenario of small sample size in target domain. Fig. 2 (f) shows that SDTDL offers advantages over other three competitive shallow methods when the number of target samples is limited. For MNISTUSPS, random source samples per class and all the target samples are selected to compose the dataset. Fig. 2 (gh) show that SDTDL outperforms other three competitors when source samples are scarce in crossdomain digit recognition. Besides, the advantage of SDTDL over TAISL in recognition accuracy demonstates that the structured discrimination dictionary learning strategy of SDTDL can effectively address the small sample size problem in crossdomain recognition.
Method  TCA  GFK  SA  JDA  CORAL  TAISL  JGSA  DANN  DDC  SDTDL 

56.3  61.2  67.8  67.2  83.6  83.0  82.3  77.1  79.1  90.7  
51.2  46.5  48.8  59.7  78.5  82.6  87.8  73.0  66.5  89.1  
MEAN  53.8  53.9  58.3  63.4  81.1  82.8  85.1  75.1  72.8  89.9 
Parameter Sensitivity Analysis. We investigate the parameter sensitivity of SDTDL w.r.t target domain weighting parameter , parameter of intraclass variance and target sample selection parameter . Fig. 2 (ij) validate that SDTDL achieves stable performance for a wide range of parameter settings for and . The observation from Fig. 2 (k) is twofolds: (1) a large proportion of target samples should be selected in SDTDL to insure the samples from each category are provided for training; (2) the proportion should be controlled within a certain range to prevent the negative effects of false labels.
Convergence Analysis. We evaluate the convergence property of SDTDL by checking the prediction accuracy of target samples in each iteration. Fig. 2 (l) shows the increasement of prediction accuracy along with dictionary learning process, indicating that the dictionary becomes more and more transferable and discriminative. This also demonstrate the effectiveness of our pseudolabel selection strategy in model training. Besides, we observe that dictionary evolution can reach the balance between domainrobust and discriminativeness within iterations in most cases.
Dictionary Property Analysis. To demonstrate the efficacy of the learned domainspecific and classspecific subdictionaries in extracting domain information and class information, we analysis the reconstructed samples associated with the two subdictionaries. Concretely, we apply SDTDL to domain adaptation from MNIST to USPS (MU) and compare the original images and the domainspecific and classspecific reconstructed images. From the results in Fig. 1, we observe that the images in (b) and (e) contain more domain information, e.g., the light typeface style of MNIST and the boldface style of USPS, than the category information of digits. We also note that the images in (d) and (f) contain far more category information than typeface information. The results demonstrate that the subdictionaries learned by SDTDL can focus on the domain factor and extract class information from data separately in domainshift situation.
6 Conclusion
Previous unsupervised domain adaptation methods vectorize multidimensional data in advance, leading to the loss of internal information which is critical to visual recognition applications. Besides, most existing methods are based on the assumption of plenty samples, which is rarely hold in practice. In this paper, we propose to learn a structured discriminative dictionary using tensor model. The dictionary is composed of multilinear factor matrices, providing the capability to represent tensors. Moreover, domainspecific information and classspecific information of the crossdomain samples are depicted by the corresponding subdictionaries respectively. Our method shows strong power of feature extraction through knowledge transfer between domains, not only in traditional domain adaptation setting, but also in the setting of limited samples, which is rarely explored.
References
 [AjaFernndez et al.2009] Santiago AjaFernndez, Rodrigo de Luis Garca, Dacheng Tao, and Xuelong Li. Tensors in Image Processing and Computer Vision. Springer, 2009.
 [Aljundi et al.2015] Rahaf Aljundi, Rémi Emonet, Damien Muselet, and Marc Sebban. Landmarksbased kernelized subspace alignment for unsupervised domain adaptation. In CVPR, 2015.
 [Baktashmotlagh et al.2013] Mahsa Baktashmotlagh, Mehrtash Tafazzoli Harandi, Brian C. Lovell, and Mathieu Salzmann. Unsupervised domain adaptation by domain invariant projection. In ICCV, 2013.
 [BenDavid et al.2010] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. ML, 79(12):151–175, 2010.
 [Bousmalis et al.2016] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. In NIPS, 2016.
 [Duan et al.2009] Lixin Duan, Ivor W. Tsang, Dong Xu, and TatSeng Chua. Domain adaptation from multiple sources via auxiliary classifiers. In ICML, 2009.
 [Fernando et al.2013] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In ICCV, 2013.

[Ganin and Lempitsky2015]
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.
In ICML, 2015.  [Gong et al.2012] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012.
 [Gopalan et al.2011] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In ICCV, 2011.
 [Gretton et al.2012] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. JMLR, 13:723–773, 2012.
 [Herath et al.2017] Samitha Herath, Mehrtash Tafazzoli Harandi, and Fatih Porikli. Learning an invariant hilbert space for domain adaptation. In CVPR, 2017.
 [Hoffman et al.2018] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In ICML, 2018.
 [Kolda and Bader2009] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
 [Kulis et al.2011] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, 2011.
 [Lathauwer et al.2000] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank1 and rank(r1,r2,. . .,rn) approximation of higherorder tensors. SIAM JMAA, 21(4):1324–1342, 2000.
 [Long et al.2014] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S. Yu. Transfer joint matching for unsupervised domain adaptation. In CVPR, 2014.
 [Long et al.2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
 [Lu et al.2017] Hao Lu, Lei Zhang, Zhiguo Cao, Wei Wei, Ke Xian, Chunhua Shen, and Anton van den Hengel. When Unsupervised Domain Adaptation Meets Tensor Representations. In ICCV, 2017.
 [Mansour et al.2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In NIPS. 2009.

[Pan and Yang2010]
S. J. Pan and Q. Yang.
A survey on transfer learning.
IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, Oct 2010.  [Pan et al.2011] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. TNN, 22(2):199–210, 2011.
 [Patel et al.2015] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. Signal Processing Magazine, 32(3):53–69, 2015.
 [Saito et al.2018] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
 [Shao et al.2014] Ming Shao, Dmitry Kit, and Yun Fu. Generalized transfer subspace learning through lowrank constraint. IJCV, 109(12):74–93, 2014.
 [Shekhar et al.2013] Sumit Shekhar, Vishal M. Patel, Hien Van Nguyen, and Rama Chellappa. Generalized domainadaptive dictionaries. In CVPR, 2013.
 [Sun et al.2016] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
 [Tang et al.2018] Hao Tang, Heng Wei, Wei Xiao, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Deep microdictionary learning and coding network. In WACV, 2018.
 [Tzeng et al.2017] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
 [van der Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. JMLR, 9:2579–2605, 2008.
 [Volpi et al.2018] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, 2018.
 [Xu et al.2018] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multisource unsupervised domain adaptation with category shift. In CVPR, 2018.
 [Yang et al.2011] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for sparse representation. In ICCV, 2011.
 [Yang et al.2018] Baoyao Yang, Andy Jinhua Ma, and Pong C. Yuen. Domainshared groupsparse dictionary learning for unsupervised domain adaptation. In AAAI, 2018.
 [Yu and Szepesvári2012] Yaoliang Yu and Csaba Szepesvári. Analysis of kernel mean matching under covariate shift. In ICML, 2012.
 [Zhang et al.2017] Jing Zhang, Wanqing Li, and Philip Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In CVPR, 2017.
 [Zhang et al.2018] Zhen Zhang, Mianzhi Wang, Yan Huang, and Arye Nehorai. Aligning infinitedimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In CVPR, 2018.
Appendix A Proof of Theorem 1
The proof proof to Theorem 2 in the main paper is presented in this section. Theorem 2 provide the solution to the following optimization problem
(19)  
s. t. 
Theorem 2.
Let be the augmented sample tensor of class which is generated by concatenating and along with the th mode. Define matrix as
(20) 
where and are identical matrices, and are the column vectors with all ones. Let be the mode flatting matrix of . Then, the optimal in (19) is provided by , with columns as the eigenvectors corresponding to the first largest eigenvalues of the following eigenvalueproblem
(21) 
Proof.
Based on the formula (4.3) (4.4) in [Kolda and Bader2009], we have
(22) 
Similarly, we have we have
(23) 
Define , we can get the following equivalence with formula derivation
(24) 
Taking (22)(23)(24) into account, the optimal problem (10) is equivalent to the following optimal problem
(25) 
For enable better readability, we define intermediate variable
(26) 
So far, we can obtain the optimal factor matrix for each mode by solving the following optimal problem
(27) 
where is the mode flatting matrix of . According to Lagrange multiplier method, the optimal solution of (27) is with columns as the eigenvectors corresponding to the first largest eigenvalues of eigenvalueproblem (21). ∎
Appendix B Dictionary Property Analysis
In this section, we provide additional experimental results to demonstrate the efficacy of the learned domainspecific and classspecific subdictionaries in extracting domain information and class information.We apply SDTDL to the task of transferring from USPS to MNIST (UM), in which we compare the original images and the domainspecific and classspecific reconstructed images. The results in Fig. 3 demonstrate that the domainspecific subdictionary and the classspecific subdictionary learned by SDTDL are able to extract the domain information and class information from crossdomain data respectively.