1 Introduction
Domain Adaptation (DA) has received much attention in recent years as it offers possibility to generalize the classifier trained on one domain to another domain, where the observed data sampled from those two domains are usually coming from different distributions Pan and Yang (2010). For example, in visual recognition, the data instances of those two domains are usually originated from different environments, sensor types, resolutions, and view angles, so that they would follow very discrepant distributions Li et al. (2018). It is impractical to annotate sufficient data for each domain since labeling data is labor intensive and expensive. Therefore, it is necessary to apply the DA techniques to exploit invariant features across different domains, so that welllabeled source knowledge could be transferred to target domain, then the labeling consumption is mitigated. Recently, DA has made remarkable progress in crossdomain hyperspectral image classification Deng et al. (2018), human action recognition Zhang and Hu (2019), etc.
However, the performances of traditional DA usually significantly rely on label quality or richness in the source domain, which is restricted in applications in the wild as we still have to seek a betterlabeled and higherquality source domain Shu et al. (2019); Tan et al. (2019)
. Label quality and its sufficiency are both important in context of domain adaptation (DA), especially for deep learning DA frameworks
Chen et al. (2020). In some real applications, it may not be easy for users to label data samples as correctly and sufficiently as possible, since they often struggle with a very complicated and large dataset. For example, there exist some data points, which are ambiguous between different categories, or require high level professional technologies. Therefore, those data samples are easily partial wronglylabeled when they commit to annotate the whole dataset, or highly sparselylabeled when they only label a handful of samples to reduce the labeling consumption as much as possible. Moreover, due to the dataset is large, it is more challenging to guarantee the label quality or its richness, especially the deep learning DA frameworks which often requires vast amounts of source domain data. Therefore, the provided poor labeled dataset, to be regarded as source domain, have a great impact on training processes of DA models, since incorrect or unknown knowledge of source domain will cause unexpectantly and heavily negative transfer Pan and Yang (2010). Therefore, it is essential to study the situation when the given source domain is poorlabeled, either partial wronglylabeled (label quality) or highly sparselylabeled (label richness).In practical applications, it may not be possible to access a significant amount of labeled data, especially with the dramatic increase of data in deep learning models. Therefore, it is essential to boost positive transfer for a newly unlabeled target domain using a poorlabeled source domain. Many examples in knowledge engineering could be found where this situation can truly be beneficial. One example is the problem of sentiment classification, where our task is to automatically classify the reviews on a product or scores on a visual image. For this classification task, we need to first collect many products or visual images and annotate them using the given reviews or scores. However, labeling them is very laborintensive and mindnumbing for users, since some data samples are too ambiguous and have no significant divergences between various categories, especially the scores on visual images. Therefore, the label quality or richness is poor when we commit to label them all or only annotate a handful of them to reduce labeling consumption.
After that, we would utilize this poor dataset, to be regarded as the source domain, to train a classifier. Since the distribution of data among different types of products or visual images can be very different, to maintain good classification performance, we need to recollect the source domains in order to train the review/scoreclassification models for each kind of products/visual images. However, this datalabeling process can be also very expensive to do. To further reduce the effort for annotating reviews/scores for various products/visual images, we may want to adapt a classification model that is trained on some products/visual images (poorlabeled), which could be directly applied to make prediction for other types of products/visual images. As another example, we can consider the data information of users collected from different supermarkets (e.g., Walmart and Amazon, etc.), which is updated every day and often in large quantities, thus it may be also impossible to guarantee the label quality and its richness. Therefore, it is very challenging for us to utilize the poorlabeled data information of users collected from a supermarket (e.g., Walmart) to exploit the interest of users from another different supermarket (e.g., Amazon). In such cases, the proposed sparselylabeled source assisted domain adaptation can save a significant amount of labeling effort.
To this end, WeaklySupervised Domain Adaptation (WSDA) is proposed to address the challenge that the source domain contains noises in labels, features, or both Shu et al. (2019). However, they only focus on the label quality problem, and do not further explore that the source labels are insufficient severely. A more realistic setting, SparselyLabeled Source Assisted Domain Adaptation (SLSADA), is therefore proposed in this paper, to further mitigate the labeling consumption, where only a sparselylabeled source domain is available without any target labels. Notably, this paper assumes that the target domain is completely unlabeled to increase the difficulty of our work since previous DA work indicates that the unsupervised DA Liang et al. (2019); Yang et al. (2018) is more challenging than the semisupervised one Wang et al. (2019a); Pereira and Torres (2018). Moreover, we aim to enable the proposed model more general since it is a special and more simple case when there exists at least one example of the class in the target domain. For example, on the OfficeHome dataset, source labels are available correctly in WSDA setting, while only in SLSADA scenario. As shown in Fig. 1, there are numerous unlabeled data but a few labeled ones, then we need to utilize this sparselylabeled source domain to assist recognition for target domain.
In order to address the challenge of SLSADA, our aim is not only to fight off the label insufficiency issues in the source domain, but also to mitigate the domain shift across the source and target domains. It is essential for DA to study this new SLSADA scenario, which could implement knowledge transfer with lowest labeling cost than most existing approaches. Specifically, SLSADA introduces two challenges. (1) It is still significant to alleviate the influence of distributional shift across different domains as presented in previous DA methods. (2) Moreover, it is nontrivial to train a wellstructured classifier since only limited source labels are available.
Due to the label scarcity problem in SLSADA setting, we carry the semisupervised projected clustering on the source domain using a few labeled source instances, while unsupervised on the target domain, so that the discriminative structures of data could be discovered desirably, i.e., data samples from the same cluster are assembled tightly (i.e., Fig. 2 (a)). Although the cluster labels of source domain can be consistent with groundtruth labels, it is uncertain in the target domain since no supervised information is provided. Therefore, the label propagation method Nie et al. (2009) is adopted to propagate the source limited labels to the source and target unlabeled instances simultaneously, so that the target cluster labels are revealed as correctly as possible (i.e., Fig. 2 (b)). Once their labels are uncovered, we can jointly align the marginal and conditional distributions across different domains using the methods of Maximum Mean Discrepancy (MMD) Gretton et al. (2006) and classwise MMD Long et al. (2013) (i.e., Fig. 2 (c)). In order to refine the final recognition performance progressively and enable different steps facilitate to each other, we iteratively conduct those three procedures in a few times.
However, it is nontrivial to integrate the projected clustering, label propagation and distributional alignment as a unified optimization framework, since some variables to be optimized are implicitly involved in their formulas, thus they could not promote to each other. To be specific, the construction of classwise MMD implicitly contains the variables related to cluster centroids, but those variables in the projected clustering should be implicit when we optimize the projection matrix. Existing DA models are usually formulated with the label prediction and distributional alignment, and separate them as different steps Ding et al. (2018). Therefore, they will fail to take advantage of each other’s merits and promote to each other. In contrast, this paper further considers the projected clustering so that the model is robust to the label scarcity problem as we respect the discriminative structures of data. Moreover, we prove that the classwise MMD could be rewritten as the clusterwise MMD when we optimize the variables related to cluster centroids, while the projected clustering could be reformulated as the intraclass scatter minimization Wang et al. (2014) when we optimize the shared projection matrix. Therefore, we could couple those three quantities together and benefit them to each other in an effective optimization manner.
The main contributions of our work are twofolds:

We first introduce a new DA scenario, called SparselyLabeled Source Assisted Domain Adaptation, which is more realistic as it requires a few labeled source data while is under insufficient exploration so far.

We propose a unified framework to jointly seek cluster centroids, source and target labels, and domaininvariant features. Then, we construct an optimization strategy to solve the objective function efficiently.
The rest of the paper is organized as follows. The related works are reviewed in Section 2. In Section 3, we propose the model and SLSADA algorithm. The experimental evaluations are discussed in Section 4. Finally, we conclude this paper in Section 5.
2 Related Work
Traditional DA aims to employ previous labeled source domain data to boost the task in the target domain. However, they usually assume the source and target domains share an identical label space, known as Closed Set Domain Adaptation (CSDA). Recently, an increasing number of new domain adaptation scenarios have been proposed to compensate for different challenges in practical application, such as Partial Domain Adaptation (PDA), Open Set Domain Adaptation (OSDA), and Universal Domain Adaptation (UDA). PDA transferred a learner from a big source domain to a small target domain, and the label set of the source domain is supposed to be large enough to contain the target label set Cao et al. (2019). By contrast, OSDA was proposed to deal with the challenge that the target domain contains unknown classes, which are not observed in the source domain Baktashmotlagh et al. (2019). Furthermore, for a given source label set and a target label set, UDA required no prior knowledge on the label sets, where they may contain a common label set and hold a private label set respectively You et al. (2019).
All of aforementioned works have shown great improvements in the performance of knowledge transfer due to the available substantial amount of highquality labeled data in the source domain. Therefore, recent research has set about following with interest the weakly supervised DA scenario. For instance, Tan et al. Tan et al. (2019) proposed a Collaborative Distribution Alignment (CDA) method for a Weakly Supervised OpenSet Domain Adaptation (WSOSDA), where both domains are partially labeled and not all classes are shared between these two domains. In contrast, Long et al. Shu et al. (2019) proposed a Transferable Curriculum Learning (TCL) approach to address the challenge of sample noises of the source domain in a Weakly Supervised CloseSet Domain Adaptation (WSCSDA). However, their settings still require enough labeled instances either in the source or target domains. In order to further mitigate the intensive labeling expenses, we propose a more realistic DA paradigm, called SparselyLabeled Source Assisted Domain Adaptation, which requires only a few source labels and a satisfactory performance could be warranted through a proposed unified framework. In order to highlight the contributions in this paper and make the model simpler, SLSADA assumes the source and target label sets are the same, and the source labeled instances are sparsely located in each class. To the best of our knowledge, our work is the first attempt to deal with this sparselylabeled WSCSDA scenario.
Recent DA methods follow a mainstream approach which is based on the feature adaptation (FDA). FDA aims to extract a shared subspace, where the distributions of the source and target data are drawn close by explicitly minimizing some predefined distance metrics, e.g., Bregman Divergence Si et al. (2010), Geodestic Distance Gong et al. (2012), Wasserstein Distance Shen et al. (2018) and Maximum Mean Discrepancy (MMD) Gretton et al. (2006). The most popular distance is MMD due to its simplicity and solid theoretical foundations Zhao et al. (2018). Pan et al. Pan et al. (2011) proposed the Transfer Component Analysis (TCA) to align the marginal distribution across domains using MMD. Long et al. Long et al. (2013) proposed classwise MMD to further reduce the conditional distribution difference between the two domains. Furthermore, SCA Ghifary et al. (2017), JGSA Zhang et al. (2017), VDA Tahmoresnezhad and Hashemi (2017) constructed the class scatter matrix of source domain to preserve its discriminative information. This paper also utilizes the MMD and classwise MMD to jointly align the marginal and conditional distributions across the source and target domains. Moreover, we prove that the projected clustering process is equivalent to boost the intraclass compactness when the projection is optimized. Therefore, the learned features from the proposed model are domaininvariant and discriminative, simultaneously.
It is noteworthy that the methods mentioned above require a strong assumption that rich labels are available in the source domain. Moreover, they optimize the target labels in a separate step along with the domaininvariant feature learning, thus they may fail to benefit to each other in an effective manner Ding et al. (2018). Different from them, this paper incorporates the projected clustering, label propagation and distributional alignment into a unified optimization framework seamlessly, and jointly optimize cluster centroids, source and target labels and domaininvariant features, where only a few source labels are available.
3 Methodology
In this section, we present our proposed model and its optimization strategy in detail.
3.1 Problem Definition
We begin with the definitions of terminologies. (resp. ) denotes the source (resp. target) domain data, where (resp. ) is the number of samples and m is the dimension of data instance. In the proposed SLSADA setting, there are a few source labels while no target labels at all, i.e., , , where and the onehot label ( is the number of classes).
Moreover, we assume that the source and target domains follow the same feature space and label space, while the marginal and conditional distributions are different due to the dataset shift. Our aim is to find a projection to map and into a shared subspace, where those two distributional differences could be explicitly reduced. Then their new representations are .
3.2 Projected Clustering
The projected clustering aims to jointly optimize the cluster centroids and cluster labels in an embedding space, so that the data instances from the same clusters could be grouped together Wang et al. (2014). Since only limited source labels are available in SLSADA scenario, we propose to utilize a semisupervised projected clustering in the source domain, while unsupervised setting in the target domain. Therefore, the discriminative structures of data could be exploited with these limited source labels. The loss of projected clustering is defined as follows:
(1) 
where , are the onehot cluster labels for the source and target domains, respectively. According to Wang et al. (2014), the source and target cluster centroids could be computed as , , where . Eq.(1) means that each data point could be reconstructed by all cluster centroids and its cluster label. In addition, we enforce the clustering results of source labeled data are consistent with their initial labels .
3.3 Effective Label Propagation
Although the source cluster labels represent the true labels in the semisupervised setting, it is uncertain in the target domain since no supervised information provided. To address this issue, a method of graphbased label propagation (GLP) Nie et al. (2009) is introduced to guide the clustering procedure on target domain, so that their predictive cluster labels are in agreement with the true labels as accurately as possible. Specifically, we propagate the labels from the labeled source data to the unlabeled source and target data, and the loss of label propagation is defined as follows:
(2) 
where and represents the graph Laplacian matrix. Meanwhile, D denotes a diagonal matrix with the diagonal entries as the column sums of W. Specifically,
(3) 
3.4 CrossDomain Feature Alignment
In order to align the domainwise distributions between the source and target domains, the MMD is adopted to explicitly reduce their marginal distribution difference, and its loss is defined as follows:
(4) 
where is the MMD matrix, and it is computed as follows:
(5) 
We further decrease the conditional distribution shift across domains by classwise MMD, and the formula is as follows:
(6) 
where and are the numbers of data samples from class in the source and target domains (, , ), then the classwise is computed as follows:
(7) 
3.5 Overall Objective Function
Finally, we formulate the proposed model by incorporating the above Eq.(2), Eq.(4), Eq.(6) as follows:
(8) 
where , are tradeoff parameters, and we constrain the subspace with such that the data on the subspace are statistically uncorrelated (
is the identity matrix and the data matrix
X is precentralized). We further impose the constraint that is small to control the scale of A Zhang et al. (2017).Remarkably, the proposed approach joints projected clustering, label propagation and distributional alignment in a unified framework. Thus, it could benefit to each other to improve the recognition for the unlabeled data in both domains.
With the projected clustering, the discriminative structures of data could be exploited effectively (i.e., the data points belonging to the same cluster could be congregated together), where only a few source labels are required. With label propagation, the cluster labels of unlabeled data are revealed correctly, either in the source or target domains. The domaininvariant features mean that the feature representations of those data instances, with the same semantic (i.e., category) from different domains, are as similar as possible, while the reasons that domaininvariant features have poor performance is that different domains follow very different distributions (i.e., domain shift). Therefore, with domain shift mitigated, the domaininvariant features could be leveraged effectively. Moreover, when they are jointly optimized, the discriminative and domaininvariant features prompt a more effective graph between the source and target domains, so that a few source labels could be propagated to the unlabeled data more accurately. Meanwhile, when more accurate labels are assigned to the unlabeled data, more effective knowledge across two domains would be transferred, and more promising projected clustering performance in both domains would be achieved. As such, those three procedures could promote to each other in a unified optimization framework and the proposed approach is more robust and effective than considering them separately.
However, there exist two difficulties when Eq.(8) is optimized. Firstly, the term contains label information, thus we have to rewrite it as a formulation, where the variable F is involved as the cluster centroids are optimized. As mentioned before, the source and target cluster centroids in the embedded space could be computed as . Remarkably, is nothing less than the sum of mean distances between the source and target embedded data from the same classes. Therefore, it is easily to verify that , where , which means that the conditional distribution alignment equals to cluster centroids calibration. Therefore, we expect that the learned cluster centroids not only enable the embedded data points more separable and discriminative, but also boost their conditional distribution alignment when the cluster centroids are optimized.
Another challenge is that how to enable the form of agree with and when the shared projection is optimized. We also prove that , where , are the intraclass scatter matrix for the source and target domains, and could be computed as previous work Wang et al. (2014). Similarly, we expect that not only the marginal and conditional distributions of source and target are aligned, but also their discriminative information could be respected when the shared projection is optimized. Therefore, the projected clustering, label propagation and distributional alignment could be optimized simultaneously, and facilitate to each other.
Theorem 1. The projected clustering process can be rewritten as the class scatter matrix:
(9) 
where , are the intraclass scatter matrice for the source and target domains.
Proof: Without loss generality, we prove that . Firstly, we denote , where represents the mean of from class . As mentioned before, . Then, we have:
(10) 
Furthermore,
(11) 
Thus, the Eq.(9) is proved.
3.6 Optimization
Here a alternative optimization strategy is constructed to solve Eq.(8) as below. We first transform it into the augmented Lagrangian function by relaxing the nonnegative constraint as follows:
(12) 
where , , , are the Lagrange multipliers for constraints . When are fixed, Eq.(12) becomes:
(13) 
where , and , are the intraclass scatter matrix for the source and target domains and could be computed as previous work Wang et al. (2014). Here we rewrite the cluster projection as class scatter matrix since the labels are uncovered when A optimized. Then, the optimal solution A to Eq.(13) is formed by the eigenvectors of corresponding to the
smallest eigenvalues.
When are fixed, Eq.(12) becomes:
(14) 
where we rewrite the distributional alignment as cluster centroids calibration since the labels are unknown. Thus, we obtain the partial derivative of J w.r.t., , by setting it to zero as:
(15) 
Using the KKT conditions ( denotes the dot product of two matrix), we achieve the following equations for :
(16) 
where , , , , . Moreover, is a matrix that the negative elements of an arbitrary matrix T are replaced by 0. Similarly, is a matrix that the positive elements of an arbitrary matrix T are replaced by 0. Similarly,
(18) 
where , , , , .
As for , we fix and Eq.(12) becomes:
(19) 
Likewise, we obtain the following equations for :
(20) 
Therefore, the updating rule for is as follows:
(21) 
where , . Similarly, the updating rule for is as follows:
(22) 
where , .
Computational Complexity
We analyze the computational complexity of Algorithm 1 using the notation. We denote as the number of iterations. The computational cost is detailed as follows: for solving the generalized eigendecomposition problem, i.e. Line 7; for updating , , i.e. Line 8; for constructing the , i.e. Lines 6; for updating , , i.e. Line 9; for updating , , i.e. Line 6; In summary, the overall computational complexity of Algorithm 1 is . Moreover, the value of k is not greater than 200, not greater than 100, so . Therefore, it can be solved in polynomial time with respect to the number of samples.
4 Experiments
4.1 Datasets and Experimental Settings
In order to validate the effectiveness of our approach in both the DA and SLSADA scenario, we conducted experiments on 4 benchmark datasets in crossdomain object recognition, i.e., Office10Caltech10, OfficeHome, ImageCLEFDA, Office31. Fig. 3 illustrates some sample images from Office10Caltech10 and OfficeHome datasets, and they follow very different distributions. Their descriptions are introduced as follows:
Office10Caltech10 Gong et al. (2012) contains 4 realworld object domains, where 3 domains are come from Office31 dataset (i.e., Amazon (A), Webcam (W) and DSLR (D)), and the last one is come from Caltech256 dataset (Caltech (C)). Then we select 10 shared classes between these 4 domains and construct a DA dataset Office10Caltech10, which has 2,533 images and DA tasks, e.g., AW, CD and so on. Note that the arrow ”” is the direction from the source domain to target domain. For example, WD means Webcam is the labeled source domain while Dslr is the unlabeled target domain.
OfficeHome Venkateswara et al. (2017) was released recently as a more challenging dataset, crawled through several search engines and online image directories. It consists of 4 different domains, Artistic images (Ar), Clipart images (Cl), Product images (Pr) and RealWorld images (Rw). In total, there are 15,500 images from 65 object categories, and 12 DA tasks.
ImageCLEFDA Long et al. (2018)
has 1800 images organized by selecting the 12 common classes shared by 3 public domains, Caltech256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P), where 6 DA tasks can be created.
Office31 Saenko et al. (2010) is an increasingly popular benchmark for visual DA, which includes 3 realworld object domains, Amazon (A), Webcam (W) and DSLR (D), and has 4,652 images from 31 categories, then 6 DA tasks can be constructed.
4.2 Experimental Results
DA  CA  CW  CD  AC  AW  AD  WC  WA  WD  DC  DA  DW  Avg. 

TCA Pan et al. (2011)  46.5  35.3  44.6  38.6  38.3  38.9  30.2  28.7  90.4  33.3  33.4  88.8  45.6 
JDA Long et al. (2013)  43.1  35.9  47.8  34.9  42.0  36.3  31.0  40.4  88.5  29.2  29.7  89.5  45.7 
BDA Wang et al. (2017)  45.4  39.7  43.9  38.6  44.7  40.8  30.1  35.0  89.8  29.9  35.9  89.8  47.0 
VDA Tahmoresnezhad and Hashemi (2017)  50.8  44.7  49.0  37.5  44.7  43.9  30.9  41.3  88.5  30.2  34.1  89.5  48.8 
JGSA Zhang et al. (2017)  50.7  47.5  43.9  42.5  48.1  46.5  30.9  39.9  89.8  30.0  39.0  89.2  49.8 
MEDA Wang et al. (2018)  55.4  54.6  57.3  44.4  55.3  40.1  34.3  41.8  86.6  33.6  43.1  86.4  52.7 
Our  59.2  58.0  55.4  46.4  45.1  47.1  36.5  32.0  93.6  38.3  44.8  89.5  53.8 
DA  IP  PI  IC  CI  CP  PC  AW  DW  WD  AD  DA  WA  Avg. 

JAN Long et al. (2017)  76.8  88.0  94.7  89.5  74.2  91.7  85.4  97.4  99.8  84.7  68.6  70.0  85.1 
CDAN Long et al. (2018)  76.7  90.6  97.0  90.5  74.5  93.5  93.1  98.2  100.0  89.8  70.1  68.0  86.8 
CAN Zhang et al. (2018)  78.2  87.5  94.2  89.5  75.8  89.2  81.5  98.2  99.7  85.5  65.9  63.4  84.1 
MADA Pei et al. (2018)  75.0  87.9  96.0  88.8  75.2  92.2  90.1  97.4  99.6  87.8  70.3  66.4  85.6 
TCA Pan et al. (2011)  77.7  81.2  92.7  87.5  74.2  84.8  76.1  97.6  99.4  79.7  64.2  63.8  81.6 
JDA Long et al. (2013)  77.0  81.3  95.2  91.2  76.8  84.3  83.3  98.0  99.8  81.7  68.2  69.0  83.8 
BDA Wang et al. (2017)  76.0  79.7  94.8  91.5  76.2  82.2  80.8  96.4  99.6  79.9  67.6  67.2  82.7 
VDA Tahmoresnezhad and Hashemi (2017)  77.3  83.3  94.3  91.5  77.0  87.2  84.3  98.6  100.0  82.5  68.7  69.8  84.5 
JGSA Zhang et al. (2017)  77.0  83.5  95.5  91.7  77.3  88.8  86.7  97.9  99.8  83.9  69.6  71.3  85.3 
MEDA Wang et al. (2018)  79.5  92.2  95.7  92.3  78.7  95.5  86.2  97.7  99.6  86.1  72.6  74.7  87.6 
Our  79.2  90.0  94.8  91.5  78.3  93.8  86.0  98.6  99.8  88.4  74.5  71.9  87.2 
DA  ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Avg. 

JAN Long et al. (2017)  45.9  61.2  68.9  50.4  59.7  61.0  45.8  43.4  70.3  63.9  52.4  76.8  58.3 
CDAN Long et al. (2018)  50.7  70.6  76.0  57.6  70.0  70.0  57.4  50.9  77.3  70.9  56.7  81.6  65.8 
MDD Zhang et al. (2019)  54.9  73.7  77.8  60.0  71.4  71.8  61.2  53.6  78.1  72.5  60.2  82.3  68.1 
TADA Wang et al. (2019b)  53.1  72.3  77.2  59.1  71.2  72.1  59.7  53.1  78.4  72.4  60.0  82.9  67.6 
BSP Chen et al. (2019)  52.0  68.6  76.1  58.0  70.3  70.2  58.6  50.2  77.6  72.2  59.3  81.9  66.3 
TAT Liu et al. (2019)  51.6  69.5  75.4  59.4  69.5  68.6  59.5  50.5  76.8  70.9  56.6  81.6  65.8 
TCA Pan et al. (2011)  48.7  65.3  70.1  49.2  59.7  63.2  52.0  45.0  71.9  63.7  51.4  77.1  59.8 
JDA Long et al. (2013)  50.9  67.7  70.9  51.3  64.4  64.9  54.6  47.7  73.3  64.9  53.7  78.3  61.9 
BDA Wang et al. (2017)  47.8  59.3  67.7  49.0  62.0  61.4  50.1  46.0  70.7  61.8  51.5  74.5  58.5 
VDA Tahmoresnezhad and Hashemi (2017)  51.2  69.3  72.2  53.6  66.1  66.9  56.0  48.8  74.5  65.8  54.1  79.5  63.2 
JGSA Zhang et al. (2017)  51.4  69.2  72.6  51.8  67.3  67.0  55.9  48.7  75.6  64.4  53.3  78.5  63.0 
MEDA Wang et al. (2018)  55.3  75.7  77.6  57.2  73.9  72.0  58.6  52.3  78.7  68.3  57.0  81.9  67.4 
Our  58.1  77.4  78.7  61.6  72.5  72.5  62.5  54.4  79.1  70.1  59.6  82.6  69.1 
The proposed approach involves 4 parameters: projected clustering regularizer , projected scaling regularizer , subspace dimensions and iterations . For the parameters, we fix =5, =0.01, and the 20nearest neighbor graph is adopted with Euclidean distancebased weight for simplicity. Specially, we set =20, =0.05 on the Office10Caltech10 and ImageCLEFFDA datasets, while =100, =0.1 on the OfficeHome and Office31 datasets since they contain more categories. In the coming section, we will provide empirical analysis on parameter sensitivity, which verifies that a stable performance could be achieved under a wide range of values.
We adopted different types of features as the inputs, either the traditional shallow features or deep features. Specifically, the shallow SURF features
Gong et al. (2012) with 800 dimensions are adopted in Office10Caltech10. As for OfficeHome, ImageCLEFDA and Office31, we utilize the deep features preextracted from the ResNet50 model and pretrained on ImageNet He et al. (2016), and the feature dimensionality is 2048. In order to construct an SLSADA scenario, we randomly choose 5 source instances from each class as labeled samples and others are unlabeled, then the random selection is repeated ten times and average results are adopted.Since no previous approaches have been proposed to tackle SLSADA problem, we first compare the proposed approach with several stateofart methods in DA setting, where the labels for all of source data instances are available. Specifically, in DA setting, we compare the proposed approach with both the shallow DA methods (TCA Pan et al. (2011), JDA Long et al. (2013), BDA Wang et al. (2017), VDA Tahmoresnezhad and Hashemi (2017), JGSA Zhang et al. (2017), MEDA Wang et al. (2018)) and the deep DA methods (JAN Long et al. (2017), CAN Zhang et al. (2018), MADA Pei et al. (2018), CDAN Long et al. (2018), MDD Zhang et al. (2019), TADA Wang et al. (2019b), BSP Chen et al. (2019), TAT Liu et al. (2019)). The performances of different methods in DA settings are shown in Table 1, Table 2, Table 3,To be specific, Table 1 illustrates that the results of our approach are substantially higher than all other 6 ones on most DA tasks (8/12), and the average accuracy is 53.8%, which has 1.1% improvement compared with the best baseline MEDA. From Table 2, it could be seen that best results are achieved only 2/12 DA tasks but most of them are very close to the highest ones. Besides, the average accuracy of our approach is only 0.2% lower than the best baseline MEDA. From Table 3, it can be observed that our approach is also able to attain best performances on the most DA tasks (8/12), and increase the average accuracy by 1.0% compared with the best baseline MDD (68.1% to 69.1%). Therefore, the competitive capability of our approach in DA setting could be validated compared with those stateofart DA methods, either the shallow or deep ones.
In order to further embody the superiority of our approach concerning the SLSADA scenario, we also test the behavior of other mainstream approaches in the SLSADA problem. As a note, the deep DA methods integrate feature extraction and knowledge transfer into an endtoend network and achieve promising results, and this paper adopts a twostage mechanism to promote the transferability of deep ResNet50 features. Some recent techniques have been proven that more effective knowledge transfer, is easier and faster to be implemented with this twostage mechanism. Moreover, the promising results of deep DA methods mainly depend on feeding adequate labeled data, it may well fail to train a classifier since the labels are very limited in SLSADA scenario. Therefore, in SLSADA scenario, we only report the results compared with those twostage methods (i.e., TCA
Pan et al. (2011), JDA Long et al. (2013), BDA Wang et al. (2017), VDA Tahmoresnezhad and Hashemi (2017), JGSA Zhang et al. (2017), MEDA Wang et al. (2018)).SLSADA  CA  CW  CD  AC  AW  AD  WC  WA  WD  DC  DA  DW  Avg. 
TCA(s) Pan et al. (2011)  39.53.1  36.52.3  38.12.9  53.12.6  54.72.6  54.03.1  79.42.4  75.84.3  80.54.0  79.51.7  75.72.7  81.83.9  62.43.0 
JDA(s) Long et al. (2013)  37.53.0  33.82.3  35.12.8  47.13.6  47.12.9  47.92.2  68.94.4  67.34.4  76.94.1  73.43.1  70.12.8  81.72.9  57.23.2 
BDA(s) Wang et al. (2017)  39.72.3  35.92.9  38.02.6  49.42.5  52.52.6  51.42.7  73.83.6  71.54.0  79.74.3  74.32.5  72.21.7  81.72.2  60.02.8 
VDA(s) Tahmoresnezhad and Hashemi (2017)  39.12.7  34.52.5  36.13.1  47.83.6  49.22.3  49.72.7  68.23.7  65.14.3  78.63.3  72.93.2  68.24.0  80.93.2  57.53.2 
JGSA(s) Zhang et al. (2017)  41.22.6  33.12.1  32.82.3  50.43.1  45.73.6  43.64.8  71.03.3  66.54.5  79.72.1  75.03.4  72.72.2  80.34.4  57.73.2 
MEDA(s) Wang et al. (2018)  39.53.8  35.22.7  35.52.4  53.63.1  53.03.0  50.44.1  77.12.9  77.33.0  78.13.3  77.42.8  76.83.5  78.34.4  61.03.3 
Our(s)  45.12.5  40.93.1  42.41.8  58.43.4  60.32.5  59.41.9  80.03.5  79.42.6  85.13.4  77.33.0  76.52.2  86.13.0  65.92.7 
TCA(t) Pan et al. (2011) 
33.93.5  27.86.1  33.54.1  31.51.5  30.22.7  30.92.3  27.92.6  28.01.3  73.45.8  31.41.4  32.61.9  79.62.8  38.43.0 
JDA(t) Long et al. (2013)  33.43.7  28.05.7  31.54.4  28.83.0  32.12.2  31.23.4  28.23.0  28.36.4  69.46.5  29.82.6  31.72.6  79.32.7  37.63.9 
BDA(t) Wang et al. (2017)  34.43.4  30.74.9  34.55.5  31.42.0  33.94.6  33.52.9  31.02.1 