Coupled Support Vector Machines for Supervised Domain Adaptation

06/22/2017 ∙ by Hemanth Venkateswara, et al. ∙ Arizona State University University of Michigan Bosch 0

Popular domain adaptation (DA) techniques learn a classifier for the target domain by sampling relevant data points from the source and combining it with the target data. We present a Support Vector Machine (SVM) based supervised DA technique, where the similarity between source and target domains is modeled as the similarity between their SVM decision boundaries. We couple the source and target SVMs and reduce the model to a standard single SVM. We test the Coupled-SVM on multiple datasets and compare our results with other popular SVM based DA approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Supervised learning algorithms often make the implicit assumption that the test data is drawn from the same distribution as the training data. These algorithms become ineffective when such assumptions regarding the test data are violated. Transfer learning

techniques are applied to address these kinds of problems. Transfer learning involves extracting knowledge from one or more tasks or domains and utilizing (transferring) that knowledge to design a solution for a new task or domain

[2]. Domain adaptation (DA) is a special case of transfer learning where we handle data from different, yet correlated distributions. DA techniques transfer knowledge from the source domain (distribution) to the target domain (distribution), in the form of learned models and efficient feature representations, to learn effective classifiers on the target domain.

Figure 1: (a) Standard SVM based DA. The source SVM is perturbed to get . (b) C-SVM, and are learned together. Training error could be high. (c) The C-SVM does not over fit. is source and is target data. Filled unfilled objects are train and test data respectively.

In this work we consider the problem of supervised DA where we use labeled samples from the source domain along with a limited number of labeled samples from the target domain, to learn a classifier for the target domain. We propose a Coupled linear Support Vector Machine (C-SVM) model that simultaneously estimates linear SVM decision boundaries

and , for the source and target training data respectively. Using a technique termed as instance matching, researchers sample source data points such that the difference between the means of the sampled source and target data is minimized [4], [10]. Our intuition behind the C-SVM is along similar lines, where we penalize the difference between and . Since the SVM decision boundaries are a linear combination of the data points, penalizing the difference between and , can be viewed as penalizing the difference between the weighted means of the source and target data points.

Figure(1a), illustrates standard SVM based DA where is first learned on the source and is subsequently perturbed to obtain the target (). The perturbed SVM could be very different from and can overfit the target training data. Figure(1b), depicts the C-SVM, where and , are learned simultaneously. The source SVM , provides an anchor for the target SVM . The difference between and is modeled based on the difference between the source and target domains. In addition, the C-SVM trades training error for generalization as illustrated in Figure(1c). In this paper, we formulate a coupled SVM problem to estimate and and reduce it to a single SVM problem that can be solved with standard quadratic optimization. We test our model and report recognition accuracies on various datasets of objects, hand-written digits, facial expressions and activities.

2 Related Work and Our Method

In this section we will discuss some of the SVM based DA techniques closely related to the C-SVM. Support Vector Machines have been extensively used for DA in the past. Daumé [3]

, modeled augmented features with a heuristic kernel. Bruzzone and Marconcini

[2], proposed an unsupervised method (DASVM) to adapt a SVM learned on the source domain to the unlabeled data in the target domain in an iterative manner. Adapt-SVM is another technique closely related to our method, where Yang et al. [17] and Li [9], learn a SVM on the target by minimizing the classification error on the target data while also reducing the discrepancy between the source and target SVMs. We differ from this method by learning the source and target SVMs simultaneously. Aytar and Zisserman [1], extend this framework to the Projective Model Transfer SVM that relaxes the transfer induced by the Adapt-SVM. Hoffman et al. (MMDT) [6], learn a single SVM model for the source and the transformed target data. The target data is transformed by a transformation matrix that is learned in an optimization framework along with the SVM. Duan et al. (AMKL) [4] implement a multiple kernel method where multiple base kernel classifiers are combined with a pre-learned average classifier obtained from fusing multiple nonlinear SVMs. Unlike in C-SVM where the similarity between source and target is learned by the model, Widmer et al. [16] use a similar approach to solve multitask problems using graph Laplacians to model task similarity. We believe the C-SVM holds a unique position in this wide array of SVM solutions for DA. The C-SVM trains a linear SVM for both the source and target domains simultaneously, thereby minimizing the chances of over-fitting, especially when there are very few labeled samples from the target domain.

3 Problem Specification

We outline the problem as follows. Let , where is the source domain, , are data points and , are their labels. Along similar lines, , where is the target domain.

3.1 Coupled-SVM Model

The goal is to learn a target classifier , that generalizes to a larger subset of and does not over fit the target training data . The catch here is that the number of labeled target data points is small and . We therefore include the source data and learn the source classifier to provide an anchor point for . The source and target SVM decision boundaries are and respectively. To simplify notation we re-define, and and account for the bias by re-defining, and . Incorporating these definitions, the Coupled-SVM can be detailed as follows,

(1)

Equation (1) is a variation of a standard linear SVM with two decision boundaries and an additional term relating the two boundaries. The first term captures the similarity(dissimilarity) between the source and target domains as the difference between the decision boundaries. controls the importance of this difference. The 2nd and 3rd term are the SVM regularizers. The 4th and 5th terms capture training loss, where and control the importance of the source and target misclassification respectively.

3.2 Solution

To simplify notation, we define a new set of variables based on the earlier ones. We concatenate the two SVM boundaries into a single variable, defined as, . The individual SVMs and can be extracted from using permutation matrices and , where and are binary matrices such that, and . For example, let , and . Then the permutation matrices and such that, and , are given by, , and . We also define new variables , where, and are the new data points and ,

(2)

where, is a vector of zeros. Similarly,

(3)

For ease of derivation, we consider the linearly separable SVM and get rid of (we will re-introduce it later). The minimization problem in Equation(1) can now be re-formulated as,

(4)

where we have defined, and used , for the first term. For the second term, we have used . We introduce Lagrangian variables to solve the problem,

(5)

We need to minimize the Lagrangian w.r.t and maximize w.r.t to . We optimize first w.r.t by setting the derivative and get, where,

is an identity matrix and,

. By the nature of our permutation matrices and , is full rank and therefore, exists. We define and substitute for in Equation(5) to arrive at the SVM dual form which we need to maximize,

(6)

Equation(6) is the standard SVM dual where and is a vector of . To use any of the standard SVM libraries, we can set . Then . The decision boundary in the space of , is given by, . The decision boundary in the space of is given by . Therefore . We re-introduce the slack variables as constraints . We can easily extend the algorithm to the multi-class setting using one-vs-one or one-vs-all settings. Once is estimated, and is used to get the source and target SVMs.

4 Experiments

In this section we discuss the extensive experiments we conducted to study the C-SVM model. We first outline the different datasets and their domains. We then outline the DA algorithms we compare against. Finally, we report the experimental details and our results.

4.1 Data Preparation

For our experiments, we consider multiple datasets from different applications and also test the C-SVM with different kinds of features. For all the experiments (except Office-Caltech) we use the following setting. For the training data, we sample examples from the source domain and examples from the target domain from every category. The test data is the remaining examples in the target domain not used for training.
Office-Caltech datasets: For this experiment we borrow the dataset and the experimental setup outlined in [5]. The Office dataset consists of three domains, Amazon, Dslr and Webcam. The Caltech256 dataset has one domain, Caltech. All the domains consist of a set of common categories viz., {back-pack, bike, calculator, headphones, computer-keyboard, laptop, monitor, computer-mouse, coffee-mug, video-projector}. We use the dimension SURF-BoW features that are provided by[5] for our experiments. We follow the experimental setup outlined in [5]. For the training data, we sample examples from the source domain (for Amazon we use ) and examples from the target domain.
MNIST-USPS datasets: The MNIST and USPS datasets are benchmark datasets for handwritten digit recognition. These datasets contain gray scale images of digits from to . For our experiments, we have considered a subset of these datasets ( images from MINST and images from USPS) based on [10]. We refer to these domains as MNIST and USPS respectively. The images are resized to pixels and represented as vectors of length 256.
CKPlus-MMI dataset: The CKPlus[11] and MMI[12] are benchmark facial expression recognition datasets. We select 6 categories viz., {anger, disgust, fear, anger, happy, sad, and surprise}, from frames with the most intense expression (peak frames) from every facial expression video sequence to get around 1500 images for each dataset with around 250 images per category. We refer to these domains as CKPlus and MMI

. We extract deep convolutional neural network based generic features which have shown astounding results across multiple applications

[13]. We therefore decided to use an ‘off-the-shelf’ feature extractor developed by Simonyan and Zisserman [15]. We used the output of the first fully connected layer from the 16 weight layer model as features with dimension 4096 which were then reduced to 100 using PCA.
HMDB51-UCF50 dataset: We pooled common categories of activity from HMDB51[8] and UCF50[14]. The categories from UCF50 are, {BaseballPitch(throw), Basketball(shoot_ball), Biking(ride_bike), Diving(dive), Fencing
(fencing)
, GolfSwing(golf), HorseRiding(ride_horse), PullUps (pullup), PushUps(pushup), Punch(punch), WalkingWithDog
(walk)
}. The category names from HMDB51 are in parenthesis. We refer to these domains as HMDB51 and UCF50. We extract state-of-the-art HOG, HOF, MBHx and MBHy descriptors from the videos according to [7]. We pool the descriptors into one grid 1x1x1, and estimate Fisher Vectors with Gaussians. The dimension of these Fishers Vectors is . We apply PCA and reduce the dimension to .

Expt. SVM(T) SVM(S) SVM(S+T) MMDT AMKL C-SVM
A W(1) 56.060.95 37.361.19 51.261.19 64.871.26 67.851.06 66.401.09
A D(2) 43.150.78 37.640.96 47.560.99 54.411.00 56.220.89 57.130.98
W A(3) 44.391.18 32.030.90 44.870.59 50.540.82 52.960.57 53.970.42
W D(4) 45.201.34 61.060.86 65.390.89 62.480.98 75.950.94 68.270.86
D A(5) 42.171.03 31.480.65 46.170.44 50.450.75 52.360.57 54.100.55
D W(6) 54.910.80 69.811.06 76.190.64 74.340.66 85.940.44 77.170.46
A C(7) 26.620.60 38.610.50 42.460.39 39.670.50 44.920.46 44.740.57
W C(8) 25.820.78 26.670.59 34.530.76 34.860.79 39.200.57 39.770.59
D C(9) 26.880.74 25.740.47 34.680.67 35.820.75 41.120.44 41.270.51
C A(10) 43.521.07 36.220.82 47.750.60 51.100.76 55.980.58 55.560.76
C W(11) 55.491.02 29.721.54 51.281.23 62.941.11 68.701.07 67.741.05
C D(12) 43.071.47 32.561.03 47.681.17 52.560.97 58.820.83 59.721.01
M U(13) 70.730.41 38.890.61 64.360.41 68.960.43 79.560.30 76.020.34
U M(14) 58.230.39 21.670.33 38.430.36 48.290.32 63.800.32 63.250.31
K I(15) 33.310.27 13.300.15 25.830.31 18.280.42 31.870.29 33.100.29
I K(16) 45.650.47 19.470.55 25.630.35 21.330.81 43.590.50 48.540.47
H F(17) 28.940.26 17.450.17 23.000.19 29.050.23 33.060.23 35.890.25
F H(18) 18.640.19 16.990.16 19.580.17 22.280.18 24.280.16 24.410.19
Table 1: Recognition accuracies (%) on the object, digit, facial expression abd activity datasets across multiple algorithms. {Amazon(A), Webcam(W), Dslr(D), Caltech(C), MNIST(M), USPS(U), CKPlus(K), MMI(I), HMDB51(H), UCF50(F)}. AW implies A is source domain and W is target domain. The best results are highlighted in red.

4.2 Existing Methods

We compare our method with existing supervised DA techniques based on SVMs. SVM(T) (Linear SVM with training data from target domain), SVM(S) (Linear SVM with training data from source domain), SVM(S+T) (Linear SVM with union of source and target domain training data), MMDT (The Max-Margin Domain Transform [6]), AMKL (The Adaptive Multiple Kernel Learning [4]), and C-SVM (Coupled SVM algorithm).

4.3 Experimental Details and Results

We conducted experiments with different combinations of datasets. Table(1) depicts the results comparing multiple algorithms. For the Office-Caltech dataset, the results are averaged across splits of data and splits for the rest of the experiments. The results for SVM(S) demonstrate the fact that although the datasets consist of the same categories, the domains have different distributions of data points. This is also highlighted by the success of SVM(T) even with few labeled training data points. The naive union of the source and target training data is in some cases beneficial but not always, as illustrated by SVM(S+T). Amongst the algorithms we have compared with, AMKL is on par with C-SVM in terms of performance. There is little to choose in terms of performance accuracies between the two. However, C-SVM is the easier and simpler solution as it is a standard linear SVM unlike AMKL, which is a multiple kernel based method.

Figure 2: Average recognition accuracy (%) across different experiments varying number of training examples in source and target.

In all of these experiments we apply leave-one-out cross validation across the training target data to determine the best values of the parameters . We also studied the C-SVM by varying the number of samples available for training. We dropped the Webcam and Dslr datasets as they have fewer number of data points. Figure(2a) illustrates that increasing the number of source training data points, does not affect the test accuracies. The SVM relies on support vectors to estimate the source decision boundary, and additional source training data does not modify the source boundaries by much. The effect of additional target training data is comparatively more pronounced in Figure(2b) which is intuitive. By far, the most interesting is Figure(2c). Increasing both source and target training data numbers is nearly comparable to increasing only the number of target training data points. Source training data does not contribute to the target SVM after a threshold number of training data points.

5 Conclusions

The C-SVM is elegant, efficient and easy to implement. We plan to extend this work to study nonlinear adaptations in the future. We would like to model classifier similarity in an infinite dimensional (kernel) space and also contemplate on the idea of unsupervised DA.

6 Acknowledgments

This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

References

  • [1] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In IEEE ICCV, 2011.
  • [2] L. Bruzzone and M. Marconcini. Domain adaptation problems: A dasvm classification technique and a circular validation strategy. IEEE, PAMI, 32(5):770–787, 2010.
  • [3] H. Daumé III. Frustratingly easy domain adaptation. In Association of Computational Linguistics, 2007.
  • [4] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning. IEEE PAMI, 34(3):465–479, 2012.
  • [5] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE CVPR, 2012.
  • [6] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Efficient learning of domain-invariant image representations. In Int’l Conference on Learning Representations (ICLR), 2013.
  • [7] V. Kantorov and I. Laptev.

    Efficient feature extraction, encoding, and classification for action recognition.

    In IEEE CVPR, 2014.
  • [8] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In IEEE ICCV, 2011.
  • [9] X. Li. Regularized adaptation: Theory, algorithms and applications. PhD thesis, University of Washington, 2007.
  • [10] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. In IEEE CVPR, 2014.
  • [11] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In IEEE (CVPRW), 2010.
  • [12] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-based database for facial expression analysis. In IEEE ICME 2005.
  • [13] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In IEEE (CVPRW), 2014, pages 512–519.
  • [14] K. K. Reddy and M. Shah. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013.
  • [15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [16] C. Widmer, M. Kloft, N. Görnitz, G. Rätsch, P. Flach, T. De Bie, and N. Cristianini. Efficient training of graph-regularized multitask svms. In ECML 2012.
  • [17] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm classifiers to data with shifted distributions. In Data Mining Workshops, IEEE ICDM, pages 69–76, 2007.