In the field of data mining and machine learning, transfer learning has been widely studied as one of the important research topics. The classification performance of a target data set (target domain) that does not easily acquire sufficient labeled samples is improved by training one or more related auxiliary sample sets (referred to as source domains). Integrating the relevant source domain instance into the training model, or by mapping the training model of the source domain to the target model through domain adaptation, can obtain effective knowledge transfer.
Most existing research on transfer learning [2, 3, 4, 5, 6] focuses on the setting of offline (batch) learning, where the training data sets of the source domain and target domain are assumed to be given in advance. However, in real-world applications, training examples in many applications come in sequential order. If you want to get all the data at once, you often can not get it or pay a high price. Therefore, research on efficient online transfer learning algorithms that use only a few target examples is receiving more and more attention.
In the past ten years, much important research work on online transfer learning includes [7, 8, 9, 10]. To reduce the challenge of online transfer learning: there is no guarantee that the classification effect after the transfer will be improved, because incorrect source domain may lead to negative learning (negative transfer). Many researchers have adopted a combination of transfer learning technology and other related machine learning techniques to achieve maximum positive transfer and improve the final classification effect. The learning framework represented by the online transfer learning algorithm (OTL)  is currently the most popular. Boyu et al.  extended the transfer integration approach to the online version.
In this paper, we focus our research on online transfer learning in the context of homogeneous data. We propose a new algorithm OTBag. Applying the classic bagging algorithm to the problem setting of online transfer learning, the advantages of the bagging algorithm are fully utilized, and a more complex strong classifier model is constructed through a series of weak classifiers. To some extent, it overcomes the disadvantage that the PA  algorithm in OTL, which is limited to a single classifier model that cannot capture more complex hierarchies. At the same time, to prevent the negative transfer problem caused by transfer learning, two filtering strategies (SDMV, JDSMV) are proposed for the final model screening stage after weak classifier online training.
The rest of the paper is organized as follows: we introduce the knowledge of the relevant fields in section II. In Section III, we introduce our algorithm OTBag and two filtering strategies. We introduce the experimental researchs in Section IV and conclude in Section V.
Ii Related Work
Ii-a Online Transfer Learning
Online transfer learning includes two important branches in the field of machine learning research, namely online learning and transfer learning.
. Online learning focuses on solving some practical problems that training data is achieved in order. The online algorithms widely used by people can be divided into three categories: a. perceptron-based online learning algorithms; b. support vector machine (SVM-based) online algorithms; c. an online algorithm based on ensemble learning.
Transfer learning is to use knowledge from the source domains (as auxiliary knowledge) to improve the learning performance of the target domain . According to different learning settings, various transfer learning methods that have been proposed can be divided into three categories: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning
. As a tool, transfer learning skill is widely used in other fields of machine learning, including anti-transfer in deep learning and robot motion generation task [21, 20].
Online transfer learning is a combination and breakthrough of traditional online learning and transfer learning. Not only that, but many current research efforts have begun to combine online transfer learning with traditional learning algorithms, or to extend offline transfer learning algorithms to online versions to deal with real-world problems in online scenarios. The paper 
uses ensemble learning and active learning to build a stable online learning framework to solve the problem drift problem in the online data flow. Yanet al.  solves online heterogeneous transfer learning tasks by building a classifier by combining the weighted ensemble methods of offline and online decision making. In the paper , the traditional offline form of Boosting for Transfer Learning (TrAdaboost) is modified into an online transfer boosting method combined with the promotion method.
Ii-B Online Bagging
The online version of Bagging  is an extension of the traditional offline bagging . For offline bagging, the entire training bag is ready to be used, because, for each basic model, the sampling is performed by randomly attracting the entire training set. In bagging, each original training example can be repeated 0 times, 1 time, 2 times or more in each basic training set. The bootstrap training set for each basic model contains K copies of each original training example. Using this, Oza et al. turned the problem into the following form,
which represents the binomial distribution. By randomly extracting the entire training set, one instance at a time, and using Equation (1) to select K represents K times of resampling for the current instance. This can be equivalent to replace the N batch replacement sampling of the entire data set in the traditional batch bagging. Due to the unknown of the training sample N, the training samples are constantly coming, making the Equation (1) unavailable. However, the training samples in the online training application are coming, so we can assume, and the binomial distribution will tend to be Poisson(1) distribution: . At this time, the dependence on the total amount of samples N in the Equation (1) can be eliminated. For each new instance in the online training, is used to generate the number K of updates to the base classifier. The final classifier is the same as the batch Bagging and also uses the majority voting mechanism, .
Online bagging is a good approximation of batch bagging algorithm because their sampling methods produce an approximate bootstrap training set distribution, and when the training sets have similar distributions, their basic model learning algorithms will produce similar hypothesis spaces.
Iii Proposed Algorithm
The proposed online transfer bagging (OTBag) algorithm is an extension of the online bagging algorithm. In , the author first proposed to introduce transfer learning into the bagging algorithm under the batch data set and to improve the target training instances through a large number of source domain instances. The problem, while the effective and diversified source data can reduce the target domain training error and improve the performance of the classifier. Our algorithm is inspired by this.
By using the knowledge of the source domain for the target domain, it is expected that the variance part of the error can be better reduced. However, in the face of the shortcomings of the transfer learning itself, there is still a negative transfer problem in the online transfer problem. Since the introduction of an instance in the source domain that is not related to the concept of the target domain will not only promote the construction of the target classifier but may lead to a worse final result. Ideally, we want to be able to identify those instances that are irrelevant during the training process, but this is not possible. But we can add a filtering strategy to the classifier, which will reduce the impact of negative transfer to a certain extent.
To better explain our ideas, we refer to the process of learning weak classifiers and the selection process of constructing final classifications as the stages of training and filtering. We will describe it in detail below. The source domain examples are represented as , and is represented as a target domain. and represent the total number of source and target domain instances, respectively. And . In the training set, the labels of the source domain instance and the target instance are known, but target domain training examples is a small amount.
1) Training Phase: In the training phase, the source training instance and the target training instance are integrated into the final training set . The order of the samples is randomly scrambled, but the identity that identifies the instance from the target domain is retained. During the training process, for each new sample, it is judged that it belongs to the training domain of the source domain or the target domain. If it comes from the target domain, then it is used to train the model, and train the target instance into the model, where M is the number of weak classifiers. For the source domain samples, only put them into model for training. See the lines in Algorithm 2 for details. At the same time, in the H and F models, it is necessary to record the correct classification of each new target instance, which is represented by and , respectively. Specifically, for the model, it is necessary to record whether each weak classifier is correctly classified for the target instance. And for the model, there is only one indicator, that is, the classification of the newly arrived target training instance by the final classifier constructed by the model through the majority voting mechanism. Algorithm 1 OTBag does not adopt a filtering strategy that reduces negative transfer, so it does not require an extra training model. Since the data is continuously coming in the form of online, the accuracy of the prediction of the target instance in the training set is also updated and recorded in real time.
2) Filtering Phase: At the end of the training phase, we can get the two models , after training. In the filtering phase, we need to make different strategic choices for weak classifiers to more accurately predict the labeling of the target concept and effectively reduce the impact of negative transfer. As with traditional batch and online bagging methods, it is most common to combine the final M by a majority voting strategy, as shown in our algorithm 1. But this strategy applies to situations where the source domain and the target domain are conceptually as similar as possible, i.e. the source domain instance can have a positive impact on the target instance. However, we are unable to confirm whether the given source instance is similar to the target instance concept, so we propose two other strategies that can preserve the advantages of ensemble learning for classification performance improvement while also reducing the impact of negative transfer.
2.1) Simple Dominant Majority Voting （SDMV): The classification accuracy rate of maintained by is directly compared with the training set accuracy . If a weak classifier has a worse empirical error than the model on the target training set , the is rejected directly, regardless of its use to construct the final classifier. The M weak classifiers are compared with the model in turn. Finally, the dominant weak classifier set is left for constructing the final strong classifier. As with the traditional online bagging method, the majority voting is used for to construct the final model. See Algorithm 2 for details.
2.2) Joint Double Subset Majority Voting （JDSMV): This method calculates the classification of the training target instances in each segment by dividing the online training learning process into multiple time segments. In this way, different weak classifier combinations can be selected. Finally, a subset of multiple weak classifiers can be formed. The online training process including the source domain and the target domain is divided into phase.
is a specified hyperparameter, indicating the number of time segments., represents the number of training samples included in each time segment. As shown in Algorithm 3, the models and are simultaneously trained (initialized) in the training set in the first time segment , starting from the second time slice , for the target domain training instance in the current time slice, the classification accuracy of all the weak classifiers in the model and the online classification accuracy of the model are separately recorded. The weak classifier index set with higher recording accuracy than the model, such as , indicated that the classification accuracy of the weak classifiers to the target training instance in the time segment is higher than that of the model. All the time segments are completed in turn, that is, all training set instances are trained to form a total index set . The strong classifier is constructed by forming a subset of weak classifiers corresponding to the indexes recorded in each time segment in , and adopting a majority voting strategy in this subset to obtain a final decision. Finally, the weak classifier corresponding to the index recorded under each time segment will constitute a subset, and a majority voting strategy is used in this subset to obtain a decision . At the same time, the model can obtain a decision by using the majority voting strategy. Finally, in the decision , , a total of strong classifiers through the majority voting strategy to generate the final classification model.
Compared with the majority voting strategy used by the traditional bagging algorithm (which can be considered as a layer set), the joint majority vote of the double subset here can recombine the weak classifiers that dominate (biased target instances). Achieve further filtering of weak classifiers with negative effects.
In this section, the performance of our proposed algorithm will be verified on three data sets. In all experiments, a large number of source domain instances and a small number of target domain instances were directly integrated and trained in an online form. In order to reduce the random impact of the training sequence of the example on the final result, the final experimental result comes from the average data of 20 sets of randomly arranged training sets and test sets. We set the number of iterations and in the comparison algorithm to 10. The training set and test set partition ratio in the experiment is 4:6.
Iv-a Data Sets
Sentiment analysis data set  is composed of Amazon users’ evaluation of the four types of products: books, DVDs, electronics, and kitchen . Each review contains a user’s rating (0-5 stars), review title, product name, reviewer name, location, date, and comment content. For ratings with scores greater than 3 stars, positive instances, and instances with ratings below 3 stars are marked as negative instances, and other comments will be discarded because their ratings are not clear. The other preprocessing for the data set is the same as in , and the feature dimension of the sample is 400, and each domain contains 2000 positive/negative sample sets. We selected 50% as the total number of training sets and test sets used in the experiment. We use the symbol to generate the source domain from books (b) and the target domain from DVDs (d). Each domain is randomly selected as the source domain and the target domain, so 12 transfer learning tasks can be generated.
, we extract the SURF feature from the image set, encode the image using the 800-bin histogram, and finally normalize the feature and z-scored. We treat each data set as a domain, from which one domain is selected as the source domain and one as the target domain. The symbolis used to represent the source domain generated from Amazon (A) and the target domain is generated from Caltech-256 (C). We select the two adjacent classes into two classification problems in order, there are five groups, BACKPACK vs TOURING-BIKE, CALCULATOR vs HEAD-PHONES, KEYBOARD vs LAPTOP-101, MONITOR vs MOUSE, COFFEE MUG vs VIDEO-PROJECTOR. For each class in the dataset, 60% was selected as the test set and the rest as the training set.
Mixed data set: In order to more clearly verify the performance of the negative transfer data, the above two data sets are mixed to form a “third data set”. For example, the partial data of Amazon in object recognition and the books data of sentiment analysis (here intercepted 400-dimensional features) are mixed into a new source domain dataset, and the DVDs dataset of sentiment analysis is used as the target domain and is expressed as . Mixing the object recognition sample into the source domain formed by the sentiment analysis instance will be very different from the target domain, and can not directly learn by the method of inductive transfer, and even a significant negative transfer phenomenon will occur.
Iv-B Experimental Results
In the experiment, we focused on the classification effect of the algorithms OTBag, OTBag-SDMV, OTBag-JDSMV on three data sets. Compare the three algorithms proposed by us with the current most popular online transfer algorithms OTB , HomOTL-I, HomOTL-II . On the one hand, the classification performance of the algorithm on the data set similar to the target concept in the first two source domains is compared, and on the other hand, the verification algorithm responds to the negative transfer effect under the mixed data set.
We can see from Table I that among the 12 transfer tasks under the sentiment analysis data set, the performance of the OTBag and OTBag-JDSMV algorithms is the best, and the accuracy between them is not much different. This is due to the fact that the forward source instance extends the diversity of the training samples, and the majority of the voting strategies of the final weak classifiers maintain this positive impact. Among them, the proposed OTBag algorithm has a relatively better improvement in accuracy than the current OTB algorithm and HomOTL-I, HomeOTL-II algorithm. We attribute the reason to the fact that the strong classifier built by multiple weak classifiers has better performance.
Compared with the direct majority voting of OTBag, the OTBag-JDSMV algorithm adopts a dual joint voting mechanism to construct a subset of weak classifiers according to the selection strategy, so that the final strong classifier is more robust. However, the limitation of this strategy is only to re-establish the combination of weak classifiers. The weak classifier itself has not been modified more, so the final effect is not much different from the OTBag algorithm. It is worth noting that another strategy we proposed, OTBag-SDMV, is not outstanding in classification performance, and is only a little better than HomOTL-I/II in some tasks. The reason is that few weak classifiers have better classification effects on target training instances than the model. Here we propose that the purpose of this strategy is more to hope that it can perform better under the data set that cannot directly perform the inductive transfer, that is, the impact of ”negative transfer” (explained in the experiment below).
|Algorithm & Tasks||BACKPACK vs TOURING-BIKE|
|CALCULATOR vs HEAD-PHONES|
|KEYBOARD vs LAPTOP-101|
|MONITOR vs MOUSE|
|COFFEEMUG vs VIDEO-PROJECTOR|
In Table 2, the accuracy performance of each algorithm under the object recognition data set is shown. Among the 5 groups of 10 tasks, 9 of them have the optimal performance belonging to OTBag and OTBag-JDSMV algorithms. As discussed above, ensemble learning has a richer meaning representation under the current data set than HomOTL-I/II using a single PA algorithm. For the OTB algorithm, which also uses the ensemble learning boosting method, its overall performance is better than the HomOTL-I/II algorithm. As for the reason why the OTB algorithm does not perform as well as the OTBag algorithm proposed by us, we attribute it to the problem of premature convergence of the source domain of the TrAdaboost algorithm . It is worth noting that we found that the accuracy of the OTBag-JDSMV algorithm is not much different from that of the OTBag algorithm, but its standard deviation is much smaller than the OTBag. This also verifies that our majority voting strategy for the dual subset of OTBag-JDSMV can improve the stability of OTBag calculations while ensuring accuracy.
In the mixed data set, our main purpose is to demonstrate the performance of our algorithm for the effects of negative transfer. As shown in Table 3, among the six tasks under mixed data, both the SDMV and JDSMV strategies make our proposed OTBag perform better than the benchmark algorithm. Among them, OTBag-JDSMV is more prominent and dominates among the four tasks. The performance of the original OTBag algorithm that does not use the filtered negative transfer strategy is similar to that of the baseline algorithm. At this time, the source instance samples play a more negative role. For the two filtering strategies we have extended, they can show the effect of reducing the negative transfer and improving the classification accuracy under the current data set.
In this paper, we proposed the online transfer learning framework OTBag based on bagging. The algorithm has better classification accuracy than the popular single source domain online transfer learning method. At the same time, we have extended the two strategies (OTBag-SDMV / JDSMV) for the filtering phase of OTBag for the impact of smaller negative transfer. Among them, the Joint double subset majority voting (OTBag-JDSMV) strategy has outstanding performance in the above three data sets.
Our algorithm has a good performance on all three real data sets, but the solution is limited to the binary classification. In the future, we will introduce the multi-category problem setting, so that our algorithm can better match the real problem of multi-label in reality. At the same time, the idea of introducing multiple source domains will be considered to further improve the performance of the algorithm.
This work was supported by the National Natural Science Foundation of China (Grant No.61673328) and Shenzhen Scientific Research and Development Funding Program (Grant No. JCYJ20180307123637294).
-  K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.
-  W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer learning,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 193–200.
S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via
transfer component analysis,”
IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2010.
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in
Proceedings of the IEEE international conference on computer vision, 2013, pp. 2200–2207.
Q. Wu, M. K. Ng, and Y. Ye, “Cotransfer learning using coupled markov chains with restart,”IEEE Intelligent Systems, vol. 29, no. 4, pp. 26–33, 2013.
-  M. Jiang, W. Huang, Z. Huang, and G. G. Yen, “Integration of global and local metrics for domain adaptation learning via dimensionality reduction,” IEEE transactions on cybernetics, vol. 47, no. 1, pp. 38–51, 2015.
-  P. Zhao, S. C. Hoi, J. Wang, and B. Li, “Online transfer learning,” Artificial Intelligence, vol. 216, pp. 76–102, 2014.
-  B. Wang and J. Pineau, “Online boosting algorithms for anytime transfer and multitask learning,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
-  Y. Yan, Q. Wu, M. Tan, M. K. Ng, H. Min, and I. W. Tsang, “Online heterogeneous transfer by hedge ensemble of offline and online decisions,” IEEE transactions on neural networks and learning systems, vol. 29, no. 7, pp. 3252–3263, 2017.
-  Q. Chen, Y.-t. Du, M. Xu, and C.-j. Wang, “Heteotl: An algorithm for heterogeneous online transfer learning,” in 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2018, pp. 350–357.
-  K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol. 7, no. Mar, pp. 551–585, 2006.
-  T. Shi and J. Zhu, “Online bayesian passive-aggressive learning,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 1084–1122, 2017.
-  N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,” IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004.
-  P. Zhao, S. C. Hoi, and R. Jin, “Double updating online learning,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1587–1615, 2011.
-  H. Yang, Z. Xu, I. King, and M. R. Lyu, “Online learning for group lasso,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 1191–1198.
-  N. C. Oza, “Online bagging and boosting,” in 2005 IEEE international conference on systems, man and cybernetics, vol. 3. Ieee, 2005, pp. 2340–2345.
A. Niculescu-Mizil and R. Caruana, “Inductive transfer for bayesian network structure learning,” inArtificial intelligence and statistics, 2007, pp. 339–346.
-  H. Daume III and D. Marcu, “Domain adaptation for statistical classifiers,” Journal of artificial Intelligence research, vol. 26, pp. 101–126, 2006.
S. Samanta, A. T. Selvan, and S. Das, “Cross-domain clustering performed by
transfer of knowledge across domains,” in
2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG). IEEE, 2013, pp. 1–4.
-  Y. Li, S. Bai, Y. Zhou, C. Xie, Z. Zhang, and A. Yuille, “Learning transferable adversarial examples via ghost networks,” arXiv preprint arXiv:1812.03413, 2018.
-  S. Vyas, N. Even-Chen, S. D. Stavisky, S. I. Ryu, P. Nuyujukian, and K. V. Shenoy, “Neural population dynamics underlying motor learning transfer,” Neuron, vol. 97, no. 5, pp. 1177–1186, 2018.
-  J. Shan, H. Zhang, W. Liu, and Q. Liu, “Online active learning ensemble framework for drifted data streams,” IEEE transactions on neural networks and learning systems, vol. 30, no. 2, pp. 486–498, 2018.
-  L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
-  T. Kamishima, M. Hamasaki, and S. Akaho, “Trbagg: A simple transfer learning method and its application to personalization in collaborative tagging,” in 2009 Ninth IEEE International Conference on Data Mining. IEEE, 2009, pp. 219–228.
-  M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” arXiv preprint arXiv:1206.4683, 2012.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2066–2073.
-  G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.
-  Z. Cui, W. Li, D. Xu, S. Shan, X. Chen, and X. Li, “Flowing on riemannian manifold: Domain adaptation by shifting covariance,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2264–2273, 2014.
-  S. Al-Stouhi and C. K. Reddy, “Adaptive boosting for transfer learning using dynamic updates,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2011, pp. 60–75.