1 Introduction
Transfer learning is a popular practice in deep neural networks. Finetuning of a large number of parameters, is difficult due to the complex wiring of neurons between splitting layers. The imbalanced distributions of data for primary and target domains, also adds up to the hurdle of the problem. The reconstruction of the primary wiring for the target network is a heavy burden, considering the size of interconnections across neurons.
For supervised learning, many classification algorithms assume the same distributions for training and test data. In order to change this distribution, the statistical models must be rebuilt. This is not always practical, due to the difficulty of recollecting the training data or complexity of the learning process. One of the solutions is transfer learning, which transfers the classification knowledge into a new domain (
pan2010survey). The aim is to learn highly generalized models for either domains with different probability distributions, or domains without labelled data (
wang2014flexible , zhang2013domain ). Here, the main challenge is to reduce the shifts in data distributions between the domains, by using algorithms that minimize their discrimination. It is worth mentioning that, this cannot resolve domainspecific variations (long2016deep ).Transfer learning has also proven to be highly beneficial at boosting the overall performance of deep neural networks. Deep learning practices usually require a huge amount of labelled data to learn powerful models. The transfer learning enables adaptation to a different source with a small number of training samples.
On the other hand, deep neural networks practically learn intermediate features. They can provide better transfer learning, because some of them generalize well, among various domains of knowledge (glorot2011domain ). These transferable features generally underlie several probability distributions (oquab2014learning ), to reduce the crossdomain discrepancy (yosinski2014transferable ). The common observation among several deep architectures, is that, features learned in the bottom layers are not that specific, but transiting towards top layers, tailors them to the target dataset or task.
A recent study of the generality or specificity of deep layers for transfer learning, show two difficulties which may affect the transfer of deep features (
yosinski2014transferable ). First, top layers get quite specialized to their primary tasks and second, some optimization difficulties arise due to the splitting of the network between adapted layers. In spite of these negative effects, other studies have confirmed that transferred features, not only perform better than random representation, but also provide better initialization.This paper proposes a distributed backpropagation scheme in which, the convolutional filters are finetuned individually, but are backpropagated, all at once. This is done by means of Basic Probability Assignment (sentz2002combination ) of evidence theory. Therefore, the primary filters are gradually transferred, based on, their contributions to classification performance of the target domain. This approach largely reduces the complexity of transfer learning, whilst improving precision. The experimental results on standard benchmarks and various scenarios, confirm the consistent improvement of the distributed backpropagation strategy for the transfer learning.
2 Method
A novel framework for distributed backpropagation in deep convolutional networks is introduced, that alleviates the burden of splitting a network through the middle of fragile layers. The intuition is that, this difficulty relates to the complexity of deep architecture and the imbalance data distribution of the primary and target domains.
On one hand, the splitting of layers, results in optimization difficulty, because there is a high complexity in the interconnections between neurons of adapted layers. To address this, the convolutional filters are finetuned individually. This reduces the complexity of nonconvex optimization for the transfer learning problem. On the other hand, the imbalance problem arises form different distributions of data, in the primary and target domains. This issue can be handled by costsensitive, imbalanced learning methods. Since the power of deep neural models, comes from mutual optimization of all parameters, the above finetuned convolutional filters are joined by a costsensitive backpropagation scheme.
The emergence of new costsensitive methods for imbalanced data (elkan2001foundations
), enables the misclassification costs to be embedded in the form of a cost matrix. Meaningful information is then able to be distributed to the learning process. The error, based on the misclassification costs, is measured for each class, to form a confusion matrix. This matrix is the most informative contingency table in imbalanced learning problems, because it gives the success rate of a classifier in a special class, and the failure rate on distinguishing that class from other classes. The confusion matrix has proven to be a great regularizer; smoothing the accuracy among imbalanced data and giving more importance to minority distributions (
ralaivola2012confusion ).Determination of a probabilistic distribution from the confusion matrix, is highly effective at producing a probability assignment, which contributes to the imbalanced problems. This can be either constructed from recognition, substitution and rejection rates (xu1992methods
) or both precision and recall rates (
deng2016improved). The key point is to harvest maximum possible prior knowledge from the confusion matrix, to overcome an imbalanced challenge. The experiments confirm the advantage of the proposed distributed backpropagation for transfer learning in deep convolutional neural networks.
3 Formulation
It is a general practice in transfer learning to include training of a primary deep neural network on a dataset, followed by finetuning of learned features for another dataset, on a new target network (bengio2012deep ). The generality of selected features for both of the primary and target domains, is critical to the success of transfer learning.
For implementation, the primary network is trained and its bottom layers copied, to form the target network. The top layers of the target network, are initialized randomly and are trained by the target data. It is possible to employ backpropagation for all layers and finetune their parameters for the target task, or freeze the copied primary layers, and only update the top target layers. This is usually decided by the size of the target dataset and number of parameters in the primary layers. Finetuning of large networks for small datasets, leads to overfitting, but for small networks and large datasets, improves the performance (sermanet2013overfeat ). A diagram of standard backpropagation for transfer learning is presented in Figure 1.
To handle the overfitting issues, a distributed backpropagation paradigm is proposed. First, the large number of finetuning parameters are divided, so as, to conquer the complexity of the primary nonconvex optimization. This is implemented by breaking of the layer network, of depth , to the distributed layer singlefilter networks, of depth one. This leads to a decay rate of in the number of parameters, needing to be finetuned in each singlefilter networks. Second, this distributed architecture is fed with the target data. As a result, each of the singlefilter networks, generates a contingency matrix of its Softmax classifier for the classification, recognition, or regression task.
Third, BPA is employed to find out the contribution of each singlefilter networks, to the specific learning task, in the target domain. Finally, the calculated probability assignments are normalized, to use as the costs of backpropagation, in each of the singlefilter networks. In other words, the parameters are updated by the multiplication of error gradients, and the cost of each distributed singlefilter networks (Figure 2).
The singlefilter architectures are initialized by parameters, learned from the primary domain, meanwhile, they are optimized by gradient descent through backpropagation on the target domain. It differs from ensemble learning in that, singlefilter networks are not reweighted, but their parameters are updated to cope with the target domain.
3.1 Basic Probability Assignment
A confusion matrix is generally represented as classbased predictions against actual labels, in the form of a square matrix. Inspired by DempsterSchafer theory (sentz2002combination
), construction of BPA gives a vector, which is independent of the number of samples in each classes and sums up to one, for each labels. BPA provides the ability to reflect the different contributions of a classifier, or combine the outcomes of multiple weak classifiers. A raw, twodimensional confusion matrix, indexed by the predicted classes and actual labels, provides some common measures of classification performance. Some general measures are accuracy (the proportion of the total number of predictions that are correct), precision (a measure of the accuracy, provided that a specific class has been predicted), recall (a measure of the ability of a prediction model to select instances of a certain class from a dataset), and Fscore (the harmonic mean of precision and recall) (
sammut2011encyclopedia ).Suppose that a set of training samples from different classes, are assigned to a label set , using a classifier . Each element of the confusion matrix is considered as the number of samples belonging to class , which assigned to label . The recall () and precision () ratios for all and , can be defined as follows (deng2016improved ),
(1) 
It can be seen that, the recall ratio is summed over the predicted classes (rows), whilst the precision ratio is accumulated by the actual labels (columns) of the confusion matrix . The probability elements of recall () and precision () for each individual class are,
(2) 
These elements are synthesized to form the final probability assignments by DempsterSchafer rule of combination (sentz2002combination ), representing the recognition ability of classifier to each of the classes of set as,
(3) 
where the operator is an orthogonal sum. The overall contribution of the classifier can be presented as a probability assignment vector,
(4) 
It is worth mentioning that should be computed by the training set, because it is assumed that, there is no actual label set at the test time.
3.2 Distributed Backpropagation
Suppose that
is a set of Softmax loss functions of the singlefilter networks, presented in Figure
2. To apply the distributed backpropagation, Algorithm 1 should be followed for each of the classifiers. The result is a set of normalized probability assignments as follows,(5) 
It is known that in each layer of the th singlefilter network, the feedforward propagation is calculated as,
(6) 
which and are weights and biases, is an activation and is a rectification function. Considering as the cost function of the th network, the output error holds,
(7) 
and the backpropagation error can be stated as,
(8) 
For the sake of gradient descent, the weights and biases are updated via,
(9) 
It can be seen that, greater of the th singlefilter network, makes larger steps to update the weights and biases of the target domain, during distributed backpropagation, compared to the standard backpropagation, with a fix in the primary domain. This helps to only update primary filters, which largely affect the target domain and also properly connect the distributed singlefilter networks, during the finetuning process. This also implies that, in spite of forwardbackward propagation in the target domain, the overall contribution of all convolutional filters, is taken into account. Since the united backpropagation is involved in every iterations of training, the singlefilter networks are trained together, which is different from ensemble learning. Algorithm 2 wraps up the proposed strategy.
It is assumed that the number of classes and assigned labels are equal (), although the merging of different classes is a common practice, particularly in visual classification (for example, vertical vs horizontal categories). The benefit lies in the fact that the bottom layers of deep convolutional architectures, contribute to detecting first and second order features, that are usually of specific directions, rather than specific distinguishing patterns of the objects. This leads to a powerful hierarchical feature learning, in the case . In contrast, some classes can be divided into various subcategories, although they all get the same initial labels, and hence this holds to take the advantage of more general features in the top layers. The proposed distributed backpropagation does not merge or divide the primary labels of the datasets under study, although it seems that, this is able to boost the performance of transfer learning, for both of the merging or dividing cases.
4 Experiments
Two different scenarios are considered to evaluate the performance of distributed backpropagation for transfer learning. In the first scenario, the performance of finetuning for pairs of datasets, with either close data distributions or number of classes, are observed. These are MNIST & SVHN and CIFAR10 & CIFAR100 pairs, as primary & target domains, and the performance of transfer learning, are reported in the form of trainingtest errors. For the second scenario, transfer learning is applied for pairs of datasets, with far dataclass distributions, which are MNIST & CIFAR10 and SVHN & CIFAR100 pairs. In these experiments, the datasets are arranged to examine the effect of dissimilar distributions, rather than overfitting.
Dataset  Baseline  

Train (%)  Test (%)  
MNIST  0.04  0.55 
SVHN  0.13  3.81 
CIFAR10  0.01  19.40 
CIFAR100  0.17  50.90 
As a practical example, suppose that the aim is to transfer MNIST as the primary domain, to CIFAR as the target domain. To initialize, CIFAR data is presented to the MNIST pretrained network, filter by filter, for each of the 20 convolutional filters of the MNIST network. Then, the outputs of Softmax layers are passed to BPA module, and the error gradients are multiplied by the specific cost of each filters. These new gradients backpropagate to all 20 singlefilter networks, to update their weights and biases. After some iterations, the MNIST network is transferred to the CIFAR network, through the distributed backpropagation scheme, running on 20 finetuned, singlefilter networks. The baselines of training and test errors on the experimental datasets, are reported in Table
1. For ease of implementation and fast replication of the results, the deep learning library provided by the Oxford Visual Geometry Group (vedaldi08vlfeat ), is deployed.4.1 Transfer Learning on Fairly Balanced Domains
In this scenario, two pairs of datasets are targeted, which contain similar data distributions and perform the same recognition tasks. The results are reported for standard vs distributed backpropagations, in Table 2. The standard backpropagation trains the primary network and finetunes the top layers for the target network (bengio2012deep ). The proposed distributed backpropagation employs BPA, for the costsensitive transfer learning.
It can be seen that, the results for the standard backpropagation follow the argument on size of networks and number of model parameters (sermanet2013overfeat ). MNIST does a poor job on transferring to SVHN, due to the overfitting of SVHN over MNIST network. In contrast, SVHN performs well as the primary network to transfer MNIST as the target domain.
On the other hand, transferring to SVHN domain from MNIST, does not result in overfitting, when the distributed backpropagation is employed. In both settings of the primary and target domains, the distributed strategy outperforms the standard backpropagation.
The experiments based on the CIFAR pair, raise interesting results, because both datasets have the same number of samples, but completely different distributions among the classes. In practice, CIFAR100 includes all the classes of CIFAR10, but CIFAR10 is not aware of several classes in CIFAR100. It can be seen that, CIFAR10 transfers well to CIFAR100. This cannot outperform the baseline performance, although the target network (CIFAR100) is not overfitted. All in all, the performance of the distributed backpropagation for transfer learning is better than the standard scheme and also, outperforms the baselines of the benchmarks.
4.2 Transfer Learning on Highly Imbalanced Domains
This scenario pairs the datasets, such that, the similarity of their data distributions, and the number of classes get minimized. They are also initially trained for different tasks, i.e. number classification vs object recognition. For implementation, MNIST gray channel is repeated three times to make a RGBlike colour representation for transferring to CIFAR network. For the CIFAR, the RGB channels are converted into grayscale to transfer on the MNIST network.
From Table 3, it is obvious that the distributed backpropagation outperforms all the standard results. For the first setup, CIFAR10 performs better at transfer learning than MNSIT, although the number of classes are the same.
primary  Target  Standard  Distributed  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  SVHN  0.01  29.57  0.24  5.18 
SVHN  MNIST  0.35  1.04  0.16  0.46 
CIFAR10  CIFAR100  0.53  68.44  0.29  54.32 
CIFAR100  CIFAR10  0.11  24.08  0.05  18.24 
primary  Target  Standard  Distributed  

Train (%)  Test (%)  Train (%)  Test (%)  
MNIST  CIFAR10  0.43  28.92  0.25  20.85 
CIFAR10  MNIST  0.44  2.37  0.23  0.95 
SVHN  CIFAR100  0.71  89.31  0.46  61.10 
CIFAR100  SVHN  0.01  12.18  0.28  7.25 
It seems that, CIFAR10 provides better generalization due to higher diversity among its classes. Here, the distributed backpropagation performs better than the standard process and, targeting of MNIST from CIFAR10 network, results in a performance that is similar to the baseline outcomes on MNIST in Table 1. The second setup leads to the overfitting of SVHN over CIFAR100 network, as a result of the large number of samples. The other outcome is the poor performance of transferring CIFAR100 to SVHN network. This is the result of different contents of primary and target datasets.
The observations show that finetuning on the training set, while calculating BPA on the validation set, result in better generalization of the transferred model. On the other hand, computing of BPA on training plus validation sets, gives a higher performance. This is due to the vastly different number of classes in the primarytarget domains. Since BPA is employed to address the imbalance distribution problem, it better captures the distribution of data, by adjoining both training and validation sets. This is especially true, when fewer classes of the primary domain, are transferred to the larger number of classes in the target domain.
5 Conclusion
We introduce a novel transfer learning for deep convolutional networks that tackles the optimization complexity of a highly nonconvex objective by breaking it to several distributed finetuning operations which backpropagate jointly. This also resolves the imbalance learning regime for the original and target domains by using the basic probability assignment of evidence theory across several unitdepth singlefilter networks. By distributed backpropagation, the overall performance shows considerable improvement over standard transfer learning scheme. We conduct several experiments on publicly available datasets and report the performance as training and test errors. The results confirm the advantage of our distributed strategy.
References
 (1) Y. Bengio et al. Deep learning of representations for unsupervised and transfer learning. ICML Unsupervised and Transfer Learning, 27:17–36, 2012.
 (2) X. Deng, Q. Liu, Y. Deng, and S. Mahadevan. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Information Sciences, 340:250–261, 2016.

(3)
C. Elkan.
The foundations of costsensitive learning.
In
International joint conference on artificial intelligence
, volume 17, pages 973–978. LAWRENCE ERLBAUM ASSOCIATES LTD, 2001. 
(4)
X. Glorot, A. Bordes, and Y. Bengio.
Domain adaptation for largescale sentiment classification: A deep
learning approach.
In
Proceedings of the 28th International Conference on Machine Learning (ICML11)
, pages 513–520, 2011.  (5) M. Long, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636, 2016.
 (6) M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring midlevel image representations using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1717–1724. IEEE, 2014.
 (7) S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 (8) L. Ralaivola. Confusionbased online learning and a passiveaggressive scheme. In Advances in Neural Information Processing Systems, pages 3284–3292, 2012.
 (9) C. Sammut and G. I. Webb. Encyclopedia of machine learning. Springer Science & Business Media, 2011.
 (10) K. Sentz and S. Ferson. Combination of evidence in DempsterShafer theory, volume 4015. Citeseer, 2002.
 (11) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
 (12) A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.
 (13) X. Wang and J. Schneider. Flexible transfer learning under support and model shift. In Advances in Neural Information Processing Systems, pages 1898–1906, 2014.
 (14) L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE transactions on systems, man, and cybernetics, 22(3):418–435, 1992.
 (15) J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
 (16) K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In ICML (3), pages 819–827, 2013.
Comments
There are no comments yet.