Sequential Transfer Machine Learning in Networks: Measuring the Impact of Data and Neural Net Similarity on Transferability

03/29/2020 ∙ by Robin Hirt, et al. ∙ KIT ibm 4

In networks of independent entities that face similar predictive tasks, transfer machine learning enables to re-use and improve neural nets using distributed data sets without the exposure of raw data. As the number of data sets in business networks grows and not every neural net transfer is successful, indicators are needed for its impact on the target performance-its transferability. We perform an empirical study on a unique real-world use case comprised of sales data from six different restaurants. We train and transfer neural nets across these restaurant sales data and measure their transferability. Moreover, we calculate potential indicators for transferability based on divergences of data, data projections and a novel metric for neural net similarity. We obtain significant negative correlations between the transferability and the tested indicators. Our findings allow to choose the transfer path based on these indicators, which improves model performance whilst simultaneously requiring fewer model transfers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning is a main driver in the automation of process tasks across industries (Sanders2016). Although many industry players face similar problems with similar data structures in areas where machine learning can be utilized, every company typically solves these problems in an isolated manner (Hirt2018). From a systems perspective, these analytical tasks are well-comparable (mizoguchi1995task).

In an ideal world with an exhaustive exchange of all data across company borders, companies could solve similar problems in a more efficient manner (Hirt2018). However, due to competition and first and foremost, due to the preservation of intellectual property and privacy, a sharing of raw data is not feasible. From an economic standpoint, this poses a significant inefficiency as similar problems are solved multiple times and no analytical knowledge is exchanged. Additionally, the creation of analytical models is typically costly. If every company builds its own models, every company would end up with an inferior model as substantially more data potentially exists in the entire ecosystem. Moreover, every company would also have to reinvent the wheel, thus resulting in higher costs for model creation. Therefore, the current industry practices result in an inefficient resource utilization from a system’s viewpoint (hicks1939foundations).

To address this challenge, we propose the utilization of transfer machine learning, a technique that enables to reuse and improve predictive machine learning models using different, distributed data sets. Hereby, no raw data exchange between companies is required, yet the transfer model can be improved by leveraging these different data sets. Although different types of analytical models could be transferred (Hopf2017a)

, neural networks are especially suited for transfer machine learning and are thus subject of the majority of related work

(Weiss2016). Multiple studies demonstrate the effectiveness and efficiency of transfer machine learning in well-known, well-formed data sets like MNIST (long2013transfer)

or ImageNet

(huh2016makes), but a lack of real-world industry studies is evident. One reason, amongst others, is the question on what, how, and when to transfer, since (naturally) not every neural net can be transferred to every data set (Pan2009ALearning)

. As our research gap, we observe a lack of techniques for identifying the impact of a neural net transfer prior to the transfer itself—which can be described as the transferability of a neural net. For the work at hand, transferability in general can be defined as the estimation of the extent to which representations learned from a source task can help in learning a target task

(Bao2019AnLearning). This is especially relevant when considering large numbers of participants in an ecosystem and a correspondingly high amount of potential neural nets candidates for transfer.

To address this gap, we perform an empirical study on a real-world use case with the aim to study the effects between different similarity measures and the transferability of neural nets. Precisely, we are interested in indicators for transferability of neural nets that are based on a comparison of data and data projections as well as on the neural nets themselves. As a basis for this study, we consider a unique data set of an ecosystem of different restaurant branches owned by different legal entities, all of whom need to perform sales forecasts to improve their respective resource allocations. As owners fear to expose data outside their restaurant, they are not willing to share raw data. Therefore, they are in need of a pre-transfer analysis on the possibility of value-adding neural nets without having to access the raw data of the competitor.

The paper at hand is structured as follows: In the remainder of this section, we cover related work, elaborate on our contribution to theory, define prerequisites and derive hypotheses. Then, we introduce the data set, present the neural net structure and the transfer, and elaborate on indicators for transferability based on raw data, data projections and neural nets. Afterwards, we present the results by first describing the performance impact of transferring neural nets in a business network. Then, we describe the impact of the tested indicators on transferability. After discussing our findings, we summarize the results, discuss their generalization, recognize limitations, and show future research prospects.

1.1 Related Work and Contribution to Theory

The foundations of transfer learning are surveyed by

Pan2009ALearning as well as Weiss2016 and provide a detailed overview on transfer learning. A wide variety of studies on the application of transfer learning can be identified: Zhong2010CrossLearning

present findings on the utilization of deep convolutional neural networks (CNN) in medical image analysis. They use large, general pre-trained sets and adapt them to a specific task to show that pre-trained CNNs using computer vision databases (e.g., ImageNet) are useful in medical image applications and that multi-view classification is possible without the pre-registration of the input images.

Kim2014ConvolutionalClassification

reports that pre-trained word vectors for sentence-level classification tasks can be seen as universal feature extractors that can be utilized for various classification tasks. In this study, we focus on investigating the transferability

(Bao2019AnLearning) of neural networks from a source to a target domain. Related work can be divided into three main aspects that can indicate the transferability, namely the task similarity, the data similarity and, recently, also the model similarity. Table 1 summarizes the related work on transferability in alignment with the aforementioned three main research categories. A variety of work covering the topic of task similarity in transfer learning exists. Xue2007Multi-TaskPriorsclassify tasks that are correlated and dependent, thus proving that concepts that were previously learned on one task may be transferred to other tasks. Yosinski2014HowNetworks

state that the transferability is negatively affected by the specialization of higher layer neurons of their source task, which eventually leads to a performance decrease on the target task. Another way to determine the transferability of neural nets is to examine the source and target data set itself.

Jain2011 use the similarity among data points in order to update the detection score of the classifier and its classification boundary. Xiao2012a find suitable training instances from other domains by measuring the distance between the source and target data in the domain of oil-prize forecasting. Zhong2010CrossLearning apply density ratio weighting to overcome the difference in marginal distributions and propose a reverse validation procedure to quantify how well a neural net approximates the true conditional distribution of the target domain. However, there are more methods for comparing data distributions that could indicate transferability, such as divergences or distances (Bhattacharyya1943OnDistributions; Eguchi1985AFunctionals; Kullback1951OnSufficiency).

Publication Task
Similarity
Neural net
Similarity
Data
Similarity
Multi-Task Learning for Classification with Dirichlet Process Priors (Xue2007Multi-TaskPriors) x
How transferable are features in deep neural networks? (Yosinski2014HowNetworks) x

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

(Raghu2017SVCCA)
x
Insights on representational similarity in neural networks with canonical correlation (Morcos2018InsightsCorrelation) x
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning (Jain2011) x
Online Domain Adaptation of a Pre-Trained Cascade of Classifiers (Zhong2010CrossLearning) x
This work x x
Table 1: Excerpt of related work on transferability and positioning of this work.

Especially if the source data set is not available or cannot be accessed due to confidentiality reasons, examining a potential source neural network can be a way to gain insights on its transferability to a target data set. To the best of our knowledge, there is no work on finding indicators for transferability based on net structures. However, recent work shows possibilities for the comparison of neural net similarity using SVCCA (Raghu2017SVCCA) to interpret neural network representations. Morcos2018InsightsCorrelation apply SVCCA to compare net similarity across a group of CNNs, demonstrating that networks that generalize converge to more similar representations than networks that memorize.

In the course of this work, we are interested in transferring models across different data sets for which the data distribution may vary, but not the task to be executed. Thus, we disregard methods that are purely based on task similarity. We are interested in finding ways to receive indications on the transferability in a case where data cannot be pooled (e.g. due to confidentiality issues). To get an estimate of the basic indication of data similarity in transfer learning, we compare "raw" data sets. Then, in order to potentially reduce the amount of exposed information during the comparison, we examine ways to compare projections of those raw data sets. Given that even those projections might not be retrievable in some cases (e.g. in cases where only models are exchanged and initial training data is not accessible), we finally aim to find indicators for transferability based on the structure of a neural net.

The contribution of this work is threefold:

  • We develop and evaluate the utility of a multi-step system-wide transfer on a unique data set in the domain of sales forecasting.

  • We empirically show an association between the divergence of data distributions and the divergence of projection of data distributions with respect to the transferability of models.

  • We empirically show that the Singular Value Canonical Correlation Analysis is associated with the transferability.

1.2 Prerequisites and Research Design

In our case, we want to transfer neural networks across different federated data distributions of companies:

(1)

We define the input of different data sets that are composed of samples of a neural network as follows:

(2)

The test inputs and the corresponding true labels are composed of samples and are constructed by sampling uniformly from :

(3)

The performance M of a neural network trained on with predicted labels is denoted as:

(4)

The performance delta of a source neural network which is trained on a distribution and then transferred to a target distribution is described as

(5)

We define as the transferability of a model that is trained on the source distribution and transferred to the target distribution . In our case, we therefore regard transferability as a performance increase of a neural network from one (source) distribution to another (target) distribution. The first goal of our work is to show that transferability, i.e. the performance increase of a transferred model, exists for the regarded problem/data set. Therefore, we formulate our first hypothesis as follows:

Hypothesis 1 (H1): A model which is pre-trained on a distribution and transferred to a distribution outperforms a model .

If this hypothesis can be confirmed, the next step of this work consists of identifying possible indicators for transferability in advance to the transfer itself. In order to do so, we analyze indicators for the transferability by comparing and directly as well as their respective projections. Hereby, a projection maps distribution as follows:

(6)

The projected distribution is

. To empirically test different projections, we apply Multidimensional Scaling (MDS), Principal Components Analysis (PCA) and t-stochastic Neighborhood Estimation (t-SNE).

(7)

To compare two distributions and and their respective projections we calculate their data divergence and data projection divergence .

In this work, we aim to empirically examine the association between the divergence of data distributions and , the divergence of projected distributions and and the performance impact . Accordingly, we formulate Hypothesis 2 and 3:

Hypothesis 2 (H2): The divergence of two distributions and , described as , correlates with the transferability .

Hypothesis 3 (H3): The divergence of the projection of two distributions and , described as correlates with the transferability .

Finally, we examine neural nets themselves without accessing the source data to find indicators for transferability. Therefore, we consider the Singular Value Canonical Correlation Analysis (SVCCA). SVCCA enables the comparison of the behavior of neural nets, derived by the activations of neurons with regard to a data input . Let denote the result of an SVCCA between a net and a net based on a data sample . Accordingly, we formulate Hypothesis 4:3

Hypothesis 4 (H4): The output of a Singular Value Canonical Correlation Analysis correlates with the transferability .

Figure 1: Overview of hypotheses and corresponding goal.

In Figure 1

we give an overview of our hypotheses and their corresponding goal. For H1, we perform a two-sided one-sample t-test for the mean of all transferabilities to test if the average transferability significantly deviates from zero. For H2-4, we calculate Spearman’s rank correlation coefficient as a non-parametric measure between the variables and test the significance of the calculated Spearman’s rho

.

2 Experiment

In this chapter, we first give an overview of the data we examined and subsequently elaborate on the sales forecasting model design and the transfer mechanism. In conclusion, we describe how we compare data and data projections. Lastly, we present the applied variation of measuring the net similarity via SVCCA.

2.1 Data Set

We analyze unique daily sales data of six different restaurant branches of two particular restaurant chains that serve different types of food. The data set captures observations from 2013 until 2017.

Branch 1 2 3 4 5 6
Company A A A B B B
City a a b a c d
Table 2: Overview of available data for branch 1 to 6 (sales data from 2013-01-01 to 2017-12-31).

By precisely predicting the sales per day for each branch in the next week, month, or even year, several advantages can be leveraged: based on the revenue and demand, staff schedules can be optimized toward cost savings and a better experience for customers can be delivered. Additionally, the procurement of supplies can be improved, as spoiled food is a main cost-driver for restaurants. Thus, the management of restaurant chains has a major interest to forecast sales for their branches.

Table 2 gives an overview on all the available branch data we use in this work. Each of the two restaurant companies has three branches with different locations. Branch 1, 2 and 4 are located within the same city.

Figure 2: Average branch revenue over days of the week, scaled.

Figure 2 compares the average weekly revenue of each branch. We can recognize a different weekly seasonality for the revenue of the restaurants. Branches 1, 2 and 5 have their highest sales on Saturday, while branch 4 and 6 reach their minimum on that very day. Different market orientation, opening hours and locations are possible reasons for this observation. Hence, branch 5 appears to be closed on Sundays. In general, a restaurant in the commercial city center can attract more customers on the weekend than one that is located in an industrial area of town. In those areas, offices or production businesses are located which tend to be closed on those days. All branches, with the exception of the aforementioned two branches, share the common behavior with an uptrend in net revenue starting from Monday and reaching their peak over the weekend.

2.2 Sales Forecasting Model Design and Transfer

We aim to build separate models for each data distribution, where one data distribution corresponds to the data set of one branch. Afterwards, these models are transferred to every other distribution and then re-trained. This procedure is repeated until every model has passed through every distribution exactly once (H1). To empirically study the effects of data, data projection divergence and net similarity on the transferability of models, we test all possible transfers in a brute-force attempt and analyze the results a posteriori (H2-H4).

Our goal is to develop a model that is able to forecast daily sales on a weekly basis. There are many ways to design a sales forecasting model, such as ARIMA models, additive, or logarithmic regressions. To simplify our research design, we focus solely on Convolutional Neural Networks (CNN) for multivariate forecast as they have proven to achieve superior results in similar problems in the past (Borovykh2017ConditionalNetworks). Here, the input of a neural network is:

(8)

where is a vector of daily sales of the previous sales period, denotes the year, the month and the week of the observation. The complete data set can be described as and . Then, the date and time index are adjusted and reformatted in line with the opening hours of the respective branches. The available variables are grouped by day in order to forecast the time series on a per day basis. We clean obvious errors in the data set by dropping erroneous values, such as negative daily revenues.

Figure 3: Multi-head architecture of employed CNN model.

As a next step, we build a multi-head CNN model to forecast the daily sales of the next sales period. The structure of the CNN model is depicted in Figure 3

. The model has four input variables: revenue of the previous sales period, month, weekday and year of the observation. Each variable is fed into a separate head. All heads consist of two one-dimensional convolutional layers with the same parameter configuration, followed by a max-pooling layer. The output of the pooling layers is flattened and merged by a concatenation layer. The merged heads’ output is fed into a first fully connected layer followed by a second one to conduct the interpretation. Finally, the sales forecast for the next period is generated in form of an output vector.

In a pre-test, we determine the model hyperparameters by empirical testing as follows: the two one-dimensional convolutional layers both have 32 filter maps and a kernel size of 3. As activation function, rectified linear unit is applied to both convolutional layers. The pool size for the max-pooling layer is set to 2. The first fully-connected layer contains 200 neurons and the second one 100 neurons. The model is compiled with mean squared error (MSE) as loss function during training and uses Adam as optimizer

(Kingma2015Adam:Optimization)

. After compilation the model is fitted on the training data set for 20 epochs with a batch size of 16.

For the model training and re-training, we split the data into a training and a test set for each branch. As testing period, we choose the year 2017 consistently. The remaining data builds our training or re-training set. For every model , we calculate its performance on the actual target data set and on the union of all test sets across all branches for comparability reasons.

Figure 4: Overview of a possible transfer path for a model across different data distributions.

To implement the transfer, we re-train a source CNN on a target data set as depicted in Figure 4. Hereby, we do not freeze the layers to enable re-weighting of the neurons in the layers. We re-train the CNN model with the same number of epochs (25) and batch size (16) as in base model training. Note that it would also be possible to adaptively choose certain layers to freeze and dynamically adapt the learning rate. For this study, we chose not to change or vary the amount of training parameters or frozen layers for a transfer. By choosing not to do so, the models are more likely to "forget" previously learned knowledge. Future work needs to address a more adaptive learning strategy. The degree of transfer denotes the total amount of performed transfers per model. In Table 3 we give an overview of all transfers, their respective source models and the respective targets according the degree of transfer. Generally, the amount of transfers grows significantly with a growing number of data sets N and can be described by .

Degree of transfer 1 2 3 4 5 Total
Source models 6 30 120 360 720 120
Possible targets 5 4 3 2 1 -
Targets 30 120 360 720 720 1950
Table 3: Number of possible transfers.

2.3 Data and Data Projection Divergence

In the following, we first introduce the utilized data divergence measure, which we apply on the unchanged data populations as well as on the projected data. Measuring the independence or divergence of two random variables or distributions can be conducted in different ways. In this work, we estimate the divergence of two data distributions using an energy distance meta estimator

as equivalent to maximum mean discrepancy (Szabo2014InformationToolbox; Szekely2013EnergyDistances), which is defined as follows:

(9)
(10)

Considering a scenario where data cannot be exchanged across entities of a system, it is not possible to compare two data sets simultaneously. To ensure a certain degree of confidentiality, a possible solution would be to compare only projected data, where critical information is already lost due to abstraction (Narayanan2008).

Thus, in an initial step we apply projections raw data to retrieve abstractions where

. We use three established algorithms to calculate abstractions of the raw data, namely t-distributed stochastic neighbor embedding (t-SNE), multidimensional scaling (MDS) as well as principal component analysis (PCA). The t-SNE is a well-suited technique for the visualization of high-dimensional data to create meaningful intermediate results and is effective for interactive data analysis

(Pezzotti2017ApproximatedAnalytics). MDS is a technique used for analyzing similarity or dissimilarity of data. It attempts to model the relationship between data as distances in a geometric space (Borg2003ModernApplications)

. Lastly, PCA decomposes a multivariate data set into a set of subsequent orthogonal components which explain a maximum amount of the variance in the data

(Halko2011FindingDecompositions). The projections for each technique applied to the first data distribution are visualized in Figure 5.

Figure 5:

Bi-variate kernel density estimates of data projection (t-SNE, PCA, MDS) for data distribution

of the first branch.

Subsequently, we calculate divergences between the data projections. Lastly, for both the raw data and data projections, we evaluate whether a correlation to the transferability of models is given.

2.4 Neural Net Similarity

The Singular Value Canonical Correlation Analysis (SVCCA) is a method for analyzing and comparing different representations learned by artificial neural networks (Raghu2017SVCCA)

. It represents an amalgamation of a singular value decomposition (SVD) and a canonical correlation analysis (CCA)

(Hardoon2004CanonicalMethods).

In this work, we use SVCCA to determine the neural net similarity of two networks of two different branches. In Figure 6, we present an overview of the application of SVCCA on a potential source net to identify its transferability to a target distribution.

Two neural nets that are to be compared are fed with data . In this study we supply a data sample which represents the sales of 2017 from the target distribution to the potential source net and capture the activation vectors for every layer.

The neurons’ response is calculated as a representation over a finite set of inputs. The resulting activation vectors

for each layer of neurons are then processed by applying SVD. Similar to the eigenvalues, these characterize the properties of the matrix. This results in singular vectors

with associated singular values for X and similarly for Y. Of these singular vectors we keep the top , as 99% of variation of X is explained by the top vectors. This helps to remove directions with respect to neurons that are constantly zero or exhibit noise with small magnitudes (Raghu2017SVCCA).

Subsequently, CCA is applied to the sets of top singular vectors . The CCA is a well-established method for understanding the similarity of two different sets of randomly distributed variables. Given the two sets of vectors

, we wish to find linear transformations

that maximally correlate with the sub-spaces. This can be reduced to an eigenvalue problem. Solving this problem results in linearly transformed sub-spaces with directions that are maximally correlated with one another. As a result, we ultimately obtain as the transferability of a source neural net towards a target data set .

Figure 6: Procedure of comparing of comparing a potential source neural network to a target net .

3 Results and Discussion

We present the results of this study along two steps. First, we describe the result of the initial net training and the performed transfers—thus addressing H1. Second, we describe the output of the analysis on the association between data, data projection, neural net, and their impact on transferability—thus addressing H2-H4.

3.1 Base and Transfer Results (Hypothesis 1)

To measure the performance of the developed forecasting models, two metrics are used: RMSE and MAPE. The RMSE is used to calculate the differences between values predicted by a model and the actual values observed and, in this work, is a basis during model optimization. It has proven to be a meaningful performance indicator for regression tasks (Spuler2015). RMSE is calculated as follows:

(11)

where is the predicted value, is the actual value observed and T is the number of different predictions performed. RMSE as a scale-dependent measurement is not suitable for comparing forecasting errors across different data sets (Hyndman2006AnotherAccuracy). Thus, to evaluate and compare the performance of different models on different data sets, we additionally calculate the MAPE. The MAPE delivers a very intuitive interpretation in terms of relative error and therefore MAPE is broadly used in practice (DeMyttenaere2016MeanModels) and is calculated as follows:

(12)

where is the forecast value and is the actual observation for the number of forecasts . In the following, to ensure comparability, we only report the MAPE for all models.

Figure 7: Scaled daily net revenue, actual and predicted; Above: branch 1, below: branch 4.

We train base models for every branch based on all available data including 2016. Then, we test the models on the full year of 2017 and calculate the MAPE and RMSE. In Figure 7, we depict the scaled daily net revenue (exemplarily) for branch 1 and branch 4. Both base models are seemingly good in predicting the actual value. However, it is also noticeable that between those two data distributions—and, thus, models—there are significant differences in sales patterns.

As shown in Table 3, the number of potential transfers and therefore the number of possible models that are evaluated grows exponentially. However, to give an overview of the transfer impact, we present the results for the first degree of transfer in Table 4.

Source Target base Branch 1 () Branch 2 () Branch 3 () Branch 4 () Branch 5 ()
-
9.59
13.31
13.94
11.88
23.00
13.28
-
13.13
(+1.34%)
15.23
(-9.30%)
11.14
(+6.28%)
25.51
(-10.91%)
13.42
(-1.23%)
9.86
(-2.80%)
-
13.85
(+0.67%)
10.64
(+10.44%)
24.78
(-7.74%)
13.96
(-5.29%)
9.49
(+1.07%)
12.52
(+5.91%)
-
11.21
(+5.67%)
24.71
(-7.46%)
13.07
(+1.38%)
9.31
(+2.91%)
13.36
(-0.43%)
16.11
(-15.60%)
-
25.59
(-11.29%)
12.82
(+3.31%)
9.23
(+3.74%)
13.72
(-3.11%)
15.06
(-8.03%)
11.22
(+5.55%)
-
13.11
(+1.10%)
9.18
(+4.30%)
12.89
(+3.12%)
15.01
(-7.68%)
10.97
(+7.70%)
24.82
(-7.92%)
-
Best
Transfer
9.18
(+4.30%)
12.52
(+5.91%)
13.85
(+0.67%)
10.64
(+10.44%)
24.71
(-7.46%)
12.82
(+3.31%)
Table 4: MAPE M (the lower the better) results for the first degree of transfer of all branches. Additionally, the performance increase in comparison to a model that is trained solely on the target’s data (in brackets).

Not all transfers have a positive impact on the performance to a target distribution (see Table 4). This indicates that transferability varies, depending on the association between the source and target distribution. Additionally, in practice it might not be feasible to test all possible transfer model candidates on a target set, as a transfer and re-training of a model is bound to a computational cost. Simply testing all possible combinations via a brute-force approach would therefore not be efficient.

Target Degr. of T. base 1 degree 2 degree 3 degree 4 degree 5 degree
Branch 1
()
9.59
()
9.18
(,)
9.08
(,,)
8.98
(,,,)
8.96
(,,,,)
8.96
(,,,,,)
Branch 2
()
13.31
()
12.52
(,)
11.87
(,,)
11.73
(,,,)
11.65
(,,,,)
11.70
(,,,,,)
Branch 3
()
13.94
()
13.84
(,)
13.76
(,,)
13.38
(,,,)
13.25
(,,,,)
13.01
(,,,,,)
Branch 4
Co. B
11.88
()
10.64
(,)
10.33
(,,)
10.18
(,,,)
10.22
(,,,,)
10.03
(,,,,,)
Branch 5
()
23.00
()
24.71
(,)
23.19
(,,)
22.42
(,,,)
22.16
(,,,,)
21.98
(,,,,,)
Branch 6
()
13.26
()
12.82
(,)
12.95
(,,)
12.42
(,,,)
12.49
(,,,,)
12.21
(,,,,,)
Table 5: MAPE M (the lower the better) of best model along degrees of transfer for each distribution with the corresponding transfer path in brackets.

In Table 5, the best results for each branch according the degree of transfer are presented. Note that we select the best performing model for every transfer step and every branch. For almost all branches, except branch 5, an increase in prediction performance can be observed with an increasing degree of transfer. In case of branch 5, we observe an increase of performance starting after the third transfer. It is noticeable that the increase steadily grows for every transfer step, albeit in some cases marginally.

With an increasing degree of transfer, we can observe that in some cases the same distributions are used to re-train models. If, for instance, we investigate target branch 1, we can observe that branch 4 and 5 seem to be good previous distributions to train a model on. However, as we always re-train the complete net, an information loss is likely to arise after multiple transfer steps. H1 states that a model which is pre-trained on a distribution and transferred to a distribution outperforms a model . Thus, a two-sided one-sample t-test for the mean of all transferabilities

(N=1950) is conducted to test if the average transferability significantly deviates from zero. With a mean of 0.00894, a standard deviation of .06728 and a p-value <.0001 although="" average="" confirms="" h1="" is="" mean="" of="" positive.="" supported.="" test="" that="" the="" thus="" transferability="">

is only slightly above zero, Table 5 illustrates that there is a steady increase of performances with every further transfer step. However, in that scenario, the best performing models are cherry-picked. In reality, it would not be desirable to test all 1950 transferred models, e.g., due to computational cost. Thus, it is desirable to know in advance which models will perform best. This leads us to the study of association on transferability.

3.2 Associations on Transferability (Hypotheses 2-4)

Returning to our previously defined research gap, we aim to find an indicator of transferability between two data distributions without comparing them directly. By establishing and testing H1, we first show the utility of a transfer in our use case. Now, we empirically study the correlation between three influence factors on transferability: the data divergence (H2), the projected data divergence (H3) and the SVCCA (H4). For every hypothesis, we calculate Spearman’s rank correlation coefficient between the transferability and the corresponding indicator. The coefficient describes both the strength and the direction of the relationship. The Spearman correlation evaluates the monotonic relationship between the two continuous variables: transferability and the corresponding indicator. The results are presented in Table 6. We split H3 into three sub-hypotheses corresponding to the differing data projection functions we examine: H3.1 corresponds to the T-SNE, H3.2 to the PCA and H3.3 to the MDS. For every hypothesis, we examine N=1950 transferred models.

H Indicator for transferability between and
H2 Data divergence -.4294***
H3.1 Projected data (T-SNE) divergence .0668**
H3.2 Projected data (PCA) divergence -.2397***
H3.3 Projected data (MDS) divergence -.3101***
H4 Neural net similarity (SVCCA) -.2245***
"*" means , "**" means , "***" and means .
Table 6: Spearman correlation of all tested indicators for transferability.

Although we do not intend to find indications on raw data as it might not be feasible in business networks due to data confidentiality reasons, we formulate H2 to investigate whether or not there is an association without any transformation of data. H2 states that the divergence of two distributions and , described as , correlates with the transferability . Results of the study indicate that there is a significant negative association between the data divergence and the transferability (=-.4294, p<.0001).

By projecting data and thus masking confidential information, we state and test different techniques for transferability indicators through H3. Thus, H3 describes that the divergence of the projection of two distributions and described as correlates with the transferability . The sub-hypotheses H3.1-3.3 describe different projection functions, respectively. For H3.1, results indicate that there is a positive association between the projected data divergence based on the T-SNE projection and the transferability (=-.0668, p<.05). However, the Spearman’s rho is rather low which indicates a weak correlation between the two variables. In the case of H3.2, however, the results paint a clearer picture: a negative correlation between the projected data divergence and the transferability is present (=-.3101, p<.0001). A similar situation can be observed by considering the results of H3.3, where we find an even higher negative correlation between the projected data divergence based on MDS to the transferability (=-.3101, p<.0001). Based on the results for H3.1-3.3, we can derive that the PCA and the MDS are better aligned with the identified correlation between data divergence and transferability (H1), as the direction of their correlations towards the transferability is the same. Furthermore, in case of the T-SNE, we only see a weak positive monotonous association.

Through the comparison, although not exposing raw, but projected data, a possible breach of confidential information is not unlikely, as certain characteristics of the original data distribution are still extractable from the projection. Thus, we state and test H4 to find indications for transferability by the result of the SVCCA, a measure for neural net similarity. In case of H4, we state that the output of a Singular Value Canonical Correlation Analysis correlates with the transferability . Our tests show a similar result as for H2, H3.2 and H3.3. We find a significant negative association between the neural net similarity and the transferability (=-.2245, p<.0001).

In summary, we can reject the null hypothesis for H2-H4. However, we observe differences in the results for each tested association. There seems to be a clear negative correlation between the projected data divergence based on PCA and MDS and the transferability as compared to T-SNE. Here, we observe a positive correlation with a Spearman’s rho value below .07 whereas PCA and MDS exhibit larger, yet negative Spearman’s rho values. Hence, we observe the same direction of correlation between the net similarity and the transferability, which indicates stable results.

4 Discussion

A multitude of insightful results can be derived from the conducted empiric research. First and foremost, what sparks our interest the most is the observed dominant, negative correlation effect between the transferability and the data and data projection divergence and neural net similarity. Based on previous research, one would expect a positive correlation to be present (Xiao2012a). However, in the regarded case, we assume that a neural network benefits from divergent or different observations which are not available in previous training data.

Additionally, in our case we consider sales data collected by different restaurants. Although the data sets originate from two different chains which serve different types of food, the underlying sales patterns might be quite comparable. Results indicate that the underlying data distribution cannot yet be learned by looking at an isolated data population. Thus, we hypothesize that if a neural net receives a larger amount of diverging observations as inputs, its generalization and hence its performance improve.

Another striking finding can be observed by visually inspecting the projections of data populations and their respective transferability and divergences. Exemplarily, we consider projections derived through MDS and compare a first degree of transfer. In Figure 8, we present two cases where the effect of projected data divergence and the transferability can be visually observed for particularly "successful" transfers and "unsuccessful" transfers. In the figure, we can detect a strong support for our hypothesis validation, as successful transfers occur when the data is extremely divergent and vice-versa, unsuccessful transfers occur when data is divergent. However, future work is necessary to further investigate this phenomenon.

Furthermore, the correlations of the data and data projection divergence and their transferability show the same direction as the correlation between the neural net similarity and the transferability. This gives us reasons to believe that the neural net similarity, as applied in this work with SVCAA, represents similar abstracted information as the divergence of data and its projection. It also aligns with the work of Raghu2017SVCCA, who aim to find representations of features of a data set in a neuron’s response. However, this assumption requires further confirmation in future work based on additional empirical research established through other data sets.

Figure 8: Overlay of bi-variate kernel density estimates of data projections (MDS) in the case of a) , and b) , and their respective bi-directional transferabilities.

5 Conclusion and Outlook

In this work, we utilize transfer machine learning on a unique sales data set. We do so to reveal two aspects of interest: first, the performance increase—labeled as transferability—of transferring models in general and second, the identification of indicators of a successful transfer prior to the transfer itself.

Therefore, we contribute to the body of knowledge in manifold ways. First, we implement a multi-step system-wide transfer on the sales data of different restaurants and restaurant chains. We can successfully show the evaluation of utility by showing empirically that transfers can be beneficial. This is in line with Hypothesis 1, which states that a model that is pre-trained on one distribution and subsequently transferred to another distribution outperforms the model built solely on the original distribution. Secondly, the association of divergence of data distributions as well as the divergence of projections of data distributions and their transferability is analyzed. We are able to confirm Hypothesis 2 and Hypothesis 3 for different sub-distributions, indicating a strong negative correlation between data divergence and data projection divergence and their transferability. Thirdly, we analyze with Hypothesis 4 whether the output of a Singular Value Canonical Correlation Analysis is associated with the transferability. Although we analyze only trained nets—and not data distributions or their projections—we are able to find an association between the neural net similarity and the transferability. In summary, this means for the regarded data set that we are now able to determine transferability of models without regarding raw data—prior to the transfer. As a result, predictions about the transferability for new data sets in a business network can be made, without exposing data distributions. Additionally, its application could allow for more efficiency across the overall system, as the same problem does not need to be solved multiple times: a once trained model can be re-applied several times for similar problems at each restaurant.

Despite the novelty of the approach, limitations are obvious. At first, only one data set of multiple entities, only time series forecasting and only one net architecture is considered. In the theorizing process of general indicators for transferability, more examples are necessary. Additionally, for the time being, we only show an association between data, data projection and neural net similarity and the transferability. We do not investigate further and enhance the association to engineer a search algorithm for transferring models in an ecosystem. On the technical side, the currently implemented transfer mechanism exploits "forgetting", i.e., we do not dynamically adapt the frozen layers. Furthermore, the data and data projection association towards transferability neglects previous transfer steps of a model and is thus trivialized. Finally, while no raw data is shared, recent research shows the possibility to retrieve single instances, especially extreme points of a population (fredrikson2015model).

Future research needs to address especially the last aspect. If we aim to allow privacy-preserving transfer machine learning, we need to incorporate differential privacy mechanisms into model training (abadi2016deep; Mironov2017). Furthermore, the empirical study can be extended by incorporating previous training sets, as these could result in stronger correlations, e.g. due to averaging over populations. A further enhancement of the transfer mechanism could prove meaningful, for instance by including the freezing of certain layers, as well as adapting the learning rate or number of frozen layers with respect to the degree of transfer. Also, an in-depth investigation of the "forgetting" aspects of networks could be interesting, e.g., how many transfer steps are required for a network to "forget" information—and therefore limit the amount of transfers from the beginning. As mentioned previously, more and repeated empirical studies on other data sets, models, and net architectures are necessary to address the generalizability of the approach. Finally, an exploitation of the association between SVCCA and transferability would be preferable, specifically the development of a method or search algorithm that utilizes it as a direction of search. This would allow to choose the "path of transfer" in advance—and result in higher model performances with less model transfer permutations. A promising field of research lies ahead.

References