1 Introduction
Machine learning is a main driver in the automation of process tasks across industries (Sanders2016). Although many industry players face similar problems with similar data structures in areas where machine learning can be utilized, every company typically solves these problems in an isolated manner (Hirt2018). From a systems perspective, these analytical tasks are wellcomparable (mizoguchi1995task).
In an ideal world with an exhaustive exchange of all data across company borders, companies could solve similar problems in a more efficient manner (Hirt2018). However, due to competition and first and foremost, due to the preservation of intellectual property and privacy, a sharing of raw data is not feasible. From an economic standpoint, this poses a significant inefficiency as similar problems are solved multiple times and no analytical knowledge is exchanged. Additionally, the creation of analytical models is typically costly. If every company builds its own models, every company would end up with an inferior model as substantially more data potentially exists in the entire ecosystem. Moreover, every company would also have to reinvent the wheel, thus resulting in higher costs for model creation. Therefore, the current industry practices result in an inefficient resource utilization from a system’s viewpoint (hicks1939foundations).
To address this challenge, we propose the utilization of transfer machine learning, a technique that enables to reuse and improve predictive machine learning models using different, distributed data sets. Hereby, no raw data exchange between companies is required, yet the transfer model can be improved by leveraging these different data sets. Although different types of analytical models could be transferred (Hopf2017a)
, neural networks are especially suited for transfer machine learning and are thus subject of the majority of related work
(Weiss2016). Multiple studies demonstrate the effectiveness and efficiency of transfer machine learning in wellknown, wellformed data sets like MNIST (long2013transfer)or ImageNet
(huh2016makes), but a lack of realworld industry studies is evident. One reason, amongst others, is the question on what, how, and when to transfer, since (naturally) not every neural net can be transferred to every data set (Pan2009ALearning). As our research gap, we observe a lack of techniques for identifying the impact of a neural net transfer prior to the transfer itself—which can be described as the transferability of a neural net. For the work at hand, transferability in general can be defined as the estimation of the extent to which representations learned from a source task can help in learning a target task
(Bao2019AnLearning). This is especially relevant when considering large numbers of participants in an ecosystem and a correspondingly high amount of potential neural nets candidates for transfer.To address this gap, we perform an empirical study on a realworld use case with the aim to study the effects between different similarity measures and the transferability of neural nets. Precisely, we are interested in indicators for transferability of neural nets that are based on a comparison of data and data projections as well as on the neural nets themselves. As a basis for this study, we consider a unique data set of an ecosystem of different restaurant branches owned by different legal entities, all of whom need to perform sales forecasts to improve their respective resource allocations. As owners fear to expose data outside their restaurant, they are not willing to share raw data. Therefore, they are in need of a pretransfer analysis on the possibility of valueadding neural nets without having to access the raw data of the competitor.
The paper at hand is structured as follows: In the remainder of this section, we cover related work, elaborate on our contribution to theory, define prerequisites and derive hypotheses. Then, we introduce the data set, present the neural net structure and the transfer, and elaborate on indicators for transferability based on raw data, data projections and neural nets. Afterwards, we present the results by first describing the performance impact of transferring neural nets in a business network. Then, we describe the impact of the tested indicators on transferability. After discussing our findings, we summarize the results, discuss their generalization, recognize limitations, and show future research prospects.
1.1 Related Work and Contribution to Theory
The foundations of transfer learning are surveyed by
Pan2009ALearning as well as Weiss2016 and provide a detailed overview on transfer learning. A wide variety of studies on the application of transfer learning can be identified: Zhong2010CrossLearningpresent findings on the utilization of deep convolutional neural networks (CNN) in medical image analysis. They use large, general pretrained sets and adapt them to a specific task to show that pretrained CNNs using computer vision databases (e.g., ImageNet) are useful in medical image applications and that multiview classification is possible without the preregistration of the input images.
Kim2014ConvolutionalClassificationreports that pretrained word vectors for sentencelevel classification tasks can be seen as universal feature extractors that can be utilized for various classification tasks. In this study, we focus on investigating the transferability
(Bao2019AnLearning) of neural networks from a source to a target domain. Related work can be divided into three main aspects that can indicate the transferability, namely the task similarity, the data similarity and, recently, also the model similarity. Table 1 summarizes the related work on transferability in alignment with the aforementioned three main research categories. A variety of work covering the topic of task similarity in transfer learning exists. Xue2007MultiTaskPriorsclassify tasks that are correlated and dependent, thus proving that concepts that were previously learned on one task may be transferred to other tasks. Yosinski2014HowNetworksstate that the transferability is negatively affected by the specialization of higher layer neurons of their source task, which eventually leads to a performance decrease on the target task. Another way to determine the transferability of neural nets is to examine the source and target data set itself.
Jain2011 use the similarity among data points in order to update the detection score of the classifier and its classification boundary. Xiao2012a find suitable training instances from other domains by measuring the distance between the source and target data in the domain of oilprize forecasting. Zhong2010CrossLearning apply density ratio weighting to overcome the difference in marginal distributions and propose a reverse validation procedure to quantify how well a neural net approximates the true conditional distribution of the target domain. However, there are more methods for comparing data distributions that could indicate transferability, such as divergences or distances (Bhattacharyya1943OnDistributions; Eguchi1985AFunctionals; Kullback1951OnSufficiency).Publication  Task
Similarity 
Neural net
Similarity 
Data
Similarity 
MultiTask Learning for Classification with Dirichlet Process Priors (Xue2007MultiTaskPriors)  x  
How transferable are features in deep neural networks? (Yosinski2014HowNetworks)  x  
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability (Raghu2017SVCCA) 
x  
Insights on representational similarity in neural networks with canonical correlation (Morcos2018InsightsCorrelation)  x  
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning (Jain2011)  x  
Online Domain Adaptation of a PreTrained Cascade of Classifiers (Zhong2010CrossLearning)  x  
This work  x  x 
Especially if the source data set is not available or cannot be accessed due to confidentiality reasons, examining a potential source neural network can be a way to gain insights on its transferability to a target data set. To the best of our knowledge, there is no work on finding indicators for transferability based on net structures. However, recent work shows possibilities for the comparison of neural net similarity using SVCCA (Raghu2017SVCCA) to interpret neural network representations. Morcos2018InsightsCorrelation apply SVCCA to compare net similarity across a group of CNNs, demonstrating that networks that generalize converge to more similar representations than networks that memorize.
In the course of this work, we are interested in transferring models across different data sets for which the data distribution may vary, but not the task to be executed. Thus, we disregard methods that are purely based on task similarity. We are interested in finding ways to receive indications on the transferability in a case where data cannot be pooled (e.g. due to confidentiality issues). To get an estimate of the basic indication of data similarity in transfer learning, we compare "raw" data sets. Then, in order to potentially reduce the amount of exposed information during the comparison, we examine ways to compare projections of those raw data sets. Given that even those projections might not be retrievable in some cases (e.g. in cases where only models are exchanged and initial training data is not accessible), we finally aim to find indicators for transferability based on the structure of a neural net.
The contribution of this work is threefold:

We develop and evaluate the utility of a multistep systemwide transfer on a unique data set in the domain of sales forecasting.

We empirically show an association between the divergence of data distributions and the divergence of projection of data distributions with respect to the transferability of models.

We empirically show that the Singular Value Canonical Correlation Analysis is associated with the transferability.
1.2 Prerequisites and Research Design
In our case, we want to transfer neural networks across different federated data distributions of companies:
(1) 
We define the input of different data sets that are composed of samples of a neural network as follows:
(2) 
The test inputs and the corresponding true labels are composed of samples and are constructed by sampling uniformly from :
(3) 
The performance M of a neural network trained on with predicted labels is denoted as:
(4) 
The performance delta of a source neural network which is trained on a distribution and then transferred to a target distribution is described as
(5) 
We define as the transferability of a model that is trained on the source distribution and transferred to the target distribution . In our case, we therefore regard transferability as a performance increase of a neural network from one (source) distribution to another (target) distribution. The first goal of our work is to show that transferability, i.e. the performance increase of a transferred model, exists for the regarded problem/data set. Therefore, we formulate our first hypothesis as follows:
Hypothesis 1 (H1): A model which is pretrained on a distribution and transferred to a distribution outperforms a model .
If this hypothesis can be confirmed, the next step of this work consists of identifying possible indicators for transferability in advance to the transfer itself. In order to do so, we analyze indicators for the transferability by comparing and directly as well as their respective projections. Hereby, a projection maps distribution as follows:
(6) 
The projected distribution is
. To empirically test different projections, we apply Multidimensional Scaling (MDS), Principal Components Analysis (PCA) and tstochastic Neighborhood Estimation (tSNE).
(7) 
To compare two distributions and and their respective projections we calculate their data divergence and data projection divergence .
In this work, we aim to empirically examine the association between the divergence of data distributions and , the divergence of projected distributions and and the performance impact . Accordingly, we formulate Hypothesis 2 and 3:
Hypothesis 2 (H2): The divergence of two distributions and , described as , correlates with the transferability .
Hypothesis 3 (H3): The divergence of the projection of two distributions and , described as correlates with the transferability .
Finally, we examine neural nets themselves without accessing the source data to find indicators for transferability. Therefore, we consider the Singular Value Canonical Correlation Analysis (SVCCA). SVCCA enables the comparison of the behavior of neural nets, derived by the activations of neurons with regard to a data input . Let denote the result of an SVCCA between a net and a net based on a data sample . Accordingly, we formulate Hypothesis 4:3
Hypothesis 4 (H4): The output of a Singular Value Canonical Correlation Analysis correlates with the transferability .
In Figure 1
we give an overview of our hypotheses and their corresponding goal. For H1, we perform a twosided onesample ttest for the mean of all transferabilities to test if the average transferability significantly deviates from zero. For H24, we calculate Spearman’s rank correlation coefficient as a nonparametric measure between the variables and test the significance of the calculated Spearman’s rho
.2 Experiment
In this chapter, we first give an overview of the data we examined and subsequently elaborate on the sales forecasting model design and the transfer mechanism. In conclusion, we describe how we compare data and data projections. Lastly, we present the applied variation of measuring the net similarity via SVCCA.
2.1 Data Set
We analyze unique daily sales data of six different restaurant branches of two particular restaurant chains that serve different types of food. The data set captures observations from 2013 until 2017.
Branch  1  2  3  4  5  6 

Company  A  A  A  B  B  B 
City  a  a  b  a  c  d 
By precisely predicting the sales per day for each branch in the next week, month, or even year, several advantages can be leveraged: based on the revenue and demand, staff schedules can be optimized toward cost savings and a better experience for customers can be delivered. Additionally, the procurement of supplies can be improved, as spoiled food is a main costdriver for restaurants. Thus, the management of restaurant chains has a major interest to forecast sales for their branches.
Table 2 gives an overview on all the available branch data we use in this work. Each of the two restaurant companies has three branches with different locations. Branch 1, 2 and 4 are located within the same city.
Figure 2 compares the average weekly revenue of each branch. We can recognize a different weekly seasonality for the revenue of the restaurants. Branches 1, 2 and 5 have their highest sales on Saturday, while branch 4 and 6 reach their minimum on that very day. Different market orientation, opening hours and locations are possible reasons for this observation. Hence, branch 5 appears to be closed on Sundays. In general, a restaurant in the commercial city center can attract more customers on the weekend than one that is located in an industrial area of town. In those areas, offices or production businesses are located which tend to be closed on those days. All branches, with the exception of the aforementioned two branches, share the common behavior with an uptrend in net revenue starting from Monday and reaching their peak over the weekend.
2.2 Sales Forecasting Model Design and Transfer
We aim to build separate models for each data distribution, where one data distribution corresponds to the data set of one branch. Afterwards, these models are transferred to every other distribution and then retrained. This procedure is repeated until every model has passed through every distribution exactly once (H1). To empirically study the effects of data, data projection divergence and net similarity on the transferability of models, we test all possible transfers in a bruteforce attempt and analyze the results a posteriori (H2H4).
Our goal is to develop a model that is able to forecast daily sales on a weekly basis. There are many ways to design a sales forecasting model, such as ARIMA models, additive, or logarithmic regressions. To simplify our research design, we focus solely on Convolutional Neural Networks (CNN) for multivariate forecast as they have proven to achieve superior results in similar problems in the past (Borovykh2017ConditionalNetworks). Here, the input of a neural network is:
(8) 
where is a vector of daily sales of the previous sales period, denotes the year, the month and the week of the observation. The complete data set can be described as and . Then, the date and time index are adjusted and reformatted in line with the opening hours of the respective branches. The available variables are grouped by day in order to forecast the time series on a per day basis. We clean obvious errors in the data set by dropping erroneous values, such as negative daily revenues.
As a next step, we build a multihead CNN model to forecast the daily sales of the next sales period. The structure of the CNN model is depicted in Figure 3
. The model has four input variables: revenue of the previous sales period, month, weekday and year of the observation. Each variable is fed into a separate head. All heads consist of two onedimensional convolutional layers with the same parameter configuration, followed by a maxpooling layer. The output of the pooling layers is flattened and merged by a concatenation layer. The merged heads’ output is fed into a first fully connected layer followed by a second one to conduct the interpretation. Finally, the sales forecast for the next period is generated in form of an output vector.
In a pretest, we determine the model hyperparameters by empirical testing as follows: the two onedimensional convolutional layers both have 32 filter maps and a kernel size of 3. As activation function, rectified linear unit is applied to both convolutional layers. The pool size for the maxpooling layer is set to 2. The first fullyconnected layer contains 200 neurons and the second one 100 neurons. The model is compiled with mean squared error (MSE) as loss function during training and uses Adam as optimizer
(Kingma2015Adam:Optimization). After compilation the model is fitted on the training data set for 20 epochs with a batch size of 16.
For the model training and retraining, we split the data into a training and a test set for each branch. As testing period, we choose the year 2017 consistently. The remaining data builds our training or retraining set. For every model , we calculate its performance on the actual target data set and on the union of all test sets across all branches for comparability reasons.
To implement the transfer, we retrain a source CNN on a target data set as depicted in Figure 4. Hereby, we do not freeze the layers to enable reweighting of the neurons in the layers. We retrain the CNN model with the same number of epochs (25) and batch size (16) as in base model training. Note that it would also be possible to adaptively choose certain layers to freeze and dynamically adapt the learning rate. For this study, we chose not to change or vary the amount of training parameters or frozen layers for a transfer. By choosing not to do so, the models are more likely to "forget" previously learned knowledge. Future work needs to address a more adaptive learning strategy. The degree of transfer denotes the total amount of performed transfers per model. In Table 3 we give an overview of all transfers, their respective source models and the respective targets according the degree of transfer. Generally, the amount of transfers grows significantly with a growing number of data sets N and can be described by .
Degree of transfer  1  2  3  4  5  Total 

Source models  6  30  120  360  720  120 
Possible targets  5  4  3  2  1   
Targets  30  120  360  720  720  1950 
2.3 Data and Data Projection Divergence
In the following, we first introduce the utilized data divergence measure, which we apply on the unchanged data populations as well as on the projected data. Measuring the independence or divergence of two random variables or distributions can be conducted in different ways. In this work, we estimate the divergence of two data distributions using an energy distance meta estimator
as equivalent to maximum mean discrepancy (Szabo2014InformationToolbox; Szekely2013EnergyDistances), which is defined as follows:(9) 
(10) 
Considering a scenario where data cannot be exchanged across entities of a system, it is not possible to compare two data sets simultaneously. To ensure a certain degree of confidentiality, a possible solution would be to compare only projected data, where critical information is already lost due to abstraction (Narayanan2008).
Thus, in an initial step we apply projections raw data to retrieve abstractions where
. We use three established algorithms to calculate abstractions of the raw data, namely tdistributed stochastic neighbor embedding (tSNE), multidimensional scaling (MDS) as well as principal component analysis (PCA). The tSNE is a wellsuited technique for the visualization of highdimensional data to create meaningful intermediate results and is effective for interactive data analysis
(Pezzotti2017ApproximatedAnalytics). MDS is a technique used for analyzing similarity or dissimilarity of data. It attempts to model the relationship between data as distances in a geometric space (Borg2003ModernApplications). Lastly, PCA decomposes a multivariate data set into a set of subsequent orthogonal components which explain a maximum amount of the variance in the data
(Halko2011FindingDecompositions). The projections for each technique applied to the first data distribution are visualized in Figure 5.Subsequently, we calculate divergences between the data projections. Lastly, for both the raw data and data projections, we evaluate whether a correlation to the transferability of models is given.
2.4 Neural Net Similarity
The Singular Value Canonical Correlation Analysis (SVCCA) is a method for analyzing and comparing different representations learned by artificial neural networks (Raghu2017SVCCA)
. It represents an amalgamation of a singular value decomposition (SVD) and a canonical correlation analysis (CCA)
(Hardoon2004CanonicalMethods).In this work, we use SVCCA to determine the neural net similarity of two networks of two different branches. In Figure 6, we present an overview of the application of SVCCA on a potential source net to identify its transferability to a target distribution.
Two neural nets that are to be compared are fed with data . In this study we supply a data sample which represents the sales of 2017 from the target distribution to the potential source net and capture the activation vectors for every layer.
The neurons’ response is calculated as a representation over a finite set of inputs. The resulting activation vectors
for each layer of neurons are then processed by applying SVD. Similar to the eigenvalues, these characterize the properties of the matrix. This results in singular vectors
with associated singular values for X and similarly for Y. Of these singular vectors we keep the top , as 99% of variation of X is explained by the top vectors. This helps to remove directions with respect to neurons that are constantly zero or exhibit noise with small magnitudes (Raghu2017SVCCA).Subsequently, CCA is applied to the sets of top singular vectors . The CCA is a wellestablished method for understanding the similarity of two different sets of randomly distributed variables. Given the two sets of vectors
, we wish to find linear transformations
that maximally correlate with the subspaces. This can be reduced to an eigenvalue problem. Solving this problem results in linearly transformed subspaces with directions that are maximally correlated with one another. As a result, we ultimately obtain as the transferability of a source neural net towards a target data set .3 Results and Discussion
We present the results of this study along two steps. First, we describe the result of the initial net training and the performed transfers—thus addressing H1. Second, we describe the output of the analysis on the association between data, data projection, neural net, and their impact on transferability—thus addressing H2H4.
3.1 Base and Transfer Results (Hypothesis 1)
To measure the performance of the developed forecasting models, two metrics are used: RMSE and MAPE. The RMSE is used to calculate the differences between values predicted by a model and the actual values observed and, in this work, is a basis during model optimization. It has proven to be a meaningful performance indicator for regression tasks (Spuler2015). RMSE is calculated as follows:
(11) 
where is the predicted value, is the actual value observed and T is the number of different predictions performed. RMSE as a scaledependent measurement is not suitable for comparing forecasting errors across different data sets (Hyndman2006AnotherAccuracy). Thus, to evaluate and compare the performance of different models on different data sets, we additionally calculate the MAPE. The MAPE delivers a very intuitive interpretation in terms of relative error and therefore MAPE is broadly used in practice (DeMyttenaere2016MeanModels) and is calculated as follows:
(12) 
where is the forecast value and is the actual observation for the number of forecasts . In the following, to ensure comparability, we only report the MAPE for all models.
We train base models for every branch based on all available data including 2016. Then, we test the models on the full year of 2017 and calculate the MAPE and RMSE. In Figure 7, we depict the scaled daily net revenue (exemplarily) for branch 1 and branch 4. Both base models are seemingly good in predicting the actual value. However, it is also noticeable that between those two data distributions—and, thus, models—there are significant differences in sales patterns.
As shown in Table 3, the number of potential transfers and therefore the number of possible models that are evaluated grows exponentially. However, to give an overview of the transfer impact, we present the results for the first degree of transfer in Table 4.
Source Target  base  Branch 1 ()  Branch 2 ()  Branch 3 ()  Branch 4 ()  Branch 5 ()  
































































Not all transfers have a positive impact on the performance to a target distribution (see Table 4). This indicates that transferability varies, depending on the association between the source and target distribution. Additionally, in practice it might not be feasible to test all possible transfer model candidates on a target set, as a transfer and retraining of a model is bound to a computational cost. Simply testing all possible combinations via a bruteforce approach would therefore not be efficient.
Target Degr. of T.  base  1 degree  2 degree  3 degree  4 degree  5 degree  
















































In Table 5, the best results for each branch according the degree of transfer are presented. Note that we select the best performing model for every transfer step and every branch. For almost all branches, except branch 5, an increase in prediction performance can be observed with an increasing degree of transfer. In case of branch 5, we observe an increase of performance starting after the third transfer. It is noticeable that the increase steadily grows for every transfer step, albeit in some cases marginally.
With an increasing degree of transfer, we can observe that in some cases the same distributions are used to retrain models. If, for instance, we investigate target branch 1, we can observe that branch 4 and 5 seem to be good previous distributions to train a model on. However, as we always retrain the complete net, an information loss is likely to arise after multiple transfer steps. H1 states that a model which is pretrained on a distribution and transferred to a distribution outperforms a model . Thus, a twosided onesample ttest for the mean of all transferabilities
(N=1950) is conducted to test if the average transferability significantly deviates from zero. With a mean of 0.00894, a standard deviation of .06728 and a pvalue <.0001 although="" average="" confirms="" h1="" is="" mean="" of="" positive.="" supported.="" test="" that="" the="" thus="" transferability="">
is only slightly above zero, Table 5 illustrates that there is a steady increase of performances with every further transfer step. However, in that scenario, the best performing models are cherrypicked. In reality, it would not be desirable to test all 1950 transferred models, e.g., due to computational cost. Thus, it is desirable to know in advance which models will perform best. This leads us to the study of association on transferability.3.2 Associations on Transferability (Hypotheses 24)
Returning to our previously defined research gap, we aim to find an indicator of transferability between two data distributions without comparing them directly. By establishing and testing H1, we first show the utility of a transfer in our use case. Now, we empirically study the correlation between three influence factors on transferability: the data divergence (H2), the projected data divergence (H3) and the SVCCA (H4). For every hypothesis, we calculate Spearman’s rank correlation coefficient between the transferability and the corresponding indicator. The coefficient describes both the strength and the direction of the relationship. The Spearman correlation evaluates the monotonic relationship between the two continuous variables: transferability and the corresponding indicator. The results are presented in Table 6. We split H3 into three subhypotheses corresponding to the differing data projection functions we examine: H3.1 corresponds to the TSNE, H3.2 to the PCA and H3.3 to the MDS. For every hypothesis, we examine N=1950 transferred models.
H  Indicator for transferability between and  

H2  Data divergence  .4294*** 
H3.1  Projected data (TSNE) divergence  .0668** 
H3.2  Projected data (PCA) divergence  .2397*** 
H3.3  Projected data (MDS) divergence  .3101*** 
H4  Neural net similarity (SVCCA)  .2245*** 
"*" means , "**" means , "***" and means . 
Although we do not intend to find indications on raw data as it might not be feasible in business networks due to data confidentiality reasons, we formulate H2 to investigate whether or not there is an association without any transformation of data. H2 states that the divergence of two distributions and , described as , correlates with the transferability . Results of the study indicate that there is a significant negative association between the data divergence and the transferability (=.4294, p<.0001).
By projecting data and thus masking confidential information, we state and test different techniques for transferability indicators through H3. Thus, H3 describes that the divergence of the projection of two distributions and described as correlates with the transferability . The subhypotheses H3.13.3 describe different projection functions, respectively. For H3.1, results indicate that there is a positive association between the projected data divergence based on the TSNE projection and the transferability (=.0668, p<.05). However, the Spearman’s rho is rather low which indicates a weak correlation between the two variables. In the case of H3.2, however, the results paint a clearer picture: a negative correlation between the projected data divergence and the transferability is present (=.3101, p<.0001). A similar situation can be observed by considering the results of H3.3, where we find an even higher negative correlation between the projected data divergence based on MDS to the transferability (=.3101, p<.0001). Based on the results for H3.13.3, we can derive that the PCA and the MDS are better aligned with the identified correlation between data divergence and transferability (H1), as the direction of their correlations towards the transferability is the same. Furthermore, in case of the TSNE, we only see a weak positive monotonous association.
Through the comparison, although not exposing raw, but projected data, a possible breach of confidential information is not unlikely, as certain characteristics of the original data distribution are still extractable from the projection. Thus, we state and test H4 to find indications for transferability by the result of the SVCCA, a measure for neural net similarity. In case of H4, we state that the output of a Singular Value Canonical Correlation Analysis correlates with the transferability . Our tests show a similar result as for H2, H3.2 and H3.3. We find a significant negative association between the neural net similarity and the transferability (=.2245, p<.0001).
In summary, we can reject the null hypothesis for H2H4. However, we observe differences in the results for each tested association. There seems to be a clear negative correlation between the projected data divergence based on PCA and MDS and the transferability as compared to TSNE. Here, we observe a positive correlation with a Spearman’s rho value below .07 whereas PCA and MDS exhibit larger, yet negative Spearman’s rho values. Hence, we observe the same direction of correlation between the net similarity and the transferability, which indicates stable results.
4 Discussion
A multitude of insightful results can be derived from the conducted empiric research. First and foremost, what sparks our interest the most is the observed dominant, negative correlation effect between the transferability and the data and data projection divergence and neural net similarity. Based on previous research, one would expect a positive correlation to be present (Xiao2012a). However, in the regarded case, we assume that a neural network benefits from divergent or different observations which are not available in previous training data.
Additionally, in our case we consider sales data collected by different restaurants. Although the data sets originate from two different chains which serve different types of food, the underlying sales patterns might be quite comparable. Results indicate that the underlying data distribution cannot yet be learned by looking at an isolated data population. Thus, we hypothesize that if a neural net receives a larger amount of diverging observations as inputs, its generalization and hence its performance improve.
Another striking finding can be observed by visually inspecting the projections of data populations and their respective transferability and divergences. Exemplarily, we consider projections derived through MDS and compare a first degree of transfer. In Figure 8, we present two cases where the effect of projected data divergence and the transferability can be visually observed for particularly "successful" transfers and "unsuccessful" transfers. In the figure, we can detect a strong support for our hypothesis validation, as successful transfers occur when the data is extremely divergent and viceversa, unsuccessful transfers occur when data is divergent. However, future work is necessary to further investigate this phenomenon.
Furthermore, the correlations of the data and data projection divergence and their transferability show the same direction as the correlation between the neural net similarity and the transferability. This gives us reasons to believe that the neural net similarity, as applied in this work with SVCAA, represents similar abstracted information as the divergence of data and its projection. It also aligns with the work of Raghu2017SVCCA, who aim to find representations of features of a data set in a neuron’s response. However, this assumption requires further confirmation in future work based on additional empirical research established through other data sets.
5 Conclusion and Outlook
In this work, we utilize transfer machine learning on a unique sales data set. We do so to reveal two aspects of interest: first, the performance increase—labeled as transferability—of transferring models in general and second, the identification of indicators of a successful transfer prior to the transfer itself.
Therefore, we contribute to the body of knowledge in manifold ways. First, we implement a multistep systemwide transfer on the sales data of different restaurants and restaurant chains. We can successfully show the evaluation of utility by showing empirically that transfers can be beneficial. This is in line with Hypothesis 1, which states that a model that is pretrained on one distribution and subsequently transferred to another distribution outperforms the model built solely on the original distribution. Secondly, the association of divergence of data distributions as well as the divergence of projections of data distributions and their transferability is analyzed. We are able to confirm Hypothesis 2 and Hypothesis 3 for different subdistributions, indicating a strong negative correlation between data divergence and data projection divergence and their transferability. Thirdly, we analyze with Hypothesis 4 whether the output of a Singular Value Canonical Correlation Analysis is associated with the transferability. Although we analyze only trained nets—and not data distributions or their projections—we are able to find an association between the neural net similarity and the transferability. In summary, this means for the regarded data set that we are now able to determine transferability of models without regarding raw data—prior to the transfer. As a result, predictions about the transferability for new data sets in a business network can be made, without exposing data distributions. Additionally, its application could allow for more efficiency across the overall system, as the same problem does not need to be solved multiple times: a once trained model can be reapplied several times for similar problems at each restaurant.
Despite the novelty of the approach, limitations are obvious. At first, only one data set of multiple entities, only time series forecasting and only one net architecture is considered. In the theorizing process of general indicators for transferability, more examples are necessary. Additionally, for the time being, we only show an association between data, data projection and neural net similarity and the transferability. We do not investigate further and enhance the association to engineer a search algorithm for transferring models in an ecosystem. On the technical side, the currently implemented transfer mechanism exploits "forgetting", i.e., we do not dynamically adapt the frozen layers. Furthermore, the data and data projection association towards transferability neglects previous transfer steps of a model and is thus trivialized. Finally, while no raw data is shared, recent research shows the possibility to retrieve single instances, especially extreme points of a population (fredrikson2015model).
Future research needs to address especially the last aspect. If we aim to allow privacypreserving transfer machine learning, we need to incorporate differential privacy mechanisms into model training (abadi2016deep; Mironov2017). Furthermore, the empirical study can be extended by incorporating previous training sets, as these could result in stronger correlations, e.g. due to averaging over populations. A further enhancement of the transfer mechanism could prove meaningful, for instance by including the freezing of certain layers, as well as adapting the learning rate or number of frozen layers with respect to the degree of transfer. Also, an indepth investigation of the "forgetting" aspects of networks could be interesting, e.g., how many transfer steps are required for a network to "forget" information—and therefore limit the amount of transfers from the beginning. As mentioned previously, more and repeated empirical studies on other data sets, models, and net architectures are necessary to address the generalizability of the approach. Finally, an exploitation of the association between SVCCA and transferability would be preferable, specifically the development of a method or search algorithm that utilizes it as a direction of search. This would allow to choose the "path of transfer" in advance—and result in higher model performances with less model transfer permutations. A promising field of research lies ahead.
Comments
There are no comments yet.