## 1 Introduction

Bayesian interpretations of neural network have a long history, dating back to early work in the 1990’s mackay1992bayesian, neal2012bayesian and have recently regained attention [e.g.][]blundell2015weight, gal2016uncertainty because of their desirable properties like uncertainty estimation, model robustness and regularisation.

In this paper we concider the application of Bayesian models to knowledge sharing between neural networks. Knowledge sharing comes in different facets, such as transfer learning, model distillation and shared embeddings. All of these tasks have in common that learned "features" ought to be shared across different networks.

Bayesian approaches offer a robust statistical framework to introduce prior knowledge into learning procedures. However, the tasks introduced above can be challenging in practice since "information" gained by one network such as learned features can be difficult to encode into prior distributions over networks that do not share the same architecture or even the same output dimension. We introduce here a Bayesian viewpoint that centres around features and describe a set of prior distributions derived from the theory of Gaussian processes and deep kernel learning that facilitate a variety of deep learning tasks in a unified way. In particular, we will show that our approach is applicable to knowledge distillation, transfer learning and combining experts.

## 2 Bayesian Neural Networks

In the standard Bayesian interpretation, neural network weights, denoted

, are random variables endowed with a prior distribution

. Let us denote the dataset consisting of independent observations , where is the input data and the labels or output. The object of interest is the posterior distributionwhere we write for the likelihood function.
Many tasks in deep learning naturally lend themselves to a Bayesian approach where a teacher network provides prior knowledge which is incorporated in the learning process of a student network.
However, assigning a prior on the weights directly is impractical as we are often interested in sharing information between networks with different architectures.
For this reason, we propose to distil *features*, denoted , generated by the teacher network, i.e.

where is the prior for the features. For prior elicitation we draw from the theory of Gaussian processes which arise naturally in the context of neural networks although other approaches are possible. For a detailed introduction to Gaussian processes we refer the reader to rasmussen2003gaussian. In the following we will briefly review how feature spaces created by neural networks form Gaussian processes.

### 2.1 Neural Networks as Gaussian Processes

The Gaussian process interpretation of neural networks originates from early work by neal2012bayesian where it is shown that an infinite width single layer neural network is equivalent to a Gaussian process. This line of work has recently found renewed attention matthews2018gaussian, lee2018deep, where the authors consider deep networks. Let be an input and for layers with width write

where denotes the bias of feature in layer and the weights analogously. Here

denotes some activation function. An application of a central limit theorem under Gaussian weights

shows that for all layers induces a zero mean Gaussian process over features with covariance functionFor finite layer neural networks, we take

This is can also be viewed as an instance of a deep kernel GP, see e.g. wilson2016deep, although we consider degenerate linear kernels (also referred to as dot product kernels) in this inferred feature space rather than RBF or spectral kernels.

### 2.2 Distance Priors and Kullback-Leibler Divergence

In order to pass the knowledge of a teacher network to a student, a good prior distribution places high probability on features that are similar to features of the teacher network. Denote

the features generated by the teacher network. Following our argument, a natural choice for a prior is then based on the distance between featureswhere is a some non-negative function measuring similarity of features but not necessarily a metric and is a tuning parameter.

Approaches comparing features directly have been proposed in the past. Consider, for example, the case of model distillation bucilua2006model, hinton2015distilling, covered in more detail in the next section. If we choose independent priors for the features of the student network and is taken as

for some temperature , binary cross-entropy , and

the softmax function, we recover the approach by Hinton hinton2015distilling. However, such approaches have limitations. As well as to the unrealistic independence assumption, the above approach requires as many logits in the student model as we have in the teacher model. Similarly, we can not easily share information of previous layers. We circumvent this by comparing the distributions of the features on their induced function spaces using the KL divergence. The new prior distribution for the student network now reads

Since the feature maps are Gaussian processes, the KL divergence has an analytic form,

In doing so the KL divergence places a probability distribution over the space of features using the Gaussian processes,

, parameterised by kernel. Note that this alleviates the requirement that the dimensionalities of the feature space (i.e. the number of neurons in the final layer) of the teacher and student network have to match as both are seen as functions in the same Hilbert space. The prior is then

Other alternative choices for include Wasserstein distances, Hellinger distance, -distance (between the features) and many more. However, we found that choices other than the KL-divergence did not improve our results. It is also worth noting that using KL-divergences in this way has clear links to the popular sparse variational Gaussian process and other approximate methods hensman2015scalable.

## 3 Bayesian Knowledge Transfer

### 3.1 Model Distillation

Using our approach, the concept of model distillation merely becomes a Bayesian neural network where the prior is (for example) our KL-prior derived from the features learned by the teacher model. As already alluded to earlier, these priors describe the behaviour of the features in more detail than a simple comparison of logits or a binary cross-entropy function. In addition, unlike traditional model distillation the latent Hilbert space representation of the the model is not constrained by the dimensionality of the output logits of the teacher model.

### 3.2 Transfer Learning

Transferring learned features from one model to another can be achieved similarly to the case of model distillation. It is important to note that the term "features" is not limited to the final layer logits. For example, in a convolutional neural network we could transfer the features learned by the convolutional layers disregarding the fully connected layers.

### 3.3 Combining Experts

Suppose we have a set of tasks associated with a neural network that has learned its respective task. We want to combine the knowledge of those "experts" into one model. In order to do, we use the respective features and combine them as independent priors

## 4 Example Application: Fully Connected Networks for Fashion-MNIST

The first example application examines the benefit of using feature priors as a form of model distillation. The dataset considered is Fashion-MNIST. A classic convolutional neural network with two convolutional layers is composed of

filters and a dense layer with 128 nodes achieves an accuracy of 92.7%. The goal of this exercise is to endeavour to train a fully dense network with two hidden layers. Intuition would suggest that a dense network naively will not perform very well at this task. However, by placing a prior on the output of the first dense layer to match the features learned by the teacher network we see an improvement by over 7% in absolute terms. Both networks received 25 epochs and had the same architecture, loss function (excluding the prior component) and optimiser.

Accuracy | -score (Micro) | -score (Macro) | ||||
---|---|---|---|---|---|---|

avg. | std. error | avg. | std. error | avg. | std. error | |

Naive | 80.65% | 1.02 | 80.65% | 1.02 | 77.45% | 1.47 |

ba2014deep | 83.51% | 0.14 | 83.51% | 0.13 | 83.47% | 0.15 |

hinton2015distilling | 86.61% | 0.11 | 86.62% | 0.11 | 86.60% | 0.11 |

sau2016deep | 82.69% | 0.12 | 82.68% | 0.12 | 82.62% | 0.14 |

romero2014fitnets | 88.51% | 0.07 | 88.52% | 0.07 | 88.55% | 0.08 |

Proposed Approach |
89.81% | 0.05 | 89.81% | 0.05 | 89.81% | 0.04 |

An important technical point to mention is how we dealt with the unknown tuning parameter outlined in section 2.2. To avoid manual tuning, we first maximised the likelihood of the features agnostic of the training labels and then trained only remaining dense layers given the inferred features. Hence, our training was broken into two parts; training the features to match the teacher network followed by training the remaining layers to maximise the predictive performance.

## 5 Example Application: Multi-Level Priors for CIFAR-10

The second example application also inspects model distillation but compares the proposed approach to that set out in Hinton et al. hinton2015distilling. We purposefully chose a more complex network to demonstrate the benefits of the proposed approach. The teacher network composes of 4 VGG-like convolutional layers as depicted in Figure 2. As compressing multiple convolutional layers into a single dense layer would be seemingly more difficult, we split the convolutional feature extraction into two parts; the first two corresponding to low level features and the latter two corresponding to higher level features. A dense layer of 8192 hidden node was used to infer each of these sets of features. Comparing accuracy to the use of no parent model and even that of earlier approaches, which both did not significantly outperform a random classifier, shows the unparalleled benefit of such Bayesian knowledge transfer.

No Parent | Classic Distillation | Proposed Approach | Parent Model | |
---|---|---|---|---|

Top-1 Accuracy | 10.00% | 10.00% | 51.90% |
65.60% |

Top-2 Accuracy | 19.98% | 19.95% | 71.64% |
82.08% |

Top-3 Accuracy | 30.03% | 30.68% | 81.75% |
89.77% |

Finally we note that this layer-wise distillation across architectures is not possible with hinton2015distilling as the number of outputs per intermediate layer do not generally match for different network architectures.

#### Acknowledgements

Sebastian M. Schmon’s research is supported by the Engineering and Physical Sciences Research Council (EPSRC) grant EP/K503113/1.

Comments

There are no comments yet.