transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
view repo
Transfer learning involves taking information and insight from one problem domain and applying it to a new problem domain. Although widely used in practice, theory for transfer learning remains less well-developed. To address this, we prove several novel results related to transfer learning, showing the need to carefully select which sets of information to transfer and the need for dependence between transferred information and target problems. Furthermore, we prove how the degree of probabilistic change in an algorithm using transfer learning places an upper bound on the amount of improvement possible. These results build on the algorithmic search framework for machine learning, allowing the results to apply to a wide range of learning problems using transfer.
READ FULL TEXT VIEW PDF🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
XLNet Extension in TensorFlow
Transfer learning is a type of machine learning where insight gained from solving one problem is applied to solve a separate, but related problem [8]. Currently an exciting new frontier in machine learning, transfer learning has diverse practical application in a number of fields, from training self-driving cars [1], where model parameters are learned in simulated environments and transferred to real-life contexts, to audio transcription [12], where patterns learned from common accents are applied to learn less common accents. Despite its potential for use in industry, little is known about the theoretical guarantees and limitations of transfer learning.
To analyze transfer learning, we need a way to talk about the breadth of possible problems we can transfer from and to under a unified formalism. One such approach is the reduction of various machine learning problems (such as regression and classification) to a type of search, using the method of the algorithmic search framework [7, 6]
. This reduction allows for the simultaneous analysis of a host of different problems, as results proven within the framework can be applied to any of the problems cast into it. In this work, we show how transfer learning can fit within the framework, and define affinity as a measure of the extent to which information learned from solving one problem is applicable to another. Under this definition, we prove a number of useful theorems that connect affinity with the probability of success of transfer learning. We conclude our work with applied examples and suggest an experimental heuristic to determine conditions under which transfer learning is likely to succeed.
Previous work within the algorithmic search framework has focused on bias [5, 4], a measure of the extent to which a distribution of information resources is predisposed towards a fixed target. The case of transfer learning carries additional complexity as the recipient problem can use not only its native information resource, but the learned information passed from the source as well. Thus, affinity serves as an analogue to bias which expresses this nuance, and enables us to prove a variety of interesting bounds for transfer learning.
Transfer learning can be defined by two machine learning problems [8], a source problem and a recipient problem. Each of these is defined by two parts, a domain and a task. The domain is defined by the feature space, , the label space, , and the data, , where and . The task is defined by an objective function , which is a conditional distribution over the label space, conditioned on an element of the feature space. In other words, it tells us the probability that a given label is correct for a particular input. A machine learning problem is “solved” by an algorithm , which takes in the domain and outputs a function
. The success of an algorithm is its ability to learn the objective function as its output. Learning and optimization algorithms use a loss function
to evaluate an output function to decide if it is worthy of outputting. Such algorithms can be viewed as black-box search algorithms [7], where the particular algorithm determines the behavior of the black box. For transfer learning under this view, the output is defined as the final element in the search history.Pan and Yang separated transfer learning into four categories based on the type of information passed between domains [8]:
Instance transfer: Supplementing the target domain data with a subset of data from the source domain.
Feature-representation transfer: Using a feature-representation of inputs that is learned in the source domain to minimize differences between the source and target domains and reduce generalization error in the target task.
Parameter transfer: Passing a subset of the parameters of a model from the source domain to the target domain to improve the starting point in the target domain.
Relational-knowledge transfer: Learning a relation between knowledge in the source domain to pass to the target domain, especially when either or both do not follow i.i.d. assumptions.
To analyze transfer learning from a theoretical perspective, we take inspiration from previous work that views machine learning as a type of search. Montañez casts machine learning problems, including Vapnik’s general learning problem (covering regression, classification, and density estimation) into an algorithmic search framework
[6]. For example, classification is seen as a search through all possible labelings of the data, and clustering as a search through all possible ways to cluster the data [6]. This framework provides a common structure which we can use to analyze different machine learning problems, as each of them can be seen as a search problem with a defined set of components. Furthermore, any result we prove about search problems applies to all machine learning problems we can represent within the framework.Within the algorithmic search framework, the three components of a search problem are the search space , target set , and external information resource . The search space, which is finite and discrete due to the finite precision representation of numbers on computers, is the set of elements to be examined. The target set is a nonempty subset of that contains the elements we wish to find. Finally, the external information resource is used to evaluate the elements of the search space. Usually, the target set and external information resource are related, as the external information resource guides the search to the target [7].
In this framework, an iterative algorithm searches for an element in the target set, depicted in Figure 1
. The algorithm is viewed as a black box that produces a probability distribution over the search space from the search history. At each step, an element is sampled from
according to the most recent probability distribution. The external information resource is then used to evaluate the queried element, and the element and its evaluation are added to the search history. Thus, the search history is the collection of all points sampled and all information gained from the information resource during the course of the search. Finally, the algorithm creates a new probability distribution according to its rules. Abstracting the creation of the probability distribution allows the search framework to work with many different search algorithms [6].Working within the same algorithmic search framework [7, 6, 5], to measure the performance of search and learning algorithms, Sam et al. [9] defined decomposable probability-of-success metrics as
where is not a function of target set (with corresponding target function ), being conditionally independent of it given information resource . They note that one can view as an expectation over the probability of successfully querying an element from the target set at each step according to an arbitrary distribution. In the case of transfer learning, the distribution we choose should place most or all of its weight on the last or last couple of steps – since we transfer knowledge from the source problem’s model after training, we care about our success at the last few steps when we’re done training, rather than the first few.
Let denote a fixed learning algorithm. We cast the source, which consists of and , into the algorithmic search framework as
range;
;
;
.
where is the th query in the search process, the loss function for the source, and is an error functional on learned conditional distribution and the optimal conditional distribution .
Generally, any information from the source can be encoded in a binary string, so we represent the knowledge transferred as a finite length binary string. Let this string be . Thus, we cast the recipient, which consists of and , into the search framework as
range;
;
;
.
where is a loss function, and is an error functional on and the optimal conditional distribution .
In a transfer learning problem, we want to know how the source problem can improve the recipient problem, which it does through the information resource. So, we can think about how the bias of the recipient is changed by the learned knowledge that the source passes over. Recall that the bias is defined by a distribution over possible information resources. However, we know that the information resource will contain , the original information resource from the recipient problem. Our distribution over information resources will therefore take that into account, and only care about the learned knowledge being passed over from the source.
To quantify this, we let be the distribution placed over , the possible learning resources, by the source. We can use it to make statements similar to bias in traditional machine learning by defining a property called affinity.
Consider a transfer learning problem with a fixed
-hot target vector
, fixed recipient information resource , and a distribution over a collection of possible learning resources, with . The affinity between the distribution and the recipient problem is defined asAffinity can be interpreted as the expected increase or decrease in performance on the recipient problem when using a learning resource sampled from a set according to a given distribution.
Using affinity, we seek to prove bounds similar to existing bounds about bias, such as the Famine of Favorable Targets and Famine of Favorable Information Resources [5].
We begin by showing that affinity is a conserved quantity, implying that positive affinity towards one target is offset by negative affinity towards other targets.
For any arbitrary distribution and any ,
This result agrees with other no free lunch [13] and conservation of information results [10, 3, 4], showing that trade-offs must always be made in learning.
Assuming the dependence structure of Figure 2, we next bound the mutual information between our updated information resource and the recipient target in terms of the source and recipient information resources.
Define
as the probability of success for transfer learning. Then,
where ( being of fixed size),
is the Kullback-Leibler divergence between the marginal distribution on
and the uniform distribution on
, and is the mutual information.This theorem upper bounds the probability of successful transfer () to show that transfer learning can’t help us more than our information resources allow. This point is determined by , the amount of mutual information between the source’s information resource and the recipient’s target, by the amount of mutual information between the recipient’s information resource and the recipient’s target, and how much (the distribution over the recipient’s target) ‘diverges’ from the uniform distribution over the recipient’s target, . This makes sense in that
the more dependent and , the more useful we expect the source’s information resource to be in searching for , in which case can take on larger values.
the more diverges from , the less helpless we are against the randomness (since the uniform distribution maximizes entropic uncertainty).
Let be a finite set of learning resources and let be an arbitrary fixed -size target set. Given a recipient problem , define
where is the decomposable probability-of-success metric for algorithm on search problem and represents the minimally acceptable probability of success under . Then,
where is the decomposable probability-of-success metric with the recipient’s original information resource.
Theorem 5.3 demonstrates the proportion of -favorable information resources for transfer learning is bounded by the degree of success without transfer, along with the affinity (average performance improvement) of the set of resources as a whole. Highly favorable transferable resources are rare for difficult tasks, within any neutral set of resources lacking high affinity. Unless a set of information resources is curated towards a specific transfer task by having high affinity towards it, the set will not and cannot contain a large proportion of highly favorable elements.
For any fixed algorithm , fixed recipient problem , where with a corresponding target function , and distribution over information resources , if , then
where represents the expected decomposable probability of successfully sampling an element of using with transfer, marginalized over learning resources , and is the probability of success without under the given decomposable metric.
Theorem 5.4 tells us that transfer learning only helps in the case that we have a favorable distribution on learning resources, tuned to the specific problem at hand. Given a distribution not tuned in favor of our specific problem, we can perform no better than if we had not used transfer learning. This proves that transfer learning is not inherently beneficial in and of itself, unless it is accompanied by a favorably tuned distribution over resources to be transferred. A natural question is how rare such favorably tuned distributions are, which we next consider in Theorem 5.5.
Given a fixed target function and a finite set of learned information resources , let
be the set of all discrete -dimensional simplex vectors. Then,
where and is Lebesgue measure.
We find that highly favorable distributions are quite rare for problems that are difficult without transfer learning, unless we restrict ourselves to distributions over sets of highly favorable learning resources. (Clearly, finding a favorable distribution over a set of good options is not a difficult problem.) Additionally, note that we have recovered the same bound as in Theorem 5.3.
Given the performance of a search algorithm on the recipient problem in the transfer learning case, , and without the learning resource, , we can upperbound the absolute difference as
This result shows that unless using the learning resource significantly changes the resulting distribution over the search space, the change in performance from transfer learning will be minimal.
We can use examples to evaluate our theoretical results. To demonstrate how Theorem 5.2 can apply to an actual case of machine learning, we can construct a pair of machine learning problems in such a way that we can properly quantify each of the terms in the inequality, allowing us to show how the probability of successful search is directly affected by the transfer of knowledge from the source problem.
Let be a grid and . In this case, we know that the target set is a single cell in the grid, so choosing a target set is equivalent to choosing a cell in the grid. Let the distribution on target sets be uniformly random across the grid. For simplicity, we will assume that there is no information about the target set in the information resource, and that any information will have to come via transfer from the source problem. Thus, .
First, suppose that we provide no information through transfer, meaning that a learning algorithm can do no better than randomly guessing. The probability of successful search will be . We can calculate the bound from our theorem using the known quantities:
;
;
(because it takes 4 bits to specify a row and 4 bits to specify a column)
;
;
Thus, we upper bound the probability of successful search at .
Now, suppose that we had an algorithm which had been trained to learn which half, the top or bottom, our target set was in. This is a relatively easier task, and would be ideal for transfer learning. Under these circumstances, the actual probability of successful search doubles to . We can examine the effect that this transfer of knowledge has on our probability of success.
;
;
;
;
;
The only change is in the mutual information between the the recipient target set and the source information resource, which was able to perfectly identify which half the target set was in. This brings the probability of successful search to , exactly twice as high as without transfer learning.
This result is encouraging, because it demonstrates that the upper bound for transfer learning under dependence is able to reflect changes in the use of transfer learning and their effects. The upper bound being twice as high when the probability of success is doubled is good. However, the bound is very loose. In both cases, the bound is 32 times as large as the actual probability of success. Tightening the bound may be possible; however, as seen in this example, the bound we have can already serve a practical purpose.
Our theoretical results suggest that we cannot expect transfer learning to be successful without careful selection of transferred information. Thus, it is imperative to identify instances in which transferred resources will raise the probability of success. In this section, we explore a simple heuristic indicating conditions in which transfer learning may be successful, motivated by our theorems. Theorem 5.2 shows that source information resources with strong dependence on the recipient target can raise the upper bound on performance. Thus, given a source problem and a recipient problem, our heuristic uses the success of an algorithm on the recipient problem after training solely on the source problem and not the recipient problem as a way of assessing potential for successful transfer. Using a classification task, we test whether this heuristic reliably identifies cases where transfer learning works well.
We focused on two similar image classification problems, classifying tigers versus wolves
^{1}^{1}1http://image-net.org/challenges/LSVRC/2014/browse-synsets (TvW) and classifying cats versus dogs^{2}^{2}2https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/data(CvD). Due to the parallels in these two problem, we expect that a model trained for one task will be able to help us with the other. In our experiment, we used a generic deep convolutional neural network image classification model (VGG16
[11], using Keras
^{3}^{3}3https://keras.io/applications/#vgg16) to evaluate the aforementioned heuristic to see whether it correlates with any benefit in transfer learning. The table below contains our results:Run | Source Problem | Source Testing Accuracy | Recipient Problem | Additional Training | Recipient Testing Accuracy |
---|---|---|---|---|---|
1 | CvD | 84.8% | TvW | N | 74.24% |
2 | CvD | 84.8% | TvW | Y | 95.35% |
3 | TvW | 92.16% | CvD | N | 48.36% |
4 | TvW | 92.16% | CvD | Y | 82.44% |
The Source Problem column denotes the problem we are transferring from, and the Recipient Problem column denotes the problem we are transferring to. The Source Testing Accuracy column contains the image classification model’s testing accuracy on the source problem after training on its dataset, using a disjoint test dataset. The Additional Training column indicates whether we did any additional training before testing the model’s accuracy on the recipient problem’s dataset — N indicates no training, which means that the following entry in the second Recipient Testing Accuracy column contains the results of the heuristic, while Y indicates an additional training phase, which means that the following entry in the Recipient Testing Accuracy column contains the experimental performance of transfer learning. In each run we start by training our model on the source problem.
Consider Runs 1 and 2. Run 1 is the heuristic run for the CvD TvW transfer learning problem. When we apply the trained CvD model to the TvW problem without retraining, we get a testing accuracy of 74.24%. This result is promising, as it’s significantly above a random fair coin flip, indicating that our CvD model has learned something about the difference between cats and dogs that can be weakly generalized to other images of feline and canine animals. Looking at Run 2, we see that taking our model and training additionally on the TvW dataset yields a transfer learning testing accuracy of 95.35%, which is higher than the testing accuracy when we train our model solely on TvW (92.16%). This is an example where transfer learning improves our model’s success, suggesting that the pre-training step is helping our algorithm generalize.
When we look at Runs 3 and 4, we see the other side of the picture. The heuristic for the TvW CvD transfer learning problem in Run 3 is a miserable 48.36%, which is roughly how well we would do randomly flipping a fair coin. It’s important to note that this heuristic is not symmetric, which is to be expected — for example, if the TvW model is learning based on the background of the images and not the animals themselves, we would expect a poor application to the CvD problem regardless of how well the CvD model can apply to the TvD problem. Looking at Run 4, the transfer learning testing accuracy is 82.44%, which is below the testing accuracy when we train solely on the CvD dataset (84.8%). This offers some preliminary support for our heuristic — when the success of the heuristic is closer to random, it may be the case that pre-training not only fails to benefit the algorithm, but can even hurt performance.
Let us consider what insights we can gain from the above results regarding our heuristic. A high value means that the algorithm trained on the source problem is able to perform well on the recipient problem, which indicates that the algorithm is able to identify and discriminate between salient features of the recipient problem. Thus, when we transfer what it learns (e.g., the model weights), we expect to see a boost in performance. Conversely, a low value (around 50%, since any much lower would allow us to simply flip the labels to obtain a good classifier) indicates that the algorithm is unable to learn features useful for the recipient problem, so we would expect transfer to be unsuccessful. It’s important to note that this heuristic is heavily algorithm independent, which is not the case for our theoretical results — problems with a large degree of latent similarity can receive poor values by our heuristic if the algorithm struggles to learn the underlying features of the problem.
These results offer preliminary support for the suggested heuristic, which was proposed to identify information resources that would be suitable for transfer learning. More research is needed to explore how well it works in practice on a wide variety of problems, which we leave for future work.
Transfer learning is a type of machine learning that involves a source and recipient problem, where information learned by solving the source problem is used to benefit the process of solving the recipient problem. A popular and potentially lucrative avenue of application is in transferring knowledge from data-rich problems to more niche, difficult problems that suffer from a lack of clean and dependable data. To analyze the bounds of transfer learning, applicable to a large diversity of source/recipient problem pairs, we cast transfer learning into the algorithmic search framework, and define affinity as the degree to which learned information is predisposed towards the recipient problem’s target. In our work, we characterize various properties of affinity, show why affinity is essential for the success of transfer learning, and prove results connecting the probability of success of transfer learning to elements of the search framework.
Additionally, we introduce a heuristic to evaluate the likelihood of success of transfer, namely, the success of the source algorithm applied directly to the recipient problem without additional training. Our results show that the heuristic holds promise as a way of identifying potentially transferable information resources, and offers additional interpretability regarding the similarity between the source and recipient problems.
Much work remains to be done to develop theory for transfer learning. Through the results presented here, we learn that there are limits to when transfer learning can be successful, and gain some insight into what powers successful transfer between problems.
Proceedings of the 12th International Conference on Agents and Artificial Intelligence, Volume 2
, A. P. Rocha, L. Steels, and H. J. van den Herik (Eds.), pp. 141–150. External Links: Link, Document Cited by: §2, §5.IEEE Trans. Evolutionary Computation
1, pp. 67–82. Cited by: §5.See 5.1
Note that is the sum of all target vectors definable on , which themselves correspond to the nonempty subsets of . Thus, the sum equals a constant vector, where .
By the definition of affinity and the linearity of expectation, we have
where the third equality follows from the fact that neither nor is a function of , allowing both to be pulled out of their sums, and the penultimate equality follows from the linearity of expectation and the fact that for any probability mass vector .
If then
where the first equality follows from application of the chain rule for mutual information, the second and fourth equalities follow from the definition of mutual information, the third equality follows from the conditional independence assumption, and the final inequality follows by application of the Data Processing Inequality
[2].See 5.2
See 5.3
We seek to bound the proportion of successful search problems for which for any threshold . Then,
where the final equality follows from the definition of decomposable probability-of-success metrics.
Note that all of the randomness in comes from the learned information, , and not the fixed recipient information resource . Applying Markov’s Inequality and the definition of , we obtain
See 5.4
Let be the space of possible learning resources. Then,
Since we are considering the general probability of success for algorithm on using learning resource , but with a fixed recipient information resource , we have
Also note that because our information resources are drawn from the distribution . Making these substitutions, we obtain
Given a fixed recipient problem (), where has corresponding target function , a finite set of learning resources , and a set of all discrete -dimensional simplex vectors,
where .
Let . Then,
The quantity is a uniform expectation on the amount of mass that the random distribution places on resource . Since contains all possible distributions over , under uniform expectation the same amount of probability mass gets placed on each information resource. So, for any . Since the probability mass on any two learning resources is equivalent and the total probability mass must sum to one, by the Expectation of Simplex Vectors is Simplex [5], we have . Continuing,
See 5.5
See 5.6
where the first equality follows form the definition of decomposable probability of success metrics and the final inequality follows by application of Pinsker’s Inequality.
Comments
There are no comments yet.