Exploiting a Zoo of Checkpoints for Unseen Tasks

by   Jiaji Huang, et al.
Purdue University
Baidu, Inc.

There are so many models in the literature that it is difficult for practitioners to decide which combinations are likely to be effective for a new task. This paper attempts to address this question by capturing relationships among checkpoints published on the web. We model the space of tasks as a Gaussian process. The covariance can be estimated from checkpoints and unlabeled probing data. With the Gaussian process, we can identify representative checkpoints by a maximum mutual information criterion. This objective is submodular. A greedy method identifies representatives that are likely to "cover" the task space. These representatives generalize to new tasks with superior performance. Empirical evidence is provided for applications from both computational linguistics as well as computer vision.



There are no comments yet.


page 5


Active Learning in Gaussian Process State Space Model

We investigate active learning in Gaussian Process state-space models (G...

Empirical Process of Multivariate Gaussian under General Dependence

This paper explores certain kinds of empirical process with respect to t...

Empirical Process of Multivariate Gaussian under General Dependency

This paper explores certain kinds of empirical process with respect to t...

Submodularity in Batch Active Learning and Survey Problems on Gaussian Random Fields

Many real-world datasets can be represented in the form of a graph whose...

Projection based Active Gaussian Process Regression for Pareto Front Modeling

Pareto Front (PF) modeling is essential in decision making problems acro...

Gaussian process interpolation: the choice of the family of models is more important than that of the selection criterion

This article revisits the fundamental problem of parameter selection for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many model checkpoints published. For example, upon acceptance of this paper, the Huggingface repository222https://huggingface.co/models includes over 18,000 checkpoints trained for dozens of tasks (The tasks are tagged very coarsely. There will be many more if we tag in a fine-grained way). On the other hand, novel tasks can always arise, exciting tremendous interest in meta learning or learning to learn Baxter (2000); Vinyals et al. (2016); Finn et al. (2017); Snell et al. (2017). This paper does not propose a new learning-to-learn method. Rather, we are interested in the question, “Can we use the checkpoint zoo to build something that can adapt to unseen tasks?"

To address the question, we have to understand how tasks are relevant to each other. A paraphrase of task relevance is task transferability Zamir et al. (2018). Intuitively, transferring to a new task is easier if we warm start from a very relevant one. Efforts in this space originates from computer vision applications. Earlier work like Taskonomy Zamir et al. (2018)

explicitly uses performance metrics (e.g., test accuracy) of transfer learning. Since then, multiple works have developed task descriptors to avoid running training and evaluation as Taskonomy. A few prominent examples are Task2vec 

Achille et al. (2019), Attribution Map (A-Map) Song et al. (2020), and LEEP Nguyen et al. (2020). Very recently, Task2vec has also been extended to computational linguistics Vu et al. (2020). Each of these methods has its own pros and cons. However, these studies mainly aim at interpreting the performance of transfer learning Vu et al. (2020); Li et al. (2020). They mainly attack the question, “For a target task, which source task is the most relevant?". It is different from the question we are interested in.

Understanding the relation among tasks is crucial for meta learning and transfer learning. It is known that dissimilar tasks can impose significant challenges for these learning frameworks Jerfel et al. (2019); Wang et al. (2019). Aware of this fact, several recent works Tripuraneni et al. (2020); Du et al. (2021) develop generalization error bounds for learning new tasks. In particular, they address the question, “how many tasks has to be seen, so that a meta-trained model can generalize to new tasks?". These works often assume that there is a shared function, mapping data to representations for all tasks. Then they propose their measure of task diversity, which is further employed to bound the generalization error for a new task. These bounds suggest that the seen tasks must be sufficiently diversified to “cover" all components that may be used for a future task. However, it is not obvious how to compute these diversity measures in practice.

Back to our problem, a zoo of checkpoints solve a collection of seen tasks. We are interested in combining/fusing a subset of them. Motivated by the theoretical works Tripuraneni et al. (2020); Du et al. (2021), we want the selected checkpoints (essentially the tasks they solve) to be diversified. In a perhaps tangential problem Krause et al. (2008), the authors consider where to place sensors in a room such that they are the most informative about the undeployed locations. They show that the placement should be diversified and well represent the landscape. To this end, we could draw an analogy. Consider the task space as a room, and checkpoints as sensors. The unseen tasks are locations where no sensor is placed but we wish to infer their information. In later sections, we shall see how this analogy helps us design a strategy for selecting checkpoints.

A relevant concept to our work is model ensemble, which has a long history Opitz and Maclin (1999), and still actively being studied recently Zhang et al. (2020)

. However, model ensemble usually assumes a shared output space for all component models, therefore no notion of multiple (seen or unseen) tasks. Another relevant topic is feature selection 

Peng et al. (2005). Indeed, if we consider each checkpoint a feature extractor, picking checkpoints amount to identifying useful features. However, feature selection methods usually aim at finding the most relevant features for a target variable. In other words, they only target at a single seen task.

2 Motivating the Problem

Consider a zoo of checkpoints, where the -th checkpoint is trained for the -th task. In this paper, we define the

-th task as a joint distribution

to be fit to, where is input and

is target output. The

-th checkpoint is trained on a dataset drawn from . is often proprietary. Or due to non-trivial preprocessing, it is very difficult to obtain the data exactly as it was. The tasks are assumed to share the same input space, but differ in their target spaces. This is a natural assumption. Consider a zoo of convolutional networks that handle various computer vision tasks Zamir et al. (2018), or a collection of huggingface transformers for multiple linguistic tasks. The is a space of images for the former case, and text for the later.

Suppose there is a universe of (possibly infinitely many) tasks, drawn from a “hyper-distribution" . That is . Modeling a family of tasks as “distribution over distributions" Baxter (2000); Amit and Meir (2018) has a long history in learning-to-learn literature. The checkpoints in the zoo have distilled information from , and we are interested in building a “stronger" model from them that can easily generalize to unseen tasks . It should be emphasized that these new tasks are not revealed to us before we build the “stronger" model.

To be more specific, let the -th checkpoint be in the (commonly seen) form, , where is a feature extractor. is a task-specific “head" for the -th task, and denotes function composition.

is often a “simple" function, e.g., softmax classifier. Note that

may not necessarily be available. This is true for many checkpoints in huggingface repository. We are interested in selecting a subset of checkpoints from the zoo before seeing any new task , and building a “stronger" feature extractor by concatenating the selected ’s, . The should be quickly adaptable to new tasks . In other words, given some small training data for a new task, we want a task-specific “head" on top of the to be “easily" learned and work well on test data.

Of course, this approach should have some budget constraints. Indeed, when is big, concatenating all the ’s may be costly in computation, and result in overfitting. At this point, one may propose to use feature selection methods like min-Redundancy-Max-Relevance (mRMR) Peng et al. (2005). That is, to pick the ’s that are the most informative for a new task, meanwhile least redundant mutually. However, in our case, no new task is revealed to us before selecting. Thus we cannot simply apply mRMR.

Having ruled out feature selection methods, we start to re-think the problem by modeling the space of tasks. To this end, we may make an analogy to optimal sensor placement Krause et al. (2008). In that application, one is supposed to pick the best locations to place sensors, so that their readouts are the most informative about the unsensed locations. In information theoretic language, this would be , where is mutual information. In our problem, each checkpoint handles its specific task. It can be considered as a “sensor" placed at a task. Then, the checkpoint zoo is a “pilot deployment" that places many “sensors" on grids of the task space, which can reveal how the tasks are related. Selecting checkpoints is essentially picking a few seen tasks that may be the most informative about the unseen ones. In the following sections, we concretize the above idea by modeling and exploiting the task space.

3 Modeling the Task Space

Theoretical works Baxter (2000); Amit and Meir (2018); Tripuraneni et al. (2020) often do not specify a form for the hyper-distribution . In this work, we assume a Gaussian process. So a next step is to define the covariance function, , for . In other words, we want to come up with a similarity measure between the two distributions and . Moreover, by definition, the has to be Positive Semi-Definite (PSD). In the following, we first rule out several options that are less applicable. Then we propose our approach.

3.1 A Taxonomy of Task Similarities

If we had access to data samples , and , a natural idea would be comparing and directly. The difficulty is that the output spaces and are not directly comparable. For example, may be the input images’ class labels, whereas may be the foreground to be segmented. Therefore, straightforward measures like KL divergence cannot be applied. To address this issue, Geometric Dataset Distance (GDD) Alvarez-Melis and Fusi (2020) proposes to estimate Wasserstein distance between and , by carefully designing the “transport cost" between and using and . However, and are often proprietary. For example, the data to train huggingface transformers are often preprocessed with various tools. It is very hard (if not impossible) to recover the exact same .

Other approaches treat as the transferability from the -th task to the -th. Taskonomy Zamir et al. (2018) empirically computes that by running training and testing for this -to- transfer learning problem. Doing that for all

pairs is very costly. Besides, non-trivial normalizations have to be applied as performance metrics can differ for different tasks. To circumvent this brute-force process, people have come up with vector representations of tasks, reducing tasks similarities to distances in this vector space. For example, Task2vec 

Achille et al. (2019) argues that Fisher information matrix (essentially gradient w.r.t. parameters in the checkpoint) is an indicative representation for the corresponding task. However, it assumes that all checkpoints have the same model configuration, which is a very strong assumption and need not be satisfied in practice. Another work, A-Map Song et al. (2020) is based on the intuition that similar tasks should have similar saliency map, which is the gradient of (pooled) feature w.r.t. input. Therefore, it only requires some unlabelled probing data, and model architectures can be different. However, the back-propagation can be more costly. Further more, to compare two saliency maps, their inputs have to be the same. This constraint can be violated for NLP applications, as the same input text is often mapped to two different sequences of embedding vectors using two checkpoints.

Recently, NCE Tran et al. (2019) and LEEP Nguyen et al. (2020) measure task transferability by estimating an approximation of conditional entropy, . Ignoring and focusing on the mutual information term, we can understand these two methods as measuring the dependency between two target spaces and . However, to estimate , data labels must be available.

For clarity, we list the discussed methods in Tab. 1. Also, the last row of Tab. 1 highlights some properties a desired method should have. We can see various reasons why these existing methods are not applicable. Nevertheless, LEEP Nguyen et al. (2020) provides us with a principled way of understanding the relationship between two tasks, i.e., measuring dependency between their target spaces. We next derive another approximation to this dependency, which is a desired method.

Method task task back-prop
Probing data
checkpoint data checkpoint data
GDD Alvarez-Melis and Fusi (2020)
Taskonomy Zamir et al. (2018)
Task2vec Achille et al. (2019)
A-Map Song et al. (2020)
LEEP Nguyen et al. (2020)
desired method ?
Table 1: Candidate methods to measure task relevance. Those requiring access to or are not applicable in our setting. Those requiring back-propagation are costly. also rules out some candidates. In contrast, the last row highlights a desired method.

3.2 Proposed Approach

Following the setup in NCE Tran et al. (2019) and LEEP Nguyen et al. (2020), we also assume input samples , with target outputs in the -th task, and in the -th task. We wish to measure the dependency between and . One approach is Kernel Alignment (KA) Cristianini et al. (2001); Cortes et al. (2012). In the case of linear kernels, KA can be calculated as

To give some intuitions of the above, let columns of and be one-hot. Then (same for

) is a binary matrix who assigns 1 for two inputs with the same label, and 0 otherwise. KA measures the dependency between the two tasks by cosine similarity of their assignment patterns.

However, we do not have access to the data labels. Moreover, sometimes the task-specific heads are unavailable, so even a prediction of the labels (denoted as and ) cannot be obtained. We tackle this difficulty by resorting to unsupervised “probing" data . Note that this need not be the inputs that train the checkpoint(s). Regardless of the availability of and , we can extract feature representations by

For notation easiness, in the following, we may write and for any input .

Then we use


as an approximation to .

A natural question to ask is how good this approximation is. We assure ourselves by the following reasoning. Let

be the predictions made for each task (Although may not be obtained as is not available. Same for ). If each checkpoint works sufficiently well for its own task, then is very “close" to , so is to . Thus well approximates .

It remains to validate if well approximates . The only gap between and is a task-specific head , likewise, for and . We argue that many task-specific architectures are “well conditioned" Vorontsov et al. (2017); Bansal et al. (2018), which guarantees that pairwise inner products are not distorted too much. This property further results in . To be more concrete, consider a simple special case, where and are linear mappings. By the “well-conditioned" assumption, we have and for some . Then it is easy to show that .

Last but not least, we have the following guarantee on positive definiteness.

Proposition 1.

Let , then .

We are not the first to use kernel alignment to understand deep nets. Kornblith et al. (2019) proposes to use linear and RBF Kernel alignment to measure similarity between representations. Although the methodologies appear similar, the goals are quite different. Kornblith et al. (2019) attempts to understand representation similarity, an open-ended problem. In contrast, we are explicitly interested in measuring the dependency between two tasks’ output spaces. And it turns out that KA between their respective features is a sensible surrogate. Moreover, Proposition 1 enables us to construct a Gaussian process on the task space.

3.3 Example: Characterize Linguistic Task Space using Huggingface Checkpoints

We showcase a concrete example using 34 checkpoints from huggingface. They are trained for various tasks (or their combinations). For example, the bert- Devlin et al. (2018) checkpoints are trained for masked language modeling and next sentence prediction. t5-

’s are for text-to-text generation. Complete details of these checkpoints and their respective tasks can be found in supplementary material. Note also that multiple checkpoints may target at the same task. These checkpoints are usually named with the same prefix. For example, the t5-

’s only differ by their architecture configurations (e.g., number of layers). In analogous to sensor placement, this amounts to a “pilot deployment" that places multiple checkpoints at the same location in the task space. It is somewhat redundant to do so. However, they grant us a chance to do sanity check on the estimation of . That is, the checkpoints corresponding to the same task should have large values.

Figure 1: Left:

matrix computed using 34 checkpoints from huggingface. Note that by definition, the diagonal is all-one. Right: hierarchical clustering on

. Colors denote clusters. Key observations: 1) Similar names usually share colors, e.g., bert- checkpoints are orange and t5-’s are green, as they respectively handle the same task. 2) A checkpoint fine-tuned on another task can become very different, e.g., roberta-large-mnli against roberta-large.

We input training set of wikitext2 as probing data, and extract the contextualized word embeddings after penultimate layer. So task specific layers like softmax classifiers are ignored. Note that these checkpoints employ different tokenizers, often resulting in the same word being split into different multiple sub-words. We adopt the strategy in Liu et al. (2019), taking the representation of a word as that of its last sub-word’s. Now columns in are contextualized word representations from the -th checkpoint. Then we can compute for each pair of checkpoints to get the matrix. To visualize the relationship between these checkpoints (thus the tasks they target), we apply hierarchical clustering333We use scipy functions: linkage with method=“ward", and dendrogram with color_threshold=0.9. to the matrix and visualize a dendrogram in fig. 1.

In fig. 1, we observe that checkpoints with the same prefix are usually close, e.g., the t5-’s. This is less surprising, since obviously they are trained for the same task. A more surprising discovery is that longformer-base Beltagy et al. (2020) is very similar to roberta-base Liu et al. (2019). A closer check of their documentation informs us that they are both trained for masked language modeling, although longformer-base innovated a novel attention mechanism. On the other hand, adapting a checkpoint to another task can result in significant difference, e.g., roberta-large-mnli against roberta-large. Therefore, the indeed reflects our intuitions on tasks.

Finally, there are also a few counter-intuitive cases. For example, gpt is very different from gpt2, but they are both trained for causal language modeling. We conjecture it is their different tokenizers that drive to be small. A full understanding of this observation is deferred to future study.

3.4 Robustness of the Estimated Covariance

Estimating relies on probing data. Therefore, it is natural to ask, how robust is against size of the probing data. In addition, we have implicitly simplified and assumed that the probing data, as well as the inaccessible training inputs for all tasks, have the same distribution on .

(a) KL divergence
(b) correlation
Figure 2: Measure the difference between (estimated from a subset of probing data) and (estimated from full probing data). converges quickly to .

However, in practice, the input distribution may have domain shift from task to task, and from task to probing data as well. Hence, we also want to check whether is stable against different genre of probing text.

We denote the covariance estimated in the previous section as . Now we estimate using various subsets of the wikitext2 training set. It is guaranteed that any smaller subset is included in a bigger subset. We want to see how quickly these ’s approach , as size of probing data increases. The same experiment is also repeated on another corpora, 10K sentences taken from billion-word benchmark, dubbed 1B-10K. This benchmark is created from news crawl, a different genre from wiki.

To compare against

, we measure the KL-divergence between two Gaussian distributions with

and as their respective covariances. Another simple measure is the cosine similarity between vectorized and . We plot and against the number of words in the probing text, shown in fig. (a)a and (b)b respectively. For both plots, the left most part of the x-axis stands for zero probing data. In this case, it is natural to assume

an identity matrix,

i.e., all checkpoints (thus the tasks they were trained for) are independent.

In fig. 2, we observe that as the size of probing data increases, KL divergences quickly converge to 0, while the cosines quickly converge to 1. In addition, let be the covariance estimated from full wikitext2 training set, and the covariance from full 1B-10K. It turns out that , not a large value considering the range of KL-divergences in fig. (a)a. On the other hand, their cosine similarity is as high as 0.98. To conclude, the estimate of covariance function is reasonably robust against size and genre of probing data.

4 Exploiting the Task Space

We have characterized the task space using a zoo of checkpoints. Now we are ready to selecting checkpoints. The idea is to pick seen tasks that are the most informative about the the task space. Again, remember that this is conducted before any new task is revealed to us.

Suppose we have a budget of using checkpoints. Let (with cardinally ) be the zoo of all checkpoints, and the set of selected checkpoints. We want to maximize the mutual information between the tasks solved by , and the remaining of the task space. With some abuse of notation, we denote this mutual information as . Here should be understood as the tasks corresponding to the checkpoints in . We optimize the following constrained objective,


The above a combinatorial optimization problem. Enumerating all cases is infeasible. Instead, a heuristic is to use greedy search. Starting from an empty

, each time we include a checkpoint that brings the largest gain to the objective. In specific, denote , and . The gain can be calculated as,


The first term implies that should surprise the current the most, thus incorporating more information. Meanwhile, the second term suggests that

should be representative of the remaining of the task space, thus avoiding outliers.

Since we assume a Gaussian process on the task space, any subset of is Gaussian distributed. And the two terms in Eq. (3) can be easily calculated. Specifically, let denote the covariance for subset , the row vector of cross-covariance between and , and its transpose. Then,


The greedy selection process is summarized in Algorithm 1. Denote as the set of selected checkpoints for a value. Algorithm 1 guarantees that for any , .

0:  Checkpoint zoo with cardinality , covariance matrix , number of checkpoints to pick,
0:  a set of selected checkpoints, where
2:  for  do
3:     for  do
4:        Compute information gain using Eq. (4)
5:     end for
8:  end for
Algorithm 1 Maximum Mutual Information (MMI) based Selection of Checkpoints

4.1 Discussion on Quality of the Greedy Approximation

It is natural to ask how good the greedy method is. At this point, we need to review the definition of submodular functions. A set function is submodular if for any sets and , . It is not hard to show that the following holds.

Proposition 2.

is a submodular function of .

Another useful concept is monotonic function. A function is monotonic if for any , . For monotonic submodular functions, it is known that a greedy method is guaranteed to achieve an objective no smaller than of the optimal one Nemhauser et al. (1978). In our case, is submodular but not monotonic. As , indicating that when set is sufficiently large, further enlarging will reduce mutual information.

It thus becomes less clear if the () guarantee still holds. In optimal senor placement Krause et al. (2008), the authors show that if the covariance is estimated for sufficiently many fine-grained grids in the room, then the mutual information is monotonic up to a reasonably big . This may suggest that our has to be estimated for sufficiently many tasks. In other words, the checkpoint zoo must be rich enough. Nevertheless, it is hard to come up with a rigorous guarantee like Krause et al. (2008), as assumptions made in 2D physical space does not necessarily transfer to an abstract task space. However, in experiments, we have found that Algorithm 1 always works in a monotonic regime.

5 Experiments

In the previous sections, we have presented two key components, estimation of and MMI based selection of checkpoints. In this section, we experiment with the two components combined. First, we apply algorithm 1 to the estimated in section 3.3, and show its effectiveness on multiple linguistic tasks. The baseline we compare against is random selection of checkpoints, and single commonly adopted checkpoint, e.g., bert-base-uncased. Then we extend to image classification tasks. Again we observe constant improvements over random picks, and other straightforward alternative.

5.1 Linguistic Tasks Using Huggingface Checkpoints

We run algorithm 1 on the matrix estimated in section 3.3. As increases from to , the set incrementally includes roberta-base, distilbert-base-uncased, t5-base, bert-base-cased and bart-large. These checkpoints solve multiple tasks: masked language modeling, text-to-text generation, next sentence prediction, and text denoising. Referring to the visualization of matrix (left pannel of fig. 1), we find that the selected checkpoints all belong to the big cluster in the lower right block. This matches our intuition, as the checkpoints outside this cluster are very isolated.

Then, we apply the selected checkpoints to a suite of probing tasks collected in Liu et al. (2019). To be self-contained, we briefly describe the tasks and datasets.
(a) Chunking examines spans and boundaries for each word. We use CoNLL2000 shared task dataset Sang and Buchholz (2000).
(b) NER predicts the entity type for each word. We use CoNLL2003 shared task dataset Sang and Meulder (2003).
(c) POS Tagging (POS) checks if word representations can capture basic syntax. We use PTB dataset Marcus et al. (1993).
(d) Semantic Tagging (ST) is interested in identifying the semantic role in context. The original source of dataset can be found in Bjerva et al. (2016).
(e) Event Factuality(EF) aims at determining if an event is factual. We use ItHappened-v2 dataset Rudinger et al. (2018). Events are given scores between -3 (non-factual) and 3 (factual). For instance, in “I decided not to go.", the word “decided" has a score of 3 whereas “go" has -3. We use Pearson correlation to compare predicted scores against groundtruths.
(f-h) Syntactic constituency ancestor tagging: A set of tasks that aim at predicting the constituent label for (f) parent; (g) Grand parent (Gparent); and (h) Great Grand parent (GGparent) of a word in the phrase structure tree. The structure is derived from PTB.

Following Liu et al. (2019), we train a softmax on top of the combined word representations () for each task. The gradients are not back-propagated through the checkpoints. There are two reasons why we choose to do so. First, for an easier comparison with Liu et al. (2019). Second, there are cases where checkpoints are published as blackboxes, and back-propagation is not allowed. One prominent example is GPT-3 Brown et al. (2020). Although we do not have access to GPT-3 in this paper, we think that our experiment design should respect this fact. Another design choice is that is taken to be the feature at top layer of the checkpoint. We do not use feature at intermediate layers, again to respect the fact of blackboxes.

(a) Chunking
(b) NER
(c) POS Tagging
(d) Semantic tagging
(e) Event factuality
(f) Tagging parent in phrase-structure tree
(g) Tagging grandparent in phrase-structure tree
(h) Tagging great grandparent in phrase-structure tree
Figure 3: Performance on a suite of probing tasks: MMI achieves near optimal performance, compared with random search (34 trials at and 20 for ). In most of these tasks, the top layer of bert-base-uncased and bert-base-cased are inferior to the first pick by MMI. Surprisingly, in many cases, the top layer of bert-base-cased is worse than half of the 34 checkpoints.
chunking NER POS ST EF parent Gparent GGparent
MMI selected 93.88 88.64 97.57 94.35 78.35 96.64 85.85 73.32
bert-large-cased 93.64 84.44 96.73 93.83 76.25 96.10 82.46 67.90
best layer Liu et al. (2019)
Table 2: Compare MMI’s selection (Algorithm 1, ) against a strong baseline Liu et al. (2019), which uses the best layer of bert-large-cased for each task. Note that the best layer changes from task to task. It has to be found by brute-force training and evaluation for all 24 layers.

The baseline to compare against is random selection. In specific, let be a set of randomly picked checkpoints, where . The task specific head is trained on top of . For , ranges over each single checkpoint in the total 34. For , we run 20 random picks using different random seeds. Then, the metrics from these random runs can be visualized by violin plots (fig. 3). In each violin plot, the bars on the top and middle indicate best and median performances respectively. We observe an increasing trend in the metrics as increases from 1 to 5. Importantly, MMI achieves near optimal performance compared with random search, especially when

. Note also that random selection results in large variance in performance. Another interesting baseline is a single checkpoint that are commonly adopted. We plot bert-base-uncased and bert-large-cased (at

). Surprisingly, for most tasks, bert-large-cased is inferior to half of the 34 checkpoints. This echos some discoveries made in Liu et al. (2019). That is, higher layers of bert model is less transferable to new tasks. Instead, stronger transferability is found in middle-layer features.

Following the above discussion, we collect metrics reported in Liu et al. (2019) for the best layer of bert-large-cased. Since layers capture different levels of information Vig and Belinkov (2019), the best layer is often task dependent. It has to be found by brute-force training and evaluation for each task. Tab. 2 lists these metrics. Also reported are metrics achieved by at . We observe that MMI outperforms the best layer of bert-large-cased for most of the tasks, and often wins by a notable margin. This is remarkable, as the checkpoints are selected before seeing any new task, and fixed throughout all tasks.

5.2 Image Classification Tasks on Cifar100

We are motivated by applications in computational linguistics. However, we find that the proposed framework extends to computer vision tasks as well. In this section, we simulate an example using cifar100 dataset.

We create “seen" tasks. Each task is a 30-way classification problem. The 30 classes are randomly sampled from the total 100 classes. Each task has 480 (=500-20) training samples per class. There are 20 training samples held out for each class. The reason will be clear soon. A resnet-50 is trained for each of these “seen" tasks, stored as a checkpoint. Then, we create 20 “unseen" tasks in the same way, but each of them has only 10 training samples (part of the aforementioned holdouts) per class. This small training data is to manifest the transferability of the selected checkpoints. Finally, the remaining 10 holdouts per-class are used as probing data to estimate . We augment this small set of probing images by applying rotations of , and degrees.

Figure 4: Covariance for the 100 seen tasks
Figure 5: accuracy averaged over the 20 unseen tasks
Figure 6: Average (over 20 new tasks) improvements over random picks

Figure 6 shows the estimated for the 100 seen tasks. Compared with the in fig. 1

, it is less structured. That is, tasks tend to be equally distant from each other. This probably suggests that the space of vision tasks fundamentally differs from that of linguistic tasks. We then apply algorithm 

1 on the to select up to checkpoints. For each task , a classification head is learned using its training data. The performance of this task is measured by accuracy on the standard validation data (excluding classes not handled in this task), denoted as . As before, a baseline is random selection. In addition, suppose we could peek at the first unseen task. We may identify the top- best checkpoints for this task, by running training and evaluation. Then we can keep using these top- checkpoints for all new tasks. This gives us another method to compare against, dubbed “peek".

In fig. 6, we plot, , the average validation accuracy over tasks against . The shaded area are accuracies of runs using different random seed. Overall, as expected, the accuracies increase and saturate at larger . To have a better view, we plot the gains of MMI and “peek" against random. That is, the MMI (and “peek") curve subtracting the average curve of the shaded area in fig. 6. The gains are shown in fig. 6. MMI steadily outperforms the random baseline. Whereas “peek" fluctuates and can be inferior than random at multiple ’s. This may be caused by “overfitting" to the first task that it peeked, which prevents generalization to other tasks. In analogous to senor placement, this amounts to optimizing the placement for one location, which could harm the accuracies for other locations. In contrast, MMI does not exhibit this issue.

6 Conclusion

In this paper, we motivate a problem that addresses two recent facts: 1) huge amount of published checkpoints; 2) emerging new tasks. By drawing analogy from optimal sensor placement, we propose a checkpoint selection framework without seeing any new task. The key steps are: 1) Estimating task relevance with proprietary constraints (e.g., data, task-specifc heads); 2) selection by maximizing mutual information, assuming a Gaussian Process on the tasks. Effectiveness is validated on tasks from computational linguistics as well as computer vision. However, it should be reminded that there are a few assumptions (section 3.2) made for step 1). We partially validated them on text data (section 3.3 and 3.4). A more comprehensive validation on other data (e.g., images) is left for future work. Aside from that, it is also interesting to explore more fine-grained models of the task space (e.g., hierarchical) and experiment with more tasks.


The authors thank Nelson Liu for helping on data of linguistic experiments. The first author thanks Yifan Wei for her company and support during the pandemic lockdown.


  • A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona (2019) Task2vec: task embedding for meta-learning. In IEEE/CVF International Conference on Computer Vision, Cited by: §1, §3.1, Table 1.
  • D. Alvarez-Melis and N. Fusi (2020) Geometric dataset distances via optimal transport. In In Conference on Neural Information Processing Systems, Cited by: §3.1, Table 1.
  • R. Amit and R. Meir (2018) Meta-learning by adjusting priors based on extended pac-bayes theory. In

    International Conference on Machine Learning

    Cited by: §2, §3.
  • N. Bansal, X. Chen, and Z. Wang (2018) Can we gain more from orthogonality regularizations in training deep cnns?. In 32nd Conference on Neural Information Processing Systems, Cited by: §3.2.
  • J. Baxter (2000) A model of inductive bias learning. artificial intelligence research. Cited by: §1, §2, §3.
  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §3.3.
  • J. Bjerva, B. Plank, and J. Bos (2016) Semantic tagging with deep residual networks. In COLING, Cited by: §5.1.
  • T. B. Brown, B. Mann, N. Ryder, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §5.1.
  • C. Cortes, M. Mohri, and A. Rostamizadeh (2012) Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research. Cited by: §3.2.
  • N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola (2001) On kernel-target alignment. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3.
  • S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei (2021) Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, Cited by: §1, §1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, Cited by: §1.
  • G. Jerfel, E. Grant, T. L. Griffiths, and K. Heller (2019) Reconciling meta-learning and continual learning with online mixtures of tasks. In Neural Information Processing Systems, Cited by: §1.
  • S. Kornblith, M. Norouzi, H. Lee, and G. Hinton (2019)

    Similarity of neural network representations revisited

    In International Conference on Machine Learning, Cited by: §3.2.
  • A. Krause, A. Singh, and C. Guestrin (2008) Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. Journal of Machine Learning Research 9, pp. 235–284. Cited by: §1, §2, §4.1.
  • Y. Li, X. Jia, R. Sang, Y. Zhu, B. Green, L. Wang, and B. Gong (2020) Ranking neural checkpoints. arXiv preprint arXiv:2011.11200. Cited by: §1.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §3.3, §3.3, §5.1, §5.1, §5.1, §5.1, Table 2.
  • M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993) Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2), pp. 313–330. Cited by: §5.1.
  • G. Nemhauser, L. Wolsey, and M. Fisher (1978) An analysis of the approximations for maximizing submodular set functions. Mathematical Programming (265–294). Cited by: §4.1.
  • C. V. Nguyen, T. Hassner, M. Seeger, and C. Archambeau (2020) LEEP: a new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, Cited by: §1, §3.1, §3.1, §3.2, Table 1.
  • D. Opitz and R. Maclin (1999) Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11, pp. 169–198. Cited by: §1.
  • H. Peng, F. Long, and C. Ding (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27 (8), pp. 1226–123. Cited by: §1, §2.
  • R. Rudinger, A. S. White, and B. V. Durme (2018) Neural models of factuality. In NAACL, Cited by: §5.1.
  • E. F. Sang and S. Buchholz (2000) Introduction to the conll-2000 shared task: chunking. In CoNLL, Cited by: §5.1.
  • E. F. Sang and F. D. Meulder (2003)

    Introduction to the conll-2003 shared task: language-independent named entity recognition

    In CoNLL, Cited by: §5.1.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In International Conference on Neural Information Processing Systems, Cited by: §1.
  • J. Song, Y. Chen, X. Wang, C. Shen, and M. Song (2020) Deep model transferability from attribution maps. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1, Table 1.
  • A. T. Tran, C. V. Nguyen, and T. Hassner (2019) Transferability and hardness of supervised classification tasks. In International Conference on Computer Vision, Cited by: §3.1, §3.2.
  • N. Tripuraneni, M. I. Jordan, and C. Jin (2020) On the theory of transfer learning: the importance of task diversity. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §3.
  • J. Vig and Y. Belinkov (2019) Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284. Cited by: §5.1.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, Cited by: §1.
  • E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal (2017) On orthogonality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning, Cited by: §3.2.
  • T. Vu, T. Wang, T. Munkhdalai, A. Sordoni, A. Trischler, A. Mattarella-Micke, S. Maji, and M. Iyyer (2020) Exploring and predicting transferability across nlp tasks. In

    2020 Conference on Empirical Methods in Natural Language Processing

    Cited by: §1.
  • Z. Wang, Z. Dai, B. Póczos, and J. Carbonell (2019) Characterizing and avoiding negative transfer. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cited by: §1.
  • A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In IEEE conference on computer vision and pattern recognition, Cited by: §1, §2, §3.1, Table 1.
  • S. Zhang, M. Liu, and J. Yan (2020) The diversified ensemble neural network. In 34th Conference on Neural Information Processing Systems, Cited by: §1.