Log In Sign Up

An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems

by   Andrea Gesmundo, et al.

Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method that can generate a large scale multitask model, and can support the dynamic and continuous addition of new tasks. The generated multitask model is sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We empirically show that the proposed method can jointly solve and achieve competitive results on 69image classification tasks, for example achieving the best test accuracy reported fora model trained only on public data for competitive tasks such as cifar10: 99.43


A Continual Development Methodology for Large-scale Multitask Dynamic ML Systems

The traditional Machine Learning (ML) methodology requires to fragment t...

muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems

Most uses of machine learning today involve training a model from scratc...

Reducing catastrophic forgetting when evolving neural networks

A key stepping stone in the development of an artificial general intelli...

Continual learning: A comparative study on how to defy forgetting in classification tasks

Artificial neural networks thrive in solving the classification problem ...

Adversarial Feature Alignment: Avoid Catastrophic Forgetting in Incremental Task Lifelong Learning

Human beings are able to master a variety of knowledge and skills with o...

Active Multitask Learning with Committees

The cost of annotating training data has traditionally been a bottleneck...

Learning Functions to Study the Benefit of Multitask Learning

We study and quantify the generalization patterns of multitask learning ...

1 Introduction

The success of machine learning continues to grow as it finds new applications in areas as diverse as language generation

(Brown et al., 2020), visual art generation (Ramesh et al., 2021), chip design (Mirhoseini et al., 2020), protein folding (Senior et al., 2020) and competitive sports (Silver et al., 2016; Vinyals et al., 2019). The vast majority of machine learning models are designed and trained for a single task and specific data modality, and are often trained by starting with randomly initialized parameters, or with limited knowledge transfer from a pre-trained model. While this paradigm has shown great success, it uses a large amount of computational resources, and does not leverage knowledge transfer from many related tasks in order to achieve higher performance and efficiency.

The work presented in this paper is based on the intuition that significant advances can be enabled by dynamic, continual learning approaches capable of achieving knowledge transfer across a very large number of tasks. The method described in this paper can dynamically incorporate new tasks into a large running system, can leverage pieces of a sparse multitask ML model to achieve improved quality for new tasks, and can automatically share pieces of the model among related tasks. This method can enhance quality on each task, and also improve efficiency in terms of convergence time, amount of training examples, energy consumption and human engineering effort.

The ML problem framing proposed by this paper can be interpreted as a generalization and synthesis of the standard multitask and continual learning formalization, since an arbitrarily large set of tasks can be solved jointly. But also, over time, the set of tasks can be extended with a continuous stream of new tasks. Furthermore, it lifts the distinction between a pretraining task and a downstream task. As new tasks are incorporated, the system searches for how to combine the knowledge and representations already present in the system with new model capacity in order to achieve high quality for each new task. Knowledge acquired and representations learned while solving a new task are available for use by any future task or continued learning for existing tasks.

We refer to the proposed method as “mutant multitask network” or

2Net. This method generates a large scale multitask network that jointly solves multiple tasks to achieve increased quality and efficiency for each. It can continuously expand the model by allowing the dynamic addition of new tasks. The more accumulated knowledge that is embedded into the system via learning on previous tasks, the higher quality the solutions are for subsequent tasks. Furthermore, new tasks can be solved with increasing efficiency in terms of reducing the newly-added parameters per task. The generated multitask model is sparsely activated as it integrates a task-based routing mechanism that guarantees bounded compute cost per task as the model expands. The knowledge learned from each task is compartmentalized in components that can be reused by multiple tasks. As demonstrated through experiments, this compartmentalization technique avoids the common problems of multitask and continual learning models, such as catastrophic forgetting, gradient interference and negative transfer. The exploration of the space of task routes and identification of the subset of prior knowledge most relevant for each task is guided by an evolutionary algorithm designed to dynamically adjust the exploration/exploitation balance without need of manual tuning of meta-parameters. The same evolutionary logic is employed to dynamically tune the hyperparameters multitask model components.

2 Related work

The proposed method combines intuitions from different lines of research. The growth in capabilities of state of the art models often requires growth in terms of trainable parameters (Kaplan et al., 2020). Sparse activation techniques at sub-layer level (Shazeer et al., 2017; Du et al., 2021) or network route level (Fernando et al., 2017) allow to decouple model size growth from compute cost. This is achieved by integrating a routing technique that selects the appropriate subset of parameters storing the most relevant knowledge for each task, sample or token/patch. Aspects of the proposed method can be interpreted as neural architecture search (NAS) (Zoph and Le, 2017), for which evolutionary approaches have been applied with success (Real et al., 2019; Maziarz et al., 2018), and efficient parameter sharing NAS techniques (Pham et al., 2018; Liu et al., 2019a; Kokiopoulou et al., 2019) create a connection to routing methods (Fernando et al., 2017; Maziarz et al., 2019).

Cross-task knowledge transfer

has gained popularity, especially through transfer learning from a model pre-trained on a large amount of data for one or a few general tasks, and then fine-tuned on a small amount of data for a related downstream task. This approach has been shown to be very effective in a wide variety of problems across many modalities, including language

(Devlin et al., 2019; Raffel et al., 2020) and vision (Dosovitskiy et al., 2021; He et al., 2016). More complex forms of knowledge transfer such as multitask training or continual learning often lead to interesting problems such as catastrophic forgetting (McCloskey and Cohen, 1989; French, 1999), negative transfer (Rosenstein, 2005; Wang et al., 2019) or gradient interference (Chen et al., 2018; Yu et al., 2020). Research on these topics mostly focuses on analyzing and proposing solutions, such as weighted combination methods (Liu et al., 2019b; Sun et al., 2020) or gradient transformations (Sener and Koltun, 2018; Kendall et al., 2018). These approaches are interesting but are not usually components of systems that achieve state of the art across many problems. The method proposed in this paper can be considered large-scale and state-of-the-art focused progression from Gesmundo and Dean (2022).

Figure 1: Graphical representation of the two types of mutations used for the large scale continual learning experiment reported in Section 6: layer cloning mutation (left) and hyperparameter change (center). The graph on the right represents the model generated by the preliminary experiment described in Section 5. The bottom nodes display the task names, the top nodes display the validation accuracy, and internal nodes are represented with the color of the task that has last updated the parameters of the corresponding layer (video:

3 Evolutionary Method

This section introduces the evolutionary method to generate and expand a large scale multitask model.

3.1 Initialization and iteration

The multitask system and evolutionary process is initialized with one root model. This model can be either pretrained or randomly initialized. During the evolutionary process, the proposed method searches for the best model for a single task at a time, referred to as the active task. During the active phase of a task, a population of models trained on the active task is evolved: the active population. The first time a task becomes active, the active population for that task starts empty. For subsequent iterations, the active population is initialized with all the models trained on the active task that have been retained from previous iterations. The active population is iteratively extended by: 1) sampling a parent model (Section 3.2), 2) applying to the parent model a sampled set of mutations (Section 3.3) to produce a child model, 3) performing cycles of training and validation in order to train and score the child model. Each trained model is assigned a score that can be a function of factors such as the validation quality. Early population pruning is performed by discarding the models that did not achieve a better score then their parent. An active phase is composed of multiple generations in which multiple batches of child models are sampled and trained in parallel. At the end of a task active phase, only its best scoring model is retained as part of the multitask system. A task can become active multiple times. Details of the method are reported below (and in Algorithm 1).

3.2 Parent model sampling

The first attempt to sample a parent model for the active task is done over the active population of models for that task. The models in the active population are visited in decreasing order of score, starting with the highest scoring. Each model,

, can be accepted as parent with probability:

. Where denotes the number of times the candidate model, , has been previously selected as parent to generate a child models for task . If the current candidate parent is not selected, then iteratively the model with the next best score is considered to be selected as parent with probability . If a full iteration on the active population is completed without a successful parent model selection, then the same method is applied to the randomly sorted list of all remaining models: all the models currently part of the multitask system that were trained on a task different from the current active task, . This fallback list is randomly sorted since these models have not been scored for . As a final fallback a parent is uniformly sampled among all the models currently in the system. This method prioritizes the exploitation of high scoring models that had few attempts at generating an improved model for the active task. But also, in combination with early pruning, it automatically transitions toward a more exploratory behavior in case the higher scoring models are unable to generate an improvement.

3.3 Mutations

The transformation of a parent model into a child model is done by sampling a set of mutations from the space of possible mutations. Deep neural networks are commonly defined by their architecture and hyperparameters. Architectures considered in this work are composed of a sequence of neural network layers, each mapping an input vector into an output vector of variable dimensions. Hyperparameters specify the details of the network instantiation such as the optimizer or data preprocessing configurations. The presented method allows for two types of mutations (Figure 


Layer cloning mutations create a copy of any parent model layer that can be trained by the child model. If a layer of the parent model is not selected for cloning, then it is shared with the child model in a frozen state to guarantee immutability of pre-existing models. Child models can train only the cloned copies of the parent layers. The cloned layers are trained with a possibly modified version of the parent optimizer. The configuration of the child optimizer is defined by the mutated hyperparameters. If such optimizer is of a type that stores a state (i.e. parameter statistics), then the state is also cloned from the state saved by the ancestor that has last trained the cloned layer.

A trainable layer may be followed by frozen layers, In this case the gradients for the trainable layer are propagated through the frozen layers and applied only to the parameters of the trainable layers while frozen parameters are left unchanged. The head layer is always cloned since it always needs to be trainable. If a child model is trained on a task different from the parent’s task, then a new randomly initialized head layer is created with output shape matching the number of classes of the new task.

Hyperparameter mutations can be applied to modify the configuration inherited from the parent. The new value for each hyperparameter can be sampled from a set of valid values. For numerical hyperparameters, their set of valid values is sorted into a list, and sampling is limited to neighbouring values, to apply an incremental change constraint.

Each possible layer cloning or hyperparameter mutation is independently sampled with a mutation probability, . is itself a hyperparameter that is mutated by the evolutionary process. Thus showing that automatic tuning is not only applied to selecting the hyperparameters of the generated models, but can also be applied to self-tune the configuration of the evolutionary algorithm.

3.4 Training and scoring

A newly sampled child model is trained on the active task for a given number of epochs. The model is evaluated on the validation set after each epoch. At each intermediate evaluation, the child model is assigned a score that the evolutionary algorithm aims to maximize. The score can be defined to optimize a mixture of factors such as validation quality, inference latency, training compute or model size, depending on the applications requirements. The presented experiments aim to compare against the state of the art for a large number of tasks without any size or compute constraint. Therefore, the validation accuracy is used directly as the score without additional factors. After training, only the parameters and optimizer state of the version of the child model achieving best score are retained.

3.5 Discussion and properties

Notice that none of the defined mutation actions or the evolutionary algorithm allow the creation of child models that can alter the parent model in any way. Once a model has been trained, the parameters storing its knowledge cannot be modified. This method guarantees immunity from catastrophic forgetting, since the knowledge of a trained model is always preserved. It also provides a solution to negative transfer, since it automates the selection of the knowledge that is most relevant for each new task. Furthermore, it also avoids gradient interference, that can arise when multiple gradients are synchronously applied to the same set of parameters. Nonetheless, models for new tasks can use knowledge and representations from prior tasks and even extend these to improve or specialize them.

The method naturally compartmentalizes the knowledge of each task in a subset of components, allowing the implementation of different dataset privacy control policies. For example, we can introduce private tasks that can benefit from the public knowledge embedded in the multitask system but are able to withhold the knowledge and representations derived from their private dataset from being used for other tasks. This can be achieved by preventing other tasks from using or cloning the components trained on a private dataset. This also allows to completely remove the knowledge extracted from the private dataset at any future date by simply removing its components. This private/public distinction can be generalized into an access-control-list mechanism. For example, a set of related private tasks could share representations, but no other tasks could use the representations shared amongst the set of related private tasks. Privacy control capabilities are empirically demonstrated in Section 5.

4 Experimental set up

This section details the instantiation of the proposed method employed in the experimental analysis. The task type for the presented set of experiment is image classification. This choice allows us to define a large benchmark of publicly available datasets with standardized framing. It also allows to build on top of state of the art models whose architecture definition and checkpoints are public.

The Visual Transformer (ViT) is used as root model (Dosovitskiy et al., 2021).

Architecture   Layer cloning mutations can create a copy of any of ViT’s layers: 1) Patch embedding: the first layer of the model maps the input image into a sequence of embedded tokens, each corresponding to a patch of the input image. 2) Class token: a classification token is prepended to the sequence. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks (Devlin et al., 2019). 3) Position embedding: the sequence representation is then augmented with an embedding that carries each patch positional information. 4) Transformer layers: the sequence representation generated by the input layers is iteratively transformed by a stack of transformer layers (Vaswani et al., 2017). 5) Model head

: a final fully connected layer mapping the representation produced by the top-most transformer layer into the logits.

Parameters   The parameters of the root model can be either randomly initialized or loaded from a checkpoint. The preliminary experiment shows the evolution from random initialization (see Section 5), while the large scale experiment starts from a pretrained large ViT model (see Section 6).

Hyperparameters   As default hyperparameters we use those resulting from the extensive study conducted by Steiner et al. (2021)

: SGD momentum optimizer, cosine decay schedule, no weight decay, gradient clipping at global norm 1 and 386

386 image resolution. The evolutionary method can change the hyperparameters of optimizer, image preprocessing and architecture (see Table 1).

[0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30]
Learning rate [0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5]
Cosine learning rate schedule warm up ratio [0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4]
Momentum [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.98, 0.99]
Nesterov update [False, True]
Crop input image [False, True]
Cropped area range min [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]
Cropped aspect ratio range min [0.25, 0.5, 0.75, 1.0]
Flip left/right [False, True]
Brightness delta [0.0, 0.01, 0.02, 0.05, 0.1, 0.2]
Contrast delta [0.0, 0.01, 0.02, 0.05, 0.1, 0.2]
Saturation delta [0.0, 0.01, 0.02, 0.05, 0.1, 0.2]
Hue delta [0.0, 0.01, 0.02, 0.05, 0.1, 0.2]
Table 1: Hyperparameters valid values. Bold vales are defaults. This search space consists of a parametrization of the configuration for the published ViT model definition library.

5 Preliminary experiment

This section describes a smaller scale preliminary experiment that introduces details of the method application and illustrates the privacy control and initialization capabilities. This experiment demonstrates the ability to generate a multitask model from random initialization and a minimal architecture rather than evolving a pretrained state-of-the-art model. Therefore, a randomly initialized ViT Ti/16 architecture (Steiner et al., 2021) stripped of transformer layers is used as root model. To allow the method to build a capable architecture, we add an extra mutation action that can insert a new randomly initialized transformer layer just before the head layer.

Furthermore, the dataset privacy control technique (see Section 3) is demonstrated by adding a private task. Three numerals recognition tasks are used as benchmark: {bangla, devanagari, telugu}. Telugu is introduced as a private task, so that no other task can access the knowledge introduced into the system by its dataset. However, its models can leverage the knowledge provided by the public tasks.

This short experiment is configured to perform 2 active task iterations for each task. During each active task iteration 4 model generations are produced. In each generation 8 child models are sampled and trained in parallel on each of the 8 TPUv3 cores. The choice of small datasets, architecture and training budget is intended to facilitate reproducibility and fast experimental iterations. The experiment can be reproduced by using the published artifacts and completes in less than 5 minutes.

Figure 1 (right) displays the resulting multitask model jointly solving the 3 tasks. We observe a high degree of cross-task knowledge and components sharing throughout the evolutionary process. Even though the root model has no transformer layers, multiple randomly initialized transformer layers are inserted and trained improving the score of each task. Note that, at any point during the evolution, the components trained on the private task (red) are only used by the private task.

6 Large scale continual learning experiment

This section reports a single large scale continual learning experiment

producing a multitask system jointly solving 69 visual tasks. A pretrained ViT L/16 is used as root model, which has been selected for its pretraining validation accuracy on the imagenet-21k dataset following

Steiner et al. (2021).

6.1 ViT benchmark

The first batch of tasks introduced to the system is a set of the 3 tasks on which ViT was initially evaluated in Dosovitskiy et al. (2021). This experiment is configured to perform 5 active iterations for each task, and 4 model generations for each iteration. During each generation, 8 child models are sampled and trained in parallel on each of the 8 TPUv3 cores. Each model training performs 4 validation cycles. To smooth the distribution of compute across tasks of different size, the number of train samples between validation cycles is set to . is equivalent to of an epoch of the imagenet2012 training set. Unless stated otherwise, the following experiments use the same configuration.

This setup results in 8 epochs for imagenet, and 80 epochs for cifar. This is roughly equivalent to the fine-tuning setup of the baseline model we compare against

(Dosovitskiy et al., 2021): 8 epochs for imagenet and 102.4 for cifar. The proposed method can be considered cheaper since: 1) ViT fine-tuning has been repeated multiple times for the hyperparameters tuning process and 2) setting results in cheaper training steps, since parameters updates can be skipped for frozen layers and gradient propagation can be skipped for frozen layers at the base of the model (preliminary experiments have shown a 2.5-4% training speed-up attributable to this). In order to provide a fair comparison, as a root model is used the same ViT L/16 architecture, checkpoint pretrained on the i12k dataset, same 384384 image resolution, and same optimizer and prepossessing configuration.

Model imagenet2012 cifar100 cifar10
ViT L/16 fine-tuning (Dosovitskiy et al., 2021) 85.30 93.25 99.15
2Net after 5 task iterations 86.38 94.75 99.35
2Net after 10 task iterations 86.66 94.67 99.38
2Net cont. after adding VTAB-full tasks 86.74 94.67 99.41
2Net cont. after adding VDD tasks 86.74 94.74 99.43
Table 2: Test accuracy achieved by 2Net and fine-tuning a comparable pretrained ViT.

Table 2 reports the top 1 test accuracy results. 2Net outperforms fine-tuning with comparable training steps per task. Extending the training with 5 additional tasks iterations leads to moderate gains on imagenet2012 and cifar10. Notice that for cifar100 the accuracy decreases. This can happen since the best models are selected according the validation accuracy and, as the model gets close to convergence, a small validation gain may lead to a noisy perturbation of the test accuracy.

To quantify knowledge transfer, we consider the model produced for each task and examine the dataset on which each layer’s ancestors were trained. On average, the layers composing the imagenet model have performed only 60.6% of the training steps on the imagenet dataset, and have received 31.5% of the gradient updates from cifar100 and 7.9% from cifar10. The layers comprising the cifar100 model have performed 42.3% of their training on imagenet and 20.6% on cifar10. And layers comprising the cifar10 model have performed 46.1% of their training on imagenet and 35.9% on cifar100. The heterogeneity of tasks and datasets improve the representations of the different layers, and result in generally higher performance, as shown in Table 2.

Following sections 6.2 and 6.3 describe the extensions of the system performed by introducing two additional benchmarks. After the introduction of each benchmark, we perform an additional iteration on imagenet and cifar tasks to analyze knowledge sharing dynamics. We find evidence of knowledge sharing: i.e. the VDD benchmark (Section 6.3) includes a low resolution version of cifar100. The model that will be generated for vdd/cifar100 will be a mutation of the current full resolution cifar100 model. Afterward, the additional active task iteration on cifar100 will be performed, and the resulting improved cifar100 model will be a mutation of the low resolution vdd/cifar100 model.

We note that 99.43 is the best cifar10 accuracy achieved by a model trained on public data: to the best of our knowledge, the state of the art is 99.40 from Touvron et al. (2021). Dosovitskiy et al. (2021) achieves 99.50 with a double size ViT-H trained on proprietary data. Our cifar100 accuracy is currently outperformed only by Ridnik et al. (2021) (95.10) and Foret et al. (2021) (96.08).

6.2 VTAB-full benchmark

Next, we introduce to the system the 19 VTAB-full tasks (Zhai et al., 2019b), plus 5 related tasks that are not included in the standard set (Table 8). From this experiment onward, the infrastructure is scaled from 8 to 32 cores, as detailed in Appendix A. The number of task iterations is reduced from 10 to 2. These changes lead to a roughly similar exploratory budget per task. However, the increased parallelism results in faster task iterations.



Steiner et al. (2021) 95.5 94.1 80.3 99.6 95.0 83.4 97.4 86.4 99.0 96.6 83.3 99.8 91.7 75.6 100 90.4 84.7 27.5 76.5
Zhai et al. (2019a) 94.6 84.8 75.9 94.7 91.5 70.2 97.0 85.9 98.8 94.9 79.5 99.8 92.5 76.5 100 96.5 82.3 100 98.4
2Net cont. 1st iter. 92.6 94.6 79.8 99.6 95.3 84.5 97.3 87.5 99.1 96.3 83.7 99.8 93.2 76.9 100 96.2 83.0 32.3 94.5
2Net cont. 2nd iter. 92.6 94.6 80.5 99.6 95.3 84.8 97.8 88.4 99.2 97.0 84.0 99.8 94.0 76.9 100 96.4 83.0 33.3 95.1
2Net cont. after VDD 93.0 94.7 81.0 99.6 95.3 84.8 97.8 91.1 99.1 97.0 84.0 99.8 94.0 76.9 100 96.4 82.3 33.3 95.1
Table 3: Test accuracy achieved on the VTAB-full benchmark by: 1) fine-tuning with matching architecture and checkpoint (ViT L/16 i21k) reported by Steiner et al. (2021), 2) the Sup-rotation method (Gidaris et al., 2018) that achieved the best result in the VTAB-full leaderboard (Zhai et al., 2019a), 3-4) 2Net results after 2 task iterations, 5) and after an additional iteration performed after the VDD benchmark introduction. Underlined models transfer knowledge from VDD tasks.

Table 3 reports the achieved results along with reference models that use limited knowledge transfer capabilities. Steiner et al. (2021) reports the quality achieved by fine-tuning a model equivalent to our root model. This outperforms 2Net on only 2 tasks, even if it has been trained multiple times to perform hyperparameter tuning. Zhai et al. (2019a) reports the results of the best model identified with a large scale study. This state of the art model outperforms 2Net on 4 tasks. Again, increasing number of task iterations and additional knowledge (VDD) in the system, seem to yield better quality.









1st iter. 89.10.19 98.20.23 97.20.34 99.90.04 99.90.03 84.51.18 83.61.79 64.83.57 75.21.20 99.00.11
2nd iter. 89.20.12 98.40.23 97.20.40 99.90.05 99.90.04 84.30.27 85.71.36 65.43.03 76.00.49 99.20.11
Table 4: Test accuracy mean and achieved on the VDD benchmark by 3 system replicas.














State of the art 99.77 95.88 95.40 99.91 99.00
2Net cont. 1st iteration 99.82 93.60 98.68 99.75 98.72 98.60 96.60 97.80
2Net cont. 2nd iteration 99.82 93.68 98.60 99.69 99.84 99.10 98.00 99.40
Table 5: Test accuracy achieved on the Multitask Character Classification Benchmark by 2Net continued extension with 2 active task iterations and by the model that has set the state of the art using comparable data splits: Jeevan and Sethi (2022) for digits, Kabir et al. (2020) for letters, Ajayan and James (2021) for kmnist, An et al. (2020)

for mnist,

Hazra et al. (2021) for cmaterdb/bangla. Underlined models reuse knowledge introduced by other character classification tasks. Datasets are listed in decreasing size from the biggest emnist/digits (240k samples) to the smallest telugu (2.5k).

6.3 Visual Domain Decathlon (VDD) benchmark

The VDD benchmark (Bilen et al., 2017)

is introduced next. The ML methodology proposed in this paper, can achieve higher efficiency by focusing the available compute on the training of a single multitask system. Though, the standard approach to measure variance relies on experiment repetitions. This section demonstrates how variance can be measured for any chosen segment of the training. In practice, 2 task iterations are performed to introduce the VDD tasks starting from the state achieved after the introduction of the last benchmark as usual. But, this experiment is run on 3 parallel copies of the system, allowing us to compute variance of the metrics for this set of task iterations.

The VDD benchmark is composed of 10 diverse tasks. This is also reflected in the diverse variance ranges measured (see Table 4

). Variance is low for most of the tasks. However, for ucf101 and aircraft is significantly higher. The metrics that have highest correlation with standard deviation are error rate (linear proportionality in log scale) and number of training samples per class (inverse proportionality in log scale) (Figure 

3). These can be considered metrics indicative of the complexity of the task. Furthermore, variance decreases with the second iteration: average standard deviation of 0.87 after 1 iteration and 0.61 after the second. These findings can support the intuitive hypothesis that tasks with higher complexity may benefit from more iterations to decrease variance and approach convergence. The next system extension continues from the state of one randomly selected replica.

6.4 Multitask Character Classification benchmark

We continue extending the system by adding a set of 8 character classification tasks. Thus offering the opportunity to study knowledge transfer across tasks with high domain correlation.

Table 5 reports the test accuracy achieved with 2 active tasks iterations. We observe that tasks with more training data (left) achieve convergence in the first iteration, this hypothesis is supported by lack of significant accuracy gains with the second iteration. While tasks with less training data (right) show a significant gain from a second training iteration. Smaller tasks use transferred in domain knowledge: bangla top model reuses components that embed knowledge introduced by emnist/letters, while devanagari transfers from omniglot and bangla, and telugu transfers from bangla and devanagari. Furthermore, the achieved quality is comparable to the state of the art published for each task.





















Dosovitskiy et al. (2021) 90.8 84.1 74.1 99.3 92.7 61.0 80.9 82.5 95.6 85.2 75.3 70.3 56.1 41.9 74.7 64.9 79.9 30.5 41.7
Zhai et al. (2019a) 91.7 53.7 69.5 90.8 88.1 32.8 88.5 83.4 96.0 82.0 71.1 47.3 57.2 36.6 88.3 52.1 77.1 51.6 33.7
2Net cont. 1st iter. 87.1 89.4 77.6 99.2 94.5 57.6 97.5 86.0 98.6 93.4 78.0 91.2 59.9 47.6 58.4 96.2 81.9 32.1 92.5
2Net cont. 2nd iter. 89.9 90.6 78.1 99.7 94.5 57.6 97.5 86.0 98.3 93.4 83.5 99.8 90.6 76.3 100 96.2 81.7 33.7 92.5
Table 6: Test accuracy achieved on the VTAB-1k benchmark by: 1) fine-tuning ViT L/16 i21k (matching root model) Dosovitskiy et al. (2021), 2) the Sup-rotation method (Gidaris et al., 2018) that achieved the best result in the VTAB-1k leaderboard (Zhai et al., 2019a). Underlined models have at least one ancestor trained on the corresponding full form task. Doubly underlined model inherit directly from the current best model for the matching full form task.

6.5 VTAB-1k benchmark

The system is further extended by adding the 1k-samples version of the VTAB tasks. Since the system contains already the knowledge learned from the full version of each task, this set allows to study how effective is the proposed method at retrieving knowledge that is already embedded in the system.

Table 6 reports results along with reference models that use limited knowledge transfer capabilities. During the first iteration, the models generated for the short form tasks can retrieve the knowledge of the corresponding full form task also without directly mutating its model, but rather mutating a model having at least one ancestor trained on the full task. For example, the model generated for flowers102 mutates the dtd model, that has 28 ancestors, of which only the 21 was trained on oxford_flowers102. However, that is enough for flowers102 to achieve 99.2% test accuracy.

After only one task iteration, 5 tasks achieve better test accuracy than the reference models without reusing any knowledge introduced by the corresponding full form task. Particularly interesting is the case of kitty-dist (a.k.a kitty/closest_vehicle_distance), that achieves a strong performance without transferring from the matching full form task but composing the knowledge of related tasks: kitty/closest_object_distance (8 ancestors), kitty/count_vehicles (3 ancestors) and kitty/count_vehicles

(3 ancestors). Thus learning to estimate distance of the closest vehicle by combining the knowledge of recognizing vehicles and estimating the distance of the closest object. Also, clevr-dist

achieves a strong performance by inheriting from the semantically equivalent task kitti/closest_object_distance without reusing the knowledge introduced by clevr-dist.

7 Conclusion

We introduced the 2Net method, aimed at achieving state-of-the-art quality on a large task set, with the ability to dynamically introduce new tasks into the running system. The more tasks are learned the more knowledge is embedded in the system. A ViT-L architecture (307M parameters) was evolved into a multitask system with 13’087M parameters jointly solving 69 tasks. However, as the system grows, the sparsity in parameter activation keeps the amount of compute and the memory usage per task constant. The average added parameters per task decreases by 38% through the experiment, and the resulting multitask system activates only 2.3% of the total parameters per task (see Figure 2 and Table 7). The proposed method allows decoupling the growth of useful representations for solving new tasks from the growth of parameters/compute per task. Furthermore, experimenting with a large number of tasks allowed us to identify different patterns of positive knowledge transfer, showing especially effective results on small tasks and across related tasks. The approach to mutations when introducing child models into the system achieves immunity against common multitask systems pitfalls such as catastrophic forgetting, negative transfer and gradient interference, demonstrating the key properties we want to achieve in a continual learning system.

Limitations   The proposed methodology does not allow for a standard study of results variance since it aims to avoid experiment repetitions and related inefficiencies. But rather it aims to focus the available compute resources into the continual enrichment of a single artificial intelligence. Section 6.3 demonstrates a methodology to estimate variance at any point of the learning process.

Societal impact   The ability to get high quality trained models for new machine learning tasks can have a significant effect on society, enabling a broader set of individuals than just ML researchers and engineers to solve real-world problems. There are a number of issues raised in such systems, however. Large scale multitask systems must provide guarantees of respect for users’ data privacy, and they must also provide control over the sources of knowledge and influence for applications that are sensitive to fairness and diversity biases. The knowledge access control techniques described in Section 3 and demonstrated in Section 5 allow the implementation of policies with such guarantees.

Figure 2: Activated and added parameters per task as percentage with respect to the total number of parameters of the multitask system along the duration the large scale experiment (see Section 6. Vertical lines highlight the start of the introduction for each of the considered benchmark.


  • A. Ajayan and A. P. James (2021)

    Edge to quantum: hybrid quantum-spiking neural network image classifier

    Neuromorphic Computing and Engineering 1. Cited by: Table 5.
  • S. An, M. J. Lee, S. Park, H. S. Yang, and J. So (2020)

    An ensemble of simple convolutional neural network models for mnist digit recognition

    ArXiv abs/2008.10400. Cited by: Table 5.
  • P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu (2022) Pathways: asynchronous distributed dataflow for ml. ArXiv abs/2203.12533. Cited by: Appendix A.
  • H. Bilen, S. Rebuffi, and T. Jakab (2017) Visual domain decathlon. Cited by: Table 8, §6.3.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. J. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. ArXiv abs/2005.14165. Cited by: §1.
  • Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, Cited by: §2.
  • G. Cheng, J. Han, and X. Lu (2017)

    Remote sensing image scene classification: benchmark and state of the art

    Proceedings of the IEEE 105, pp. 1865–1883. Cited by: Table 8, Table 9.
  • M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild.

    2014 IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 3606–3613.
    Cited by: Table 8, Table 9.
  • T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha (2018) Deep learning for classical japanese literature. ArXiv abs/1812.01718. Cited by: Table 9.
  • G. Cohen, S. Afshar, J. C. Tapson, and A. van Schaik (2017) EMNIST: extending mnist to handwritten letters. 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: Table 9.
  • N. Das, J. M. Reddy, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu (2012a) A statistical-topological feature combination for recognition of handwritten numerals. Appl. Soft Comput. 12, pp. 2486–2495. Cited by: Table 9.
  • N. Das, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu (2012b)

    A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application

    Appl. Soft Comput. 12, pp. 1592–1606. Cited by: Table 9.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2, §4.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/2010.11929. Cited by: §2, §4, §6.1, §6.1, §6.1, Table 2, Table 6.
  • N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, Y. Wu, Z. Chen, and C. Cui (2021) GLaM: efficient scaling of language models with mixture-of-experts. ArXiv abs/2112.06905. Cited by: §2.
  • L. Fei-Fei, R. Fergus, and P. Perona (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Cited by: Table 8, Table 9.
  • C. Fernando, D. S. Banarse, C. Blundell, Y. Zwols, D. R. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra (2017) PathNet: evolution channels gradient descent in super neural networks. ArXiv abs/1701.08734. Cited by: §2.
  • P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur (2021) Sharpness-aware minimization for efficiently improving generalization. ArXiv abs/2010.01412. Cited by: §6.1.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3, pp. 128–135. Cited by: §2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: Table 8, Table 9.
  • A. Gesmundo and J. Dean (2022) MuNet: evolving pretrained deep neural networks into scalable auto-tuning multitask systems. ArXiv 2205.10937. Cited by: §2.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ArXiv abs/1803.07728. Cited by: Table 3, Table 6.
  • A. Hazra, P. Choudhary, S. C. Inunganbi, and M. Adhikari (2021) Bangla-meitei mayek scripts handwritten character recognition using convolutional neural network. Applied Intelligence 51, pp. 2291–2311. Cited by: Table 5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.
  • P. Helber, B. Bischke, A. R. Dengel, and D. Borth (2019) EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, pp. 2217–2226. Cited by: Table 8, Table 9.
  • P. Jeevan and A. Sethi (2022) WaveMix: resource-efficient token mixing for images. ArXiv abs/2203.03689. Cited by: Table 5.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1988–1997. Cited by: Table 8, Table 9.
  • N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. S. Bajwa, S. Bates, S. Bhatia, N. J. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. B. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. A. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon (2017)

    In-datacenter performance analysis of a tensor processing unit

    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: Appendix A.
  • H. M. D. Kabir, M. Abdar, S. M. J. Jalali, A. Khosravi, A. F. Atiya, S. Nahavandi, and D. Srinivasan (2020) SpinalNet: deep neural network with gradual input. ArXiv abs/2007.03347. Cited by: Table 5.
  • Kaggle and EyePacs (2015) Kaggle diabetic retinopathy detection. Cited by: Table 8, Table 9.
  • J. Kaplan, S. McCandlish, T. J. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. ArXiv abs/2001.08361. Cited by: §2.
  • A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7482–7491. Cited by: §2.
  • D. A. Klindt, L. Schott, Y. Sharma, I. Ustyuzhaninov, W. Brendel, M. Bethge, and D. M. Paiton (2021) Towards nonlinear disentanglement in natural data with temporal sparse coding. ArXiv abs/2007.10930. Cited by: Table 8, Table 9.
  • E. Kokiopoulou, A. Hauth, L. Sbaiz, A. Gesmundo, G. Bartók, and J. Berent (2019) Fast task-aware architecture inference. ArXiv abs/1902.05781. Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: Table 8, Table 9.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350, pp. 1332 – 1338. Cited by: Table 9.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proc. IEEE 86, pp. 2278–2324. Cited by: Table 9.
  • Y. LeCun, F. J. Huang, and L. Bottou (2004) Learning methods for generic object recognition with invariance to pose and lighting. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. 2, pp. II–104 Vol.2. Cited by: Table 8, Table 9.
  • H. Liu, K. Simonyan, and Y. Yang (2019a) DARTS: differentiable architecture search. ArXiv abs/1806.09055. Cited by: §2.
  • S. Liu, Y. Liang, and A. Gitter (2019b) Loss-balanced task weighting to reduce negative transfer in multi-task learning. In AAAI, Cited by: §2.
  • K. Maziarz, A. Khorlin, Q. de Laroussilhe, and A. Gesmundo (2018) Evolutionary-neural hybrid agents for architecture search. ArXiv abs/1811.09828. Cited by: §2.
  • K. Maziarz, E. Kokiopoulou, A. Gesmundo, L. Sbaiz, G. Bartók, and J. Berent (2019) Gumbel-matrix routing for flexible multi-task learning. ArXiv abs/1910.04915. Cited by: §2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of Learning and Motivation 24, pp. 109–165. Cited by: §2.
  • A. Mirhoseini, A. Goldie, M. Yazgan, J. W. Jiang, E. M. Songhori, S. Wang, Y. Lee, E. Johnson, O. Pathak, S. Bae, A. Nazi, J. Pak, A. Tong, K. Srinivasa, W. Hang, E. Tuncer, A. Babu, Q. V. Le, J. Laudon, R. Ho, R. Carpenter, and J. Dean (2020)

    Chip placement with deep reinforcement learning

    ArXiv abs/2004.10746. Cited by: §1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: Table 8, Table 9.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: Table 8, Table 9.
  • O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar (2012) Cats and dogs. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505. Cited by: Table 8, Table 9.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, Cited by: §2.
  • C. Raffel, N. M. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv abs/1910.10683. Cited by: §2.
  • A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. ArXiv abs/2102.12092. Cited by: §1.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In AAAI, Cited by: §2.
  • T. Ridnik, G. Sharir, A. Ben-Cohen, E. Ben-Baruch, and A. Noy (2021) ML-decoder: scalable and versatile classification head. Cited by: §6.1.
  • M. T. Rosenstein (2005) To transfer or not to transfer. In NIPS 2005, Cited by: §2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: Table 8.
  • O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In NeurIPS, Cited by: §2.
  • A. W. Senior, R. Evans, J. M. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Zídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis (2020) Improved protein structure prediction using potentials from deep learning. Nature 577, pp. 706–710. Cited by: §1.
  • N. M. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. ArXiv abs/1701.06538. Cited by: §2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. Cited by: §1.
  • A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer (2021) How to train your vit? data, augmentation, and regularization in vision transformers. ArXiv abs/2106.10270. Cited by: 1st item, §4, §5, §6.2, Table 3, §6.
  • Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2020) ERNIE 2.0: a continual pre-training framework for language understanding. ArXiv abs/1907.12412. Cited by: §2.
  • H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J’egou (2021) Going deeper with image transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 32–42. Cited by: §6.1.
  • A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. ArXiv abs/1706.03762. Cited by: §4.
  • B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling (2018) Rotation equivariant cnns for digital pathology. ArXiv abs/1806.03962. Cited by: Table 8, Table 9.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pp. 1–5. Cited by: §1.
  • Z. Wang, Z. Dai, B. Póczos, and J. G. Carbonell (2019) Characterizing and avoiding negative transfer. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11285–11294. Cited by: §2.
  • J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) SUN database: large-scale scene recognition from abbey to zoo. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: Table 8, Table 9.
  • T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn (2020) Gradient surgery for multi-task learning. ArXiv abs/2001.06782. Cited by: §2.
  • X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S. Pinto, M. Neumann, A. Dosovitskiy, L. Beyer, O. Bachem, M. Tschannen, M. Michalski, O. Bousquet, S. Gelly, and N. Houlsby (2019a) A large-scale study of representation learning with the visual task adaptation benchmark. arXiv: Computer Vision and Pattern Recognition. Cited by: §6.2, Table 3, Table 6.
  • X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S. Pinto, M. Neumann, A. Dosovitskiy, L. Beyer, O. Bachem, M. Tschannen, M. Michalski, O. Bousquet, S. Gelly, and N. Houlsby (2019b) The visual task adaptation benchmark. ArXiv abs/1910.04867. Cited by: Table 8, Table 9, §6.2.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. ArXiv abs/1611.01578. Cited by: §2.

Appendix A Experiments details

All the experiments reported in this paper can be reproduced by using the following public resources:

  • The ViT model definition and checkpoints published by Steiner et al. (2021). These resources are available at and distributed under the Apache License 2.0.

  • Published code of the proposed method:

  • All the used datasets are publicly available via the Tensorflow Datasets image classification, catalog. Refer to for detailed information regarding each dataset licence and other metadata. Table 8 reports exact dataset splits and reference foor each task.

We also publish the 2Net checkpoint resulting from the large-scale multitask experiment reported in Section 6. This checkpoint can be used for inference on any of the 69 learned image classification tasks, or for further analysis, or even to be extended with additional tasks or methods. For information about the checkpoint and its license refer to:

References for mentioned state of the art public model for different tasks are sourced from as of May 2022.

The VTAB-1k results are reported for reference and are not directly comparable to the state of the art, as the benchmark definition specifies that VTAB-full tasks cannot be used for pre-training.

The initial experiments reported in Sections 5 and 6.1 have been executed on a TPUv3 (Jouppi et al., 2017) machine with 8 cores. While, all the following experiments have been executed on a larger scale infrastructure using 32 TPUv4 chips in MegaCore mode, by using the Pathways orchestration layer (Barham et al., 2022).

Table 7 reports more details for each training segment.

Training TPU #Tasks #Params Activated params
segment core-hours #cores type (M) per task (%)
ViT tasks 10 iters 4949 8 TPUv3 3 659 46.6%
VTAB-full 2 iters 5927 32 TPUv4 26 4400 7.0%
ViT tasks +1 iter 541 32 TPUv4 26 4577 5.2%
VDD 2 iters 1507 32 TPUv4 36 7790 3.9%
ViT tasks +1 iter 564 32 TPUv4 36 7854 3.9%
VTAB-full +1 iter 2266 32 TPUv4 36 7596 4.0%
Char. class. 2 iters 1785 32 TPUv4 44 9368 3.3%
VTAB-1k 2 iters 271 32 TPUv4 69 13087 2.3%
Table 7: Details for the different training segments of the large scale continual learning experiment described in Section 6.
Figure 3: Correlation in log scale with the standard deviation measured during the variance analysis conducted on Visual Domain Decathlon benchmark by running the training on 3 parallel replicas of the system. We display the 2 metrics that are most correlated with the standard deviation: error rate computed on the test set (left) and training samples per class (right). The red line is fitted to minimize the squared distance to the set of points.
Figure 4: Distributions of the hyperparameter values used by the best models of the 69 image classification tasks at the end of the large scale continual learning experiment described in Section 6.
Figure 5: Graph representing the architecture of the multitask system solving jointly 69 image classification tasks generated by the large scale continual learning experiment described in Section 6. Each task is identified with a unique color. Bottom triangular nodes represent the data input of each task. Top rectangular nodes represent the head layer of each task. Each edges sequence of the same color connecting a task input to its head, a path, defines the layers sequence composing the model for each task. Each path traverses 27 round nodes representing ViT L/16 internal layers (see Section 4) in the following order from bottom to top: patch embedding, class token, position embedding and 24 transformer layers. Internal nodes are represented with the color of the task on which the parameters of the corresponding layer were trained last. Except for the gray nodes that have not received gradient updates from any of the 69 tasks and still carry the parameters of the root model that were loaded from a checkpoint of a ViT L/16 pretrained on the imagenet-21k dataset (see Section 6) (video:
Name Train Val. Test Reference
imagenet2012 train imagenet_v2:test val (Russakovsky et al., 2015)
cifar100 train[:98%] train[98%:] test (Krizhevsky, 2009)
cifar10 train cifar10_1:test test (Krizhevsky, 2009)
VTAB-full benchmark
caltech101 train[:2754] train[2754:] test (Fei-Fei et al., 2004)
dtd train val test (Cimpoi et al., 2014)
oxford_flowers102 train val test (Nilsback and Zisserman, 2008)
oxford_iiit_pet train[:2944] train[2944:] test (Parkhi et al., 2012)
sun397 train val test (Xiao et al., 2010)
svhn_cropped train[:65931] train[65931:] test (Netzer et al., 2011)
patch_camelyon train val test (Veeling et al., 2018)
eurosat/rgb train[:16200] train[16200:21600] train[21600:] (Helber et al., 2019)
resisc45 train[:18900] train[18900:25200] train[25200:] (Cheng et al., 2017)
drd/btgraham-300 train val test (Kaggle and EyePacs, 2015)
clevr/count_cylinders train[:63000] train[63000:] val (Johnson et al., 2017)
clevr/count_all train[:63000] train[63000:] val (Johnson et al., 2017)
clevr/closest_object_distance train[:63000] train[63000:] val (Johnson et al., 2017)
dmlab train val test (Zhai et al., 2019b)
dsprites/label_x_position train[:589824] train[589824:663552] train[663552:] (Klindt et al., 2021)
dsprites/label_orientation train[:589824] train[589824:663552] train[663552:] (Klindt et al., 2021)
kitti/closest_object_distance train val test (Geiger et al., 2012)
kitti/count_vehicles train val test (Geiger et al., 2012)
kitti/closest_vehicle_distance train val test (Geiger et al., 2012)
smallnorb/label_category train test[:50%] test[50%:] (LeCun et al., 2004)
smallnorb/label_lighting train test[:50%] test[50%:] (LeCun et al., 2004)
smallnorb/label_azimuth train test[:50%] test[50%:] (LeCun et al., 2004)
smallnorb/label_elevation train test[:50%] test[50%:] (LeCun et al., 2004)
Visual domain decathrlon benchmark
vdd/imagenet12 train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/svhn train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/cifar100 train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/gtsrb train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/daimlerpedcls train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/omniglot train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/ucf101 train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/aircraft train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/dtd train val[:50%] val[50%:] (Bilen et al., 2017)
vdd/vgg-flowers train val[:50%] val[50%:] (Bilen et al., 2017)
Continues in Table 9
Table 8: Datasets splits and reference (part 1 of 2). For each datasets used in the experiments, this table reports: 1) dataset name indicative of the Tensorflow Datasets Catalogs identification string and linking to the corresponding catalog page ("visual_domain_decathlon" has been abbreviated as "vdd", and "diabetic_retinopathy_detection" as "drd"), 2) train, validation and test data splits, represented with the standard Tensorflow Datasets format ("validation" has been abbreviated as "val"). 3) corresponding scientific publication reference. Datasets are listed in the order of introduction into the system.
                                                                                                                                                  [1] The test split of the imagenet_v2 dataset is used as validation set for imagenet2012.
                                                                                                                                                  [2] The test split of the cifar10_1 dataset is used as validation set for cifar10.
                                                                                                                                                  [3] The VTAB-full benchmark also includes the cifar100 task. Cifar100 has been introduced to the 2Net system as part of the initial benchmark. In the VTAB-full results tables we refer to the top 1 test accuracy achieved in the latest cifar100 training iteration without retraining it as part of the VTAB-full active training iteration.
                                                                                                                                                  [4] The definition for the VTAB standard and additional tasks has been sourced from
                                                                                                                                                  [5] VTAB additional task, not included in the standard scoring set. These tasks were added to further scale the system and analyze transfer across related tasks.
Name Train Val. Test Reference
…Continues from Table 8
Multitask Character Classification Benchmark
emnist/digits train[5%:] train[:5%] test (Cohen et al., 2017)
emnist/letters train[5%:] train[:5%] test (Cohen et al., 2017)
kmnist train[5%:] train[:5%] test (Clanuwat et al., 2018)
mnist train[5%:] train[:5%] test (LeCun et al., 1998)
omniglot train small1 small2 (Lake et al., 2015)
cmaterdb/bangla train[20%:] train[:20%] test (Das et al., 2012b, a)
cmaterdb/devanagari train[20%:] train[:20%] test (Das et al., 2012b, a)
cmaterdb/telugu train[20%:] train[:20%] test (Das et al., 2012b, a)
VTAB 1k benchmark
caltech101 train[:800] train[2754:2954] test (Fei-Fei et al., 2004)
cifar100 train[:800] train[45000:45200] test (Krizhevsky, 2009)
cifar10 train[:800] train[45000:45200] test (Krizhevsky, 2009)
dtd train[:800] val[:200] test (Cimpoi et al., 2014)
oxford_flowers102 train[:800] val[:200] test (Nilsback and Zisserman, 2008)
oxford_iiit_pet train[:800] train[2944:3144] test (Parkhi et al., 2012)
sun397 train[:800] val[:200] test (Xiao et al., 2010)
svhn_cropped train[:800] train[65931:66131] test (Netzer et al., 2011)
patch_camelyon train[:800] val[:200] test (Veeling et al., 2018)
eurosat/rgb train[:800] train[16200:16400] train[21600:] (Helber et al., 2019)
resisc45 train[:800] train[18900:19100] train[25200:] (Cheng et al., 2017)
drd/btgraham-300 train[:800] val[:200] test (Kaggle and EyePacs, 2015)
clevr/count_cylinders train[:800] train[63000:63200] val (Johnson et al., 2017)
clevr/count_all train[:800] train[63000:63200] val (Johnson et al., 2017)
clevr/closest_object_distance train[:800] train[63000:63200] val (Johnson et al., 2017)
dmlab train[:800] val[:200] test (Zhai et al., 2019b)
dsprites/label_x_position train[:800] train[589824:590024] train[663552:] (Klindt et al., 2021)
dsprites/label_orientation train[:800] train[589824:590024] train[663552:] (Klindt et al., 2021)
kitti/closest_object_distance train[:800] val[:200] test (Geiger et al., 2012)
kitti/count_vehicles train[:800] val[:200] test (Geiger et al., 2012)
kitti/closest_vehicle_distance train[:800] val[:200] test (Geiger et al., 2012)
smallnorb/label_category train[:800] test[:200] test[50%:] (LeCun et al., 2004)
smallnorb/label_lighting train[:800] test[:200] test[50%:] (LeCun et al., 2004)
smallnorb/label_azimuth train[:800] test[:200] test[50%:] (LeCun et al., 2004)
smallnorb/label_elevation train[:800] test[:200] test[50%:] (LeCun et al., 2004)
Table 9: Datasets splits and reference (part 2 of 2).
1:Active task:
2:Set of all the models currently in the multitask system:
3:Active population:  trained on
4:for  do
5:     for  do
6:          Sample parent model
7:         Parent model: none
8:         for Candidate parent model:  do
9:              if  then
11:                  break
12:              end if
13:         end for
14:         if  none then
16:         end if
17:          Sample child model
18:         Set of mutations:
19:         for Candidate mutation:  do
20:              if  then
22:              end if
23:         end for
24:         Untrained child model:
25:          Train child model
26:         Retained child model: none
27:         for  do
29:              if  none trained on  then
31:              end if
32:         end for
33:         if  none then
35:         end if
36:     end for
37:end for
38: Keep only the best model for
39: not trained on
Algorithm 1 Pseudocode for one active task iteration