HIDRA: Head Initialization across Dynamic targets for Robust Architectures

10/28/2019 ∙ by Rafael Rego Drumond, et al. ∙ 0

The performance of gradient-based optimization strategies depends heavily on the initial weights of the parametric model. Recent works show that there exist weight initializations from which optimization procedures can find the task-specific parameters faster than from uniformly random initializations, and that such a weight initialization can be learned by optimizing a specific model architecture across similar tasks via MAML (Model-Agnostic Meta-Learning). Current methods are limited to populations of classification tasks that share the same number of classes due to the static model architectures used during meta-learning. In this paper, we present HIDRA, a meta-learning approach that enables training and evaluating across tasks with any number of target variables. We show that Model-Agnostic Meta-Learning trains a distribution for all the neurons in the output layer and a specific weight initialization for the ones in the hidden layers. HIDRA explores this by learning one master neuron which is used to initialize any number of output neurons for a new task. Extensive experiments on the Miniimagenet and Omniglot data sets demonstrate that HIDRA improves over standard approaches while generalizing to tasks with any number of target variables. Moreover, our approach is shown to robustify low-capacity models in learning across complex tasks with a high number of classes for which regular MAML fails to learn any feasible initialization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Performance comparison of Random Initialization and maml vs. a single initialization learned with hidra for different number of target variables when training on Miniimagenet

Machine learning models and especially deep neural networks are crucial in various fields of research and industry to the point that not only experts, but also practitioners of related areas are dependent on their application. In almost all cases, the optimization of these parametric models relies on a suitable selection of multiple hyperparameters which influence the training performance drastically. This parameter selection either requires expert knowledge or the use of hyperparameter optimization techniques

[schilling2016scalable]. One often disregarded hyperparameter is the weight initialization for parametric models which is required as a starting point for gradient-based-optimization. A suitable weight initialization is essential for a fast convergence to a near-optimal solution when using a method that generally converges to a local optima. Standard hyperparameter optimization approaches are not capable of finding a per-weight initialization for neural networks due to their high number of continuous weight parameters. Instead, a random weight-initialization is typically chosen as a starting point [glorot2010understanding, He_2015_ICCV].

Recent approaches such as maml [finn2017model] show that it is possible to learn a weight initialization for a specific neural network by utilizing second-order optimization for training across a set of similar tasks. This allows to find a per-weight initialization that can lead to a fast convergence for similar tasks. However, such a process requires that each task has the same number of target variables since a specific model architecture is optimized which also means having a fixed number of output neurons. In practice it results in a huge computational effort since it is necessary to optimize a single model architecture for each potential number of output neurons expected during application. Moreover, the initialization should perform equally well for two identical tasks with permuted class-order due to the fact that there is no inherent sequence to the target variables of a standard classification task. This suggests that the different output neurons cannot learn different output weights when trained across data sets with different class semantics. We propose an extension to existing meta-learning approaches by learning a single master neuron which can be used to initialize any number of output neurons. During meta-learning, it is used to initialize the required number of output neurons for a specific task, train on the task via maml and update the master neuron with regard to the different output neurons. Thus, enabling approaches like maml to train and evaluate across tasks with a dynamic number of target variables.

The core contributions of this work are as follows: (1) We demonstrate that standard maml learns indifferent output neurons which limits the approach to a fixed number of target variables. (2) We extend maml to work across dynamic target sizes by deploying a general master neuron that learns to initialize any number of output neurons for similar tasks. (3) Finally, we show that our method hidra leads to a higher model robustness such that even for tasks with a high number of target variables, finding a suitable weight initialization is feasible while regular maml fails to do so (Figure 1).

2 Related Work

Current meta-learning approaches that find a model-weight initialization are typically evaluated by applying them to few-shot classification problems, because it is generally easier to generate the necessary number of tasks required for meta-learning when dealing with few-shot tasks. Few-shot learning [gui2018few, sung2018learning, Gidaris2018DynamicFV] strives to achieve the highest possible classification performance when faced with a new task that comprises only a handful of samples per class. This can be achieved by learning an initialization that converges fast, even when only few instances are given, but also through the application of other meta-learning approaches. For example, Liu et al. [Liu2018LearningTP]

try to deal with this low-data regime by classifying all available tasks in one step by using transductive inference through label propagation as opposed to having a model that processes single tasks. Snell et al.

[snell2017prototypical] propose to learn a single model via meta-learning that embeds instances of a task in a metric space to measure the distances between them. For a novel task, a prototypical representation is selected for each target class to predict new images simply by looking at the nearest neighbor among these prototypes. To better calculate these distances, Oreshkin et al. devised TADAM [oreshkin2018tadam], a relation metric that adapts based on the task and scales appropriately as well.

In contrast to these methods, there are many approaches that strive to optimize the model on the training instances of the evaluation task, instead of simply using them for inference. Training a model on a single few-shot task with such a small number of samples requires a suitable model initialization because it can very easily converge to a poor local optima otherwise.

Another category of meta-learning approaches is referred to as Transfer Learning [sung2018comp, pan2010survey]

. It describes the process of training a model on different auxiliary tasks and then using the learned model to actually fit to the target problem to improve performance. For instance, pre-training blocks of convolutional neural networks on smaller tasks allows fitting a joint model to a much larger task

[zoph2018learning]

. Another angle of transfer learning is using auxiliary tasks to help the model extract more useful features by training extra heads of the architecture to learn metadata from the same inputs

[ranjan2019hyperface].

Our work builds on the research of Finn et al. Model Agnostic Meta-Learning (maml) [finn2017model] finds an initialization for a specific model by training it across a set of similar few-shot learning tasks. They optimize a single model initialization by taking into account the validations loss after some iterations on each task after starting on this same initial point. Every task consists of pairs of inputs and target values. This means the authors sample a batch of tasks from a greater set . New parameters are then calculated for a task after performing one or more update steps using the specific loss starting with the initialization . For the update steps, after initializing , this can be written as:

(1)

Then is updated using the second derivative of the updated weights with regard to their validation performance over all tasks in the meta-batch as in:

(2)

maml can be applied to any architecture, but out-of-the-box will only work on a fixed topology. Alex Nichol et al. developed reptile [reptile2018] in order to simplify the heavy computation of second derivatives from maml by approximating Equation (2) as:

(3)

Finn et al. later expands this work with Probabilistic Model-Agnostic Meta-Learning [Finn2018ProbabilisticMM] to learn a distribution for the model parameters by injecting Gaussian noise into the gradient descent steps.

Figure 2: hidra utilizes a fixed Model but instead of a fixed output layer, the method keeps a single neuron parameterized with . Given a task with target variables, is replicated times forming a task specific output layer with neurons parameterized with . Dashed lines represent weights assigned to the master node with respect to the latent features from M. Dotted lines represent the master node replicating itself and its weights to create the output layer neurons.

Inspired by maml, a recent paper by Rusu et al. called LEO [rusu2018metalearning]

proposes a method to sample network parameters of a model for few-shot learning. An additional encoder network takes a task as input and generates a latent embedding that consists of a mean and a standard deviation for each neuron to initialize. These distributions are used to sample the parameters for the respective neurons. During the training process the latent representation is updated instead of the weights itself. They show the effectiveness of that approach by achieving state-of-the-art results for few-shot-classification. The complexity of the generated latent embedding depends on the number of neurons to initialize since the approach generates a weight distribution to sample from for each neuron. Due to this computational bottleneck, the authors only generate the weights for an output layer that is placed on top of a pre-trained deep residual network which is used to generate task-embeddings that facilitate learning.

So far, none of these methods are specifically designed to work across tasks with different input and target shapes. The work by Brinkmeyer and Drumond et al., Chameleon [brinkmeyer2019chameleon], targets the problem of meta-learning tasks having a variable input schema. The authors train a network which transforms different input schema of training batches to a fixed representation, enabling meta-learning methods such as reptile to work with tasks with different input sizes by attaching this model to the target network’s input layer. Similarly Dataset2Vec from Jomaa et al. [Jomaa2019Dataset2VecLD] extracts useful meta-features from different data sets to perform hyperparameter optimization.

Our work focuses on meta-learning over tasks with different target variables, being the first to our knowledge to directly explore such a problem.

3 Methodology

Meta-learning approaches like maml train a parameter initialization for a specific model by sampling meta-batches of similar tasks from a set of tasks . In few-shot classification, a task is represented by a single batch containing instances for each of the classes present. Thus, it consists of predictor data and target data . Typically, similar tasks are defined to have the same feature space but different target variables such that and with features for every task . As usually defined in the literature, this type of problem setting is referred to as -way--shot. Thus, the goal is to provide a task with instances for each of the selected classes to the model and observe a high accuracy on unseen instances of that task after training. Furthermore, a task comes with a predefined training/test split

Most optimization-based meta-learning approaches operate in two phases: An inner loop and an outer loop. During the inner loop, the model is trained on a specific task starting from the current weight initialization for gradient steps. The updated parameter set for a task is then given by:

(4)

where is the optimization method used to compute the new weights by performing a number of inner update steps. For maml this is shown in Equation 1. Afterwards, the performance of the current initialization is evaluated by measuring the validation performance for the same task with those updated weights. The outer loop executes the inner loop for a batch of tasks to update the current initialization with respect to the validation performance. The outer meta-objective is then defined as:

(5)

In maml this outer objective is optimized by relying on the second derivatives as described in Equation 2.

Optimizing a fixed network architecture restricts the model to tasks with the same number of classes. As already stated previously, the learned initialization is required to be invariant to permutations of the class order since two sampled tasks could have the same instances while having their classes in a different order. This means, that there should be no inherent difference between the initialization learned for the different output neurons. At the same time, few-shot classification is always evaluated with unseen classes. Thus, it should be possible to learn a single output neuron initialization in the outer loop that can be dynamically adapted to each number of classes in the inner loop.

3.1 Hidra

Our method learns a single output neuron as the master node which is replicated times during the inner-loop for a task with classes. hidra takes into consideration that in every task the number of classes might vary. Even when two tasks have the same amount of target variables, their labels may represent different classes. In essence we need to find a dynamic architecture that works for any number of target variables while the initialization performs equally well for any label. In order to do so, we create the master node . When is replicated times it creates the output layer for a task that predicts number of values. This setup is illustrated in Figure 2.

Given a network architecture with initial parameters , we first randomly initialize the master node with parameters . In order to optimize the initialization, a batch of tasks with classes each is sampled. The number of classes only has to be identical within one meta batch, and can vary over the course of the meta-training. During the inner loop, we generate a temporary output layer with neurons each of which is initialized with the current weights of the master neuron , so that the weights of the output layer are set as

(6)

This output layer is connected to the top of the base model to form the task-specific model which is capable of training on tasks with classes. Finally, we perform one meta-iteration of maml on this model for some steps on each task of the sampled meta-batch to compute the task-dependent sets of weights and using Equation 1, before updating the initial weights for and each neuron with Equation 2. Finally, we update the weights of the master neuron by aggregating the updated initial weights of the output neurons in :

(7)

The full approach is depicted in Algorithm 1. showing the inner and outer loop of hidra.

1:  Select gradient step-sizes and
2:  Initialize Meta-Data-Set
3:  Initialize Model with
4:  Initialize Output Master Node
5:  for Max-Iterations do
6:     Sample batch of tasks from with a random                                      output size
7:     Instantiate output layer with neurons
8:     for every neuron  do
9:        
10:     end for
11:     Define network
12:     
13:     for every task  do
14:        
15:        for  amount of inner steps do
16:           
17:        end for
18:     end for
19:     
20:     
21:  end for
Algorithm 1 hidra Method
Figure 3: Test accuracy for Omniglot 10-way 5-shot when using the weights for one output neuron learned via maml to initialize the other output neurons. The dashed line marks the performance of the regular initialization of maml.
Figure 4: Weights of the output layer for the initialization learned via maml when trained on Omniglot -way.

4 Experiments

We conduct experiments on the standard few-shot classification data sets Omniglot [lake2011one] and Miniimagenet [ravi2016optimization]

. Both are frequently used as few-shot classification benchmarks. Omniglot consists of 1623 written characters, each with 20 instances, taken from 50 different alphabets. We randomly split the data set with 1200 characters used for training and the rest for testing. The Miniimagenet data set includes 100 classes from ImageNet with 600 instances per class. We utilize the proposed split with 64 classes in the training, 16 in the validation and 20 in the test data set as proposed by Ravi et al.

[ravi2016optimization].

Figure 5: Performance comparison between maml and hidra for different numbers of target variables when training on Miniimagenet. The left plot compares hidra trained with static number of target variables (i). The right plot compares the best result from the static experiments with dynamic experiments where we have varied sizes of target variables within a range (ii).
Figure 6: Performance comparison between maml and hidra for different numbers of target variables when training on the Omniglot data set. This shows the average accuracy on tasks using each of the initializations. Each graph represent the accuracy for one amount of meta-steps. In this setup in particular we trained the hidra and maml experiments with learning rate equal to 0.4.
Figure 7: Performance comparison between maml and hidra for different number of target variables when training on the Omniglot data set. This shows the average accuracy on tasks using each of the initializations. Each graph represent the accuracy for one amount of meta-steps. In this setup in particular we trained the hidra experiments with learning rate equal to 0.001. maml uses learning rate equal to 0.4 which remains as the best value.

For all experiments, the same model architecture, originally proposed by [vinyals2016matching], and the same hyperparameters are used as in [finn2017model]

. It consists of four convolutional blocks, each being a 3x3 convolution, followed by batch normalization, ReLU nonlinearity and 2x2 max pooling. For Omniglot the filter size is set to 64 and for Miniimagenet to 32. The inner learning rate

for training the model on a specific task is set to 0.01 and 0.4 for Miniimagenet and Omniglot respectively. For the meta-objective in Equation 5, the Adam optimizer [adam] is used with a learning rate . Our work focuses on

-way 5-shot classification tasks since this work focuses on analyzing varying class numbers. As for training, the number of inner gradient steps on a task is set to 5 and 1 for training Miniimagenet and Omniglot respectively. Furthermore, for every meta-epoch we sample 32 tasks for Omniglot and 4 for Miniimagenet. In contrast to the work of Finn et al.

[finn2017model], we conducted the Omniglot experiments without data augmentation for maml and hidra which leads to a slightly lower accuracy but faster runtime to evaluate all the different number of classes. For evaluation, we aggregate the accuracy across 4000 randomly sampled test tasks, performing up to 10 gradients steps for Miniimagenet and accordingly up to 3 gradient steps for Omniglot on the learned initialization. We had to use an alternative implementation of maml due to hardware scalability problems of the original implementation when evaluating tasks with a high number of classes. Running the original code for 2 to 6 classes per task leads to an approximately higher accuracy compared to the results reported in this work. Since we built hidra on top of the same framework, we can assume that these findings transfer to other meta-learning approaches used for model initialization, including maml.

4.1 Maml

In our first experiment, we want to analyze the weight initializations for the output layer learned via maml to show there is no inherent structure between the neurons to motivate the application of hidra. For that, we compared the performance of an initialization learned with maml for -way 5-shot Omniglot with the same initialization, but one of the ten learned output neurons is used to initialize every other output neuron. The results, shown in Figure 3, illustrate that the weights learned for a single output neuron with maml are already sufficient to initialize the complete output layer. The average accuracy across each of those initializations is , while the standard initialization using all learned output weights achieves . Most importantly, using a single output neuron to initialize the output layer leads to a higher performance in some of the cases. By visualizing these weights in Figure. 4, one can see a how the output neurons are all learning a similar pattern showing the redundancy for the weights in the output layer learned via maml (contribution 1).

4.2 Hidra

For our main experiments, we compare the performance of hidra and maml when training on Omniglot and Miniimagenet for a varying number of classes. Our experiments investigate different -way -shot problem settings where the ranges from to classes for Omniglot and 2 to 15 classes for Miniimagenet. In order to compare our approach to maml which only works with a fixed number of classes, we train and evaluate a separate model with maml for each output size as baseline. Additionally, the performance of a standard random initialization is tracked for each class size. Finally, we learn various initialization with hidra for two different settings: (i) with a fixed number of targets variables and (ii) with a varying number of target variables utilized during meta-learning. Every initialization learned via hidra is then evaluated for each of the different -way settings.

Figure 8: Training and validation accuracy during Meta-Learning on Omniglot displayed at every 1000th epoch.

max width= ACCURACY FOR MINIIMAGENET EXPERIMENTS Used Labels in Evaluation Task Method hidra 2 hidra 5 hidra 8 hidra 10 hidra 4-6 hidra 2-10 maml 2 3 4 5 6 7 8 9 10 11 12 13 14 15 82.95 72.93 65.78 59.96 55.77 51.95 48.76 46.13 43.95 41.85 40.05 38.51 37.01 35.61 77.48 67.58 60.32 55.21 50.43 47.25 44.05 41.54 39.26 37.54 35.73 34.51 33.06 31.8 77.79 67.74 61.24 56.62 52.24 49.02 45.99 43.34 41.08 39.24 37.46 36.01 34.63 33.38 79.94 68.79 62.44 57.68 54.27 51.23 48.36 45.97 43.32 41.38 39.43 37.81 36.37 34.96 82.45 73.64 66.8 61.56 57.07 52.93 50.04 47.27 44.86 42.79 40.98 39.33 37.7 36.34 82.39 73.2 66.07 60.73 56.49 52.38 49.17 46.7 44.2 42.36 40.48 38.88 37.38 36.03 81.13 70.06 62.62 55.5 50.09 34.27 12.5 11.11 10.4 9.09 8.33 7.69 7.143 6.667

Table 1: Average accuracy for the experiments with MiniImageNet. Each Hydra experiment (line) used amount of labels during training and was evaluated for tasks with different traget label size (columns). Hydra 2-10 and Hydra 4-6 are trained on tasks with a variable number of target variables. maml is trained on a fixed output size and evaluated on the same target shape.

max width= ACCURACY FOR OMNIGLOT EXPERIMENTS Used Labels in Evaluation Task Method Learning Rate = 0.4 1 step hidra 2 hidra 5 hidra 8 hidra 10 hidra 5-30 Learning Rate = 0.01 1 step hidra 2 hidra 5 hidra 8 hidra 10 Learning Rate = 0.01 3 step hidra 2 hidra 5 hidra 8 hidra 10 maml 2 3 4 5 6 7 8 9 10 99.7 99.43 99.16 98.91 98.64 98.42 98.21 97.97 97.78 99.82 99.64 99.45 99.32 99.15 99.02 98.9 98.74 98.61 99.82 99.66 99.52 99.37 99.19 99.07 98.94 98.82 98.69 99.8 99.63 99.49 99.32 99.2 99.09 98.97 98.82 98.7 99.69 99.54 99.31 99.11 99.01 98.71 98.5 98.38 98.05 99.68 99.42 99.18 98.86 98.68 98.45 98.21 98.01 97.75 99.79 99.64 99.48 99.29 99.14 99.01 98.86 98.73 98.62 99.8 99.66 99.51 99.33 99.24 99.09 98.94 98.79 98.71 99.8 99.63 99.49 99.36 99.2 99.07 98.96 98.82 98.71 99.85 99.69 99.54 99.31 99.13 98.91 98.69 98.48 98.24 97.05 97.64 99.17 99.57 99.53 99.46 99.37 99.28 99.19 97.73 96.13 96.89 98.27 99.22 99.43 99.42 99.33 99.3 98.09 91.89 87.72 87.94 89.96 92.81 95.53 97.65 98.66 98.16 97.64 97.36 97.12 96.85 96.42 95.35 94.46 94.58

Table 2: Average accuracy for the experiments with Omniglot. Each Hydra experiment (line) is trained with tasks containing target variables and evaluated for tasks with different label size (columns). Hydra 5-30 is trained on tasks with a variable number of target variables. maml is trained and evaluated on a fixed output size.

4.3 Results and Discussion

The results of the comparison between the different initialization learned with hidra on Miniimagent and the ones learned via maml are shown in Figure 5. Note that each data point in the graph for maml represents a separate model trained for the respective number of classes, while each graph of the hidra experiments represents one model which is evaluated for every number of classes. The results for hidra show a similar performance as maml for up to 6 classes per task, with the hidra models trained on 2-way and 10-way problems slightly outperforming maml. Generally, all models initialized with hidra generalize to any number of classes during evaluation (contribution 2). The results for training on tasks with varying target size achieve the highest accuracy with a slight improvement over the hidra initialization trained for 2-way 5-shot classification.

Furthermore, maml fails to generalize to unseen tasks when evaluating for more then 7 classes with 5 instances each when using the architecture described above. The performance of models initialized with hidra, on the other hand, only decreases marginally with increasing number of classes. Moreover, meta-learning with hidra on tasks with a low number of classes is demonstrated to generalize to those with a high number of classes and vice versa, essentially computing a robust initialization which is independent of the number of target variables (contribution 3). The numerical results for these experiments are given in Table 1.

Experiments for evaluating our approach with varying number of gradient steps on Omniglot are displayed in Figure 6. hidra fails to outperform maml with three gradient steps of size 0.4, as used in [finn2017model], but whereas maml reaches its highest accuracy after 3 steps, hidra actually achieves the highest score after using a single gradient step on an unseen task. Due this faster convergence, we also evaluated hidra with a smaller inner learning rate of 0.01 (Figure. 7) which shows the best performance on Omniglot when using 3 gradient steps. Numerical results for Omniglot are displayed in Table 2 The meta-learning progress for training hidra on Omniglot for different -way 5-shot settings is illustrated in Figure 8. One can see that the gap between the training and validation error grows with the number of classes per task. The experiments were conducted using an NVIDIA Tesla K80. Training on Miniimagenet takes approximately 3.5 hours for 2 classes and 13 hours for 10 classes for both hidra and maml. Our code is available online for reproduction purposes at: https://github.com/radrumond/hidra.

5 Conclusion

In this paper, we present a novel approach for learning a task-specific initialization through meta-learning. We show that while maml is capable of learning such an initialization, it is restricted to a fixed number of classes while including redundancy in the learned output layer which is demonstrated to hinder learning across tasks with a high number of classes when using a low-capacity model. hidra solves both of these problems by training a single master neuron which is used to dynamically initialize output neurons. Experiments on common few-shot classification benchmarks demonstrate that a single hidra model can generalize to all number of classes independent of the number of target variables used during meta-learning. At the same time this is shown to lead to a more robust architecture which is able to train on tasks with a high number of classes, where maml is not applicable. Finally, using a single model initialized with hidra is shown to improve on the results achieved with a set of models initialized with fixed output layers.

References