Hierarchical Expert Networks for Meta-Learning

10/31/2019 ∙ by Heinke Hihn, et al. ∙ 0

The goal of meta-learning is to train a model on a variety of learning tasks, such that it can adapt to new problems within only a few iterations. Here we propose a principled information-theoretic model that optimally partitions the underlying problem space such that the resulting partitions are processed by specialized expert decision-makers. To drive this specialization we impose the same kind of information processing constraints both on the partitioning and the expert decision-makers. We argue that this specialization leads to efficient adaptation to new tasks. To demonstrate the generality of our approach we evaluate on three meta-learning domains: image classification, regression, and reinforcement learning.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent systems are often formalized as decision-makers that learn probabilistic models of their environment and optimize utilities. Such utility functions can represent different classes of problems, such as classification, regression or reinforcement learning. To enable these agents to learn optimal policies, it is usually too costly to enumerate all possibilities and determine the expected utilities. Intelligent agents must instead invest their limited resources such that they optimally trade off utility versus processing costs [Gershman2015], which can be formalized in the framework of bounded rationality [Simon1955]. The information-theoretic approach to bounded rationality [Ortega2013] provides an abstract model to formalize how such agents behave in order to maximize utility within a given resource limit, where resources are quantified by information processing constraints [Edward2014, McKelvey1995, Tishby2011, Wolpert2006].

Intriguingly, the information-theoretic model of bounded rationality can also explain the emergence of hierarchies and abstractions, in particular when multiple bounded rational agents are involved in a decision-making process [Genewein2015]. In this case an optimal arrangement of decision-makers leads to specialization of agents and an optimal division of labor, which can be exploited to reduce computational effort [Gottwald2019]. Such multi-agent decision-making has recently received increased attention in the reinforcement learning literature as a way to deal with complex learning problems [Foerster2017, Ghosh2017, Khan2018]. Here, we introduce a novel gradient-based on-line learning paradigm for hierarchical decision-making systems. Our method finds an optimal soft partitioning of the problem space by imposing information-theoretic constraints on both the coupling between expert selection and on the expert specialization. We argue that these constraints enforce an efficient division of labor in systems that are bounded. As an example, we apply our algorithm to systems that are limited in their representational power—in particular by assuming linear decision-makers that can be combined to solve problems that are too complex for each decision-maker alone.

Recent machine learning research has shown impressive results on incredibly diverse tasks from problem classes such as pattern recognition, reinforcement learning, and generative model learning

[devlin2018bert, Mnih2015, Schmidhuber2015]. These success stories typically have two computational luxuries in common: a large data base with thousands or even millions of training samples and a very long and extensive training period. However, applying these pre-trained models to new tasks naïvely usually leads to very poor performance, as with each new incoming batch of data, expensive and slow re-learning is required. In contrast to this, humans are able to learn from very few examples and excel at adapting quickly [jankowski2011meta], for example in motor tasks [braun2009motor] or at learning new visual concepts [lake2015human].

Sample-efficient adaptation to new tasks can be regarded as a form of meta-learning or “learning to learn” [thrun2012learning, schmidhuber1997shifting, caruana1997multitask] and is an ongoing and active field of research–see e.g. [koch2015siamese, vinyals2016matching, Finn2017model, ravi2017optimization, ortega2019meta, botvinick2019reinforcement, yao2019hierarchically]. Meta-learning can be defined in different ways, but a common point is that the system learns on two levels, each with different time scales: slow learning across different tasks on a meta-level, and fast learning to adapt to each task individually.

Here, we propose a novel learning paradigm for hierarchical meta learning systems. Our method finds an optimal soft partitioning of the problem space by imposing information-theoretic constraints on both the process of expert selection and on the expert specialization. We argue that these constraints drive an efficient division of labor in systems that are bounded in their respective information processing power, where we make use of information-theoretic bounded rationality [Ortega2013]. When the model is presented with previously unseen tasks it assigns them to experts specialized on similar tasks – see Figure 1

. Additionally, expert networks specializing on only a subset of the problem space allows for smaller neural network architectures with only few units per layer. In order to split the problem space and to assign the partitions to experts, we learn to represent tasks through a common latent embedding, that is then used by a selector network to distribute the tasks to the experts.

The outline of this paper is as follows: first we introduce bounded rationality and meta learning, next we introduce our novel approach and derive applications to classification, regression, and reinforcement learning. Finally, we conclude.

Figure 1: The selector assigns the new input encoding to one of the three experts , or , depending on the similarity of the input to previous inputs seen by the experts.

2 Background

2.1 Bounded Rational Decision Making

An important concept in decision making is the notion of utility [VonNeumann2007], where an agent picks an action such that it maximizes their utility in some context , i.e. , where the utility is given by a function and the states distribution is known and fixed. Trying to solve this optimization problem naïvely leads to an exhaustive search over all possible pairs, which is in general a prohibitive strategy. Instead of finding an optimal strategy, a bounded-rational decision-maker optimally trades off expected utility and the processing costs required to adapt. In this study we consider the information-theoretic free-energy principle [Ortega2013]

of bounded rationality, where the decision-maker’s resources are modeled by an upper bound on the Kullback-Leibler divergence

between the agent’s prior distribution and the posterior policy , resulting in the following constrained optimization problem:


This constraint can also be interpreted as a regularization on . We can transform this into an unconstrained variational problem by introducing a Lagrange multiplier :


For we recover the maximum utility solution and for the agent can only act according to the prior. The optimal prior in this case is given by the marginal [Ortega2013].

2.1.1 Hierarchical Decision Making

Aggregating several bounded-rational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decision-makers [Genewein2015]. To achieve this, the search space is split into partitions such that each partition can be solved by a decision-maker. A two stage mechanism is introduced: The first stage is an expert selection policy that chooses an expert given a state and the second stage chooses an action according to the expert’s posterior policy . The optimization problem given by (3) can be extended to incorporate a trade-off between computational costs and utility in both stages:


where is the resource parameter for the expert selection stage and for the experts.

is the mutual information between the two random variables. The solution can be found by iterating the following set of equations



where and are normalization factors and is the free energy of the action selection stage. Thus the marginal distribution defines a mixture-of-experts policy given by the posterior distributions weighted by the responsibilities determined by the Bayesian posterior . Note that is not determined by a given likelihood model, but is the result of the optimization process (4).

Figure 2: Our proposed method consists of three main stages. First, the training dataset

, is passed through a convolutional autoencoder to find a latent representation

for each , which we get by flattening the preceding convolutional layer (labeled as flattening layer in the figure). This image embedding is then pooled and fed forward selection network.

2.2 Meta Learning

Meta-learning algorithms can be divided roughly into Metric-Learning [koch2015siamese, vinyals2016matching, snell2017prototypical], Optimizer Learning [ravi2017optimization, Finn2017model, zintgraf2018caml], and Task Decomposition Models [lan2019meta, vezhnevets2019options]. Our approach depicted in Figure 2 can be seen as a member of the latter group.

2.2.1 Meta Supervised Learning

In a supervised learning task we are usually interested in a dataset consisting of multiple input and output pairs

and the learner is tasked with finding a function that maps from input to output, for example through a deep neural network. To do this, we split the dataset into training and test sets and fit a set of parameters on the training data and evaluate on test data using the learned function . In meta-learning, we are instead working with meta-datasets , each containing regular datasets split into training and test sets. We thus have different meta-sets for meta-training, meta-validation, and meta-test (, and , respectively). On , we are interested in training a learning procedure (the meta-learner) that can take as input one of its training sets

and produce a classifier (the learner) that achieves low prediction error on its corresponding test set


A special case of meta-learning for classification are -Shot -way tasks. In this setting, we are given for each dataset a training set consisting of labeled examples of each of the classes ( examples per dataset) and corresponding test sets. In our study, we focus on the following variation of -Shot 2-Way tasks: the meta-learner is presented with samples ( positive and negative examples) and must assign this dataset to an expert learner. Note that the negative examples may be drawn from any of the remaining classes.

2.2.2 Meta Reinforcement Learning

We model sequential decision problems by defining a Markov Decision Process as a tuple

, where is the set of states, the set of actions,

is the transition probability, and

is a reward function. The aim is to find the parameter of a policy that maximizes the expected reward:


We define as the cumulative reward of trajectory , which is sampled by acting according to the policy , i.e. and . Learning in this environment can then be modeled by reinforcement learning [Sutton2018], where an agent interacts with an environment over a number of (discrete) time steps . At each time step , the agent finds itself in a state and selects an action according to the policy . In return, the environment transitions to the next state and generates a scalar reward . This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state , which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is Q-Learning [Watkins1992], where we make use of an action-value function that is defined as the discounted sum of rewards , where is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods [Sutton2000] which are a popular choice to tackle continuous reinforcement learning problems. The main idea is to directly manipulate the parameters of the policy in order to maximize the objective by taking steps in the direction of the gradient .

In meta reinforcement learning the problem is given by a set of tasks , where each task is defined by an MDP as described earlier. We are now interested in finding a set of policies that maximizes the average cumulative reward across all tasks in and generalizes well to new tasks sampled from a different set of tasks .

1:Input: Data Distribution , number of samples , batch-size , training episodes
2:Hyper-parameters: resource parameters , , learning rates , for selector and experts
3:Initialize parameters
4:for  = 0, 1, 2, …,  do
5:     Sample batch of datasets , each consisting of a training dataset and a meta-validation dataset with samples each  
6:     for  do
7:         Find Latent Embedding
8:         Select expert
9:         Compute of on         
10:     Update selection parameters with
11:     Update Autoencoder with pos. samples in
12:     Update experts with assigned    
13:return ,
Algorithm 1 Expert Networks for Supervised Meta-Learning

3 Expert Networks for Meta-Learning

Information-theoretic bounded rationality postulates that hierarchies and abstractions emerge when agents have only limited access computational resources [Genewein2015, Gottwald2019, gottwald2019bounded], e.g. limited sampling complexity [Hihn2018] or limited representational power [Hihn2019]. We will show that forming such abstractions equips an agent with the ability of learning the underlying problem structure and thus enables learning of unseen but similar concepts. The method we propose comes out of a unified optimization principle and has the following important features:

  1. A regularization mechanism to enforce the emergence of expert policies.

  2. A task compression mechanism to extract relevant task information.

  3. A selection mechanism to find the most efficient expert for a given task.

  4. A regularization mechanism to improve generalization capabilities.

3.1 Latent Task Embeddings

Note that the selector assigns a complete dataset to an expert and that this can be seen as a meta-learning task, as described in [ravi2017optimization]

. To do so, we must find a feature vector

of the dataset . This feature vector must fulfill the following desiderata: 1) invariance against permutation of data points in , 2) high representational capacity, 3) efficient computability, and 4) constant dimensionality regardless of sample size . In the following we propose such features for image classification, regression, and reinforcement learning problems.

For image classification we propose to pass the positive images in the dataset through a convolutional autoencoder and use the respective outputs of the bottleneck layer. Convolutional Autoencoders are generative models that learn to reconstruct their inputs by minimizing the Mean-Squared-Error between the input and the reconstructed image (see e.g. [chen2019closer]). In this way we get similar embeddings for similar inputs belonging to the same class. The latent representation is computed for each positive sample in and then passed through a pooling function to find a single embedding for the complete dataset–see figure 2

for an overview of our proposed model. While in principle functions such as mean, max, and min can be used, we found that max pooling yields the best results. The authors of

[yao2019hierarchically] propose a similar feature set.

For regression we define a similar feature vector. The training data points are transformed into a feature vector by binning the points into bins according to their respective value and collecting the respective value. If more than one point falls into the same bin the values are averaged, thus providing invariance against the order of the data points in . We use this feature vector to assign each data set to an expert according to .

In the reinforcement learning setting we use a dynamic recurrent neural network (RNN) with LSTM units

[hochreiter1997long] to classify trajectories. We feed the RNN with tuples to describe the underlying Markov Decision Process describing the task. At we sample the expert according to the learned prior distribution , as there is no information available so far. The authors of [lan2019meta] propose a similar feature set.

Omniglot Few-Shot Classification Results
Number of Experts
K 2 4 8 16
% Acc I(X;W) % Acc I(X;W) % Acc I(X;W) % Acc I(X;W)
1 76.2 ( 0.02) 0.99 ( 0.01) 86.7 ( 0.02) 1.96 ( 0.01) 90.1 ( 0.01) 2.5 ( 0.20) 92.9 ( 0.01) 3.2 ( 0.3)
5 67.3 ( 0.01) 0.93 ( 0.01) 75.5 ( 0.01) 1.95 ( 0.10) 78.4 ( 0.01) 2.7 ( 0.10) 81.2 ( 0.01) 3.3 ( 0.2)
10 66.4 ( 0.04) 0.95 ( 0.30) 75.8 ( 0.01) 1.90 ( 0.03) 77.3 ( 0.01) 2.8 ( 0.15) 77.8 ( 0.01) 3.1 ( 0.2)
Table 1: Classification results for the omniglot data set [lake2011one]. We evaluate our system by splitting the dataset into training and validation data (80% - 20%) and train the system as described in Algorithm 1 and report the classification accuracy on the validation, i.e. classes and samples that are novel to the model. We trained for 50.000 episodes each with a batch of 32 datasets and set and .

3.2 Hierarchical On-line Meta-Learning

As discussed in section 2.1, the aim of the selection network is to find an optimal partition of the experts , such that the selector’s expected utility is maximized under an information-theoretic constraint , where are the selector’s parameters (e.g. weights in a neural network), the expert and is an input. Each expert follows a policy that maximizes their expected utility . We introduce our gradient based on-line learning algorithm to find the optimal partitioning and the expert parameters in the following. Rewriting the optimization problem (4) as


where the objective is given by


and are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters , i.e. , but we drop the index for readability. In the following we will show how we can apply this formulation to classification, regression and reinforcement learning.

3.2.1 Application to Supervised Learning

Combining multiple experts can often be beneficial [Kuncheva2004], e.g. in Mixture-of-Experts [Yuksel2012] or Multiple Classifier Systems [Bellmann2018]. Our method can be interpreted as a member of this family of algorithms.

In accordance with Section 2.1 we define the utility as the negative prediction loss, i.e. , where is the prediction of the expert given the input data point (in the following we will use the shorthand ) and is the ground truth. We define the cross-entropy loss as a performance measure for classification and the mean squared error for regression. The objective for expert selection thus is given by


where , i.e. the free energy of the expert and are the parameters of the selection policy and the expert policies, respectively. Analogously, the action selection objective for each expert is defined by

Figure 3: Here we show the soft-partition found by the selection policy for the sine prediction problem , where are chosen uniformly at each trial. To generate these plots we train a system on or respectively, sample and points and feed the data set to the selection policy. Each color represents a different expert. We can see that the selection policy becomes increasingly more precise as we provide more points per data set (denoted by ) to the system. We set and .

3.2.2 Application to Reinforcement Learning

In the reinforcement learning setup the utility is given by the reward function . In maximum entropy RL the regularization penalizes deviation from a fixed uniform prior, but in a more general setting we can discourage deviation from an arbitrary prior policy by determining the optimal policy as


As discussed in Section 2.1, the optimal prior is the marginal of the posterior policy given by . We approximate the prior distributions and by exponential running mean averages of the posterior policies.

To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is

which we learn by parameterizing the value function with a neural network. Similar to the discounted reward we can now define the discounted free energy as where . The value function is learned by parameterizing it with a neural network and performing regression on the mean-squared-error.

3.2.3 Expert Selection

The selector network learns a policy that assigns states to expert policies optimally. The resource parameter constrains the information-processing in this step. For the selection assigns each state completely randomly to an expert, while for the selection becomes deterministic, always choosing the most promising expert . The selector optimizes the following objective:


where , which is the free energy of the expert. The gradient of is then given (up to an additive constant) by

Figure 4: The single expert system is not able to learn the underlying structure of the sine wave, where the two expert system is already able to capture the periodic structure. Adding more experts improves adaption further, as the results show. We trained for 10.000 episodes each with a batch of 32 data sets.
Figure 5: Analogously to the rate-distortion curve in rate-distortion theory [Blahut1972, Arimoto1972]

, we can interpret this curve as the rate-utility showing the trade-off between information processing and expected utility (transparent area represents the standard deviation). Increasing the processing power of the selection stage

(i.e. adding more experts) improves adaption.

The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single tuple for . This formulation is known as the policy gradient method [Sutton2000]

and is prone to producing high variance gradients, but can be reduced by using an advantage function instead of the reward

[schulman2015high]. The advantage function is a measure of how well a certain action performs in a state compared to the average performance in that state, i.e. . Here, is called the value function and captures the expected cumulative reward when in state , and is an estimate of the expected cumulative reward achieved in state when choosing a particular action . Thus the advantage is an estimate of how advantageous it is to pick in state in relation to a baseline performance . Instead of learning and , we can approximate the advantage function


such that we can get away with just learning a single value function . Both the selection network and the selector value network are implemented as recurrent neural networks with LSTM cells [hochreiter1997long]. Both networks share the recurrent cell followed by independent feed forward layers.

Figure 6: In each Meta-Update Step we sample tasks from the training task set and update the agents. After training is completed we evaluate their respective performance on a tasks from the meta test set . Rewards are normalized to and the episode horizon is 500 time steps. Results are averaged over 10 random seeds and trained for 1000 episodes each with a batch of 64 environments.

3.2.4 Action Selection

The actions is sampled from the posterior action distribution of the experts. Each expert maintains a policy for each of the world states and updates those according to the utility/cost trade-off. The advantage function for each expert is given as


The objective of this stage is then to maximize the expected advantage .

4 Empirical Results

4.1 Sinusoid Regression

We adopt this task from [Finn2017model]. In this -shot problem, each task consists of learning to predict a function of the form , with both and chosen uniformly, and the goal of the learner is to find given based on only pairs of . Given that the underlying function changes in each iteration it is impossible to solve this problem with a single learner. Our results show that by combing expert networks, we are able to reduce the generalization error iteratively as we add more experts to our system–see Figures 5 for and settings. In Figure 4 we show how the system is able to capture the underlying problem structure as we add more experts and in Figure 3 we visualize how the selector’s partition of the problem space looks like.

4.2 Few-Shot Classification

The Omniglot dataset [lake2011one] consists of over 1600 characters from 50 alphabets. As each character has merely 20 samples each drawn by a different person, this forms a difficult learning task and is thus often referred to as the ”transposed MNIST” dataset. The Omniglot dataset is regarded as a standard meta learning benchmark, see e.g. [Finn2017model, vinyals2016matching, ravi2017optimization].

We train the learner on a subset of the dataset (, i.e. 1300 classes) and evaluate on the remaining classes, thus investigating the ability to generalize to new data. In each round we build the datasets and by selecting a target class and sample positive and negative samples. To generate negative samples we draw images randomly out of the remaining classes. We present the selection network with the feature presentation of the positive training samples (see Figure 2), but evaluate the experts’ performance on the test samples in . In this way the free energy of the experts becomes a measure of how well the expert is able to generalize to new samples of the target class and distinguish them from negative examples. Using this optimization scheme, we train the expert networks to become experts in recognizing a subset of classes. After a suitable expert is selected we train that expert using the samples from the training dataset—see Figure 5 and Table 1 for results. To generate this figure, we ran a 10-fold cross-validation on the whole dataset and show the averaged performance metric and the respective standard-deviation across the folds. In both settings ”0 bits” corresponds to a single expert, i.e. a single neural network trained on the task.

4.3 Meta Reinforcement Learning

Task Distribution
Distance Penalty [] []
Goal Position [0.3, 0.4] [0, 3]
Start Position [-0.15, 0.15] [-0.25, 0.25]
Motor Torques
Motor Actuation [185, 215] [175, 225]
Inverted Control
Gravity [0.01, 4.9] [4.9, 9.8]
Table 2: All parameters are sampled uniformly from the specified range for each environment. is used for training and for meta evaluation.

We create a set of RL tasks by sampling the parameters for the Inverted Double Pendulum problem [Sutton1996] implemented in OpenAI Gym [Brockman2016]. The task is to balance a two-link pendulum in an upward position. We modify inertia, motor torques, reward function, goal position and invert the control signal – see Table 2 for details. The control signal is continuous in the interval [-1,1] is generated by neural network that outputs and of a gaussian. The action is sampled by re-parameterizing the distribution to , where , so that the distribution is differentiable w.r.t to the network outputs.

The meta task set is based on the same environment, but the parameter distribution and range is different, providing new but similar reinforcement learning problems. In each episode environments are sampled and the system is updated accordingly. After training is concluded the system is evaluated on tasks sampled from . We trained the system for 1000 Episodes with 64 tasks from and evaluate for 100 system updates on tasks from . We report the results in Figure 6, where we can see improving performance as more experts are added and the mutual information in the selection stage indicates that the tasks can be assigned to their respective expert policy.

5 Discussion

We have introduced and evaluated a novel information-theoretic approach to meta learning. In particular we leveraged an information-theoretic approach to bounded rationality [Leibfried2017, grau2018soft, Hihn2019, Schach2018, Gottwald2019]

. Our results show that our method is able to identify sub-regions of the problem set with expert networks. In effect, this equips the system with several initializations covering the problem space and thus enables it to adapt quickly to new but similar tasks. To reliably identify such tasks, we have proposed feature extraction methods for classification, regression and reinforcement learning, that could be simply be replaced and improved in the future. The strength of our model is that it follows from simple principles that can be applied to a large range of problems. Moreover, the system performance can be interpreted in terms of the information processing performed by the selection stage and the expert decision-makers.

Most other methods for meta learning such as [Finn2017model] and [ravi2017optimization] try to find a initial parametrization of a single learner, such that it is able to adapt quickly to new problems. This initialization can be interpreted as compression of the most common task properties over all tasks. Our method however learns to identify task properties over a subset of tasks and provide several initializations. Task specific information is thus directly available instead of a delayed availability after several iterations as in [Finn2017model] and [ravi2017optimization]. In principle, this can help to adapt within fewer iterations. In a way our method can be seen as the general case of such monolithic meta-learning algorithms.

Another hierarchical approach to meta-learning is the work of [yao2019hierarchically], where the focus is on learning similarities between completely different problems (e.g. different classification datasets). In this way the portioning is largely governed by the different tasks. Our study however focuses on discovering meta-information within the same task family, where the meta-partitioning is determined solely by the optimization process and can thus potentially discover unknown dynamics and relations within a task family.

Although our method is widely applicable, it suffers from low sample efficiency in the RL domain. An interesting research direction would be to combine our system with model-based RL which is known improve sample efficiency. Another research direction would be to investigate our systems performance in continual adaption tasks, such as in [yao2019hierarchically]. There the system is continuously provided with data sets (e.g. additional classes and samples). Another limitation is the restriction to binary meta classification tasks, which we leave for feature work.


This work was supported by the European Research Council Starting Grant BRISC, ERC-STG-2015, Project ID 678082.