1 Introduction
Intelligent systems are often formalized as decisionmakers that learn probabilistic models of their environment and optimize utilities. Such utility functions can represent different classes of problems, such as classification, regression or reinforcement learning. To enable these agents to learn optimal policies, it is usually too costly to enumerate all possibilities and determine the expected utilities. Intelligent agents must instead invest their limited resources such that they optimally trade off utility versus processing costs [Gershman2015], which can be formalized in the framework of bounded rationality [Simon1955]. The informationtheoretic approach to bounded rationality [Ortega2013] provides an abstract model to formalize how such agents behave in order to maximize utility within a given resource limit, where resources are quantified by information processing constraints [Edward2014, McKelvey1995, Tishby2011, Wolpert2006].
Intriguingly, the informationtheoretic model of bounded rationality can also explain the emergence of hierarchies and abstractions, in particular when multiple bounded rational agents are involved in a decisionmaking process [Genewein2015]. In this case an optimal arrangement of decisionmakers leads to specialization of agents and an optimal division of labor, which can be exploited to reduce computational effort [Gottwald2019]. Such multiagent decisionmaking has recently received increased attention in the reinforcement learning literature as a way to deal with complex learning problems [Foerster2017, Ghosh2017, Khan2018]. Here, we introduce a novel gradientbased online learning paradigm for hierarchical decisionmaking systems. Our method finds an optimal soft partitioning of the problem space by imposing informationtheoretic constraints on both the coupling between expert selection and on the expert specialization. We argue that these constraints enforce an efficient division of labor in systems that are bounded. As an example, we apply our algorithm to systems that are limited in their representational power—in particular by assuming linear decisionmakers that can be combined to solve problems that are too complex for each decisionmaker alone.
Recent machine learning research has shown impressive results on incredibly diverse tasks from problem classes such as pattern recognition, reinforcement learning, and generative model learning
[devlin2018bert, Mnih2015, Schmidhuber2015]. These success stories typically have two computational luxuries in common: a large data base with thousands or even millions of training samples and a very long and extensive training period. However, applying these pretrained models to new tasks naïvely usually leads to very poor performance, as with each new incoming batch of data, expensive and slow relearning is required. In contrast to this, humans are able to learn from very few examples and excel at adapting quickly [jankowski2011meta], for example in motor tasks [braun2009motor] or at learning new visual concepts [lake2015human].Sampleefficient adaptation to new tasks can be regarded as a form of metalearning or “learning to learn” [thrun2012learning, schmidhuber1997shifting, caruana1997multitask] and is an ongoing and active field of research–see e.g. [koch2015siamese, vinyals2016matching, Finn2017model, ravi2017optimization, ortega2019meta, botvinick2019reinforcement, yao2019hierarchically]. Metalearning can be defined in different ways, but a common point is that the system learns on two levels, each with different time scales: slow learning across different tasks on a metalevel, and fast learning to adapt to each task individually.
Here, we propose a novel learning paradigm for hierarchical meta learning systems. Our method finds an optimal soft partitioning of the problem space by imposing informationtheoretic constraints on both the process of expert selection and on the expert specialization. We argue that these constraints drive an efficient division of labor in systems that are bounded in their respective information processing power, where we make use of informationtheoretic bounded rationality [Ortega2013]. When the model is presented with previously unseen tasks it assigns them to experts specialized on similar tasks – see Figure 1
. Additionally, expert networks specializing on only a subset of the problem space allows for smaller neural network architectures with only few units per layer. In order to split the problem space and to assign the partitions to experts, we learn to represent tasks through a common latent embedding, that is then used by a selector network to distribute the tasks to the experts.
The outline of this paper is as follows: first we introduce bounded rationality and meta learning, next we introduce our novel approach and derive applications to classification, regression, and reinforcement learning. Finally, we conclude.
2 Background
2.1 Bounded Rational Decision Making
An important concept in decision making is the notion of utility [VonNeumann2007], where an agent picks an action such that it maximizes their utility in some context , i.e. , where the utility is given by a function and the states distribution is known and fixed. Trying to solve this optimization problem naïvely leads to an exhaustive search over all possible pairs, which is in general a prohibitive strategy. Instead of finding an optimal strategy, a boundedrational decisionmaker optimally trades off expected utility and the processing costs required to adapt. In this study we consider the informationtheoretic freeenergy principle [Ortega2013]
of bounded rationality, where the decisionmaker’s resources are modeled by an upper bound on the KullbackLeibler divergence
between the agent’s prior distribution and the posterior policy , resulting in the following constrained optimization problem:(1)  
(2) 
This constraint can also be interpreted as a regularization on . We can transform this into an unconstrained variational problem by introducing a Lagrange multiplier :
(3) 
For we recover the maximum utility solution and for the agent can only act according to the prior. The optimal prior in this case is given by the marginal [Ortega2013].
2.1.1 Hierarchical Decision Making
Aggregating several boundedrational agents by a selection policy allows for solving optimization problems that exceed the capabilities of the individual decisionmakers [Genewein2015]. To achieve this, the search space is split into partitions such that each partition can be solved by a decisionmaker. A two stage mechanism is introduced: The first stage is an expert selection policy that chooses an expert given a state and the second stage chooses an action according to the expert’s posterior policy . The optimization problem given by (3) can be extended to incorporate a tradeoff between computational costs and utility in both stages:
(4) 
where is the resource parameter for the expert selection stage and for the experts.
is the mutual information between the two random variables. The solution can be found by iterating the following set of equations
[Genewein2015]:(5) 
where and are normalization factors and is the free energy of the action selection stage. Thus the marginal distribution defines a mixtureofexperts policy given by the posterior distributions weighted by the responsibilities determined by the Bayesian posterior . Note that is not determined by a given likelihood model, but is the result of the optimization process (4).
2.2 Meta Learning
Metalearning algorithms can be divided roughly into MetricLearning [koch2015siamese, vinyals2016matching, snell2017prototypical], Optimizer Learning [ravi2017optimization, Finn2017model, zintgraf2018caml], and Task Decomposition Models [lan2019meta, vezhnevets2019options]. Our approach depicted in Figure 2 can be seen as a member of the latter group.
2.2.1 Meta Supervised Learning
In a supervised learning task we are usually interested in a dataset consisting of multiple input and output pairs
and the learner is tasked with finding a function that maps from input to output, for example through a deep neural network. To do this, we split the dataset into training and test sets and fit a set of parameters on the training data and evaluate on test data using the learned function . In metalearning, we are instead working with metadatasets , each containing regular datasets split into training and test sets. We thus have different metasets for metatraining, metavalidation, and metatest (, and , respectively). On , we are interested in training a learning procedure (the metalearner) that can take as input one of its training setsand produce a classifier (the learner) that achieves low prediction error on its corresponding test set
.A special case of metalearning for classification are Shot way tasks. In this setting, we are given for each dataset a training set consisting of labeled examples of each of the classes ( examples per dataset) and corresponding test sets. In our study, we focus on the following variation of Shot 2Way tasks: the metalearner is presented with samples ( positive and negative examples) and must assign this dataset to an expert learner. Note that the negative examples may be drawn from any of the remaining classes.
2.2.2 Meta Reinforcement Learning
We model sequential decision problems by defining a Markov Decision Process as a tuple
, where is the set of states, the set of actions,is the transition probability, and
is a reward function. The aim is to find the parameter of a policy that maximizes the expected reward:(6) 
We define as the cumulative reward of trajectory , which is sampled by acting according to the policy , i.e. and . Learning in this environment can then be modeled by reinforcement learning [Sutton2018], where an agent interacts with an environment over a number of (discrete) time steps . At each time step , the agent finds itself in a state and selects an action according to the policy . In return, the environment transitions to the next state and generates a scalar reward . This process continues until the agent reaches a terminal state after which the process restarts. The goal of the agent is to maximize the expected return from each state , which is typically defined as the infinite horizon discounted sum of the rewards. A common choice to achieving this is QLearning [Watkins1992], where we make use of an actionvalue function that is defined as the discounted sum of rewards , where is a discount factor. Learning the optimal policy can be achieved in many ways. Here, we consider Policy gradient methods [Sutton2000] which are a popular choice to tackle continuous reinforcement learning problems. The main idea is to directly manipulate the parameters of the policy in order to maximize the objective by taking steps in the direction of the gradient .
In meta reinforcement learning the problem is given by a set of tasks , where each task is defined by an MDP as described earlier. We are now interested in finding a set of policies that maximizes the average cumulative reward across all tasks in and generalizes well to new tasks sampled from a different set of tasks .
3 Expert Networks for MetaLearning
Informationtheoretic bounded rationality postulates that hierarchies and abstractions emerge when agents have only limited access computational resources [Genewein2015, Gottwald2019, gottwald2019bounded], e.g. limited sampling complexity [Hihn2018] or limited representational power [Hihn2019]. We will show that forming such abstractions equips an agent with the ability of learning the underlying problem structure and thus enables learning of unseen but similar concepts. The method we propose comes out of a unified optimization principle and has the following important features:

A regularization mechanism to enforce the emergence of expert policies.

A task compression mechanism to extract relevant task information.

A selection mechanism to find the most efficient expert for a given task.

A regularization mechanism to improve generalization capabilities.
3.1 Latent Task Embeddings
Note that the selector assigns a complete dataset to an expert and that this can be seen as a metalearning task, as described in [ravi2017optimization]
. To do so, we must find a feature vector
of the dataset . This feature vector must fulfill the following desiderata: 1) invariance against permutation of data points in , 2) high representational capacity, 3) efficient computability, and 4) constant dimensionality regardless of sample size . In the following we propose such features for image classification, regression, and reinforcement learning problems.For image classification we propose to pass the positive images in the dataset through a convolutional autoencoder and use the respective outputs of the bottleneck layer. Convolutional Autoencoders are generative models that learn to reconstruct their inputs by minimizing the MeanSquaredError between the input and the reconstructed image (see e.g. [chen2019closer]). In this way we get similar embeddings for similar inputs belonging to the same class. The latent representation is computed for each positive sample in and then passed through a pooling function to find a single embedding for the complete dataset–see figure 2
for an overview of our proposed model. While in principle functions such as mean, max, and min can be used, we found that max pooling yields the best results. The authors of
[yao2019hierarchically] propose a similar feature set.For regression we define a similar feature vector. The training data points are transformed into a feature vector by binning the points into bins according to their respective value and collecting the respective value. If more than one point falls into the same bin the values are averaged, thus providing invariance against the order of the data points in . We use this feature vector to assign each data set to an expert according to .
In the reinforcement learning setting we use a dynamic recurrent neural network (RNN) with LSTM units
[hochreiter1997long] to classify trajectories. We feed the RNN with tuples to describe the underlying Markov Decision Process describing the task. At we sample the expert according to the learned prior distribution , as there is no information available so far. The authors of [lan2019meta] propose a similar feature set.Omniglot FewShot Classification Results  
Number of Experts  
K  2  4  8  16  
% Acc  I(X;W)  % Acc  I(X;W)  % Acc  I(X;W)  % Acc  I(X;W)  
1  76.2 ( 0.02)  0.99 ( 0.01)  86.7 ( 0.02)  1.96 ( 0.01)  90.1 ( 0.01)  2.5 ( 0.20)  92.9 ( 0.01)  3.2 ( 0.3) 
5  67.3 ( 0.01)  0.93 ( 0.01)  75.5 ( 0.01)  1.95 ( 0.10)  78.4 ( 0.01)  2.7 ( 0.10)  81.2 ( 0.01)  3.3 ( 0.2) 
10  66.4 ( 0.04)  0.95 ( 0.30)  75.8 ( 0.01)  1.90 ( 0.03)  77.3 ( 0.01)  2.8 ( 0.15)  77.8 ( 0.01)  3.1 ( 0.2) 
3.2 Hierarchical Online MetaLearning
As discussed in section 2.1, the aim of the selection network is to find an optimal partition of the experts , such that the selector’s expected utility is maximized under an informationtheoretic constraint , where are the selector’s parameters (e.g. weights in a neural network), the expert and is an input. Each expert follows a policy that maximizes their expected utility . We introduce our gradient based online learning algorithm to find the optimal partitioning and the expert parameters in the following. Rewriting the optimization problem (4) as
(7) 
where the objective is given by
(8) 
and are the parameters of the selection policy and the expert policies, respectively. Note that each expert policy has a distinct set of parameters , i.e. , but we drop the index for readability. In the following we will show how we can apply this formulation to classification, regression and reinforcement learning.
3.2.1 Application to Supervised Learning
Combining multiple experts can often be beneficial [Kuncheva2004], e.g. in MixtureofExperts [Yuksel2012] or Multiple Classifier Systems [Bellmann2018]. Our method can be interpreted as a member of this family of algorithms.
In accordance with Section 2.1 we define the utility as the negative prediction loss, i.e. , where is the prediction of the expert given the input data point (in the following we will use the shorthand ) and is the ground truth. We define the crossentropy loss as a performance measure for classification and the mean squared error for regression. The objective for expert selection thus is given by
(9) 
where , i.e. the free energy of the expert and are the parameters of the selection policy and the expert policies, respectively. Analogously, the action selection objective for each expert is defined by
(10) 
3.2.2 Application to Reinforcement Learning
In the reinforcement learning setup the utility is given by the reward function . In maximum entropy RL the regularization penalizes deviation from a fixed uniform prior, but in a more general setting we can discourage deviation from an arbitrary prior policy by determining the optimal policy as
(11) 
As discussed in Section 2.1, the optimal prior is the marginal of the posterior policy given by . We approximate the prior distributions and by exponential running mean averages of the posterior policies.
To optimize the objective we define two separate value functions: one to estimate the discounted sum of rewards and one to estimate the free energy of the expert policies. The discounted reward for the experts is
which we learn by parameterizing the value function with a neural network. Similar to the discounted reward we can now define the discounted free energy as where . The value function is learned by parameterizing it with a neural network and performing regression on the meansquarederror.3.2.3 Expert Selection
The selector network learns a policy that assigns states to expert policies optimally. The resource parameter constrains the informationprocessing in this step. For the selection assigns each state completely randomly to an expert, while for the selection becomes deterministic, always choosing the most promising expert . The selector optimizes the following objective:
(12) 
where , which is the free energy of the expert. The gradient of is then given (up to an additive constant) by
(13) 
The double expectation can be replaced by Monte Carlo estimates, where in practice we use a single tuple for . This formulation is known as the policy gradient method [Sutton2000]
and is prone to producing high variance gradients, but can be reduced by using an advantage function instead of the reward
[schulman2015high]. The advantage function is a measure of how well a certain action performs in a state compared to the average performance in that state, i.e. . Here, is called the value function and captures the expected cumulative reward when in state , and is an estimate of the expected cumulative reward achieved in state when choosing a particular action . Thus the advantage is an estimate of how advantageous it is to pick in state in relation to a baseline performance . Instead of learning and , we can approximate the advantage function(14) 
such that we can get away with just learning a single value function . Both the selection network and the selector value network are implemented as recurrent neural networks with LSTM cells [hochreiter1997long]. Both networks share the recurrent cell followed by independent feed forward layers.
3.2.4 Action Selection
The actions is sampled from the posterior action distribution of the experts. Each expert maintains a policy for each of the world states and updates those according to the utility/cost tradeoff. The advantage function for each expert is given as
(15) 
The objective of this stage is then to maximize the expected advantage .
4 Empirical Results
4.1 Sinusoid Regression
We adopt this task from [Finn2017model]. In this shot problem, each task consists of learning to predict a function of the form , with both and chosen uniformly, and the goal of the learner is to find given based on only pairs of . Given that the underlying function changes in each iteration it is impossible to solve this problem with a single learner. Our results show that by combing expert networks, we are able to reduce the generalization error iteratively as we add more experts to our system–see Figures 5 for and settings. In Figure 4 we show how the system is able to capture the underlying problem structure as we add more experts and in Figure 3 we visualize how the selector’s partition of the problem space looks like.
4.2 FewShot Classification
The Omniglot dataset [lake2011one] consists of over 1600 characters from 50 alphabets. As each character has merely 20 samples each drawn by a different person, this forms a difficult learning task and is thus often referred to as the ”transposed MNIST” dataset. The Omniglot dataset is regarded as a standard meta learning benchmark, see e.g. [Finn2017model, vinyals2016matching, ravi2017optimization].
We train the learner on a subset of the dataset (, i.e. 1300 classes) and evaluate on the remaining classes, thus investigating the ability to generalize to new data. In each round we build the datasets and by selecting a target class and sample positive and negative samples. To generate negative samples we draw images randomly out of the remaining classes. We present the selection network with the feature presentation of the positive training samples (see Figure 2), but evaluate the experts’ performance on the test samples in . In this way the free energy of the experts becomes a measure of how well the expert is able to generalize to new samples of the target class and distinguish them from negative examples. Using this optimization scheme, we train the expert networks to become experts in recognizing a subset of classes. After a suitable expert is selected we train that expert using the samples from the training dataset—see Figure 5 and Table 1 for results. To generate this figure, we ran a 10fold crossvalidation on the whole dataset and show the averaged performance metric and the respective standarddeviation across the folds. In both settings ”0 bits” corresponds to a single expert, i.e. a single neural network trained on the task.
4.3 Meta Reinforcement Learning
Task Distribution  
Paramater  
Distance Penalty  []  [] 
Goal Position  [0.3, 0.4]  [0, 3] 
Start Position  [0.15, 0.15]  [0.25, 0.25] 
Motor Torques  
Motor Actuation  [185, 215]  [175, 225] 
Inverted Control  
Gravity  [0.01, 4.9]  [4.9, 9.8] 
We create a set of RL tasks by sampling the parameters for the Inverted Double Pendulum problem [Sutton1996] implemented in OpenAI Gym [Brockman2016]. The task is to balance a twolink pendulum in an upward position. We modify inertia, motor torques, reward function, goal position and invert the control signal – see Table 2 for details. The control signal is continuous in the interval [1,1] is generated by neural network that outputs and of a gaussian. The action is sampled by reparameterizing the distribution to , where , so that the distribution is differentiable w.r.t to the network outputs.
The meta task set is based on the same environment, but the parameter distribution and range is different, providing new but similar reinforcement learning problems. In each episode environments are sampled and the system is updated accordingly. After training is concluded the system is evaluated on tasks sampled from . We trained the system for 1000 Episodes with 64 tasks from and evaluate for 100 system updates on tasks from . We report the results in Figure 6, where we can see improving performance as more experts are added and the mutual information in the selection stage indicates that the tasks can be assigned to their respective expert policy.
5 Discussion
We have introduced and evaluated a novel informationtheoretic approach to meta learning. In particular we leveraged an informationtheoretic approach to bounded rationality [Leibfried2017, grau2018soft, Hihn2019, Schach2018, Gottwald2019]
. Our results show that our method is able to identify subregions of the problem set with expert networks. In effect, this equips the system with several initializations covering the problem space and thus enables it to adapt quickly to new but similar tasks. To reliably identify such tasks, we have proposed feature extraction methods for classification, regression and reinforcement learning, that could be simply be replaced and improved in the future. The strength of our model is that it follows from simple principles that can be applied to a large range of problems. Moreover, the system performance can be interpreted in terms of the information processing performed by the selection stage and the expert decisionmakers.
Most other methods for meta learning such as [Finn2017model] and [ravi2017optimization] try to find a initial parametrization of a single learner, such that it is able to adapt quickly to new problems. This initialization can be interpreted as compression of the most common task properties over all tasks. Our method however learns to identify task properties over a subset of tasks and provide several initializations. Task specific information is thus directly available instead of a delayed availability after several iterations as in [Finn2017model] and [ravi2017optimization]. In principle, this can help to adapt within fewer iterations. In a way our method can be seen as the general case of such monolithic metalearning algorithms.
Another hierarchical approach to metalearning is the work of [yao2019hierarchically], where the focus is on learning similarities between completely different problems (e.g. different classification datasets). In this way the portioning is largely governed by the different tasks. Our study however focuses on discovering metainformation within the same task family, where the metapartitioning is determined solely by the optimization process and can thus potentially discover unknown dynamics and relations within a task family.
Although our method is widely applicable, it suffers from low sample efficiency in the RL domain. An interesting research direction would be to combine our system with modelbased RL which is known improve sample efficiency. Another research direction would be to investigate our systems performance in continual adaption tasks, such as in [yao2019hierarchically]. There the system is continuously provided with data sets (e.g. additional classes and samples). Another limitation is the restriction to binary meta classification tasks, which we leave for feature work.
Acknowledgments
This work was supported by the European Research Council Starting Grant BRISC, ERCSTG2015, Project ID 678082.
Comments
There are no comments yet.