Existing artificial intelligence (AI) algorithms available today constitute the so-called narrow AI landscape, meaning that they have been designed, trained, and optimized by human engineers to solve a single, specific task or a very narrow collection of closely related problems. Although such algorithms sometimes outperform humans in their established skill-set, they are not able to extend their capabilities to new domains. This limits their re-usability, potentially increases the amount of data required to train them, and leaves them lacking generality and extensibility to higher order reasoning.
In contrast, algorithms capable of overcoming these limitations could eventually converge towards a repertoire of functionalities akin to a human-level skill-set. Such algorithms might be able to learn to come up with creative solutions for a wide range of multi-domain tasks. Systems employing aforementioned algorithms, oftentimes termed under the umbrella of general AI systems, are viewed by many as the ultimate leverage in solving many of humanity’s direst problems [Bostrom2014, Domingos, Stone2016].
Gradual learning is an ability of learning systems to learn in a gradual manner. This is likely a necessary pre-condition for acquiring such a broad set of skills, striving to solve a wide range of disparate problems [Rosa2016c, Kumaran2016a, Tommasino2016, Rusu2016-pp, Balcan2015, Pentina2015, Ruvolo2013, Hamker2001a]. Unfortunately, most existing algorithms struggle to deal with many tasks with different distributions at the same time and tasks that drift in distribution from one another [Ditzler2015] and frequently exhibit catastrophic forgetting, once applied to new problems, especially under realistic computational constraints [Rusu2016-pp, Fernando2017, Hamker2001a, Kirkpatrick2016, French1999, Gershman2015]. Gradual learning is therefore a difficult and unsolved problem that requires a focused investigation by the community. The General AI Challenge provides a platform for such directed focus.
2 Round One: Gradual Learning
The aim of the first round of the challenge is to build an agent that exhibits gradual learning [Rosa2016c] as evidenced by solving a curriculum of increasingly complex and disparate sets of tasks. Despite the similarity in name to gradual learning in [Kumaran2016a] and other related concepts such as cumulative [Tommasino2016], incremental [Rusu2016-pp], life-long [Balcan2015, Pentina2015, Ruvolo2013, Hamker2001a] or continual learning [Kirkpatrick2016], in our setting, the concept encompasses a broader class of requirements that an agent needs to satisfy. These include:
Exploiting previously learned knowledge
Fast adaptation to unseen problems
Avoiding catastrophic forgetting
The above requirements are expressed in two primary objectives of the first round, each with its own set of evaluation criteria:
(Quantitative) Agent capable of passing an evaluation curriculum in the shortest number of simulation steps. Passing requires the ability to exploit previously learned knowledge and avoid catastrophic forgetting, all within a predefined time limit.
Alternatively, a description of a conceptual method for achieving gradual learning is also sought:
(Qualitative) An idea, concept, or design that shows the best promise for scalable gradual learning. A working AI agent is desirable but not necessary.
The fundamental focus of the challenge is on gradual learning, i.e. on the efficient re-use of already gained knowledge. Competitors are required to develop solutions with the following properties:
(Gradual Learning) An ability to re-use previously gained knowledge for the acquisition of subsequent knowledge to more efficiently solve hitherto new and yet unseen problems, while avoiding catastrophic forgetting.
It is important to note here not only the re-use of existing knowledge, but also the focus on efficiency of solving new and unseen problems. Unlike compositionality or other concepts similar to gradual learning, our definition transcends the boundaries of meta-learning [Duan2017, Wang2016, Duan2017a, Santoro2016, Chen2016, Chen2016a]. We seek agents that are able to learn to quickly adapt to new and unseen tasks, while exploiting the gradual structure of the underlying problems they are solving. One can think of the above as that the agents need to possess the ability to search for more efficient strategies to solve new problems.
Naturally, when solving new problems, agents must be able to retain the ability to solve old tasks. In many systems, the lack of such ability causes catastrophic forgetting [French1999], which must be avoided:
(Avoiding Catastrophic Forgetting) Avoiding the loss of information, relevant for one task, due to the incorporation of knowledge necessary for a new task.
In summary, the aim is not optimizing for agent’s performance of existing skills, i.e. how good an agent is at delivering solutions to problems it has already encountered. Instead, we desire optimizing for agent’s performance on solving new and unseen problems, i.e. maximizing the speed of convergence to ‘acceptable solutions’ on new problems, while exploiting existing knowledge and ensuring its survival during the acquisition of new information.
To fully appreciate our setting, below we provide background on why graduality can be beneficial and how it can be encouraged. This is followed by a more formal description of the challenge requirements, environment and evaluation procedures.
3.1 Benefits of Graduality
Given a complex task that needs to be solved, frequently a good strategy for finding a solution is to break the problem down into smaller problems which are easier to deal with. The same applies to learning [Bengio2009-fd, Salakhutdinov2013-on]. It can be much faster to learn things gradually than to try to learn a complex skill from scratch [Alexander2015-mx, Zaremba2015-ia, Gulcehre2016-mb, Oquab2014-ql]. One example of this is the hierarchical decomposition of a task into subtasks and the gradual learning of skills necessary for solving each of them [Krueger2009-vv, Vezhnevets2017, Lee2017, Andreas2017], progressing from the bottom of the hierarchy to the top [Mhaskar2016-vq, Mhaskar2016-du, Poggio2015-hg, Polya2014-vb]. A prime example in the natural world is the gradual acquisition of motor skills by infants during their first years of life [Adolph2017].
3.2 Guidance through Curricula
Exploiting graduality during learning is clearly beneficial. Building systems that learn in a gradual manner should then be encouraged and its benefits and limitations explored further. To enable such type of learning, one can control and guide the learning process. Guided learning, also called curriculum learning, [Gulcehre2016-ht, Bengio2009-fd, Vapnik2015a] provides control by means of presenting the learner with parts of the problem in the order and extent that is likely to be most beneficial at that point in time. One method of providing such order is in the form of a learning curriculum [Bengio2009-fd], akin to a curriculum used in schools. In this scenario, easier topics are taught before more complex ones, in order to exploit the gradual nature of taught knowledge.
Gradual and guided learning has a number of other benefits over other types of learning [Gulcehre2016-mb, Gulcehre2016-ht, Bengio2009-fd, Pan2010-en]. For example, optimizing a model that has few parameters and gradually building up to a model with many parameters could be more efficient than starting with a model that has many parameters from the beginning [Stanley2002-nm]. In this case, a smaller number of new parameters is learned at each step [Chen2015-py, Kirkpatrick2016, Andreas2015-av, Rusu2016-pp]. This might also result in reducing the necessity for exploration. Furthermore, apriori knowledge of the system’s architecture might not be necessary [Zhou2012-ve, Rusu2016-pp, Ganegedara2016-cv], and architecture size can be dynamically derived during training to correspond to the complexity of given problems [Fahlman1990-sp]. Last but not least, reuse of already learned skills is feasible and encouraged [Andreas2015-av, Andreas2016-at]. Once a skill is acquired, it is no longer relevant how long the skill took to discover. The cost of using an existing skill is notably smaller than searching for a skill from scratch.
4 Learning to Gradually Learn
Having described the objectives of the challenge and established the benefits of gradual learning through a curriculum, we will now focus on formal definitions and descriptions of the requirements of the first round of the challenge.
4.1 Instances, Tasks & Curricula
The gradual learning ability of an agent is evaluated by subjecting an agent to a sequence of tasks of increasing complexity. Such sequence is called a curriculum:
(Curriculum) A curriculum is an n-tuple of tasks of increasing complexity. Tasks are endowed with an arbitrary measure of complexity.
In section 8.3, we present one possible way of measuring task complexity in a principled manner and to ensure proper ordering of curricula. It can be performed with the help of a measure from the field of computational mechanics, namely statistical complexity [Crutchfield1994].
Each task is observed by an agent through task instances . The publicly available curriculum provided as part of the challenge [Stransky2017]
can be seen as a curriculum of distributions over stochastic decision problems (SDPs). In particular, a variant of partially observable Markov decision processes (POMDPs), or more generally partially observable stochastic games (POSGs).
(Task and Instance) A task is a distribution over a set of Partially Observable Markov Decision Processes. An instance of a task is a sample from said distribution.
In other words, a single instance of a task is a POMDP. A POMDP is an 8-tuple where is a set of states, a set of actions, a set of observations,
a transition probability distribution,is an observation probability distribution, a reward function, a discount factor and a horizon. In the standard POMDP formulation, the goal of an agent is to maximize its future discounted reward , where the expectation is over the sequence of agent’s belief state/action pairs , where is an initial belief state, , and denotes a policy parameterized by .
Unlike in the standard setting however, the goal in the challenge round is different and distributed across multiple levels of hierarchy. We are not interested in maximizing the agent’s future discounted reward, but rather a more complex set of objectives at different scales. For example at the instance level, it is sufficient to find an acceptable solution to each instance , as described in Algorithm 1. Figure 1 shows the three levels of hierarchy present in the challenge:
Fast learning (A): Policy Search At the individual instance level, the agent is required to find a solution (e.g. a policy ) to a single POMDP/task instance .
Slow learning (B): Meta-Policy Discovery The agent needs to discover a meta-strategy (e.g. a meta-policy ) that quickly converges to a solution across all instances .
Fast Adaptation (C): Policy Transfer The agent is required to exploit existing acquired knowledge and policies in order to adapt to each new task in a curriculum, to solve it faster.
Whether it is possible to find a correspondence between the challenge objectives and the standard reward formulation remains to be seen and up to competitors to determine.
In addition to the above hierarchy, a number of objectives and constraints need to be satisfied by competing agents. The primary goal of an agent is the completion of the entire curricula in as short a time as possible.
(Quantitative Objective) Successful completion of an evaluation curriculum in the shortest number of time-steps possible among all competitors and within 24 hours from the start of the evaluation process on predefined evaluation H/W. Given a set of competing agents , the fastest agent is determined according to:
where corresponds to in Algorithm 2 which returns the number of simulation steps it took an agent to successfully complete a curriculum .
Satisfying the condition set forth in Definition 3, the agent also has to show that two additional conditions are met, namely gradual learning and avoiding catastrophic forgetting.
(Gradual Learning) A manifestation of Property 1, exhibited through a reduced number of computational steps required when solving a task ensuing the solving of previous tasks, i.e. , where denotes agent that has already learned to solve tasks and .
The above condition does not ensure that gradual learning truly occurs, nevertheless it provides sufficient evidence that an agent improves its performance on subsequent tasks, having already solved other tasks before.
(Avoiding Catastrophic Forgetting) A manifestation of Property 2 through the maintenance of fast convergence to an acceptable solution for already solved tasks, i.e. where is some constant of the agent.
This condition ensures that learning to solve a new task does not impair the ability to solve previously encountered tasks.
4.2 Validation Curricula
To test generalization of gradual learning, an agent trained on a single curriculum cannot be effectively evaluated on tasks from the same curriculum. A different curriculum is necessary. Curricula appropriate for a gradually learning agent also form a distribution from which a training and an evaluation curriculum should be drawn. The mini and micro tasks used in the challenge (described in section 6) form together one sample from this distribution. Using a single sample curriculum
to estimate the distributionmight be too difficult. It is possible that competitors might need to create their own curricula. Examples of a number of possible final tasks from alternative curricula were shown in [Rosa2017].
As mentioned previously, simply maximizing reward is neither sufficient, nor desired. Immediately after the agent reaches an acceptable performance on an instance, the environment presents it with the next sample . Upon the successful completion of a number of instances from the same task, determined according to Algorithm 1, the environment presents the agent with the next task in the curriculum. This can be seen in Algorithm 2.
The EnvStep and AgentStep functions in Algorithm 1 update the environment and the agent respectively, while exchanging reward, input and output. Together, they form the core of the environment-agent communication loop, depicted in Figure 2.
The helper functions SoftLimit and HardLimit are used to compute the respective time step limits for a given instance. As the naming suggests, there are two types of limits considered when running an instance: a soft limit, and a hard limit. Both limits are represented in terms of a number of environment steps. They are dynamically computed as they depend on the current instance in fact, they can even evolve as the agent progresses through the instance.
The soft limit can be thought of as a success limit. The agent is required to solve the instance within this limit in order for this attempt to count as successful. If it fails to do so, the prospective solution will no longer be considered as successful. The instance, however, does not end yet. This allows the agent to continue exploring and gather some additional useful knowledge about the unsolved problem.
A forceful instance termination comes only when the hard limit is reached. This behavior is beneficial in situations where an agent gets stuck in a particular instance.
The reasoning behind such limits is as follows: the agent has no way of communicating its needs, for example the need to change the instance to a different one, or switching to the previous task. Therefore, those limits can be thought of as supplement to such cases.
6 Learning Environment
Participants are provided with an environment and a public curriculum of tasks of increasing complexity [Stransky2017]. Both, the environment and the associated tasks are specifically constructed to minimize distraction from unrelated research problems and to primarily focus on gradual learning and associated obstacles. The environment and tasks are built on top of the CommAI-env [Mikolov2015, Baroni2017].
Participants are asked to develop and train their agents on the public curriculum and possibly any other additional knowledge or data sources that they have at their disposal. The public set of tasks provides a reference example of how the evaluation curriculum might be structured and the type of tasks that will be required to be solved by an agent. Tasks come in two groups, as simple micro-tasks and as more advanced mini-tasks
. It is important to note, that natural language processing is not necessary to solve the tasks.
Mini-tasks are based directly on the CommAI-mini task set [Baroni2017], with focus on simple grammar processing and deciding language membership problems for strings. The mini-tasks are simple for an educated and biased human, but they are tremendously complex for an unbiased agent that receives only sparse reward. For this reason, we believe that simpler tasks are necessary. We refer to those as micro-tasks.
The purpose of micro-tasks is to allow the agent to acquire a sufficient repertoire of prior knowledge, necessary for discovering the solutions to the more complex mini-tasks in a reasonable time, under the assumption that the agent is capable of gradual learning.
Decomposing mini-tasks into simpler problems yields a number of skills that the agent should learn first. For each such skill, there is a separate micro-task. The micro-tasks build on top of each other in a gradual and compositional way, relying on the assumption that the agent can make use of them efficiently.
One could argue that the micro-tasks (and even the mini-tasks) are too simple and can be solved by hard-coding the necessary knowledge into the agent. This is indeed true. However, such a solution is not desirable and goes against the spirit of the challenge - to learn new skills from data. The challenge will prevent any hard-coded solution by using hidden evaluation tasks which are different from the public curriculum. A hard-coded solution that does not exhibit gradual learning should fail on such evaluation tasks.
6.3 Environment-Agent Interface
The interface between the agent and the environment is depicted in Figure 2. The environment sends reward and data (one byte) to the agent and receives the agent’s action (one byte) in response. This happens in a continuous cycle. During a single simulation step, the environment processes the received action from the agent and sends reward with new data to the agent; the agent processes this input and sends an action back to the environment.
Specifically, the environment sends 8 bits of data to the agent at every simulation step and receives 8 bits of data back from the agent. This is different from the original CommAI-env [Mikolov2015], which sends only a single bit instead of a full byte every time. Although this makes the interface slightly less flexible, the agent’s communication should be more interpretable and it should reduce the complexity of the presented problems.
Note that the environment is not sending any special information about what task the agent is in at any point in time. The environment appears as a continuous stream of bytes (and rewards) to the agent.
Looking back at Figure 1, it is apparent that there is a significant gap between the structure of the curriculum, the objectives to be learned within, and what kind of information the agent receives. Most notably, there is no explicit information about a successful completion of an instance or a task. Similarly, no indication of an instance or a task switch is provided. The lack of information and limited amount of feedback that an agent has at its disposal makes this round challenging and different from most other learning problems.
Moreover, unlike standard POMDPs, for many of the tasks in the public curriculum, the reward is adaptive. This can be very challenging due to the drifting nature of the reward function. Another challenging aspect of the first round is the inability to request previous tasks from the environment. Hence, competing agents need to be sample efficient as well as ‘epoch’ efficient.
8 Task Complexity via Computational Mechanics
Computational mechanics [Crutchfield2013], a subfield of physics, is concerned with exploring the ways in which nature performs computations. How to identify structure in natural processes and how to measure it, as well as how information processing is embedded in dynamical behavior. The field offers a number of useful tools for thinking about and quantifying complex systems, some of which are relevant for building intelligent machines. Namely, we are interested in the concept of -machines [Crutchfield1989] and the associated measure of statistical complexity .
In our scenario, these concepts and tools can be useful for exploring curricula of the challenge, from the point of view of the complexity of each individual task and the ability to begin investigating the possibility of automating the creation of more principled curricula for both training and evaluation.
8.1 -Machines as Minimal Optimal Models
The concept of -machines as a class of minimal optimal predictors was developed for quantifying structure in a stochastic process via the reconstruction of underlying causal states in a natural process [Crutchfield1994]
. In their most common form, they can be thought of as a type of hidden Markov model (HMM), whose states have a unique, causal definition.-machines are, however, not limited to only a HMM representation and can in fact be used at various levels of representation hierarchy, depending on whether current representation hits the limits of the agent’s computational resources [Crutchfield1994].
An -machine can be constructed with the help of an equivalence relation
denotes a discrete random variable,and a block of past (e.g. ) and future random variables, respectively, with and denoting a particular history or a future (a sequence of symbols from alphabet ) of a generating process, respectively. The equivalence relation (1) states that any two histories and are equivalent if the probability distribution over their futures is the same, i.e. . Given the above equivalence relation, one can then define a mapping from the space of histories to ‘causal’ states of the underlying process. Such mapping is called an -map that takes a given past to its corresponding causal state :
Intuitively, one can think of as a function that partitions the set of pasts according to the equivalence relation (1), into clusters that lead to the same distribution over futures.
We can then define the -machine as a tuple where denotes the process’ alphabet, its causal state set, and a set of symbol-valued transition matrices:
where comprises the following elements:
where and denote two temporally consecutive random variables of a random variable chain defining the causal state process that underlies the stochastic process we are modeling. Each takes on some value . The set thus defines the dynamics over causal states of the underlying system. An -machine is then a unique, minimal-size, maximally predictive unifilar representation of the observed process [Crutchfield1989].
Despite their desirable minimality and optimality properties, -machines are inherently able to model generative processes only. In order to exploit the useful properties that -machine’s formulation offers for the analysis of our curriculum, we need to go beyond output processes and exploit a slightly more complex representation, that of input-output processes, namely -transducers.
In our scenario, we are interested in observing the structure of not only one process, but a coupling between two stochastic processes, which can also be viewed as the analysis of a communications channel between an agent and its environment. The concept of an equivalence relation can be extended, for this type of scenario, over joint (input-output) pasts:
, defining a basis for a channel’s unique, maximally predictive, minimal statistical complexity unifilar presentation, called -transducer. Similarly to -machines, a map from pasts to states can be created that maps given joint input-output past to its corresponding channel causal state
The -transducer is then defined as the tuple , where, in contrast to -machines, additionally defines the set of inputs and the set of conditional transition probabilities:
where has elements:
The ability of -transducers to model input-output mappings, enables the incorporation of actions and makes modeling of tasks in a curriculum feasible.
8.3 Statistical and Structural Complexity
Having defined both -machines and -transducers, we can now define a distribution over causal states for -machines as
and an input-dependent state distribution for -transducers:
In both cases, this defines the asymptotic probability of a process being in any one of its causal states. This information can then be used for quantifying the complexity of the underlying generating or input-output process, respectively. This measure, also called statistical complexity in the case of -machines, is defined as , where points to the fact that it is a measure over an asymptotic distribution, and denotes the Shannon entropy [Shannon48]. On the other hand, -transducer’s input-dependent statistical complexity is defined as
The upper bound on -transducer complexity is then the channel complexity, calculated as the supremum of statistical complexity over input processes:
with topological state complexity being
and due to the fact that uniform distributions maximize Shannon entropy, in general[Barnett2015a].
Statistical complexity is a widely used measure of complexity in physics and complexity science. It has a number of benefits over other measures and intuitive interpretations. For example, it measures the amount of information a process needs to store in order to be able to reconstruct a given signal, or in the case of structural complexity, it defines the capacity of a communication channel. In the case of tasks in our challenge, it could provide a principled way of measuring an upper bound on task complexity that can be beneficial in the construction of structured curricula and subsequent automatic generation of tasks of increasing complexity.
9 Agent-agnostic Task and Curriculum Representation
Modeling tasks of the challenge curriculum with the help of -transducers allow for an agent-agnostic representation of the complexity of the problem each task tackles. It not only provides a way to measure complexity of each task, but also to compare and contrast entire task curricula and reason about the inherent compositionality and graduality of an entire curriculum (see Figure (2(b))). Preliminary results for a number of tasks from the challenge curriculum can be seen in the appendix.
9.1 Task and Curriculum Complexity
Using statistical and structural complexity we are able to start reasoning about the correctness of the curriculum order. This allows for more well defined curricula and subsequent possibility of automating the curriculum generating process with guaranteed characteristics.
9.2 Task and Curriculum Graduality
The minimal representation of tasks also allows us to start thinking about comparison of how much structure is shared among tasks within a curriculum. This can be beneficial in designing and measuring the intrinsic graduality of a curriculum. As -machines and transducers are inherently graphical models, graph theoretic tools and algorithms, such as subgraph-matching, can be used to start comparing tasks and measure their level of shared structure that can be eventually exploited by an agent with gradual learning capabilities.
In this work, we have outlined the structure and goals of the first round of the General AI Challenge, focused on building agents that can learn gradually. We have argued that building such agents requires solving problems at three different levels of hierarchy. From fast learning, akin to a search for a policy at a level of an instance of a task, through slow learning that requires the discovery of a meta-policy that generalizes across instance, all the way to fast adaptation to new tasks, akin to policy transfer between disparate problems. Furthermore, we have proposed the use of a method from computational mechanics, namely -transducers for modeling of tasks in a curriculum that allow for quantifying task complexity and potentially graduality of entire curricula. Explicit calculation of the complexity of tasks and its subsequent use for building better curricula is left for future work.
We would like to thank the entire GoodAI team, in particular Olga Afanasjeva and Marek Rosa.
Appendix A Appendix
All figures in this section are simplified visualizations of -transducers, each for a particular task. These simplifications come in four flavors:
Only two instances, out of a numerous possible, are shown, as others would only be repetitions of an already presented pattern (black rectangles delineate instances).
For the sake of clarity not all transitions (arrows) are always visualized (e.g. error or instance switch transitions).
All transition probabilities from a state are uniformly distributed among visualized transitions. Only instance decision transition probabilities are shown for illustration.
In Figure 4, states are merged together, to avoid repetition when the environment is providing the current instance description. This way the visualization is still readable and focuses more on the important part the part where the agent replies.
General notation for transitions (arrows) between causal states is as follows: , where is an observation/state that is the output of the environment, the reward for the last performed action by the agent. Note that the action at current time-step determines the reward at current time-step, but the observation/state for the subsequent step.