Deep reinforcement learning (DRL) networks currently rival human-level performance in a variety of domains, including object recognition , speech recognition , video games , and complex board games such as Go . Despite the impressive achievements of DRL, networks fail to adapt to trivial changes of the inputs and goals of the learning task, such as changes to board dimensions and structure . One potential reason for this shortcoming is that DRL algorithms learn through extensive feedback about the value of specific input-output associations, without any appreciation for the organizing features of the game that govern these associations. In contrast, evidence from cognitive science suggests that humans learn to perform complex tasks through a model-based approach that involves constructing an internal model of the organizing principles or rules of the environment [5, 6]. This form of learning is particularly important in dynamic environments where survival depends on the ability to generalize previous training to novel settings . While model-based learning algorithms stand to improve the robustness of DRL networks in dynamic environments, it remains a largely unanswered question how this might be achieved.
One reason for the paucity of model-based approaches to deep learning is that DRL agents are typically developed to solve highly complex tasks, thereby precluding any straightforward process for exploring possible internal models of the environment. One way to facilitate development of deep model-based learning agents is to identify simpler testing environments that effectively reduce the number of available features from which internal models can be constructed. A recent paper  highlighted several advantages of simplified environments for evaluating the safety and robustness of DRL agents, showing that simple ”gridworld” environments provide a tractable way for identifying pitfalls in the learned action policy. Indeed, in these simple environments two state-of-the-art DRL networks failed to effectively adapt to subtle differences between training and testing environments, highlighting the need for more robust RL and DRL algorithms. Another important environmental characteristic for the purposes of building model-based agents is the ability to alter, remove, or introduce dimensions of the environment without rendering previous training irrelevant. In other words, in order to fairly evaluate the success of the agent, there needs to exist a reliable strategy or internal model that is robust across variations of the environment or task rules .
The conditions described above are satisfied by a class of impartial combinatorial games . The most notable difference between an impartial game and a game such as Go is that most impartial games have a ground truth solution. Every position in an impartial game is either hot, meaning that there exists a winning strategy for the player about to make a move, or cold, meaning that under optimal play, the player about to make a move will always lose. The distribution of hot and cold positions across the state space in an impartial game usually comes with inherent mathematical structure. In an impartial game, Player () and Player () alternate in making moves until there are no available moves to make, with the player to make the last move declared the winner. The function that takes in a state and returns the set of available legal actions has as its domain an infinite set, meaning it can be generalized to arbitrary dimensions, lending impartial games particularly amenable to a model-based learning strategy.
Working within the constraints of impartial game theory, we now consider the differences between RL agents with model-free and model-based learning strategies. Substantial evidence from cognitive psychology and neuroscience suggests that model-based learning is associated with hierarchical information processing, with action-value associations learned at lower levels of the hierarchy and abstract predictions about the environment at higher levels[12, 13, 14, 15, 16, 17]. One example of such a hierarchy is the prefrontal cortico-basal ganglia (BG) network [18, 19] found in many mammalian species. Converging evidence from human and animal neurophysiological experiments shows that the prefrontal BG networks engage in model-free learning of action-values, driven by phasic dopaminergic signals from the midbrain that alter the weights of cortical inputs to the BG in accordance with environmental feedback [20, 21, 22]. As a result of this plasticity, rewarded (or punished) actions become more (or less) likely to be executed in the future. This form of learning is analogous to that enacted by deep Q-Learning (DQL) agents that exhibit behavioral policies determined solely by the feedback of previous actions.
Importantly, rather than arising from an entirely separate and independent process, model-based learning can be viewed as a companion system that is both informed by and exerts control over the feedback-dependent associations formed through model-free learning . Behavior becomes “model-based” when lower-level feedback dependent representations are leveraged to construct an internal model of the environmental dynamics responsible for previous observations , sometimes referred to as a “generative model”. A key difference in the behavioral outcomes of these two forms of learning is flexibility [23, 17]: model-free learning results in habitual actions based on a static cache of associated values whereas model-based learning results in goal-directed actions based on inferred dynamics of the environment. Evidence from human neuroimaging experiments suggests that the shift from model-free to model-based policies is driven by a concomitant shift from BG to prefrontal behavioral control [23, 24], signaling a shift away from feedback-dependent knowledge to active predictions drawn from the agent’s internal model of the environment.
Motivated by the hierarchical organization of prefrontal cortico-BG systems that are thought to implement model-based learning in the human brain, we devised a novel Hierarchical Q-Network (HQN) that attempts to build an internal strategy (e.g., generative model) based on inferred patterns of hot and cold positions (e.g., model-free Q-learning) on a variant of impartial combinatorial games called Wythoff’s game. We show how this hierarchical learning structure promotes generalizability and robustness to rule changes while also improving post-training interpretability of learning outcomes. Compared to the performance of standard Q learning and deep Q-Networks, the HQN is markedly faster at learning the task and, more importantly, shows clear benefits to the transfer of learning, not only to alterations of Wythoff’s game, but across a variety of other impartial games with distinct, but similar rule structures. Below, we describe our findings, highlighting 1) the benefits afforded by impartial games for developing more robust deep learning agents and 2) the importance of hierarchical learning in environments that demand flexibility.
2.1 Impartial games: Wythoff’s game, Nim, and Euclid
Wythoff’s game is played on a two dimensional grid in which players alternate turns to move an object that is initially on the bottom-right corner towards the top-left corner. The player who gets to place the object in the top-left corner terminates, and thereby wins the game. Every turn, the object can be moved horizontally, vertically, or diagonally towards the top-left corner.
Wythoff’s game is an impartial game where the states are all 2-dimensional non-negative integer coordinates. From coordinates , and can access all states of the form , , and where , , and .
As mentioned above, every position in an impartial game is either hot or cold, indicating whether or will win the game under optimal play. For formal definitions of hot and cold positions, see Definition 5, for an inductive proof of the partitioning, see Theorem 2.
The partition of hot and cold positions in Wythoff’s game is deeply embedded in properties of the Fibonacci string and the golden ratio .
Let and where is the golden ratio. Then, all cold positions in Wythoff’s game is in the form or , where is a natural number.
While we benchmark our HQN agent on Wythoff’s game, we will also subject the game to certain rule changes in later sections. The hot-cold partition of the resulting games are structurally similar to Wythoff’s game, and are discussed in the Appendix section. See Figure 12 for a visualization.
We denote by Nim the impartial game resulting from a Wythoff’s game where diagonal moves are disallowed.
We denote by Euclid the impartial game resulting from a Nim where a distance travelled in the horizontal or vertical direction has to be a multiple of the minimum of the horizontal and vertical distance to the top-left corner.
2.2 Hierarchical Q-Network (HQN)
The HQN is comprised of two interconnected systems, the Q-agent and the Model-Network, that attempt to cooperatively generate an internal model for the task environment while working with datasets that differ in dimensionality. The Q-agent works with a high-dimensional dataset reflecting the expected value of state-action pairs. The Model-Network, on the other hand, feeds off of the conclusions of the Q-agent obtain a low-dimensional dataset that solely reflects the expected values of given states. The Model-Network uses a deep neural network to extrapolate a model from the extracted dataset and evaluates the value of the model by testing the model against an opponent simulated by the Q-agent. In return, the Model-Network biases the action policy of the Q-agent to favor movements to states more likely to generalize to larger environments. The behavior of the Q-agent then effectively explores state-action pairs that contradict or corroborate the current generative model of the Model Network. The HQN succeeds if and only if the Model-Network converges on a generalizable model of the given environment.
2.2.2 Network Details
A summary of the underlying logic behind the HQN is given above, whereas a detailed discussion about its implementation including pseudocode is provided below. Here, we provide details about the architectures of the networks that compose the HQN.
The Q-agent component of the HQN uses Q-Learning 
to build estimates for how good a given state is for the player, based solely on gameplay experience. The learning rate () is set to and the discount rate () is set to . Action selection is randomized through a Boltzmann distribution where the exploration constant () is set to . Further details are given in the next section.
The Model-Network is a feed-forward, single-layer (neurons), and fully-connected network. The error limit and maximum number of iterations are also randomized, to account for errors due to over-fitting or under-fitting. Sigmoid-activation function is used as the activation function for individual neurons. We use the standard backpropogation algorithm  to train the network with the cross-entropy function as the cost-function to avoid learning slow-down, as dysfunctional models are expendable, making the trade-off worthwhile. The cross-entropy cost-function is given by:
Where is over all training inputs, is the number of inputs, is the desired output, and is the output of the neuron.
Nimblenet  library for Python was used in order to simulate the neural network that is in the architecture of the Model-Network. Further details about the separate networks and their interaction is given in the next section.
2.2.3 The Q-agent
The Q-agent relies on Q-Learning, a standard model-free reinforcement learning technique , to estimate the expected value (Q-value) of a state-action pair in a given environment over multiple training sessions (see Figure 3). Every move made adjusts the values stored in the Q-table through value iteration update. Directly updating state-action pairs in this way affords greater precision, and is computationally simpler, than relying on error propagation to adjust the weights of neural network. The Q-Network was able to achieve similar performance to the basic Q-agent when action values were estimated independent of the current state (e.g., board position), albeit less efficiently.
Here is a state-action pair, while is the new state after action is taken at state , is the reward associated with , and and are the learning and the discount rate, respectively. Finally, is the function that returns the set of available actions from state in the environment. Then, the Q-value is updated through:
We take and , since every action has equal effect on the outcome in a given impartial game, so discounting future rewards is redundant.
Actions are selected through the Boltzmann distribution that uses current approximations of the Q-Values to generate a weighted probability space.
Explicitly, the probability that action is selected is given by:
Here is the constant that determines how exploratory or exploitative the action selection process is going to be. We set throughout, reflecting a moderate degree of exploratory behavior in the model.
It is important to note that the process that selects actions to explore also depends on the model as a variable. Actions favored by the generated by the Model-Network get a boost in their probability of being selected. Before the decision process is left to the Boltzmann probability space, the HQN decides whether to explore the action recommended by the model with probability:
is the expected value, or performance of the model as computed by the Model-Network. is the limit imposed on the confidence on the model, in order to maintain that the Q-agent still operates mostly independently of the Model-Network. Otherwise, “echo-loops” may be created in the HQN. is the steepness factor, determining how fast the probability approaches the limit as increases. We set as an appropriate limit, and to get an optimal probability function with respect to , where the model does not begin to influence the Q-agent until .
2.2.4 The Model-Network
The question that motivates the Q-agent is “What moves should I make in which positions to maximize my likelihood of winning?”. However, this question is restricted to the space in which learning occurs. Thus, the question that motivates the Model Network is “Are some positions better for me than others, if so, is there any structure to how these positions are distributed across the board?” For an by Wythoff’s game, there are possible ways in which good and bad positions could be distributed. But without the latter question, the former question seems too short-sighted in order to yield any useful insights into the nature of the game. The HQN architecture allows us to ask these questions simultaneously.
While the Q-agent attempts to approximate the Q-values of state-action pairs, the Model-Network works with simply the expected values of individual states in order to find a heuristic that will separate good states from bad states (see Figure 4). The expected value of a state is simply the Q-value of the best available action from that state.
The Model-Network, equipped with some fixed confidence threshold,
creates a dataset classifying stateas cold if and hot if . Random samples of this dataset are then fed into a neural network. We refer to the trained neural network as the model. Architectural details about the neural network was given in the previous section.
The Model-Network evaluates the performance of a model by benchmarking against a greedy-agent that has access to the Q-agent, as opposed to perfect-play, thus the training process remains unsupervised. The model receives a performance score between and , based on the ratio of games the model can win against the greedy-Q-agent. Since the Q-agent almost always remains more accurate on smaller board sizes, benchmarking games are played on a larger board size so as to favor potentially generalizable models.
2.2.5 Q-agent and Model-Network
The full HQN integrates both the Q-agent and the Model-Network through the learning algorithm modelBasedLearn. On every iteration of the modelBasedLearn algorithm, the Q-agent is trained on the specified amount of gameplays, and the process concludes with the construction of a candidate model, potentially replacing the current best-performing model. Note that even models that eventually get outplaced have a positive impact on the learning outcomes, since hypotheses from flawed models get contradicted by the Q-agent, allowing for the construction of more accurate models in upcoming iterations.
Performance of the HQN agent was also tested against changes in the rules of the game, without explicitly notifying the agent of such changes. It is crucial that the HQN agent is able to detect such changes and adapt to the new rules of the game, especially if the HQN agent was trained on the same set of rules earlier. In order to do so, we allow the HQN agent access to a dataset consisting of the calculated performances of the current best performing model. For a predetermined , if , where is the average of the past performances, and is the current calculated performance of the model, HQN detects a severe performance drop. In this case, the is stored away on the if the need for that same model later arises. is also checked for the existence of models that would fit the new rules of the , and if so, that model is used as the variable.
2.2.6 Model-Free Learning Agents
We compared the HQN to two non-hierarchical implementations of Q-learning.
We benchmark the HQN against an independent Q-Agent to illustrate the effect of the addition of the Model-Network to the system. The Q-Agent has an almost identical framework to the Q-Agent component of the HQN agent. The only difference is that this Q-agent does not have its exploration procedure influenced by a Model-Network. Hence, we do not include more details about its implementation.
The core difference between the Q-Network and the Q-Agent is that the Q-Network makes use of a neural network to approximate the Q-function, whereas the Q-agent algorithm does not attempt to make an inference beyond the look-up-table process for the Q-values of the state-action pairs. A high-level explanation of the algorithm is given in Figure 7, and detailed pseudo-code is given in Figure 7. Nimblenet  was used to simulate the neural network.
The network was fully-connected and single-layer, however, more layers did not have a significant effect on the learning outcomes. The standard backpropogation algorithm was used with the sum-squared-error cost function with sigmoid activation function on the individual units. and .
3.1 Model Building
The Model-Network displayed great efficacy in producing generalizable models for Wythoff’s game and its variants. Figure 8 shows how the two components of the HQN learned the value of board positions at different stages of learning. Models that are developed in the earlier stages of training remain mostly irrelevant to generalization; however, models that meaningfully generalize, although with low accuracy accurately, begin to emerge soon after initial training. Such models are crucial for the learning process because they influence the way the Q-Network chooses to explore different action spaces. Without such guidance, the Q-Network explores actions without any overall purpose or insight. With the guidance from the Model Network, the Q-Network explores actions that would either contradict or confirm an overall hypothesis about the nature of the learning environment.
Wythoff’s games have the type of mathematical structure that should be very easy for a neural network to recognize, explaining a significant portion of the HQN agent’s success. Unfortunately, neural networks are less adept in recognizing discrete, stepwise patterns then they are in recognizing regions and finding slopes. For example, even the best models generated by the HQN agent for Wythoff’s game largely ignored the stepwise distribution of the cold positions across the line. As previously mentioned, this issue begs the existence of a layer that can be more flexible in the types of models that it could hypothesize.
3.2 HQN Efficiency vs. Q-agent and Q-Networks
We compared the performance of the HQN agent in Wythoff’s game to that of a Q-agent and a Q-Network (QN) . Figure 9 shows the accuracy of all three agents during learning. The HQN agent improves performance in discrete jumps as better models replace worse ones over time. Since models are assessed by the HQN in an unsupervised manner, some models evaluated to be better will in fact be less accurate, explaining the occasional fall in the performance of the HQN agent.
The core idea that gives rise to the Q-Network is using neural networks to approximate the Q-function. Whereas a traditional Q-learning attempts to fill in every single value for the Q-function in increasing accuracy in a look-up table manner, a Q-Network attempts to train a neural network that approximates this function. The Q-Network is also able to interpolate after training, since the network attempts to approximate the Q-function continuously, filling in for the gaps in the dataset. The benefits of such an approach have been demonstrated in detail in DeepMind’s Atari Network. However, while attempting to be more efficient and general than a naive Q-agent, Q-Networks sacrifice a lot of stability. Re-training a Q-Network with a newly discovered dataset can be destructive to already existing features of the network. In order to remedy the destructive re-training issue, especially while training through large datasets, (deep) Q-Networks make use of “experience replay” . A Q-Network agent that uses experience replay will store training data as it comes, and backpropagates that data across the network in occasional intervals, as opposed to some novel data-point.
For Wythoff’s game, the Q-Network agent’s performance was subpar compared to the HQN and naive Q-agents, even with additional modifications such as experience replay. Overall Q-Network performance did not exceed random chance significantly within the time constraints that allowed the HQN and the Q-agent to attain reasonable performance. Giving the Q-Network additional advantages, such as training against a perfect agent, or increasing the number of layers in the neural network, was not able to fix the disparity.
The only structural change that observably changed the behavior of the Q-Network was to equate actions and states in the training phase. In an impartial game, how good an action is depends only on which state the action takes the game to. Moves towards cold positions are good moves, whereas moves towards hot positions are bad moves. Under most learning tasks, this assumption does not hold, e.g. pressing left could win the game in a certain scenario, but be disastrous in the other. As illustrated in Figure 6, the Q-Network, similar to the HQN and the Q-agent, does not operate under this assumption, since the network trains to approximate how good an action is given the state.
However, we can hard-code the irrelevance of the starting state as a assumption, by representing an action as simply the encoding of the new state.
Under this framework, the task of the Q-Network would be to output the identical Q-vector that separates good states from bad states, given any state in the game. Since there are a lot more states in a game of Wythoff (Order=) then there are actions (Order=, assuming (state)), the resulting Q-vector will be significantly larger. Increasing the number of neurons to the same order fixes the problem while slowing down training periods. However, the resulting modified agent is able to converge on strategies on a pace that is competitive.
This “trick”, however, is inapplicable in most scenarios outside impartial games, that is why we did not hard-code such notions to the HQN agent. For a similar reason, we do not include the modified Q-Network agent that treats an as a in our analysis.
3.3 HQN performance across dimensions
The HQN agent was able to attain reasonably high levels of performance beyond the dimensions it was trained in. Figure 10 shows the accuracy of a specific model generated by the HQN agent for Wythoff’s game across different board dimensions.
The Q-agent was collecting data on a 12 by 12 board, while the Model-Network was evaluating generated models against the Q-agent on a 50 by 50 board. The fact that the Model-Network tests models by their performance on dimensions that they were not trained on is crucial to prioritize generalizability across dimensions. The fact that the Q-agent cannot perform optimally on higher dimensions is also an advantage, since models that achieve some level of generalizability will be assigned a higher score despite having poor accuracy on smaller boards.
Fortunately, even though the Model-Network was attempting to optimize performance up to a 50 by 50 board size, the models generated were able to display a reasonable degree of performance on boards that are larger. For the model in Figure 10, the model achieved 70% accuracy on a board that was 300 by 300.
3.4 HQN performance across rule variations
HQN was also benchmarked against contexts where the rules of the game did not stay constant. In Figure 11 shows performance of the HQN agent across three games that had similar, but not identical, rules. The HQN agent started out by being trained through Wythoff’s game. Once the agent reached satisfactory () performance, we changed the gameplay rules to that of the game Nim, without informing the agent of this change. The agent was able to detect this change through the sharp decrease in the model’s performance (as perceived by the HQN agent, displayed with the green line), store the model for the Wythoff task away, reset the Q-agent, and start training again. Since Nim (and later, Euclid) has less complex of an action space, we decreased the period from 250 Q-agent gameplays to only 50, in order to slow the learning process down for visualization purposes. When satisfactory performance was reached in Nim, we changed the learning task to Euclid. When we cycled through the three tasks in a similar fashion once again; however, the agent was able to attain satisfactory performance immediately after it detects a change in the rules.
In this paper, we proposed some basic strategies for developing and evaluating agents that learn adaptable and robust strategies, an increasingly important goal for developing AI capable of navigating novel environments. The hierarchical structure of HQN showed promise in the transfer-learning domain, while remaining competitive with standard RL approaches in terms of performance. We trained a Q-agent, a Q-Network, and a HQN for identical amounts of time on the Wythoff’s game (see Figure9). The Q-agent was able to show improved accuracy, although at a steadily decreasing rate over time. The Q-Network, a more unstable but also more efficient advancement over the Q-agent algorithm, was not able to learn as well in this context of impartial games. The HQN agent, on the other hand, achieved increasing accuracy in discrete jumps as better models for the environment were discovered. The HQN also did more than merely excel in terms of efficiency. In the transfer-learning domain, where standard RL approaches are infamously unsuccessful, the HQN agent was able to achieve performance that generalized across dimensions (Figure 10) and remain resistant to changes to rules of the game (Figure 11). Most importantly, we could query the HQN agent to show its strategy for game play in an intuitive and explainable way.
Towards this goal of extensibilty in artificial agents, meta-reinforcement learning, the idea that RL agents can be trained to build better base networks for other RL agents to be trained on, holds a lot of promise. Wang et al. , Duan et al.  and Hansen  provide state-of-the-art approaches to Meta-reinforcement learning, that they call Deep Meta-Reinforcement Learning (DMRL), , and Deep Episodic Value Iteration (DEVI) respectively. These agents are evaluated against benchmarks beyond efficiency and accuracy metrics, including one-shot changes to rewards, and ability to learn abstract task structure. Real et al.  and Miikkulainen et al.  also propose algorithms to optimize network architecture, including connectivity and parameters, for high-dimensional deep learning tasks such as image recognition and language modeling. We consider these efforts important as we aim for artificial networks that can generalize across tasks and yield interpretable outcomes.
4.1 The Hierarchical Q Network
Our key innovation in this study was the introduction of the Hierarchical Q-Network (HQN), a model-based learning agent that capitalizes on hierarchical information processing (see discussion of biological motivations in subsection 4.2). The HQN was composed of an “lower” layer, the Q-agent, that explored through the high-dimensional state-action search space, and an “higher” layer, the Model-Network, that abstracted away the action dimension, and processed through the expected values of states to extract generalized structure from the environment. More important than the hierarchical structure of the HQN, however, is that the two networks interact in such a way that observations by the Q-agent effectively inform model building and that hypotheses generated by the Model Network effectively constrain future action policies. Without the Model-Network, the Q-agent blindly explores the massive search space without any “insight”. Conversely, without the Q-agent, the Model-Network does not have any information with which to generalize from.
While the HQN’s performance was superior than the other RL agents tested here, we should point out that it does suffer from limitations that future work should focus on. One of the inherent limitations of the HQN agent as proposed in this paper is that neural networks were used as the implementation of the Model-Network. Neural networks proved themselves to be suitable in a wide variety of learning tasks; however, there exists a wide range of limitations. For instance, a standard neural network will not be able to classify objects that follow a discrete pattern. For example, in Wythoff’s game, even though the Model-Network was able to generate models that recognized the two symmetrical lines of cold positions, the network was unable to appreciate the discrete intervals separating cold positions along each of the lines. Processing-units that can independently and cohesively handle a vast array of decision problems are essential, if the goal is to understand and simulate how the biological brain can seamlessly navigate a highly-complex physical environment where inputs and goals of a learning task can change rapidly. We propose that symbolic representations combined with the strengths of statistical approaches of neural networks might be extremely useful. An initial attempt to explore such an intersection is given by Garnelo et al. where they propose a symbolic model-based learning agent . We intend to follow a similar direction in our future work.
4.2 Biological Motivations for Hierarchical Processing
The advantages of the HQN, along with recent work by others [5, 3, 31], suggests that hierarchical structure is an effective catalyst for adaptive and generalizable learning in artificial agents. Indeed, substantial evidence from experimental and computational neuroscience suggests the same is true of biological brains [15, 18, 24], pointing towards the looped architecture of cortico-basal ganglia networks as an important feature for model-based and model-free learning systems [18, 24]. The basal-ganglia (BG) is a subcortical network that receives widespread cortical input through the striatum, forming a channel-like architecture - each channel representing a particular action - that loops back up to motor cortex through the thalamus . Critically, each action channel in the BG contains a facilitation and suppression pathway, capable of exerting bidirectional control over the corresponding action channel in primary motor cortex. Schultz and colleagues  famously showed that, during learning, the weights of these pathways are adjusted by phasic changes in striatal dopamine, encoding both the magnitude and sign of the prediction errors estimated from Q-learning models. This dopamine-dependent plasticity of cortico-striatal connections serves to reinforce the future selection of rewarding actions while also suppressing less desirable alternatives, serving a similar computational goal to that of Q-Networks [37, 38, 39]. However, as previously mentioned, relying on feedback alone to drive learning 1) quickly becomes inefficient as task complexity increases, 2) limits the range of learned associations that can be simultaneously stored and exploited, and 3) fails to account for the robust and flexible nature of mammalian behavior.
The fundamental idea behind model-based learning is that, through experience and observation, internal beliefs are formed about the causal relationship between contextual features, states, and action values. For hierarchically structured tasks, for which state-action values depend on multiple, nested contextual features, generative models offer an imperfect but highly efficient strategy for guiding action selection. Critically, however, implementing a model-based learning strategy often relies on simultaneously learning from feedback in a model-free manner. Thus, the challenge of implementing model-based learning is two-fold, requiring 1) a generative mechanism for constructing hypotheses and 2) fluid interaction between inferential and feedback-dependent learning systems.
and machine learning[43, 7, 44, 45] communities have shown a growing interest in model-based learning mechanisms , leading to mutually informative lines of investigation (e.g., understanding how biological brains encode model-based learning strategies provides hints for overcoming the challenges of model-based learning in artificial agents). Evidence from human neuroimaging studies suggests that model-free learning computations in the BG are regulated by top-down inputs from a model-based learning system in the prefrontal cortex (PFC) . Critically, due to the looped architecture of cortico-BG pathways, model-based computations in cortex are informed by feedback-dependent updates in the action-value landscape. Over time, cortical model-based learning systems generate predictions based on model-free computations and, in turn, provide top-down constraints that regulate feedback sensitivity and decision policies in the BG. This symbiosis between BG- and PFC-dependent learning systems is mirrored in the HQN, with observed state-action values in the Q-Network facilitating better predictive models in the Model Network that, in turn, improve future performance through top-down constraints on action evaluation. This scaffolding of model-based and model-free learning computations accelerates the learning process by proactively testing different hypotheses about the rule structure of the task and constraining future decision policies as confidence increases about the fidelity of these expectations.
4.3 Impartial Games as a Benchmark
We should point out that we are not the first to observe and leverage the fact that the benchmark environment chosen profoundly influences the learning agents we design. Although the success of the DeepMind Atari Network was impressive, the benchmark featured implausible 2D environments through a third person perspective. Kempka et al. developed VizDoom , a dynamic first-person perspective learning environment as an alternative testing benchmark for visual RL agents.  was evaluated in the VizDoom environment to demonstrate adaptability to high-scale problems. More recently, DeepMind and Blizzard announced a partnership  to utilize StarCraft II as a AI research environment. StarCraft II is a third-person strategy game with complicated raw visual input, state and action space, and delayed rewards and punishments to selected actions. Initial results already show that this new learning environment will be a challenge for even to most well-established deep reinforcement learning architectures.
We share similar goals with most of the aforementioned research, including designing learning agents that are more adaptable to changes in inputs and goals, as well as ensuring that learning outcomes are interpretable to humans. However, our critical argument is that, in order to achieve these goals, tasks should be designed in which adaptability, as opposed to accuracy, is prioritized. One of the ways our approach separates itself is in the sheer simplicity of the learning task chosen: Impartial games are equipped with rules straightforward enough that winning strategies have a complete mathematical theory. The fact that impartial games are “solved” games allows us to conveniently evaluate performance, and shift focus entirely to the transfer-learning and model-building domain.
Despite their simplicity, the scalability of impartial games makes them uniquely conducive to experimentation with model-based learning algorithms. Common benchmarks such as multi-armed bandit problems 
lack an environment that needs to be navigated through dynamic model-building: a model for the environment cannot go beyond the predetermined expected value and variance distribution. The complexity of games like Go and Starcraft II, on the other hand, preclude any straightforward approach to model-building. For impartial games, model-building can be performed by exploring the geometrical structure of value over topology of the game environment. Thus, we argue that impartial games offer a more suitable environment for rigorously testing and comparing deep model-based agents. The benefits of using impartial games for benchmarking model-based deep learning are summarized below.
The rules of impartial games immediately generalize to bigger board dimensions, in a way that preserves the mathematical structure of the winning strategy. This feature allows us to differentiate between learning agents beyond simply looking at their performance. In order to realize whether a learning agent has truly understood the nature of the game environment, we would just need to benchmark it on a bigger board size. Thus, the learning outcomes of an agent become more transparent. Games like Chess or Go do not have structure that straightforwardly generalizes across different board sizes, so such an approach at benchmarking would have been infeasible.
Impartial games offer wide variety of ways to change the rules of the game, without destroying the inherent mathematical structure of the winning-strategy associated with the specific set of rules. Just by imposing some natural restrictions on the function that returns the set of legal moves from a position on Wythoff’s game, we were able to generate two other games (Nim and Euclid) where the winning strategy has similar mathematical structure. Rule changes in games such as Chess or Go, albeit how insignificant, may influence overall strategy in very intricate ways. Thus Chess and Go would be less accessible for initial attempts for transfer-learning across rule changes.
There are a lot of impartial games where structure and noise can co-exist, similar to the real world. This is a feature that we did not utilize in this paper, but also reflects an advantage of impartial games. For example, a complete mathematical characterization of the winning (hot) and losing (cold) positions in a 3D Wythoff’s game, as of time of writing, has not been discovered. However, results from the 2D version partially generalize to shed some light into optimal behaviour. A learning agent that can figure out how to make generalizations across dimensions could be worthwhile challenge.
All these features allow us to conclude that impartial games, when taken as a benchmark for learning agents, allows for asking questions to the agent where answers in the affirmative demonstrate a type of intelligence that goes beyond a brute-force pattern matching task
where answers in the affirmative demonstrate a type of intelligence that goes beyond a brute-force pattern matching task.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
Alex Graves, Abdel rahman Mohamed, and Geoffrey Hinton.
Speech recognition with deep recurrent neural networks.pages 2–5, 2013.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
-  David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
-  Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
-  Asako Toyama, Kentaro Katahira, and Hideki Ohira. A simple computational algorithm of model-based choice preference. Cognitive, Affective, & Behavioral Neuroscience, pages 1–20, 2017.
-  Wouter Kool, Fiery A Cushman, and Samuel J Gershman. When does model-based control pay off? PLoS computational biology, 12(8):e1005090, 2016.
-  Gary Marcus. Deep learning: A critical appraisal. arXiv, 2018.
-  J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, and A. Lefrancq. Ai safety gridworlds. arXiv, 2017.
-  Falk Lieder and Thomas L Griffiths. When to use which heuristic: A rational solution to the strategy selection problem. In CogSci, 2015.
-  Guy R. Berlekamp, E. and J. Conway. Winning Ways for your Mathematical Plays. A K Peters, Natick, MA, 1982.
-  B. B. Doll, D. A. Simon, and N. D. Daw. The ubiquity of model-based reinforcement learning. Current opinion in neurobiology, 22(6):1075–1081, 2012.
-  P. Smittenaar, T. H. FitzGerald, V. Romei, N. D. Wright, and R. J. Dolan. Disruption of dorsolateral prefrontal cortex decreases model-based in favor of model-free control in humans. Neuron, 80(4):914–919, 2013.
-  K. Wunderlich, P. Smittenaar, and R. J. Dolan. Dopamine enhances model-based over model-free choice behavior. Neuron, 75(3):418–424, 2012.
-  B. B. Doll, K. D. Duncan, D. Simon, D. Shohamy, and N. D. Daw. Model-based choices involve prospective neural activity. Nature Neuroscience, 18:1–9, 2015.
-  E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLOS Computational Biology, 13:9, 2017.
-  J. P. O’Doherty, J. Cockburn, and W. M. Pauli. Learning, reward, and decision making. Annual review of psychology, 68:73–100, 2017.
-  M. J. Frank and D. Badre. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: Computational analysis. Cerebral Cortex, 22:509–526, 2012.
-  D. Badre and M. D’esposito. Is the rostro-caudal axis of the frontal lobe hierarchical? Nature Reviews Neuroscience, 10(9):659–669, 2009.
-  W. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science, 275:1593–1599, 1997.
-  Neir Eshel, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige Uchida. Arithmetic and local circuitry underlying dopamine prediction errors. Nature, 525(7568):243–246, 2015.
-  Neir Eshel, Ju Tian, Michael Bukwich, and Naoshige Uchida. Dopamine neurons share common response function for reward prediction error. Nature neuroscience, 19(3):479–486, 2016.
-  N. D. Daw, Y. Niv, and P. Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704–1711, 2005.
-  D. Badre and M. J. Frank. Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits 2: evidence from fmri. Cereb. Cortex, 22:527–36, 2011.
-  Zachary Abel. Putting the why in wythoff. http://blog.zacharyabel.com/2012/06/putting-the-why-in-wythoff/, 2014.
-  R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction (Vol. 1, No. 1). MIT press, Cambridge, 1998.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 8506, 1985.
-  Jorgen Grimnes. Nimblenet. http://jorgenkg.github.io/python-neural-network/, 2016. Github repository.
-  J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, et al. Learning to Reinforcement Learn. 2017.
-  Yan Duan, John Schulman, Xi Chen, Peter L. Bartlet, et al. . arXiv, 2016.
-  Steven S. Hansen. Deep episodic value iteration for model-based meta-reinforcement learning. arXiv, 2017.
-  Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, et al. Large-scale evolution of image classifiers. arXiv, 2017.
-  Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, et al. Evolving deep neural networks. 2017.
-  M. Garnelo, K. Arulkumaran, and M. Shanahan. Towards Deep Symbolic Reinforcement Learning. ArXiv e-prints, September 2016.
-  G. E. Alexander, M. R. DeLong, and P. L. Strick. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9:357–381, 1986.
-  W. Schultz, P. Dayan, and P. R. A Montague. Neural substrate of prediction and reward. Science, 80:1593–1599.
-  M. J. Frank, L. C. Seeberger, and R. C. O’reilly. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306:1940–3, 2004.
-  S. M. L. Cox et al. Striatal d1 and d2 signaling differentially predict learning from positive and negative outcomes. Neuroimage, 109:95–101, 2015.
-  A. V. Kravitz, L. D. Tye, and A. C. Kreitzer. Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nature Neuroscience, 15:816–8, 2012.
-  Michael A McDannald, Yuji K Takahashi, Nina Lopatina, Brad W Pietras, Josh L Jones, and Geoffrey Schoenbaum. Model-based learning and the contribution of the orbitofrontal cortex to the model-free world. European Journal of Neuroscience, 35(7):991–996, 2012.
-  Peter Dayan and Kent C Berridge. Model-based and model-free pavlovian reward learning: revaluation, revision, and revelation. Cognitive, Affective, & Behavioral Neuroscience, 14(2):473–492, 2014.
-  Nathaniel D Daw, Samuel J Gershman, Ben Seymour, Peter Dayan, and Raymond J Dolan. Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204–1215, 2011.
D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick.
Neuroscience-inspired artificial intelligence.Neuron, 95:245–258, 2017.
-  Nathaniel D Daw and Peter Dayan. The algorithmic anatomy of model-based evaluation. Phil. Trans. R. Soc. B, 369(1655):20130478, 2014.
-  Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model-based reinforcement learning. Neural computation, 14(6):1347–1369, 2002.
-  Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. A Doom-based AI Research Platform for Visual Reinforcement Learning. 2016.
-  Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, et al. https://deepmind.com/blog/deepmind-and-blizzard-open-starcraft-ii-ai-research-environment/, 2017.
In this section, we formalize some definitions referred to in the rest of the paper. We also provide some of the proofs that will create suitable mathematical background for analysing impartial combinatorial games, such as Wythoff’s game. We start by giving a formal description of an impartial game.
Let be a set of states, and be the legal moves function. An impartial game is a game played among and , such that:
and alternate in making moves with going first.
Given a state , the move made by has to be an element of
loses at state if and only if it is ’s turn, and is the empty-set, meaning that there are no legal actions for to do.
There cannot exist a sequence of states such that , , , .
From every state there exists a valid sequence of states such that , , , where is the empty-set. Thus, is a terminal state in the game.
Conditions , and lay out the main structure of the game. Condition insists that once a state has been reached, it cannot be re-accessed, and thus the game cannot go in cycles. Condition , combined with Condition ensures that the game will always terminate, since every legal move must decrease the maximum distance to a terminal state where there are no available actions. Once the the distance reaches zero, the player whose turn it is loses, and the other player wins.
First, we prove using the principle of mathematical induction that indeed, in any well-defined impartial game, every position will be either hot or cold. First, we formally define the notions of hot and cold.
Let be an impartial game. Let . We say is cold if , that is, is a terminal position. We say is hot if and only if there exists a such that is a cold position. If is not a cold position, we say is cold if and only if for all , is a hot position.
Thus, the definition of hot and cold recursively builds up from each other, and since terminal states being cold constitute the necessary base case, the recursion is well-defined. However, if the reader is unfamiliar with recursive constructions, the theorem we present next does not immediately follow.
Let be a impartial game. Then, for all , is either hot or cold.
Proof is by induction on the maximum distance the state has to a terminal state. We refer to this distance as the depth of the state.
In the base case, the depth is just , implying that the is a terminal state, then by definition is cold, so the theorem is true.
Now, we assume inductively that the depth of is greater than . For all , the depth of in necessarily smaller than that of , thus inductive hypothesis applies to show all such is either or .
Case 1: All such are .
Then by definition, is cold.
Case 2: There exists a which is .
Then by definition, is hot.
Since these are the only two cases, the result follows by induction. ∎
Thus we have that for all impartial games, the states can be partitioned into hot and cold positions. The question that remains is what the partition is, given a specific impartial game. We answer this question for the game of Nim. We will prove that a position in pile Nim is hot as long as the piles are asymmetrical.
Let be the impartial game of Nim restricted to only 2 piles. Then, is a cold position if and only if .
Proof is by induction on depth of the state . If depth is , we know , is cold, and , as desired. If the depth is greater than , we have two cases.
Without loss of generality, assume . By the rules of Nim, there exists a move that decreases to . Since has a smaller depth, by induction, is hot, as desired.
In this case, since in Nim diagonal moves are disallowed, all moves will will bring the game to a state where the piles are asymmetrical. Any new state will have a smaller depth, and by induction, will be hot. Thus, is cold, as desired. ∎
The partition in Nim for arbitrarily many dimensions has a similar structure, but requires a little more background to prove, hence but we state it below.
Let represent a Nim game. Then, is a cold position if and only if when through combined with the bitwise exclusive or operation (xor), the result is .
Since bitwise logical operators bring us into the realm of stepwise distributions again, it becomes difficult for a HQN-like agent to converge on optimal performance.
For Wythoff’s game, the proof for the partition is again somewhat involved, and hence we omit a proof for Theorem 1. We do present, however, a short proof for the partition of the hot and cold positions for Euclid, making use of the properties of the golden ratio.
Let be the impartial game of Euclid restricted to dimensions. Then, let , and without loss of generality, assume . Then, is hot if and only if where is the golden ratio.
Given a game state , it suffices to show (1) that if , then is a position, and (2) otherwise is a position. Since we reduce one of the dimensions each move, and theorem works for terminal positions trivially, we can inductively assume theorem works for all accessible states from a given state.
Let be a game state for Euclid, and suppose without loss of generality that .
(1) First, let . Since , the only state accessible from is . We need to show , which implies by inductive hypothesis that is a position.
|Since by assumption|
|by definition of the golden ratio|
(2) Now, we let . We want to access a game in the form where is an integer, and is a position. By inductive hypothesis, this is equivalent to saying or , depending on whether or is the larger integer.
The only reason why we would be unable to access such a state is while removing multiples of from , we skip over the entirety of the range. This would only happen if was a number larger then the number of unique ’s such that is a position, i.e. or hold true. Combining the inequalities, we see that we need to count the number of integer ’s such that
There will be precisely such values for , so the entirety of the range cannot be leaped over, as desired. ∎