1 Introduction
Deep reinforcement learning (DRL) networks currently rival humanlevel performance in a variety of domains, including object recognition [1], speech recognition [2], video games [3], and complex board games such as Go [4]. Despite the impressive achievements of DRL, networks fail to adapt to trivial changes of the inputs and goals of the learning task, such as changes to board dimensions and structure [5]. One potential reason for this shortcoming is that DRL algorithms learn through extensive feedback about the value of specific inputoutput associations, without any appreciation for the organizing features of the game that govern these associations. In contrast, evidence from cognitive science suggests that humans learn to perform complex tasks through a modelbased approach that involves constructing an internal model of the organizing principles or rules of the environment [5, 6]. This form of learning is particularly important in dynamic environments where survival depends on the ability to generalize previous training to novel settings [7]. While modelbased learning algorithms stand to improve the robustness of DRL networks in dynamic environments, it remains a largely unanswered question how this might be achieved.
One reason for the paucity of modelbased approaches to deep learning is that DRL agents are typically developed to solve highly complex tasks, thereby precluding any straightforward process for exploring possible internal models of the environment[8]. One way to facilitate development of deep modelbased learning agents is to identify simpler testing environments that effectively reduce the number of available features from which internal models can be constructed. A recent paper [9] highlighted several advantages of simplified environments for evaluating the safety and robustness of DRL agents, showing that simple ”gridworld” environments provide a tractable way for identifying pitfalls in the learned action policy. Indeed, in these simple environments two stateoftheart DRL networks failed to effectively adapt to subtle differences between training and testing environments, highlighting the need for more robust RL and DRL algorithms. Another important environmental characteristic for the purposes of building modelbased agents is the ability to alter, remove, or introduce dimensions of the environment without rendering previous training irrelevant. In other words, in order to fairly evaluate the success of the agent, there needs to exist a reliable strategy or internal model that is robust across variations of the environment or task rules [10].
The conditions described above are satisfied by a class of impartial combinatorial games [11]. The most notable difference between an impartial game and a game such as Go is that most impartial games have a ground truth solution. Every position in an impartial game is either hot, meaning that there exists a winning strategy for the player about to make a move, or cold, meaning that under optimal play, the player about to make a move will always lose. The distribution of hot and cold positions across the state space in an impartial game usually comes with inherent mathematical structure. In an impartial game, Player () and Player () alternate in making moves until there are no available moves to make, with the player to make the last move declared the winner. The function that takes in a state and returns the set of available legal actions has as its domain an infinite set, meaning it can be generalized to arbitrary dimensions, lending impartial games particularly amenable to a modelbased learning strategy.
Working within the constraints of impartial game theory, we now consider the differences between RL agents with modelfree and modelbased learning strategies. Substantial evidence from cognitive psychology and neuroscience suggests that modelbased learning is associated with hierarchical information processing, with actionvalue associations learned at lower levels of the hierarchy and abstract predictions about the environment at higher levels
[12, 13, 14, 15, 16, 17]. One example of such a hierarchy is the prefrontal corticobasal ganglia (BG) network [18, 19] found in many mammalian species. Converging evidence from human and animal neurophysiological experiments shows that the prefrontal BG networks engage in modelfree learning of actionvalues, driven by phasic dopaminergic signals from the midbrain that alter the weights of cortical inputs to the BG in accordance with environmental feedback [20, 21, 22]. As a result of this plasticity, rewarded (or punished) actions become more (or less) likely to be executed in the future. This form of learning is analogous to that enacted by deep QLearning (DQL) agents that exhibit behavioral policies determined solely by the feedback of previous actions.Importantly, rather than arising from an entirely separate and independent process, modelbased learning can be viewed as a companion system that is both informed by and exerts control over the feedbackdependent associations formed through modelfree learning [17]. Behavior becomes “modelbased” when lowerlevel feedback dependent representations are leveraged to construct an internal model of the environmental dynamics responsible for previous observations [16], sometimes referred to as a “generative model”. A key difference in the behavioral outcomes of these two forms of learning is flexibility [23, 17]: modelfree learning results in habitual actions based on a static cache of associated values whereas modelbased learning results in goaldirected actions based on inferred dynamics of the environment. Evidence from human neuroimaging experiments suggests that the shift from modelfree to modelbased policies is driven by a concomitant shift from BG to prefrontal behavioral control [23, 24], signaling a shift away from feedbackdependent knowledge to active predictions drawn from the agent’s internal model of the environment.
Motivated by the hierarchical organization of prefrontal corticoBG systems that are thought to implement modelbased learning in the human brain, we devised a novel Hierarchical QNetwork (HQN) that attempts to build an internal strategy (e.g., generative model) based on inferred patterns of hot and cold positions (e.g., modelfree Qlearning) on a variant of impartial combinatorial games called Wythoff’s game. We show how this hierarchical learning structure promotes generalizability and robustness to rule changes while also improving posttraining interpretability of learning outcomes. Compared to the performance of standard Q learning and deep QNetworks, the HQN is markedly faster at learning the task and, more importantly, shows clear benefits to the transfer of learning, not only to alterations of Wythoff’s game, but across a variety of other impartial games with distinct, but similar rule structures. Below, we describe our findings, highlighting 1) the benefits afforded by impartial games for developing more robust deep learning agents and 2) the importance of hierarchical learning in environments that demand flexibility.
2 Methods
2.1 Impartial games: Wythoff’s game, Nim, and Euclid
Wythoff’s game is played on a two dimensional grid in which players alternate turns to move an object that is initially on the bottomright corner towards the topleft corner. The player who gets to place the object in the topleft corner terminates, and thereby wins the game. Every turn, the object can be moved horizontally, vertically, or diagonally towards the topleft corner.
Definition 1.
Wythoff’s game is an impartial game where the states are all 2dimensional nonnegative integer coordinates. From coordinates , and can access all states of the form , , and where , , and .
As mentioned above, every position in an impartial game is either hot or cold, indicating whether or will win the game under optimal play. For formal definitions of hot and cold positions, see Definition 5, for an inductive proof of the partitioning, see Theorem 2.
The partition of hot and cold positions in Wythoff’s game is deeply embedded in properties of the Fibonacci string and the golden ratio [11].
Theorem 1.
Let and where is the golden ratio. Then, all cold positions in Wythoff’s game is in the form or , where is a natural number.
The mathematical structure of Wythoff’s game (expressed by Theorem 1) manifests in a highly patterned separation of hot and cold positions (see Figure 2).
While we benchmark our HQN agent on Wythoff’s game, we will also subject the game to certain rule changes in later sections. The hotcold partition of the resulting games are structurally similar to Wythoff’s game, and are discussed in the Appendix section. See Figure 12 for a visualization.
Definition 2.
We denote by Nim the impartial game resulting from a Wythoff’s game where diagonal moves are disallowed.
Definition 3.
We denote by Euclid the impartial game resulting from a Nim where a distance travelled in the horizontal or vertical direction has to be a multiple of the minimum of the horizontal and vertical distance to the topleft corner.
2.2 Hierarchical QNetwork (HQN)
2.2.1 Overview
The HQN is comprised of two interconnected systems, the Qagent and the ModelNetwork, that attempt to cooperatively generate an internal model for the task environment while working with datasets that differ in dimensionality. The Qagent works with a highdimensional dataset reflecting the expected value of stateaction pairs. The ModelNetwork, on the other hand, feeds off of the conclusions of the Qagent obtain a lowdimensional dataset that solely reflects the expected values of given states. The ModelNetwork uses a deep neural network to extrapolate a model from the extracted dataset and evaluates the value of the model by testing the model against an opponent simulated by the Qagent. In return, the ModelNetwork biases the action policy of the Qagent to favor movements to states more likely to generalize to larger environments. The behavior of the Qagent then effectively explores stateaction pairs that contradict or corroborate the current generative model of the Model Network. The HQN succeeds if and only if the ModelNetwork converges on a generalizable model of the given environment.
2.2.2 Network Details
A summary of the underlying logic behind the HQN is given above, whereas a detailed discussion about its implementation including pseudocode is provided below. Here, we provide details about the architectures of the networks that compose the HQN.
The Qagent component of the HQN uses QLearning [26]
to build estimates for how good a given state is for the player, based solely on gameplay experience. The learning rate (
) is set to and the discount rate () is set to . Action selection is randomized through a Boltzmann distribution where the exploration constant () is set to . Further details are given in the next section.The ModelNetwork is a feedforward, singlelayer (neurons), and fullyconnected network. The error limit and maximum number of iterations are also randomized, to account for errors due to overfitting or underfitting. Sigmoidactivation function is used as the activation function for individual neurons. We use the standard backpropogation algorithm [27] to train the network with the crossentropy function as the costfunction to avoid learning slowdown, as dysfunctional models are expendable, making the tradeoff worthwhile. The crossentropy costfunction is given by:
Where is over all training inputs, is the number of inputs, is the desired output, and is the output of the neuron.
Nimblenet [28] library for Python was used in order to simulate the neural network that is in the architecture of the ModelNetwork. Further details about the separate networks and their interaction is given in the next section.
2.2.3 The Qagent
The Qagent relies on QLearning, a standard modelfree reinforcement learning technique [26], to estimate the expected value (Qvalue) of a stateaction pair in a given environment over multiple training sessions (see Figure 3). Every move made adjusts the values stored in the Qtable through value iteration update. Directly updating stateaction pairs in this way affords greater precision, and is computationally simpler, than relying on error propagation to adjust the weights of neural network. The QNetwork was able to achieve similar performance to the basic Qagent when action values were estimated independent of the current state (e.g., board position), albeit less efficiently.
Here is a stateaction pair, while is the new state after action is taken at state , is the reward associated with , and and are the learning and the discount rate, respectively. Finally, is the function that returns the set of available actions from state in the environment. Then, the Qvalue is updated through:
We take and , since every action has equal effect on the outcome in a given impartial game, so discounting future rewards is redundant.
Actions are selected through the Boltzmann distribution that uses current approximations of the QValues to generate a weighted probability space.
Explicitly, the probability that action is selected is given by:
Here is the constant that determines how exploratory or exploitative the action selection process is going to be. We set throughout, reflecting a moderate degree of exploratory behavior in the model.
It is important to note that the process that selects actions to explore also depends on the model as a variable. Actions favored by the generated by the ModelNetwork get a boost in their probability of being selected. Before the decision process is left to the Boltzmann probability space, the HQN decides whether to explore the action recommended by the model with probability:
is the expected value, or performance of the model as computed by the ModelNetwork. is the limit imposed on the confidence on the model, in order to maintain that the Qagent still operates mostly independently of the ModelNetwork. Otherwise, “echoloops” may be created in the HQN. is the steepness factor, determining how fast the probability approaches the limit as increases. We set as an appropriate limit, and to get an optimal probability function with respect to , where the model does not begin to influence the Qagent until .
2.2.4 The ModelNetwork
The question that motivates the Qagent is “What moves should I make in which positions to maximize my likelihood of winning?”. However, this question is restricted to the space in which learning occurs. Thus, the question that motivates the Model Network is “Are some positions better for me than others, if so, is there any structure to how these positions are distributed across the board?” For an by Wythoff’s game, there are possible ways in which good and bad positions could be distributed. But without the latter question, the former question seems too shortsighted in order to yield any useful insights into the nature of the game. The HQN architecture allows us to ask these questions simultaneously.
While the Qagent attempts to approximate the Qvalues of stateaction pairs, the ModelNetwork works with simply the expected values of individual states in order to find a heuristic that will separate good states from bad states (see Figure 4). The expected value of a state is simply the Qvalue of the best available action from that state.
The ModelNetwork, equipped with some fixed confidence threshold,
creates a dataset classifying state
as cold if and hot if . Random samples of this dataset are then fed into a neural network. We refer to the trained neural network as the model. Architectural details about the neural network was given in the previous section.The ModelNetwork evaluates the performance of a model by benchmarking against a greedyagent that has access to the Qagent, as opposed to perfectplay, thus the training process remains unsupervised. The model receives a performance score between and , based on the ratio of games the model can win against the greedyQagent. Since the Qagent almost always remains more accurate on smaller board sizes, benchmarking games are played on a larger board size so as to favor potentially generalizable models.
2.2.5 Qagent and ModelNetwork
The full HQN integrates both the Qagent and the ModelNetwork through the learning algorithm modelBasedLearn. On every iteration of the modelBasedLearn algorithm, the Qagent is trained on the specified amount of gameplays, and the process concludes with the construction of a candidate model, potentially replacing the current bestperforming model. Note that even models that eventually get outplaced have a positive impact on the learning outcomes, since hypotheses from flawed models get contradicted by the Qagent, allowing for the construction of more accurate models in upcoming iterations.
Performance of the HQN agent was also tested against changes in the rules of the game, without explicitly notifying the agent of such changes. It is crucial that the HQN agent is able to detect such changes and adapt to the new rules of the game, especially if the HQN agent was trained on the same set of rules earlier. In order to do so, we allow the HQN agent access to a dataset consisting of the calculated performances of the current best performing model. For a predetermined , if , where is the average of the past performances, and is the current calculated performance of the model, HQN detects a severe performance drop. In this case, the is stored away on the if the need for that same model later arises. is also checked for the existence of models that would fit the new rules of the , and if so, that model is used as the variable.
2.2.6 ModelFree Learning Agents
We compared the HQN to two nonhierarchical implementations of Qlearning.
QAgent
We benchmark the HQN against an independent QAgent to illustrate the effect of the addition of the ModelNetwork to the system. The QAgent has an almost identical framework to the QAgent component of the HQN agent. The only difference is that this Qagent does not have its exploration procedure influenced by a ModelNetwork. Hence, we do not include more details about its implementation.
QNetwork
The core difference between the QNetwork and the QAgent is that the QNetwork makes use of a neural network to approximate the Qfunction, whereas the Qagent algorithm does not attempt to make an inference beyond the lookuptable process for the Qvalues of the stateaction pairs. A highlevel explanation of the algorithm is given in Figure 7, and detailed pseudocode is given in Figure 7. Nimblenet [28] was used to simulate the neural network.
The network was fullyconnected and singlelayer, however, more layers did not have a significant effect on the learning outcomes. The standard backpropogation algorithm was used with the sumsquarederror cost function with sigmoid activation function on the individual units. and .
3 Results
3.1 Model Building
The ModelNetwork displayed great efficacy in producing generalizable models for Wythoff’s game and its variants. Figure 8 shows how the two components of the HQN learned the value of board positions at different stages of learning. Models that are developed in the earlier stages of training remain mostly irrelevant to generalization; however, models that meaningfully generalize, although with low accuracy accurately, begin to emerge soon after initial training. Such models are crucial for the learning process because they influence the way the QNetwork chooses to explore different action spaces. Without such guidance, the QNetwork explores actions without any overall purpose or insight. With the guidance from the Model Network, the QNetwork explores actions that would either contradict or confirm an overall hypothesis about the nature of the learning environment.
Wythoff’s games have the type of mathematical structure that should be very easy for a neural network to recognize, explaining a significant portion of the HQN agent’s success. Unfortunately, neural networks are less adept in recognizing discrete, stepwise patterns then they are in recognizing regions and finding slopes. For example, even the best models generated by the HQN agent for Wythoff’s game largely ignored the stepwise distribution of the cold positions across the line. As previously mentioned, this issue begs the existence of a layer that can be more flexible in the types of models that it could hypothesize.
3.2 HQN Efficiency vs. Qagent and QNetworks
We compared the performance of the HQN agent in Wythoff’s game to that of a Qagent and a QNetwork (QN) . Figure 9 shows the accuracy of all three agents during learning. The HQN agent improves performance in discrete jumps as better models replace worse ones over time. Since models are assessed by the HQN in an unsupervised manner, some models evaluated to be better will in fact be less accurate, explaining the occasional fall in the performance of the HQN agent.
The core idea that gives rise to the QNetwork is using neural networks to approximate the Qfunction. Whereas a traditional Qlearning attempts to fill in every single value for the Qfunction in increasing accuracy in a lookup table manner, a QNetwork attempts to train a neural network that approximates this function. The QNetwork is also able to interpolate after training, since the network attempts to approximate the Qfunction continuously, filling in for the gaps in the dataset. The benefits of such an approach have been demonstrated in detail in DeepMind’s Atari Network
[3]. However, while attempting to be more efficient and general than a naive Qagent, QNetworks sacrifice a lot of stability. Retraining a QNetwork with a newly discovered dataset can be destructive to already existing features of the network. In order to remedy the destructive retraining issue, especially while training through large datasets, (deep) QNetworks make use of “experience replay” [3]. A QNetwork agent that uses experience replay will store training data as it comes, and backpropagates that data across the network in occasional intervals, as opposed to some novel datapoint.For Wythoff’s game, the QNetwork agent’s performance was subpar compared to the HQN and naive Qagents, even with additional modifications such as experience replay. Overall QNetwork performance did not exceed random chance significantly within the time constraints that allowed the HQN and the Qagent to attain reasonable performance. Giving the QNetwork additional advantages, such as training against a perfect agent, or increasing the number of layers in the neural network, was not able to fix the disparity.
The only structural change that observably changed the behavior of the QNetwork was to equate actions and states in the training phase. In an impartial game, how good an action is depends only on which state the action takes the game to. Moves towards cold positions are good moves, whereas moves towards hot positions are bad moves. Under most learning tasks, this assumption does not hold, e.g. pressing left could win the game in a certain scenario, but be disastrous in the other. As illustrated in Figure 6, the QNetwork, similar to the HQN and the Qagent, does not operate under this assumption, since the network trains to approximate how good an action is given the state.
However, we can hardcode the irrelevance of the starting state as a assumption, by representing an action as simply the encoding of the new state.
Under this framework, the task of the QNetwork would be to output the identical Qvector that separates good states from bad states, given any state in the game. Since there are a lot more states in a game of Wythoff (Order=) then there are actions (Order=, assuming (state)), the resulting Qvector will be significantly larger. Increasing the number of neurons to the same order fixes the problem while slowing down training periods. However, the resulting modified agent is able to converge on strategies on a pace that is competitive.
This “trick”, however, is inapplicable in most scenarios outside impartial games, that is why we did not hardcode such notions to the HQN agent. For a similar reason, we do not include the modified QNetwork agent that treats an as a in our analysis.
3.3 HQN performance across dimensions
The HQN agent was able to attain reasonably high levels of performance beyond the dimensions it was trained in. Figure 10 shows the accuracy of a specific model generated by the HQN agent for Wythoff’s game across different board dimensions.
The Qagent was collecting data on a 12 by 12 board, while the ModelNetwork was evaluating generated models against the Qagent on a 50 by 50 board. The fact that the ModelNetwork tests models by their performance on dimensions that they were not trained on is crucial to prioritize generalizability across dimensions. The fact that the Qagent cannot perform optimally on higher dimensions is also an advantage, since models that achieve some level of generalizability will be assigned a higher score despite having poor accuracy on smaller boards.
Fortunately, even though the ModelNetwork was attempting to optimize performance up to a 50 by 50 board size, the models generated were able to display a reasonable degree of performance on boards that are larger. For the model in Figure 10, the model achieved 70% accuracy on a board that was 300 by 300.
3.4 HQN performance across rule variations
HQN was also benchmarked against contexts where the rules of the game did not stay constant. In Figure 11 shows performance of the HQN agent across three games that had similar, but not identical, rules. The HQN agent started out by being trained through Wythoff’s game. Once the agent reached satisfactory () performance, we changed the gameplay rules to that of the game Nim, without informing the agent of this change. The agent was able to detect this change through the sharp decrease in the model’s performance (as perceived by the HQN agent, displayed with the green line), store the model for the Wythoff task away, reset the Qagent, and start training again. Since Nim (and later, Euclid) has less complex of an action space, we decreased the period from 250 Qagent gameplays to only 50, in order to slow the learning process down for visualization purposes. When satisfactory performance was reached in Nim, we changed the learning task to Euclid. When we cycled through the three tasks in a similar fashion once again; however, the agent was able to attain satisfactory performance immediately after it detects a change in the rules.
4 Discussion
In this paper, we proposed some basic strategies for developing and evaluating agents that learn adaptable and robust strategies, an increasingly important goal for developing AI capable of navigating novel environments. The hierarchical structure of HQN showed promise in the transferlearning domain, while remaining competitive with standard RL approaches in terms of performance. We trained a Qagent, a QNetwork, and a HQN for identical amounts of time on the Wythoff’s game (see Figure
9). The Qagent was able to show improved accuracy, although at a steadily decreasing rate over time. The QNetwork, a more unstable but also more efficient advancement over the Qagent algorithm, was not able to learn as well in this context of impartial games. The HQN agent, on the other hand, achieved increasing accuracy in discrete jumps as better models for the environment were discovered. The HQN also did more than merely excel in terms of efficiency. In the transferlearning domain, where standard RL approaches are infamously unsuccessful, the HQN agent was able to achieve performance that generalized across dimensions (Figure 10) and remain resistant to changes to rules of the game (Figure 11). Most importantly, we could query the HQN agent to show its strategy for game play in an intuitive and explainable way.Towards this goal of extensibilty in artificial agents, metareinforcement learning, the idea that RL agents can be trained to build better base networks for other RL agents to be trained on, holds a lot of promise. Wang et al. [29], Duan et al. [30] and Hansen [31] provide stateoftheart approaches to Metareinforcement learning, that they call Deep MetaReinforcement Learning (DMRL), , and Deep Episodic Value Iteration (DEVI) respectively. These agents are evaluated against benchmarks beyond efficiency and accuracy metrics, including oneshot changes to rewards, and ability to learn abstract task structure. Real et al. [32] and Miikkulainen et al. [33] also propose algorithms to optimize network architecture, including connectivity and parameters, for highdimensional deep learning tasks such as image recognition and language modeling. We consider these efforts important as we aim for artificial networks that can generalize across tasks and yield interpretable outcomes.
4.1 The Hierarchical Q Network
Our key innovation in this study was the introduction of the Hierarchical QNetwork (HQN), a modelbased learning agent that capitalizes on hierarchical information processing (see discussion of biological motivations in subsection 4.2). The HQN was composed of an “lower” layer, the Qagent, that explored through the highdimensional stateaction search space, and an “higher” layer, the ModelNetwork, that abstracted away the action dimension, and processed through the expected values of states to extract generalized structure from the environment. More important than the hierarchical structure of the HQN, however, is that the two networks interact in such a way that observations by the Qagent effectively inform model building and that hypotheses generated by the Model Network effectively constrain future action policies. Without the ModelNetwork, the Qagent blindly explores the massive search space without any “insight”. Conversely, without the Qagent, the ModelNetwork does not have any information with which to generalize from.
While the HQN’s performance was superior than the other RL agents tested here, we should point out that it does suffer from limitations that future work should focus on. One of the inherent limitations of the HQN agent as proposed in this paper is that neural networks were used as the implementation of the ModelNetwork. Neural networks proved themselves to be suitable in a wide variety of learning tasks; however, there exists a wide range of limitations. For instance, a standard neural network will not be able to classify objects that follow a discrete pattern. For example, in Wythoff’s game, even though the ModelNetwork was able to generate models that recognized the two symmetrical lines of cold positions, the network was unable to appreciate the discrete intervals separating cold positions along each of the lines. Processingunits that can independently and cohesively handle a vast array of decision problems are essential, if the goal is to understand and simulate how the biological brain can seamlessly navigate a highlycomplex physical environment where inputs and goals of a learning task can change rapidly. We propose that symbolic representations combined with the strengths of statistical approaches of neural networks might be extremely useful. An initial attempt to explore such an intersection is given by Garnelo et al. where they propose a symbolic modelbased learning agent [34]. We intend to follow a similar direction in our future work.
4.2 Biological Motivations for Hierarchical Processing
The advantages of the HQN, along with recent work by others [5, 3, 31], suggests that hierarchical structure is an effective catalyst for adaptive and generalizable learning in artificial agents. Indeed, substantial evidence from experimental and computational neuroscience suggests the same is true of biological brains [15, 18, 24], pointing towards the looped architecture of corticobasal ganglia networks as an important feature for modelbased and modelfree learning systems [18, 24]. The basalganglia (BG) is a subcortical network that receives widespread cortical input through the striatum, forming a channellike architecture  each channel representing a particular action  that loops back up to motor cortex through the thalamus [35]. Critically, each action channel in the BG contains a facilitation and suppression pathway, capable of exerting bidirectional control over the corresponding action channel in primary motor cortex. Schultz and colleagues [36] famously showed that, during learning, the weights of these pathways are adjusted by phasic changes in striatal dopamine, encoding both the magnitude and sign of the prediction errors estimated from Qlearning models. This dopaminedependent plasticity of corticostriatal connections serves to reinforce the future selection of rewarding actions while also suppressing less desirable alternatives, serving a similar computational goal to that of QNetworks [37, 38, 39]. However, as previously mentioned, relying on feedback alone to drive learning 1) quickly becomes inefficient as task complexity increases, 2) limits the range of learned associations that can be simultaneously stored and exploited, and 3) fails to account for the robust and flexible nature of mammalian behavior.
The fundamental idea behind modelbased learning is that, through experience and observation, internal beliefs are formed about the causal relationship between contextual features, states, and action values. For hierarchically structured tasks, for which stateaction values depend on multiple, nested contextual features, generative models offer an imperfect but highly efficient strategy for guiding action selection. Critically, however, implementing a modelbased learning strategy often relies on simultaneously learning from feedback in a modelfree manner. Thus, the challenge of implementing modelbased learning is twofold, requiring 1) a generative mechanism for constructing hypotheses and 2) fluid interaction between inferential and feedbackdependent learning systems.
Both the neuroscience [40, 41, 42]
and machine learning
[43, 7, 44, 45] communities have shown a growing interest in modelbased learning mechanisms , leading to mutually informative lines of investigation (e.g., understanding how biological brains encode modelbased learning strategies provides hints for overcoming the challenges of modelbased learning in artificial agents). Evidence from human neuroimaging studies suggests that modelfree learning computations in the BG are regulated by topdown inputs from a modelbased learning system in the prefrontal cortex (PFC) [15]. Critically, due to the looped architecture of corticoBG pathways, modelbased computations in cortex are informed by feedbackdependent updates in the actionvalue landscape. Over time, cortical modelbased learning systems generate predictions based on modelfree computations and, in turn, provide topdown constraints that regulate feedback sensitivity and decision policies in the BG. This symbiosis between BG and PFCdependent learning systems is mirrored in the HQN, with observed stateaction values in the QNetwork facilitating better predictive models in the Model Network that, in turn, improve future performance through topdown constraints on action evaluation. This scaffolding of modelbased and modelfree learning computations accelerates the learning process by proactively testing different hypotheses about the rule structure of the task and constraining future decision policies as confidence increases about the fidelity of these expectations.4.3 Impartial Games as a Benchmark
We should point out that we are not the first to observe and leverage the fact that the benchmark environment chosen profoundly influences the learning agents we design. Although the success of the DeepMind Atari Network was impressive, the benchmark featured implausible 2D environments through a third person perspective. Kempka et al. developed VizDoom [46], a dynamic firstperson perspective learning environment as an alternative testing benchmark for visual RL agents. [30] was evaluated in the VizDoom environment to demonstrate adaptability to highscale problems. More recently, DeepMind and Blizzard announced a partnership [47] to utilize StarCraft II as a AI research environment. StarCraft II is a thirdperson strategy game with complicated raw visual input, state and action space, and delayed rewards and punishments to selected actions. Initial results already show that this new learning environment will be a challenge for even to most wellestablished deep reinforcement learning architectures.
We share similar goals with most of the aforementioned research, including designing learning agents that are more adaptable to changes in inputs and goals, as well as ensuring that learning outcomes are interpretable to humans. However, our critical argument is that, in order to achieve these goals, tasks should be designed in which adaptability, as opposed to accuracy, is prioritized. One of the ways our approach separates itself is in the sheer simplicity of the learning task chosen: Impartial games are equipped with rules straightforward enough that winning strategies have a complete mathematical theory. The fact that impartial games are “solved” games allows us to conveniently evaluate performance, and shift focus entirely to the transferlearning and modelbuilding domain.
Despite their simplicity, the scalability of impartial games makes them uniquely conducive to experimentation with modelbased learning algorithms. Common benchmarks such as multiarmed bandit problems [29]
lack an environment that needs to be navigated through dynamic modelbuilding: a model for the environment cannot go beyond the predetermined expected value and variance distribution. The complexity of games like Go and Starcraft II, on the other hand, preclude any straightforward approach to modelbuilding. For impartial games, modelbuilding can be performed by exploring the geometrical structure of value over topology of the game environment. Thus, we argue that impartial games offer a more suitable environment for rigorously testing and comparing deep modelbased agents. The benefits of using impartial games for benchmarking modelbased deep learning are summarized below.

The rules of impartial games immediately generalize to bigger board dimensions, in a way that preserves the mathematical structure of the winning strategy. This feature allows us to differentiate between learning agents beyond simply looking at their performance. In order to realize whether a learning agent has truly understood the nature of the game environment, we would just need to benchmark it on a bigger board size. Thus, the learning outcomes of an agent become more transparent. Games like Chess or Go do not have structure that straightforwardly generalizes across different board sizes, so such an approach at benchmarking would have been infeasible.

Impartial games offer wide variety of ways to change the rules of the game, without destroying the inherent mathematical structure of the winningstrategy associated with the specific set of rules. Just by imposing some natural restrictions on the function that returns the set of legal moves from a position on Wythoff’s game, we were able to generate two other games (Nim and Euclid) where the winning strategy has similar mathematical structure. Rule changes in games such as Chess or Go, albeit how insignificant, may influence overall strategy in very intricate ways. Thus Chess and Go would be less accessible for initial attempts for transferlearning across rule changes.

There are a lot of impartial games where structure and noise can coexist, similar to the real world. This is a feature that we did not utilize in this paper, but also reflects an advantage of impartial games. For example, a complete mathematical characterization of the winning (hot) and losing (cold) positions in a 3D Wythoff’s game, as of time of writing, has not been discovered. However, results from the 2D version partially generalize to shed some light into optimal behaviour. A learning agent that can figure out how to make generalizations across dimensions could be worthwhile challenge.
All these features allow us to conclude that impartial games, when taken as a benchmark for learning agents, allows for asking questions to the agent
where answers in the affirmative demonstrate a type of intelligence that goes beyond a bruteforce pattern matching task
.References
 [1] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.

[2]
Alex Graves, Abdel rahman Mohamed, and Geoffrey Hinton.
Speech recognition with deep recurrent neural networks.
pages 2–5, 2013.  [3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 [4] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [5] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 [6] Asako Toyama, Kentaro Katahira, and Hideki Ohira. A simple computational algorithm of modelbased choice preference. Cognitive, Affective, & Behavioral Neuroscience, pages 1–20, 2017.
 [7] Wouter Kool, Fiery A Cushman, and Samuel J Gershman. When does modelbased control pay off? PLoS computational biology, 12(8):e1005090, 2016.
 [8] Gary Marcus. Deep learning: A critical appraisal. arXiv, 2018.
 [9] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, and A. Lefrancq. Ai safety gridworlds. arXiv, 2017.
 [10] Falk Lieder and Thomas L Griffiths. When to use which heuristic: A rational solution to the strategy selection problem. In CogSci, 2015.
 [11] Guy R. Berlekamp, E. and J. Conway. Winning Ways for your Mathematical Plays. A K Peters, Natick, MA, 1982.
 [12] B. B. Doll, D. A. Simon, and N. D. Daw. The ubiquity of modelbased reinforcement learning. Current opinion in neurobiology, 22(6):1075–1081, 2012.
 [13] P. Smittenaar, T. H. FitzGerald, V. Romei, N. D. Wright, and R. J. Dolan. Disruption of dorsolateral prefrontal cortex decreases modelbased in favor of modelfree control in humans. Neuron, 80(4):914–919, 2013.
 [14] K. Wunderlich, P. Smittenaar, and R. J. Dolan. Dopamine enhances modelbased over modelfree choice behavior. Neuron, 75(3):418–424, 2012.
 [15] B. B. Doll, K. D. Duncan, D. Simon, D. Shohamy, and N. D. Daw. Modelbased choices involve prospective neural activity. Nature Neuroscience, 18:1–9, 2015.
 [16] E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw. Predictive representations can link modelbased reinforcement learning to modelfree mechanisms. PLOS Computational Biology, 13:9, 2017.
 [17] J. P. O’Doherty, J. Cockburn, and W. M. Pauli. Learning, reward, and decision making. Annual review of psychology, 68:73–100, 2017.
 [18] M. J. Frank and D. Badre. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: Computational analysis. Cerebral Cortex, 22:509–526, 2012.
 [19] D. Badre and M. D’esposito. Is the rostrocaudal axis of the frontal lobe hierarchical? Nature Reviews Neuroscience, 10(9):659–669, 2009.
 [20] W. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science, 275:1593–1599, 1997.
 [21] Neir Eshel, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige Uchida. Arithmetic and local circuitry underlying dopamine prediction errors. Nature, 525(7568):243–246, 2015.
 [22] Neir Eshel, Ju Tian, Michael Bukwich, and Naoshige Uchida. Dopamine neurons share common response function for reward prediction error. Nature neuroscience, 19(3):479–486, 2016.
 [23] N. D. Daw, Y. Niv, and P. Dayan. Uncertaintybased competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704–1711, 2005.
 [24] D. Badre and M. J. Frank. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 2: evidence from fmri. Cereb. Cortex, 22:527–36, 2011.
 [25] Zachary Abel. Putting the why in wythoff. http://blog.zacharyabel.com/2012/06/puttingthewhyinwythoff/, 2014.
 [26] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction (Vol. 1, No. 1). MIT press, Cambridge, 1998.
 [27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 8506, 1985.
 [28] Jorgen Grimnes. Nimblenet. http://jorgenkg.github.io/pythonneuralnetwork/, 2016. Github repository.
 [29] J. X. Wang, Z. KurthNelson, D. Tirumala, H. Soyer, et al. Learning to Reinforcement Learn. 2017.
 [30] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlet, et al. . arXiv, 2016.
 [31] Steven S. Hansen. Deep episodic value iteration for modelbased metareinforcement learning. arXiv, 2017.
 [32] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, et al. Largescale evolution of image classifiers. arXiv, 2017.
 [33] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, et al. Evolving deep neural networks. 2017.
 [34] M. Garnelo, K. Arulkumaran, and M. Shanahan. Towards Deep Symbolic Reinforcement Learning. ArXiv eprints, September 2016.
 [35] G. E. Alexander, M. R. DeLong, and P. L. Strick. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9:357–381, 1986.
 [36] W. Schultz, P. Dayan, and P. R. A Montague. Neural substrate of prediction and reward. Science, 80:1593–1599.
 [37] M. J. Frank, L. C. Seeberger, and R. C. O’reilly. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306:1940–3, 2004.
 [38] S. M. L. Cox et al. Striatal d1 and d2 signaling differentially predict learning from positive and negative outcomes. Neuroimage, 109:95–101, 2015.
 [39] A. V. Kravitz, L. D. Tye, and A. C. Kreitzer. Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nature Neuroscience, 15:816–8, 2012.
 [40] Michael A McDannald, Yuji K Takahashi, Nina Lopatina, Brad W Pietras, Josh L Jones, and Geoffrey Schoenbaum. Modelbased learning and the contribution of the orbitofrontal cortex to the modelfree world. European Journal of Neuroscience, 35(7):991–996, 2012.
 [41] Peter Dayan and Kent C Berridge. Modelbased and modelfree pavlovian reward learning: revaluation, revision, and revelation. Cognitive, Affective, & Behavioral Neuroscience, 14(2):473–492, 2014.
 [42] Nathaniel D Daw, Samuel J Gershman, Ben Seymour, Peter Dayan, and Raymond J Dolan. Modelbased influences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204–1215, 2011.

[43]
D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick.
Neuroscienceinspired artificial intelligence.
Neuron, 95:245–258, 2017.  [44] Nathaniel D Daw and Peter Dayan. The algorithmic anatomy of modelbased evaluation. Phil. Trans. R. Soc. B, 369(1655):20130478, 2014.
 [45] Kenji Doya, Kazuyuki Samejima, Kenichi Katagiri, and Mitsuo Kawato. Multiple modelbased reinforcement learning. Neural computation, 14(6):1347–1369, 2002.
 [46] Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. A Doombased AI Research Platform for Visual Reinforcement Learning. 2016.
 [47] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, et al. https://deepmind.com/blog/deepmindandblizzardopenstarcraftiiairesearchenvironment/, 2017.
5 Appendix
In this section, we formalize some definitions referred to in the rest of the paper. We also provide some of the proofs that will create suitable mathematical background for analysing impartial combinatorial games, such as Wythoff’s game. We start by giving a formal description of an impartial game.
Definition 4.
Let be a set of states, and be the legal moves function. An impartial game is a game played among and , such that:

and alternate in making moves with going first.

Given a state , the move made by has to be an element of

loses at state if and only if it is ’s turn, and is the emptyset, meaning that there are no legal actions for to do.

There cannot exist a sequence of states such that , , , .

From every state there exists a valid sequence of states such that , , , where is the emptyset. Thus, is a terminal state in the game.
Conditions , and lay out the main structure of the game. Condition insists that once a state has been reached, it cannot be reaccessed, and thus the game cannot go in cycles. Condition , combined with Condition ensures that the game will always terminate, since every legal move must decrease the maximum distance to a terminal state where there are no available actions. Once the the distance reaches zero, the player whose turn it is loses, and the other player wins.
First, we prove using the principle of mathematical induction that indeed, in any welldefined impartial game, every position will be either hot or cold. First, we formally define the notions of hot and cold.
Definition 5.
Let be an impartial game. Let . We say is cold if , that is, is a terminal position. We say is hot if and only if there exists a such that is a cold position. If is not a cold position, we say is cold if and only if for all , is a hot position.
Thus, the definition of hot and cold recursively builds up from each other, and since terminal states being cold constitute the necessary base case, the recursion is welldefined. However, if the reader is unfamiliar with recursive constructions, the theorem we present next does not immediately follow.
Theorem 2.
Let be a impartial game. Then, for all , is either hot or cold.
Proof.
Proof is by induction on the maximum distance the state has to a terminal state. We refer to this distance as the depth of the state.
In the base case, the depth is just , implying that the is a terminal state, then by definition is cold, so the theorem is true.
Now, we assume inductively that the depth of is greater than . For all , the depth of in necessarily smaller than that of , thus inductive hypothesis applies to show all such is either or .
Case 1: All such are .
Then by definition, is cold.
Case 2: There exists a which is .
Then by definition, is hot.
Since these are the only two cases, the result follows by induction. ∎
Thus we have that for all impartial games, the states can be partitioned into hot and cold positions. The question that remains is what the partition is, given a specific impartial game. We answer this question for the game of Nim. We will prove that a position in pile Nim is hot as long as the piles are asymmetrical.
Theorem 3.
Let be the impartial game of Nim restricted to only 2 piles. Then, is a cold position if and only if .
Proof.
Proof is by induction on depth of the state . If depth is , we know , is cold, and , as desired. If the depth is greater than , we have two cases.
Case 1:
Without loss of generality, assume . By the rules of Nim, there exists a move that decreases to . Since has a smaller depth, by induction, is hot, as desired.
Case 2:
In this case, since in Nim diagonal moves are disallowed, all moves will will bring the game to a state where the piles are asymmetrical. Any new state will have a smaller depth, and by induction, will be hot. Thus, is cold, as desired.
∎
The partition in Nim for arbitrarily many dimensions has a similar structure, but requires a little more background to prove, hence but we state it below.
Theorem 4.
Let represent a Nim game. Then, is a cold position if and only if when through combined with the bitwise exclusive or operation (xor), the result is .
Since bitwise logical operators bring us into the realm of stepwise distributions again, it becomes difficult for a HQNlike agent to converge on optimal performance.
For Wythoff’s game, the proof for the partition is again somewhat involved, and hence we omit a proof for Theorem 1. We do present, however, a short proof for the partition of the hot and cold positions for Euclid, making use of the properties of the golden ratio.
Theorem 5.
Let be the impartial game of Euclid restricted to dimensions. Then, let , and without loss of generality, assume . Then, is hot if and only if where is the golden ratio.
Proof.
Given a game state , it suffices to show (1) that if , then is a position, and (2) otherwise is a position. Since we reduce one of the dimensions each move, and theorem works for terminal positions trivially, we can inductively assume theorem works for all accessible states from a given state.
Let be a game state for Euclid, and suppose without loss of generality that .
(1) First, let . Since , the only state accessible from is . We need to show , which implies by inductive hypothesis that is a position.
Since by assumption  
by definition of the golden ratio 
(2) Now, we let . We want to access a game in the form where is an integer, and is a position. By inductive hypothesis, this is equivalent to saying or , depending on whether or is the larger integer.
The only reason why we would be unable to access such a state is while removing multiples of from , we skip over the entirety of the range. This would only happen if was a number larger then the number of unique ’s such that is a position, i.e. or hold true. Combining the inequalities, we see that we need to count the number of integer ’s such that
There will be precisely such values for , so the entirety of the range cannot be leaped over, as desired. ∎
Comments
There are no comments yet.