Comparing Knowledge-based Reinforcement Learning to Neural Networks in a Strategy Game

01/15/2019 ∙ by Liudmyla Nechepurenko, et al. ∙ arago 0

We compare a novel Knowledge-based Reinforcement Learning (KB-RL) approach with the traditional Neural Network (NN) method in solving a classical task of the Artificial Intelligence (AI) field. Neural networks became very prominent in recent years and, combined with Reinforcement Learning, proved to be very effective for one of the frontier challenges in AI - playing the game of Go. Our experiment shows that a KB-RL system is able to outperform a NN in a task typical for NN, such as optimizing a regression problem. Furthermore, KB-RL offers a range of advantages in comparison to the traditional Machine Learning methods. Particularly, there is no need for a large dataset to start and succeed with this approach, its learning process takes considerably less effort, and its decisions are fully controllable, explicit and predictable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The core difference between the hype and reality of AI is that machines do not have a human-like brain. Actually, machines do not understand, do not reason, and do not infer. Nevertheless, people have been seeking for decades to create a human-like intelligent machine by trying to simulate and inherit the way the human brain operates. One of the most recent trends in AI is Machine Learning (ML). To a high extend, ML owns its popularity to the development of Neural Networks - the technology that was inspired by the biological neurons in the human brain. The fact that NNs can solve insolvable before tasks drew a lot of inspiration to the field. Thus, it is no surprise that nowadays the majority of research in AI is focused mostly on NNs, while little attention is paid to other methods.

However, NNs have their own limitations and difficulties. Firstly, they are data greedy. NNs require hundreds of thousands times more data to learn than a human, for example. While in such domains as speech or image recognition the Internet created the abundance of data, in some areas acquiring vast amounts of data is challenging [Ng2015]. The lack of data in such cases makes NNs ineffective and ill-fated. Secondly, handling huge amounts of data in NNs requires extensive computational power. Giant companies such as Amazon, Google or Facebook have access to sufficient hardware resources and can train NNs for weeks under their budget. Yet, for smaller businesses and projects, the availability of CPUs and RAM becomes the overwhelming constraint. Another major drawback of NNs is that they are one-task models. Once trained, NN model can be incredibly effective at a specific task, such as detecting objects or playing a game. However, NNs cannot operate like a human brain, solving various tasks, generalizing concepts, and transferring knowledge between different domains. Combining NNs with Reinforcement Learning into Deep Reinforcement Learning (DRL) opened new possibilities for AI. However, DRL also inherited aforementioned disadvantages of Neural Networks.

The method discussed in this paper is based on the idea that human knowledge can be leveraged in automating problem solving. Rather than collecting tons of data for feeding a neural network, teaching the explicit rules to the machine can significantly shorten the time for finding the optimal solution. In many cases, people possess a lot of knowledge about the problem, and learn from each other when the knowledge is missing. Similarly, humans could share their knowledge with the machine. Then, instead of blank start, computers can begin problem solving like a human expert by reasoning about available knowledge, iterating through it and optimizing.

This idea motivated the Arago company to develop its Knowledge-base Reinforcement Learning approach. To demonstrate the capability of this approach, Arago decided to pick a problem that is enough challenging and closely related to the real world tasks. Thereby, the strategy game CIVILIZATION was chosen as a benchmark. The motivation for this choice can be summarized in the following reasons:

  • Historically, games have been considered as an excellent test-bed for AI research [Laird and van Lent2000]. Such games as Maze, Chess, Checkers have been a universal benchmark for AI studies since the origin of AI. More recently, Atari console games [Mnih et al.2013], Mario [Dann et al.2017] and StarCraft [Hu et al.2018] gained increasing interest due to their higher complexity and availability of well developed APIs. Most recently, AlphaGo managed to win against the World Champion [Silver et al.2016] in Go.

  • The complexity of the CIVILIZATION game. This game is considerably more complex than, for example, the Go game. The complexity of Go is estimated as approximately

    possible games [Stanek2017]. The game is played in a deterministic environment with static rules. In contrast, a player in the CIVILIZATION game has to manage numerous agents with a much bigger action space and in a non-deterministic environment. This brings the estimated complexity of this game to be over possible games.

  • The paradigm of the CIVILIZATION game is close to the real world and real business. It means that playing the game can be easily translated into solving real world tasks. In the game, as well as in the business world, it is all about management within restricted resources and competing goals.

Particularly, the game’s implementation FreeCiv was taken as a benchmark due to its open source availability and well defined API. The KB-RL system was setup to play FreeCiv and the HIRO FreeCiv Challenge [Arago2016] was called to demonstrate the concept. To illustrate the difference between the KB-RL approach and the NN framework, it was decided to conduct an experiment comparing them both on a game subtask. The subtask was chosen in such a way that 1. it can be implemented with both approaches, and 2. it would be a typical task for the NN. After careful consideration, we decided on the task of optimizing a regression problem.

Naturally, we also considered Deep Reinforcement Learning as an opponent for the KB-RL. However, early results showed that DRL could not give much advantage to the results of our experiment. The reasons for this are explained in section 3.

The results show that the KB-RL system is able to outperform NN in the selected subtask of the FreeCiv game. Moreover, contrary to the NN, the KB-RL system provides a number of advantages such as no need for a large dataset to start and succeed with the solution. Its learning process takes considerably less effort, and its decisions are transparent and controllable.

2 Related Work

After an outstanding success, NNs and DRL continue to be a flagship of AI research, attracting large investment, and progressing to overcome their constraints. For instance, DeepMind published recently several reports on multi-task models [Kaiser et al.2017], knowledge transferring [Fernando et al.2017], improved performance with such mechanisms as attention [Vaswani et al.2017], parallelism [Nair et al.2015], double Q-learning [Hasselt et al.2016], and continuous control [Lillicrap et al.2015].

As the result of NNs’ popularity, little attention has been paid to other AI approaches, such as symbolism, evolutionarism, or Bayesian statistics

[Domingos2015]

. More recently, though, new studies emerge that show successful results in applying alternative approaches to AI tasks. For example, Denis G Willson at el. show that their evolutionary algorithm can outperform the deep neural network approach in playing Atari games

[Wilson et al.2018]. More studies are targeting General AI as, for instance, the CYC project [Lenat et al.1990]. Some researchers advocate that the combination of different techniques into one powerful AI system is the way to go [Geffner2018].

Though having a rule-based engine in its core, our KB-RL system is cross-related to several AI areas including reasoning, machine learning, and general AI systems. Generally, the system allows to combine different methods in a highly flexible manner. Being a loosely coupled system of dedicated modules, it allows easily to plug in any new technique independently of the domain. The experiment discussed in this paper is a perfect illustration of this principle. The neural network model for city output prediction was plugged into KB-RL with very little effort and the minimum disruption to the overall solution of playing FreeCiv.

From the perspective of using CIVILIZATION as a test-bed for AI algorithms, there were several works done previously. In 2004 A. Houk used a symbolic approach and reasoning to develop an agent for playing FreeCiv. He showed that the agent could play ”in a limited, but successful manner” the initial phase of the game [A. Houk2004]

. However it was not able to play and win against the embedded AI or human players. In 2014 Branavan et al. employed Natural Language Processing (NLP) to improve the player performance in CIVILISATION II

[Branavan et al.2014]. They showed that a linguistically-informed game-playing agent outperforms its language-unaware counterpart. Their work was a combination of Reinforcement Learning and NLP.

Another example on applying RL for playing CIVILIZATION was performed by S. Wander in 2009 [Wender2009]. His work is particularly close to our paper as he also investigated the learning of potentially best city sites. Wander implemented several modifications of Sarsa and Q-learning algorithms and used the game score to evaluate the performance. Wander showed that his algorithm is improving, however, he was able to demonstrate this only in the very initial phase of the game. Due to the lack of algorithm efficiency, he had to cut the length of each episode from the planned 120 turns to 60. In this period, the agent had time for only 2 cities.

In contrast to the previous studies, Arago’s KB-RL system can successfully play the full CIVILIZATION game, win against embedded AI players, and demonstrate the ability to learn with more games played. With the popularity of NNs, it is reasonable to expect that the next step in playing strategic games would be based on NNs. To the best of our knowledge, there are no previous studies that use NNs to play CIVILIZATION, so we decided to try it ourselves and compare its performance to the KB-RL method.

3 Experimental setup

3.1 Task definition

Image and speech recognition are the areas where NNs demonstrate the most exceptional performance. On the other hand, it is hardly possible to explicitly encode a solution for such tasks with classical programming. Therefore, we chose for our experiment an element of FreeCiv that incorporates perception of the image. Particularly, we picked up the task of evaluating the map’s tile for building cities that would lead the game to the maximum of generated resources. In other words, we would like to maximize the game’s generated resources by optimizing the cities locations on the game map. As it is shown in below, the amount of natural resources generated in one game is implemented through the points of different type that are adding every game turn. We call the amount of generated natural resource from all cities in one game: the total game output (TGO).

For a human player it is a matter of one look at the map to understand its multiple features and to estimate a tiles’ value with regard to the future city output and strategic position. Estimation of all map features and their possible values in one script of traditional computer programming would result in dozens of ’if-else’ blocks and endless code repetition, which is deficient and error prone. With sufficient amount of data, NNs can solve such task highly efficiently by analyzing the image of the map chunks and predicting their quality for the given task. Yet, solving given task by the KB-RL approach appears to be even more effective.

In FreeCiv, the settlement mechanism is implemented in the game by means of building settler units, creating cities, and their development. The wise choice of city location is a guarantee of its rapid growth, rich resources and consequently player’s success.

Cities generate natural resources from the terrain tiles within city borders. City borders may reach terrain within the 5x5 region centered on the city, minus its corners. To extract resources from a tile, the player must have a citizen working there. Each working tile generates a number of food, production and trade points per turn. Trade points can be turned into gold, luxury or science points. These six types of points - food, production, trade, gold, luxury, and science - constitute the city output.

In this way, we calculate the city output as a sum of all points that are collected with every turn, and we double the production points as they can be used as half a gold point when buying the current city project. Thus, the formula for the city output is given by equation 1

(1)

where and refer to the turn number.

Consequently, the total game output at turn is the sum of all city outputs owned by the player until the -th turn:

(2)

where is the number of the player’s cities.

As previously mentioned, the goal of the experiment is to maximize the TGO by optimizing city positions on the game map. Let’s take a look at the parameters related to the map tiles that are relevant for the city output.

3.2 Selected parameters

The output of each tile is affected by the terrain, the presence of special resources, and improvements such as roads, irrigation, or mines. The total city output can be affected by the city economy, city governor, and the government type. Also, a powerful mechanism to boost trade points are trade routes.

For the purpose of this experiment, we considered only those parameters that are relevant to the map qualities: (TERR) Terrain of the tile and terrain of the surrounding 5x5 tiles with cut corners. There are 9 possible terrain types in the game suitable for building a city: Desert, Forest, Grassland, Hills, Jungle, Mountains, Plains, Swamp, and Tundra. (RES) Resources on the tile and surrounding 5x5 tiles with cut corners. Every type of terrain has a chance of an additional special resource that boosts one or two of the products. Special resource can be one of 17 types and only one per tile. (WATER) Availability of water resources. Presence of Ocean or Deep Ocean terrain in the city has special significance due to their rich resources and strategic advantages. Therefore, we consider them as extra parameter separately from other terrain types. (RIVERS) Availability of rivers. Rivers enable improvements of the terrain and enhance trade for some types of terrain.

FreeCiv is a very complex game and there might be more parameters that affect the city output. To include each of them in the experiment was not our objective. Firstly, we aimed to include the most relevant features, and secondly, we setup equal conditions for neural network and for KB-RL, and their performance against each other was our objective. The only two attributes that were included in the dataset unrelated to the map qualities were those that characterize neighboring cities: number of player’s cities in the region 9x9 (with cut corners) centered on the city, and the number of enemy cities in this region. We mark them ’NEIGHB’.

Settling is happening in FreeCiv in its initial phase. After cities are built, the player mostly focuses on the developing economy, technologies and warfare. For the purpose of the experiment, we did not need to play the game until it finished. Stopping the game ahead of time gave us the advantage of significantly shorter episode duration: such episodes took about of the whole game time. Analyzing the HIRO FreeCiv Challenge games [Arago2016], we chose to play only first 120 turns of the game, as it seemed to be a good trade-off between amount of generated data, game state, and the playing time.

3.3 Regression problem

In fact, we saw the total game output as a regression problem that determines the relationships between the aforementioned parameters and the total game output value:

(3)

In other words, given the parameters of the map cluster (5x5), we aimed to predict a continuous integer value reflecting the future output of the city being built in the cluster center.

When a new Settler is completed, the player evaluates each map tile and chooses the best suitable location to send the unit there for founding a city. This evaluation is the process that promises the future city output. Therefore, we set up our experiment to optimize the tiles’ evaluation with two different approaches: KB-RL and NN, and compared the outcome.

3.4 Experiment structure

Figure 1: The diagram for KB-RL and NN setups. The task of tile evaluation was implemented in two different ways. On the left, the knowledge-based rules were used within the KB-RL approach to evaluate the tile scores. On the right, the Neural Network model was used to predict the tile scores.

Firstly, playing the FreeCiv game was implemented with the knowledge-based approach without any optimization that would involve RL. At that stage, the system could play fairly well against embedded AI, and mostly won. After that, Arago announced the HIRO FreeCiv Challenge [Arago2016] asking human players for their expertise in playing the game. The ten best strategies were then implemented as expert knowledge for ten separate knowledge pools. The next stage was to mix all knowledge pools into one and find the best strategy via RL.

For the selected task, we designed two setups: one would evaluate the tile based on the rules derived from human players expertise, and another one would use a neural network. Figure 1 illustrates the difference in the two setups. It is important to note that both setups were exposed to reinforcement learning and used the same knowledge pool except the outlined tile evaluation. Designed this way, the difference in output of the two setups would be the result of different approaches for the tile evaluation, and thus, would become a point of comparison for these two methods.

Initially, we also considered the setup where learning would be performed first, and then NN would be plugged in to observe its performance against the pure KB-RL approach. However, in this case the NN would lead the game through a different set of states that were not that much experienced and learned in the KB-RL setup run. This fact would give a disadvantage to the NN setup. Thus, the decision was made in favor of running learning for both setups from scratch.

3.5 Neural Network setup

To create a dataset, we had 1100 fully played games acquired from the HIRO FreeCiv Challenge. Realistically, for training neural networks, it is very scarce data. Considering the limited resources we had, it took more than a month to collect these data. Spending more resources on obtaining more data was unreasonably expensive. Therefore, we worked our best to exploit the available data to their full potential.

Our goal was to create a dataset where data entries represent the map tile parameters as discussed above, and the value would be the output that the city could generate in the first 100 turns of its existence. We collected all tiles on which cities were built from 1100 games and determined their map parameters according to our design. To estimate the city output on these tiles, we faced a few challenges. Firstly, cities built on the same land in different games would differ significantly due to the different game strategies and player’s progress. Secondly, cities were built in different turns, but we had to estimate each tile independently from the turn built. Therefore, we could not use the formula 1

to set the value against our dataset entries. By analyzing the data and experimenting with hyperparameters for training the neural network, we chose to calculate city output as in formula

4

(4)

where c refers to the city index, and represents the age of the city in terms of turns. For example, relates to the first turn after a city was built, and is the 100th turn of city existence on the map. We replaced duplicate entries with one entry defining the city output as average output of these entries. By keeping only unique entries, we aimed to minimize the possible data imbalance [Kołcz et al.2003].

As a result, we collected more than 2700 unique entries for training our NN model. The input dataset was normalized by min-max strategy, and the trained model had the following structure:

Input layer accepts 83-dimensional feature vector.

One hidden layer with 95 neurons and ReLU activation.

Weights are initialized using truncated normal distribution with zero mean and 0.0005 standard deviation.

To avoid overfitting, a dropout with probability 0.5 is applied to the hidden layer.

Output is a single neuron, which is a continuous variable.

The mean tiled error is used as a loss function.

The ADAM optimizer has shown the best performance among other optimization algorithms. Batch size is 30 and learning rate is 0.002. In order to find optimal hyperparameters, including the number of hidden layers, grid search has been applied to the model. For the model assessment, we chose K-fold cross validation with 10 splits and with shuffling. After training, the mean tile error for the test set reached the value 0,00637.

Having NN in place, we examined the possibility to set it up for DRL. As it turned out, each of the episodes could contribute almost no new entries to the dataset. Firstly, because the city had to exist at least 100 turns to calculate its output as in equation 4. Secondly, the game started each time at the same place and very few tiles were opened to the player at the beginning. Thus, the first two or three cities were built on a very narrow patch of land, and their data entries repeated from game to game. As such, 1000 games could contribute only 7 unique entries to the dataset.

Nevertheless, we decided not to change the setup. Reducing the number of turns from 100 to a smaller count would degrade the data quality as city output develops in a non-linear manner. We could not afford such harm to the prediction accuracy considering how small amount of data we had. On the other hand, playing the game more turns would result in very long episodes. Besides, even longer games did not look promising in delivering sufficient data for DRL. Moreover, the KB-RL system was set up in equal conditions and its performance was not diminished by such arrangement, that points out by itself to one of the KB-RL advantages.

The dataset and the games database are publicly available at [Nechepurenko and Voss2019]

3.6 KB-RL setup

Figure 2: TGO averaged over a number of games. Both setups show improvement with more learning. The difference in output derives from the different methods in evaluating the tiles.

Our KB-RL system follows three core principles: Semantic map that maps the processes to semantic data graph so that the system has a contextual representation of the problem world. Knowledge about the solution. As opposed to recording these as a sequence of steps (like a script), the knowledge is recorded as discrete rules that allows the engine to reuse them for automating similar but different tasks without the recording of repeating knowledge. Decision-making engine that applies available knowledge to the problem’s context from the semantic map. Critically, due to the integrated AI approaches, the engine is able to dynamically handle incomplete or ambiguous information.

Knowledge about FreeCiv arrived in the KB-RL system from the human experts. During the HIRO FreeCiv Challenge, we collected experience from top players about their settling strategies and their evaluation of the map for building cities. Their knowledge was recorded in form of the discrete rules that we call knowledge items. We used a scoring system to estimate the degree of tile suitability to deliver high city output. Each knowledge item contributed to the score of a particular tile independently of others.

Knowledge items addressed the same parameter set as it was outlined for training the neural network: terrain, resources, and water resources on the central tile and surrounding tiles. There were 14 features covered by knowledge items: 9 for different terrain types, and 5 for other features: (1) resources on the central tile, (2) resources on the surrounding tiles, (3) availability of water resources, (4) access to deep ocean, and (5) presence of whale resources. As whale is a rare resource that boosts two products (food and production) at the same time, many players favor it more than other resources. Thus, we treated it with additional rules.

Figure 3: Total game output for each single game in the run of episodes. As the game is full of random events, the output has a great variation from game to game. In the beginning, the variation is greater due to high exploration factor. Later, the agent learns to avoid bad decisions, and the variation declines for both setups.

Players have different strategies, and they value features differently. For instance, some prefer Grassland to Plains and Forest, while others put most value into special resources. Therefore, for each feature we implemented redundant knowledge items carrying alternative amounts of points added to the score. As such, for each of the 14 features we created a group of knowledge items where only one of them had to be selected for a particular tile at the given state. In this way, we had all human experts’ strategies encoded into knowledge items and put together into a big knowledge pool. However, we did not care about the algorithm how to combine these knowledge pieces into the optimal strategy. This task laid upon the KB-RL system intelligence.

KB-RL system employs reasoning to combine the knowledge items into one solution, and reinforcement learning to handle redundant knowledge. In every situation when the system works on some task, it selects the best matching knowledge within the current context and executes it. Consequently, the executed command may change the context of the problem, and the next best matching knowledge can be applied. Hence, the KB-RL system solves the problem step by step by reacting to the problem situation with suitable knowledge. When it needs to choose between alternative knowledge, it relies on reinforcement learning to rank the knowledge items against the predefined goal.

In terms of reinforcement learning, total game output (equation 1) is the total cumulative return

that the agent collects in the environment defined as a Markov Decision Process. The state space

is defined by the clustering over the all tasks and their contexts in the system. The action space is defined by all knowledge items that are known to it. We refer to the action-value () or Q-value as the expected long-term return with discount factor taking action under policy , and to as the expectation on the return. We use an on-policy, model-free algorithm similar to Monte Carlo methods [Sutton and Barto1998] but adapted to the specifics of our problem to learn the Q-value based on the action performed by the current policy. The policy is represented by the normal distribution.

4 Results

To measure the performance of both setups, we chose the metric of total game output averaged over a number of games that was calculated after every game. Figure 2 visualizes the averaged total game output for KB-RL and neural network setups. Additionally, figure 3 shows the output of each single game in the run of episodes.

In the beginning, the game outputs had much variation with average TGO just under 15 000. As learning proceeded, the TGO steadily climbed up and the variation declined. We run the experiment for 1000+ episodes, and at this time the game play stabilized with the average total reward of 20 500 and 22 400 points for NN and KB-RL setups, respectively. The average total game output of the last hundred games reached 21 400 for the NN, and 24 000 points for the KB-RL setup.

Notwithstanding the difference, both setups considerably improved the total game output. For the KB-RL setup, the improvement constitutes in contrast to the starting value, while for NN, it is increase. While the total game output improved for both setups due to the reinforcement learning for the overall game, the difference between KB-RL and NN results stems from the settling strategies.

Figure 4: TGO for both setups in contrast to human players, tournament games, and embedded AI players. Tournament games are games that were played by expert knowledge pools without RL optimization during FreeCiv Challenge.
Figure 5: Distribution of the terrain types on the city’s central tile. The contrast between two setups is in chosen terrain for founding a city. KB-RL setup favored plains the most, and then grassland with forest, while NN setup built majority of cities on the grassland.
Figure 6: Distribution of the terrain by type within city borders. While both setups preferred plains and grassland the most, the difference is that KB-RL setup occupied almost twice more ocean tiles. On the contrary, NN setup resided more tiles of such types as forest, desert, hills, others.

To understand the achieved results, we compared them to the performance of human players and FreeCiv’s own computer players. Figure 4 illustrates the average TGO for KB-RL, NN, human players and embedded AI. The human players output was acquired from human experts during the HIRO FreeCiv Challenge. For comparison, we show the TGO of the top 3 players. They are definitely great experts in playing the game as their play was quick and efficient, and they won against embedded AI with a big advantage.

Investigating the two setups in contrast to each other, it can be seen that the fundamental difference in settling cities lay in choosing the terrain type of the central tile, and less, but also significant asymmetry is in the terrain type of surrounding tiles. While both setups built comparatively similar number of cities, with the similar amount of resources and rivers within city borders,

the terrain of city tiles differs significantly (figures 5 and 6). In the KB-RL setup, the majority of cities were built on one of three terrain types: plains (above ), grassland (just under ) and hills (around ). On the contrary, most of the cities in NN setup were built on grassland (above ) with a surprisingly big part of cities being built on the desert terrain (above ). Most likely, this is a consequence of the data deficit during training the NN model as desert terrain is an obvious disadvantage for the city development. Hence, the NN model cannot generalize well to game tiles with this terrain feature.

Cities of both setups occupied the terrain of type grassland and plains to a similar extent (figure 6). However, the KB-RL approach tended to build cities mostly on the coast with a high number of ocean tiles belonging to the city. At the same time, the NN setup shows more preference to the forest terrain, while coastal terrain takes almost 50% less than forest. Furthermore, cities in the NN setup occupied more terrain of types hills and desert in comparison to KB-RL setup.

5 Discussion

The goal of this article was to compare two approaches, knowledge-based reinforcement learning and the neural network, in solving a typical artificial intelligence task. The evaluation of map tiles for city sites was chosen considering that it relies on the perception of the image, and it is one of the most critical aspects in the game. The results show that both setups perform well in comparison to human performance and to the embedded AI players. With other conditions being equal, the KB-RL setup outperformed the NN in on average.

Our experiment shows that leveraging experts’ knowledge helps to beat one of the biggest drawbacks of using NNs: their demand for an extensive amount of data in order to achieve good results. Starting with no previous experience, KB-RL played the initial phase of the experiment equally well to NN being trained on the 1100 previously played games.

Based on human knowledge and empowered by reinforcement learning, KB-RL demonstrates the ability to optimize the complex policy for the high-dimensional action space with relatively small number of iterations. Meanwhile, the neural network could not deliver such optimization and became a bottleneck for city output improvement.

Moreover, in contrast to NN, KB-RL solutions are absolutely transparent and controllable. The ability to explain the system decisions can be imperative in many cases, especially when it comes to human health, security and well-being.

References