Tracking the Race Between Deep Reinforcement Learning and Imitation Learning – Extended Version

08/03/2020 ∙ by Timo P. Gros, et al. ∙ Universität Saarland 0

Learning-based approaches for solving large sequential decision making problems have become popular in recent years. The resulting agents perform differently and their characteristics depend on those of the underlying learning approach. Here, we consider a benchmark planning problem from the reinforcement learning domain, the Racetrack, to investigate the properties of agents derived from different deep (reinforcement) learning approaches. We compare the performance of deep supervised learning, in particular imitation learning, to reinforcement learning for the Racetrack model. We find that imitation learning yields agents that follow more risky paths. In contrast, the decisions of deep reinforcement learning are more foresighted, i.e., avoid states in which fatal decisions are more likely. Our evaluations show that for this sequential decision making problem, deep reinforcement learning performs best in many aspects even though for imitation learning optimal decisions are considered.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning (DL) and especially deep reinforcement learning (DRL) have been applied with great successes to the task of learning near-optimal policies for sequential decision making problems. DRL has been applied to various applications such as Atari games 

[11, 12], Go and Chess [16, 17, 18], or Rubic’s cube [1]. It relies on a feedback loop between self-play and the improvement of the current strategy by reinforcing decisions that lead to good performance.

Passive imitation learning (PIL) is another well-known approach to solve sequential decision making problems, where a policy is learned based on training data that is labeled by an expert [15]. An extension of this approach is active imitation learning (AIL), where after an initial phase of passive learning, additional data is iteratively generated by exploring the state space based on the current strategy and subsequent expert labeling [7, 14]. AIL has successfully been applied to common reinforcement learning benchmarks such as cart-pole or bicycle-balancing [7].

Sequential decision making problems are typically described by Markov decision processes (MDPs). During the simulation of an MDP, the set of those states that will be visited in the future depend on current decisions. In PIL, the agent, which represents a policy, is trained by iterating over the given expert data set, whose distribution does not generally resemble this dependence. AIL extends the data with sequentially generated experiences. Hence, the data is more biased towards sequentially taken decisions. In contrast, DRL does not rely on expert data at all, but simply alternates between exploitation of former experiences and exploration. It is a priori not obvious which method achieves the best result for a particular sequential decision making problem.

Here we aim at an in-depth study of empirical learning agent behavior for a range of different learning frameworks. Specifically we are interested in differences due to the sequential nature of action decisions, inherent in reinforcement learning and active imitation learning but not in passive imitation learning. To be able to study and understand algorithm behavior in detail, we conduct our investigation in a simple benchmark problem, namely Racetrack.

Racetrack is originally a pen and paper game, adopted as a benchmark in AI sequential decision making for the evaluation of MDP solution algorithms [2, 3, 13, 19]. A map with obstacles is given, and a policy for reaching a goal region from an initial position has to be found. Decisions for two-dimensional accelerations are taken sequentially, which requires foresighted planning. Ignoring traffic, changing weather conditions, fuel consumption, and technical details, Racetrack can be considered a simplified model of autonomous driving control [4]. Racetrack is ideally suited for a comparison of different learning approaches, because not only the performance of different agents but also their “driving characteristics” can be analyzed. Moreover, for small maps, expert data describing optimal policies can be obtained.

We train different agents for Racetrack using DRL, PIL, and AIL and study their characteristics. We first apply PIL and train agents represented by linear functions and artificial neural networks. As expert labeling, we apply the

algorithm to find optimal actions for states in Racetrack. We suggest different variants of data generation to obtained more appropriate sample distributions. For AIL, we use the DAGGER approach [14] to train agents represented by neural networks. We use the same network architecture when we apply deep reinforcement learning. More specifically, we train deep Q-networks [12] to solve the Racetrack benchmark. We compare the resulting agents considering three different aspects: the success rate, the quality of the resulting action sequences, and the relative number of optimal and fatal decisions.

Amongst other things, we find that, even though it is based on optimal training data, imitation learning leads to unsafe policies, much more risky than those found by RL. Upon closer inspection, it turns out that this apparent contradiction actually has an intuitive explanation in terms of the nature of the application and the different learning methods: to minimize time to goal, optimal decisions navigate very closely to dangerous states. This works well when taking optimal decisions throughout – but is brittle to (and thus fatal in the presence of) even small divergences as are to be expected from a learned policy. We believe that this characteristic might carry over to many other applications beyond Racetrack.

The outline of our paper is the following: We first introduce the Racetrack domain (Section 2). Then we introduce the DAGGER framework and deep Q-learning (Section 3), before we describe our application to the Racetrack domain (Section 4). In Section 5, we present our experiments and findings. We finally draw a conclusion and present future work in Section 6.

This report is an extended version of the conference paper by Gros et al. [6].

2 Racetrack

Racetrack has been used as a benchmark in the context of planning [3, 13] and reinforcement learning [2, 19]. It can be played on different maps. The example used throughout the paper is displayed in Figure 1.

2.1 The Racetrack Game

At the beginning of the game, a car is placed randomly at one of the discrete positions on the start line (in purple) with zero velocity. In every step it can speed up, hold the velocity or slow down in and/or dimension. Then, the car moves in a straight line with the new velocity from the old position to a new one, where we discretize the maps into cells. The game is lost when the car crashes, which is the case when either (1) the new position itself is a wall position or outside the map, or (2) the straight line between the old and new position intersects with a wall, i.e. the car drives through a wall on its way to the new position. The game is won when the car either stops at or drives through the goal line (in green).

Figure 1: Example of a Racetrack map: goal line is green, start line is purple.

2.2 Markov Decision Process

Given a Racetrack map, the game can be modeled as a Markov decision process.

States.

The current state is uniquely defined by the position and the velocity .

Actions.

Actions represent the acceleration . As the car can be accelerated with values in the and in the dimension, there are exactly different actions available in every state.

Transitions.

We assume a wet road, so with a chance of , the acceleration cannot be applied, i.e.

. Otherwise, with probability

, the acceleration is as selected by the action. The new velocity is given by the sum of the acceleration and the current velocity. The new position is given by adding to the current position, i.e.

To define several properties, we use a discretization of transitions of the MDP similar to the one of Bonet & Geffner [3]. The corresponding driving trajectory is a sequence of visited positions

such that

where , and ,  [5].

If either the vertical or horizontal speed is , exactly all grid coordinates between and are contained in the trajectory. Otherwise, we consider

equidistant points on the linear interpolation between the two positions and for each one round to the closest position on the map. While in the original discretization

 [3], in this model it is given by . The former is problematic when having a velocity which moves less into the than into the direction, as then only few points will be contained in the trajectory and counterintuitive results may be produced.

We consider a transition to be valid, if and only if it does not crash, i.e. no position is either a wall or outside of the map. A transition is said to reach the goal, if and only if one of the positions is on the goal line. Additionally, a transition cannot be invalid and reach the goal. If a transition fulfills the conditions for both, only the one that was fulfilled first holds. In words: if a car has already reached the goal, it cannot crash anymore and vice versa.

A transition leads to a state with new position if it is valid and does not reach the goal. If it is invalid, it leads to a bottom state that has no further transitions. Otherwise, i.e. it is reaching the goal, it leads to the goal state .

Rewards/Costs.

As we consider both, planning and learning approaches, we define the following two cost functions: For planning we consider a uniform cost function, such that an optimal planner will find the shortest path to reach the goal line. For reinforcement learning, we consider a reward function that is positive if the step reaches the goal, negative if the step is invalid and otherwise. More concretely, we chose

for the reward of a transition.

As reinforcement learning makes use of discounting, both functions motivate to reach the goal as fast as possible.

2.3 Simulation

For a given map, we consider several variants of the simulation.

  1. Normal start (NS) versus random start (RS): Usually a game starts on the start line, but we also consider the slightly more difficult task of starting on a random (valid) position on the map.

  2. Zero velocity (ZV) versus random velocity (RV): Usually a game starts with velocity , but we further use a variant starting with a random velocity between and a given upper bound.

  3. Noisy (N) versus deterministic (D): Usually the chosen acceleration is applied with the rules given above. When the deterministic option is set, the chosen acceleration is always applied without assuming a wet road, i.e. without a chance to ignore the acceleration and keep the velocity unchanged.

3 Learning – Approaches

We consider two different learning approaches that are based on different principles. Imitation learning is based on labeled training data, while deep reinforcement learning is based on self-play without prior knowledge.

3.1 Imitation Learning

We consider both passive and active imitation learning. For passive imitation learning, we use (1) logistic regression (LR) and linear discriminant analysis (LDA) to train linear functions, and (2) stochastic gradient descent to train neural networks. To represent the class of active imitation learning algorithms, we consider DAGGER 

[14].

Dagger.

Dataset Aggregation (DAGGER) is a meta-algorithm for active imitation learning. The main idea of DAGGER is to mitigate the problem related to the disruption of the independently identical distributed (i.i.d.) assumption in passive imitation learning for sequential decision making. The trained agent is then used to iteratively sample more labeled data to train a new neural network. The algorithm starts with a pre-trained neural network using the following steps:

  • It follows the current action policy to explore the state space.

  • For every visited state, it uses an expert to find the action that shall be imitated.

  • It adds the pairs of state and action to the training set, and

  • trains a new policy on the enlarged data set.

Step (i) can be varied via a hyper-parameter that sets the ratio of following the current policy or the expert for exploration. With it follows the current policy only. Step (iii) can be done with any thinkable expert and step (iv) can be done with any training procedure.

3.2 Deep Reinforcement Learning

While there are many different approaches of deep reinforcement learning, e.g. policy-based methods [10] or methods based on Monte Carlo tree search [16, 18], we here focus on the value-based approach of deep Q-learning [12].

Deep Q-learning.

Given an MDP, we train an agent which represents a policy such that the expected cumulative reward of the MDP’s episodes is maximized. As (potentially) a race can last forever, the task is a continuing one [19] and the accumulated future reward, the so-called return, of step is therefore given by where is a discount factor with and we assume that is the reward obtained during the transition from the state to state for  [19].

For a fixed state , an action , and a policy , the action-value gives the expected return that is achieved by taking action in state and following the policy afterwards, i.e.

We write for the optimal action-value function that maximizes the expected return. The idea of value-based

reinforcement learning methods is to find an estimate

of the optimal action-value function. Artificial neural networks can express complex non-linear relationships and are able to generalize. Hence, they have become popular for function approximation. We estimate the Q-value function using a neural network with weights , a so-called deep Q-network (DQN) [11]. We denote the DQN by and optimize it w.r.t. the target

(1)

Thus, in iteration

the corresponding loss function is

(2)

where refers to the parameters from some previous iteration, with the so-called fixed target [12] . We optimize the loss function by stochastic gradient descent using an approximation of [12].

Furthermore, we apply the idea of experience replay [12]. Instead of directly learning from observations, we store all experience tuples in a data set and sample uniformly from that set.

We generate our experience tuples by exploring the state space epsilon-greedily, that is, with a chance of during the Monte Carlo simulation we follow the policy that is implied by the current network weights and otherwise uniformly choose a random action [12].

In the following, we will use the terms reinforcement learning (RL) and deep reinforcement learning (DRL) interchangeably.

4 Training Racetrack Agents

In this section we describe the training process of agents based on active and passive imitation learning as well as deep reinforcement learning.

State Encoding.

Although a state in the Racetrack problem is uniquely given by the car’s position and velocity, we provide several other features that can be used as state encoding to improve the learning procedure. Instead of giving a complete encoding of the grid to the agent, the following features will be provided. These features correspond well to the idea of Racetrack being a model of autonomous driving control.

  • , , : linear distance to a wall in all directions. These eight distances are distributed equally around the car position and are given analogously to the acceleration, i.e. , or in both dimensions.

  • : distance to the nearest goal field in and dimension, respectively.

  • : total goal distance, i.e. .

Together with the position and the velocity, this gives us a total of features per state. We use these features for all considered learning approaches.

Objective Function.

The learning methods that we consider rely on two different objective functions: DRL uses the reward function and imitation learning uses data sets that were optimized w.r.t. the number of steps until the goal is reached. As DRL makes use of discounting (see Section 3.2), the accumulated reward is higher if less steps are taken. Thus, both objective functions serve the same purpose, even though they are not completely equivalent. Note that a direct mapping from the costs used in the planning procedure to the reward structure was not possible. We tested different reward structures for DRL and found that a negative reward for each single step combined with a positive reward for the goal and a negative reward for invalid states led to very poor convergence properties of the training procedure. No well-performing alternative was found to the reward structure defined in Section 2.2 up to scaling.

4.1 Imitation Learning

We want to train agents for all simulation scenarios including those where the car starts at an arbitrary position on the map and visits future positions on the map with different velocities. Usually, all learning methods are based on the assumption that the data is i.i.d.. Data that is generated via simulation of the Racetrack greatly disrupts this assumption. Thus, we propose different approaches for data generation to encounter this problem.

Data Sets.

In the base case, we uniformly sample states and velocities for the simulation scenarios described in Section 2.3. The samples are then labeled by an expert. This expert basically is a Racetrack-tailored version of the algorithm to find an optimal action (there might be more than one), i.e. acceleration, from the current state.

We further use additional options that can be set when sampling data to address the problem of decisions depending on each other:

  • Complete trajectory (T): If this option is set, all states on the way to the goal are added to the data set instead of only the current state.

  • Exhaustive (E): If the exhaustive option is set, all optimal solutions for the specified state are added to the data set.

  • Unique (U): Only states having a unique optimal acceleration are added to the data set.

Option E excludes option T due to runtime constraints as the number of optimal trajectories increases exponentially with the trajectory’s length.

This leads to a total of different combinations as displayed in Table 1.

No ID Description RS RV T E U
(1) RS-RV Uniform sample from all positions on map and all possible velocities.

(2)
NS-ZV-T Uniform sample from all positions on the start line; combined with zero velocity. All states that were visited on the optimal trajectory to the goal line are included in the data set.
(3) RS-ZV-T Uniform sample from all positions on the map; combined with zero velocity. All states that were visited on the optimal trajectory to the goal line are included in the data set.
(4) RS-RV-T Uniform sample from all positions on the map and all possible velocities. All states visited on the optimal trajectory to the goal line are included in the data set.
(5) RS-RV-E Uniform sample from all positions on the map and all possible velocities. All optimal actions for that state are included in the data set.
(6) RS-RV-U Uniform sample from all positions on the map and all possible velocities. Only such states that have a unique optimal action are included in the data set.
Table 1: Racetrack configurations used to create our data sets.

The first data set contains uniformly sampled (valid) positions and velocities and combines them with a single optimal action. This explores the whole state space equally. The data sets (2) and (3) differ in their starting points. For (2), the car is positioned on the start line, for (3) it might be anywhere on the map. Both sets contain not only the optimal acceleration for this starting state, but for every one visited on the trajectory from there on to the goal. To do both, uniformly sample through the state space and take into account the trajectories, (4) starts with a random position and a random velocity but still collects the whole trace. The data set (5) includes all optimal solutions instead of just one. Apart from that, (5) is similar to set (1). (6) only includes entries that have a unique optimal next action.

For each learning method, we train several instances; at least one on each data set. Each data set consists of approximately entries.

4.1.1 Passive Imitation Learning

Linear Predictors.

While deep learning clearly is more powerful than linear learning, linear classifiers have the advantage that their decisions are more transparent.

We use the package sklearn to apply both Linear Discriminant Analysis (LDA) and Logistic Regression (LR). Together with the six combinations of data sets, this gives different agents.

Neural Networks.

We use the PyTorch package to train neural networks [8]. We repeatedly iterate over the labeled training data. We use the MSE as loss function. As neural networks tend to overfit when the training iterates over the training data too often, we store the neural network after every iteration. We experimentally found that a maximum iteration number of is more than sufficient. As we again use every data sets, this gives us a total of agents.

As explained in Section 2, a state is represented by features, which gives us the input size of the network. There are possible actions. As we do not process the game via an image but through predefined features, we only use fully connected layers. More sophisticated network structures are only needed for complex inputs such as images. For a fair comparison, we use the same network size for all methods. We use two hidden layers of size , resulting in a network structure of .

4.1.2 Active Imitation Learning

Dagger.

In the case of active imitation learning, we applied DAGGER using for all iterations, i.e. after the pre-training we followed the trained agent without listening to the expert for exploration. To have a fair comparison, DAGGER has the same number of samples as PIL, i.e. . Still, the pre-training is important for sampling within the first iteration of the algorithm, but the main idea is to generate further entries that are more important for the training of the agent. Thus, we pre-trained the agent on each of our data sets and then additionally allowed DAGGER to add samples. We split these samples into iterations. The neural network was trained by performing eight iterations over the data set. Our experiments with the networks showed that this is the best trade-off between over- and under-fitting. Again, we store the trained agents after every iteration, giving us a total of agents for the DAGGER method.

4.2 Deep Reinforcement Learning

Deep Q-learning.

In contrast to imitation learning, reinforcement learning is not based on data sets and thus is not applied to any of the data sets given in Table 1. Training is done by self-play only; the Racetrack agent chooses its actions using a neural network and applies them to the environment. After every move, the next state (given by the features), a reward as defined in Section 2 that was achieved for the move, as well as the information whether the episode is terminated are returned to the agent. The agent then uses the loss, defined again by the MSE function, between the accumulated reward and the expected return to correct the weight of the network.

All imitation learning agents were trained with (new) samples using the same network structure. Therefore, we here restrict the agents to (1) entries in the replay buffer, i.e. the maximal number of entries an agent can learn from at the same time, and (2) episodes that the agent can play at all. The neural network is not pre-trained but initiated randomly.

To have a trade-off between exploration and exploitation, our agent acts -greedy, i.e. with a probability of it chooses a random acceleration instead of the best acceleration to explore the state space. As our DRL agent is initiated randomly – and thus starts without any experience about what good actions are – in the beginning of the training phase the focus lies on exploration. We therefore begin our training with , i.e. always choosing a random action. After every episode , we decrease exponentially with a factor to shift the focus from exploration to exploitation during training, until a threshold of is reached, i.e. .

To train the agents to not just reach the goal but to minimize the number of steps, we make use of a discount factor .

No ID Description RS D
1 NS-D Starting on a random position on the start line using the deterministic simulation.
2 NS-N Starting on a random position on the start line using the noisy simulation.
3 RS-D Starting on a random position on the map using the deterministic simulation.
4 RS-N Starting on a random position on the map using the noisy simulation
Table 2: Racetrack configurations used to train Racetrack agents with deep reinforcement learning.

Besides the given options of either starting on the start line (NS) or anywhere on the map (RS), DRL can benefit from learning while the noisy (N) version of Racetrack is simulated instead of the deterministic (D) one. This gives us four different training modes listed in Table 2.

To determine the best network weights, the average return over the last episodes during the training process is used. We save the network weights that achieve the best result. The training progress is displayed in Figure 2. Additionally, we save the network weights after running all training episodes, independent from the received average return. This results in total to different DRL agents.

Figure 2: Training progress of the RL agent. The left graph shows the RS-N mode, the right one displays NS-D. The right plot further displays a temporarily decrease of the return, which is not uncommon during training.

5 Results

For evaluation, we consider all possible combinations given by the simulation parameters as described in Section 2.3. In total, it results in different simulation settings on which we compare the trained agents. These settings are given in Table 3. The combinations with NS and RV are not considered, as they include more starting states where a crash is inevitable than solvable ones.

In the sequel, for each learning method we present the best-performing parameter combination of all those that we tested. We investigate three aspects of the behavior of the resulting agents: the success rate, the quality of the resulting action sequences, and the relative number of optimal and fatal decisions.

No ID Description RS RV D
1 NS-ZV-D Starting on a random position on the start line with zero velocity using the deterministic simulation
2 NS-ZV-N Starting on a random position on the start line with zero velocity using the noisy simulation
3 RS-ZV-D Starting on a random position on the map with zero velocity using the deterministic simulation
4 RS-ZV-N Starting on a random position on the map with zero velocity using the noisy simulation
5 RS-RV-D Starting on a random position on the map with a random velocity using the deterministic simulation
6 RS-RV-N Starting on a random position on the map with a random velocity using the noisy simulation
Table 3: Configurations on which we evaluate the agents.

5.1 Success Rate

Figure 3: Success rate results for all classes of examined agents.

We first compare how often the agents win a game, i.e. reach the goal, or loose, i.e. crash into a wall. We limit the game to steps. If an agent then neither succeeded nor failed we count the episode as timed out. We compare the agents on simulation runs. For each single run of the simulation, all agents start in the same initial state. The results can be found in Figure 3.

We omitted the plot for NS-ZV-D, as all of the agents had winning rate. The linear agents perform worst. Especially with random starting points and velocities, they fail to reach the goal. DAGGER outperforms the passive imitation learning agents. This is not surprising, as it has been designed to cope with sequential decision making.

Throughout all settings, the DRL agents perform best. They clearly outperform DAGGER, reaching the goal more than times more often in the NS-ZV-N setting.

5.2 Quality of Action Sequences

We illustrate results for the quality of the chosen action sequences in Figure 4.

Figure 4: Average reward (left) and average number of needed steps (right) for all classes of agents.

The left plot gives the cumulative reward reached by the agents averaged over all runs (also over those that are not successful). DRL clearly achieves the highest cumulative reward. We remark that the optimal policies computed via A give higher cumulative rewards as the goal is reached faster. However, imitation learning achieves lower results on average as it fails more often.

The right of Figure 4 shows results for the number of steps needed. When a car crashes, we are not interested in the number of steps taken. Therefore – in this specific analysis – we only report on successful runs. They show that – while reinforcement learning has the most wins and is the best agent considering the reward objective – it is consuming the highest number of steps when reaching the goal. It even takes more steps than linear classifiers.

5.3 Quality of Single Action Choices

Figure 5: Quality of selected actions.

Next we examine whether the agents choose the optimal acceleration, i.e. the acceleration that does not crash and leads to the goal with as few steps as possible, for different positions and velocities. We distinguish between (1) optimal actions, (2) fatal actions that unavoidably lead to a crash, and (3) secure actions that are neither of the former. We use the same settings as before, except for the ones with noise, which does not make sense when considering optimal actions, i.e. NS-ZV, RS-ZV and RS-RV.

The results are given in Figure 5. Especially when we start from a random position on the map, we see that (independent from the setting) passive imitation learning with neural networks selects optimal actions more often than active imitation learning or deep reinforcement learning. Interestingly, DAGGER and RL select both secure and fatal choices more often than PIL.

5.4 Discussion

We found that passive imitation learning agents perform poorly (see Figure 3) even though they select optimal actions most often. One reason for this is that the data sets from which they learn contain samples that have not been generated by iteratively improving the current policy. Hence, it is not biased towards sequences of dependent decisions leading to good performance. We have observed that DAGGER and in particular DRL sometimes do not select optimal actions, but those with lower risk of hitting a wall. As a result, they need more steps than other approaches before reaching the goal, but the trajectories they use are more secure and they crash less often. This is an interesting insight, as all approaches (including PIL) try to optimize the same objective: reach the goal as soon as possible without hitting a wall.

The fact that both, DAGGER and RL have a relatively high number of fatal actions, but not an increased number of losses, leads us to the assumption that these agents avoid states where they might make fatal decisions, even though these states could help reaching the goal faster.

Figure 6 illustrates the paths taken by the different agents for the easiest case (NS-ZV-D) where all policies reach their goal. DRL differs the most from the optimal (black) trajectory, which describes one of the shortest paths to the goal and obtains the maximum cumulative reward. For the harder setting where a starting point is chosen randomly (RS-ZV-D), only DAGGER and DRL make it to the goal, with DRL using significantly more steps than the optimal agent.

Figure 6: Traces of different Racetrack agents. The black trajectories are optimal. The other colors are chosen as in Figure 4. The upper plot is a NS-ZV-D simulation, while the lower shows RS-ZV-D.

In summary, DRL performs surprisingly well. In some aspects, it performs even better than active imitation learning, which is not only considered a state of the art for sequential decision making [7], but – in contrast to DRL – even has the chance to benefit from expert knowledge.

6 Conclusion

We have presented an extensive comparison between different learning approaches to solve the Racetrack benchmark. Even though we provided optimal decisions during imitation learning, the agents based on deep reinforcement learning outperform those of imitation learning in many aspects.

We believe that our observations carry over to other applications, in particular to more complex autonomous vehicle control algorithms. We plan to consider extensions of the Racetrack problem, which include further real-world characteristics of autonomous driving. We believe that, to address the difficulties we observed with imitation learning, further investigations into the combination of expert data sets and reinforcement learning agents are necessary.

Additionally, other methods of guiding the agents to more promising solutions during training will be examined, such as reward shaping [9], and their influence on the characteristics of the final agent. Another interesting question for future work is whether multi-objective reinforcement learning can be used to adjust the agents’ behavior in a fine-grained manner.

Acknowledgements

This work has been partially funded by DFG grant 389792660 as part of TRR 248 (see https://perspicuous-computing.science)

References

  • [1] F. Agostinelli, S. McAleer, A. Shmakov, and P. Baldi (2019) Solving the Rubik’s Cube with deep reinforcement learning and search. Nature Machine Intelligence 1 (8), pp. 356–363. Cited by: §1.
  • [2] A. G. Barto, S. J. Bradtke, and S. P. Singh (1995) Learning to act using real-time dynamic programming. Artificial Intelligence 72 (1-2), pp. 81–138. Cited by: §1, §2.
  • [3] B. Bonet and H. Geffner (2001) GPT: a tool for planning with uncertainty and partial information. In Proceedings of the IJCAI Workshop on Planning with Uncertainty and Incomplete Information, pp. 82–87. Cited by: §1, §2.2, §2.2, §2.
  • [4] T. P. Gros, H. Hermanns, J. Hoffmann, M. Klauck, and M. Steinmetz (2020) Deep statistical model checking. In Proceedings of the 40th International Conference on Formal Techniques for Distributed Objects, Components, and Systems (FORTE), pp. 96–114. Cited by: §1.
  • [5] T. P. Gros, H. Hermanns, J. Hoffmann, M. Klauck, and M. Steinmetz (2020) Models and Infrastructure used in ”Deep Statistical Model Checking”. Note: (to appear)http://doi.org/10.5281/zenodo.3760098 Cited by: §2.2.
  • [6] T. P. Gros, D. Höller, J. Hoffmann, and V. Wolf (2020) Tracking the race between deep reinforcement learning and imitation learning. In Proceedings of the 17th International Conference on Quantitative Evaluation of SysTems (QEST), Cited by: §1.
  • [7] K. Judah, A. P. Fern, T. G. Dietterich, and P. Tadepalli (2014) Active imitation learning: formal and practical reductions to i.i.d. learning.

    Journal of Machine Learning Research

    15 (120), pp. 4105–4143.
    Cited by: §1, §5.4.
  • [8] N. Ketkar (2017) Introduction to pytorch. In Deep learning with Python, pp. 195–208. Cited by: §4.1.1.
  • [9] A. D. Laud (2004) Theory and application of reward shaping in reinforcement learning. Technical report Cited by: §6.
  • [10] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.2.
  • [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with deep reinforcement learning. In NIPS Deep Learning Workshop, Cited by: §1, §3.2.
  • [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §1, §3.2, §3.2, §3.2, §3.2.
  • [13] L. E. Pineda and S. Zilberstein (2014) Planning under uncertainty using reduced models: revisiting determinization. In Proceedings of the 24th International Conference on Automated Planning and Scheduling (ICAPS), pp. 217–225. Cited by: §1, §2.
  • [14] S. Ross, G. J. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Proceedings, Vol. 15, pp. 627–635. Cited by: §1, §1, §3.1.
  • [15] S. Schaal (1999) Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §1.
  • [16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1, §3.2.
  • [17] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §1.
  • [18] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of Go without human knowledge. Nature 550, pp. 354–359. Cited by: §1, §3.2.
  • [19] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Second edition, Adaptive computation and machine learning, The MIT Press. Cited by: §1, §2, §3.2.