I Introduction
Reinforcement learning(RL) can acquire a policy maximizing longterm rewards in an environment. Designers do not need to specify how to achieve a goal; they only need to specify what a learning agent should achieve with a reward function. A reinforcement learning agent performs both exploration and exploitation to find how to achieve a goal by itself. It is common for the stateaction space to be quite large in a real environment like robotics. As the stateaction space becomes larger, the number of iterations exponentially increases to learn the optimal policies, and the learning becomes too slow to obtain optimal policies in a realistic amount of time. Since a human could have knowledge that would be helpful to such an agent in some cases, a promising approach is utilizing human knowledge [ijcai2018817, DBLP:conf/atal/HarutyunyanBVN15a, 8708686].
The reward function is the most related to learning efficiency. Most difficult tasks in RL have a sparse reward function[10.5555/3327144.3327216]. The agent is not able to evaluate its policy due to it and to learn the policy. In contrast, learning speeds up when the reward function is dense. Inverse reinforcement learning (IRL) [Ng:2000:AIR:645529.657801, 10.1145/1015330.1015430] is the most popular method for enriching the reward function. IRL uses an optimal policy to generate a dense reward function. Recent studies have utilized optimal trajectories [10.5555/1620270.1620297, NIPS2016_6391]. There is the question of the cost of the teacher in providing trajectories or policies. Humans sometimes have difficulty providing these because of the skills they may or they may not have. In particular, in a robotics task, humans are required to have robothandling skills and knowledge on the optimal trajectory. Another approach is reward shaping. This method expands the original environmental reward function. Potentialbased reward shaping is able to add external rewards while keeping the optimal policy of the environment [Ng+HR:1999]. It is calculated as the difference between the realnumber functions (potential function) of the previous and current state. In [Ng+HR:1999], it was mentioned that learning sped up with a learned value function used as the potential function. Since a policy learned with a potentialbased reward shaping is equivalent to that with value initialization with the potential function [DBLP:journals/corr/abs11065267], using the learned value function for the potential function is equivalent to initializing the value function with the learned value function. Therefore, learning is jumpstarted. To use potentialbased reward shaping, we need to define the potential function. It is often very difficult to represent the potential function directly. To solve this problem, SARSARS acquires it in learning[Grzes2010]. A designer provides the aggregation function of states before learning, and SARSARS builds a value function over abstract state space as the potential function. The propagation of the reward over value function accelerates because the abstract state space is smaller than the original state space, and the agent learns the policy faster. Devlin and Kudenko proved that timevaried reward shaping keeps a policy invariant [Devlin:2012:DPR:2343576.2343638]. The result has made clear that a policy learned with SARSARS is equivalent to the original policy. However, it is very difficult to define the aggregation function of states when the task has a highdimensional state space. We propose a subgoalbased trajectory aggregation method. The designer defines only the subgoal identification function to apply SARSARS to a reinforcement learning algorithm. Since it is easier for the designer to make a similarity function than an aggregation function [ijcai2017534]
, our method can keep designer effort minimal. Moreover, a nonexpert may enhance the reinforcement learning algorithm if an identification function exists. This might be related to interactive machine learning
[Amershi_Cakmak_Knox_Kulesza_2014]. Providing subgoals is sometimes easier than trajectories because it does not require handling skills but only task decomposition skills.Ii Related Work
The landmarkbased reward shaping of Demir et al. [Demir2019] is the closest to our method. The method shapes only rewards on a landmark using a value function. Their study focused on a POMDP environment, and landmarks automatically become abstract states. We focus on an MDP environment, and we propose an aggregation function. We acquire subgoals from human participants, and we apply our method to a task with highdimensional observations. Potentialbased advice is reward shaping for states and actions[Wiewiora2003]. The method shapes the qvalue function directly for a state and an action, and it makes it easy for a human to advice to an agent regarding whether an action in an arbitrary state is better or not. Subgoals show what ought to be achieved on the trajectory to a goal. We adopted the shaping of a state value function. Harutyunyan et al. [Harutyunyan2015] has shown that the qvalues learned by arbitrary rewards can be used for the potentialbased advice. The method mainly assumes that a teacher negates the agent’s action selection. The method uses failures in the trial and errors. In contrast, our method uses successes.
In the field of interactive reinforcement learning, a learning agent interacts with a human trainer as well as the environment[doi:10.1080/09540091.2018.1443318].The TAMER framework is a typical interactive framework for reinforcement learning[KCAP09knox]. The human trainer observes the agent’s actions and provides binary feedback during learning. Since humans often do not have programming skills and knowledge on algorithms, the method relaxes the requirements to be a trainer. We aim for fewer trainer requirements, and we use a GUI on a web system in experiments with navigation tasks.
Our method is similar to hierarchical reinforcement learning(HRL) in a hierarchy. The option framework is the major in the field of HRL. The framework of Sutton et al. [Sutton:1999:MSF:319103.319108]
was able to transfer learned policies in an option. An option consists of an initiation set, an intraoption policy, and a termination function. An option expresses a combination of a subtask and a policy for it. The termination function takes on the role of subgoal because it terminates an option and triggers the switching to another option. Recent methods have found good subgoals for a learner simultaneously with policy learning
[Bacon:2017:OA:3298483.3298491, 10.5555/3305890.3306047]. The differences with our method are whether the policy is over abstract states or not and whether rewards are generated. The framework intends to recycle a learned policy, but our method focuses on improving learning efficiency.Reward shaping in HRL has been studied in [Gao2015, Li2019]. Gao et al. [Gao2015] has shown that potentialbased reward shaping remains policy invariant to the MAXQ algorithm. Designing potentials every level is laborious work. We use a single highlevel value function as a potential, which reduces the design load. Li et al. [Li2019] incorporated an advantage function in highlevel stateaction space into reward shaping. Their approach is similar to ours in the utilization of a highlevel value function, but it does not incorporate external knowledge into their algorithm. The reward shaping method in [Paul2019] utilized subgoals that are automatically discovered with expert trajectories. The potentials generated every subgoal are different. The value of a potential is fixed and not learned. Our method learns the value of a potential.
Iii Preliminaries and Notation
A Markov Decision Process consists of a set of states
, a set of actions , a transition function , and a reward function. A policy is a probability distribution over actions conditioned on states,
. In a discounted fashion, the value function of a state under a policy , denoted , is . Its actionvalue function is , where is a discount factor.Iiia PotentialBased Reward Shaping
Potentialbased reward shaping [Ng+HR:1999, DBLP:journals/corr/abs11065267] is an effective method for keeping an original optimal policy in an environment with an additional reward function . If the potentialbased shaping function is formed as:
(1) 
, it is guaranteed that policies in MDP are consistent with those in MDP . Note that and . is an absorbing state, so the MDP “stops” after a transition into . is known as the potential function. should be a realvalue function such as . For better understanding, we use the example of Qlearning with potentialbased reward shaping. The learning rule is formally written as
(2)  
(3) 
, where is a learning rate. We need to define an appropriate for every domain. There is the problem of how to define to accelerate learning. The study of [DBLP:journals/corr/abs11065267] has shown that learning with potentialbased reward shaping is equivalent to value initialization with the potential function before learning. The result has made clear that is the best way to accelerate learning. We cannot know before learning since we acquire after learning. This suggests that we can accelerate the learning if there is a value function learned faster than another with the same rewards.
IiiB SarsaRs
Grzes et al. [Grzes2008, Grzes2010] proposed a method that learns a potential function during the learning of a policy , called “SARSARS”. The method solved the problem of the design of an appropriate potential function for a domain being too difficult and timeconsuming. We define as a set of abstract states. The method builds a value function over and uses it as
(4) 
, where is an aggregation function, . The function is predefined. The potentialbased shaping function over SARSARS is written as follows.
(5) 
The method learns the value function during policy learning as:
(6) 
, where is the transformation function from MDP rewards into SMDP rewards, and is the duration between and . The potential function changes dynamically during learning, and the equivalency of the potentialbased reward shaping cannot is applied because it depends on the time in addition to the state. Since Devlin and Kudenko have shown that a shaped policy is equivalent to a nonshaped one, when the potential function changes dynamically during learning[Devlin:2012:DPR:2343576.2343638], SARSARS keeps the learned policy original. We omit the time argument in the following section to simplify the expression. The size of is smaller than thanks to the aggregation of states. Therefore, the propagation of environmental rewards is faster, and the policy learning with SARSARS is also faster. As mentioned above, the method requires the predefined aggregation function . In an environment of highdimensional observations, it is almost impossible to make an aggregation function.
Iv Reward Shaping with SubgoalBased Aggregation
We propose a method of aggregation from states into an abstract states. The method basically follows SARSARS. We use a predefined subgoal series and aggregate episodes dynamically into abstract states during learning with it.
Iva Subgoal
We define a subgoal as follows
Definition 1.
A state is a subgoal if is a goal in one of the subtasks decomposed from a task.
In the option framework, the subgoal is the goal of a subtask, and it is expressed as a termination function[Sutton:1999:MSF:319103.319108]. Many studies on the option framework have developed automatic subgoal discovery [Bacon:2017:OA:3298483.3298491]. We aim to incorporate human subgoal knowledge into the reinforcement learning algorithm with less human effort required. The property of a subgoal might be a part of the optimal trajectories because a human should decompose a task to achieve the goal. We acquire a subgoal series and incorporate the subgoals into our method in experiments. The subgoal series is written formally as . is a set of subgoals and a subset of . There are two types of subgoal series, totally ordered and partially ordered. With totally ordered subgoals, a subgoal series is deterministically determined at any subgoal. In contrast, partially ordered subgoals have several transitions to the subgoal series from a subgoal. We used only the totally ordered subgoal series in this paper, but both types of ordered subgoals are available for our proposed reward shaping. Since an agent needs to achieve a subgoal only once, the transition between subgoals is unidirectional.
IvB SubgoalBased Dynamic Trajectory Aggregation
We propose a method of aggregating trajectories dynamically into abstract states using subgoal series. The method makes the SARSARS method available for environments of highdimensional observations thanks to less effort being required from designers. The method requires only a subgoal series consisting of several states instead of all states. In this section, we assume that the subgoal series is predefined, and . The method basically follows SARSARS, and the difference is mainly the aggregation function and minorly the accumulated rewards.
IvB1 Dynamic Trajectory Aggregation
We build abstract states to represent the achievement status of a subgoal series. If there are subgoals, the size of abstract states is . The agent is in a first abstract state before a subgoal is achieved. Then, the abstract state transits to when the subgoal is achieved. This means the aggregation of episodes until subgoal transits into . The aggregated episodes change dynamically every trial because of the policy with randomness and learning. As the learning progresses, the aggregated episodes become fixed. The value over abstract states is distributed to the values of states of the trajectory. Note that the trajectories for updating the values are different from those of distributed values. The updated value function is not used for the current trial but for the next trials. An image of dynamic trajectory aggregation is shown in Fig.1.
In the figure, a circle is a state, and the aggregated states are in each gray background area. There are two abstract states in the case of a single subgoal. The bold circles express the states with which the designer deals. The number of bold circles in Fig. 1(b) is much lower than Fig. 1(a). “S” and “G” in the circles are a start and a goal, respectively. Fig. 1(b) shows that the episode is separated into two subepisodes, and each of them corresponds to the abstract states.
IvB2 Accumulated Reward Function
We clearly define the reward transformation function because our method only updates the achievements of subgoals as abstract states. A set of abstract states is part of the semiMarkov decision process (SMDP)[Sutton:1999:MSF:319103.319108]. The transition between an abstract state and another consists of multiple actions. The value function in SMDP is written as:
(7) 
where is the duration of the abstract state , and is the event of the aggregation function being initiated in state at time . Therefore, we describe this formally as , where is the duration until subgoal achievement. The function accumulates rewards with discount . Depending on the policy at the time, is varied dynamically. This follows nstep temporal difference (TD) learning [Sutton1998] because there are transitions between an abstract state and another one . Algorithm 1 shows the whole process of SARSARS with subgoalbased dynamic trajectory aggregation. and are hyperparameters, that is, the learning rate and discount factor for updating the value function over abstract states.
In Algorithm 1, the method is involved between lines 613. The value function over abstract states is parameterized by . If equals , is updated by an approximate multistep TD method [Sutton1998], and our method sets the next subgoal in lines 912.
V Experiments
In this section, we explain the experiments done to evaluate our method. We used navigation tasks in two domains, fourrooms and pinball, because they are popular problems with discrete/continuous states that have been used in previous studies [Bacon:2017:OA:3298483.3298491, NIPS2009_3683, Sutton:1999:MSF:319103.319108]. Furthermore, we used a pick and place task with a robot arm with continuous actions[1606.01540]. The navigation task involved finding the shortest path to a goal state from a start state. The pick and place task is to grasp an object and bring it to a target. First, we conducted an experiment to acquire human subgoals. Second, a learning experiment was conducted, in which we compared the proposed method with four other methods for the navigation task. We compared the proposed method with a baseline RL algorithm for the pick and place task. A SARSA algorithm was used for the fourrooms domain, an actorcritic algorithm for the pinball domain, and a DDPG for the pick and place domain. All the experiments were conducted with a PC [Core i77700 (3.6GHz), 16GB memory]. We used the same hyperparameters, and , as those of the baseline RL algorithms, respectively.
Va User Study: Human Subgoal Acquisition
VA1 Navigation Task
We conducted an online user study to acquire human subgoal knowledge using a webbased GUI. We recruited 10 participants who consisted of half graduate students in the department of computer science and half others(6males and 4females, ages 23 to 60, average of 36.4). We confirmed they did not have expertise on subgoals in the two domains. Participants were given the same instructions as follows for the two domains, and they were then asked to designate their two subgoals both for the fourrooms and pinball domains in this fixed order. The number of subgoals was the same as the hallways in the optimal trajectory for the fourrooms domain. The instructions explained to the participants what the subgoals were and how to set them. Also, specific explanations of the two task domains were given to the participants. In this experiment, we acquired just two subgoals for learning since they are intuitively considered easy to give on the basis of the structure of the problems. We considered the two subgoals to be totally ordered ones.
VA2 Pick and Place Task
A user study for the pick and place task was also done online. Since it was difficult to acquire human subgoal knowledge with GUI, we used a descriptive answertype form. We assumed that humans use subgoals when they teach behavior in a verbal fashion. They state not how to move but what to achieve in the middle of behavior. The results of this paper minorly support this assumption. We recruited five participants who were amateurs in the field of computer science(3 males and 2 females, ages 23 to 61, average of 38.4). The participants read the instructions and then typed the answer in a web form. The instructions consisted of a description of the pick and place task, a movie of manipulator failures, a glossary, and a question on how a human can teach successful behavior. The question included the sentence “Please teach behavior like you would teach your child.” This is because some participants answered that they did not know how to teach the robot in a preliminary experiment. We imposed no limit on the number of subgoals.
VB Navigation in FourRooms Domain
The fourrooms domain has four rooms, and the rooms are connected by four hallways. The domain is common for reinforcement learning tasks. In this experiment, learning consisted of a thousand episodes. An episode was a trial run until an agent reached a goal state successfully or when a thousand stateactions ended in failure. A state was expressed as a scalar value labeled through all states. An agent could select one of four actions: up, down, left, and right. The transition of a state was deterministic. A reward of +1 was generated when an agent reached a goal state. The start state and goal state were placed at fixed positions. An agent repeated the learning 100 times. The learning took several tens of seconds.
VB1 Experimental Setup
We compared the proposed reward shaping with human subgoals (HRS) with three other methods. They were a SARSA algorithm (SARSA) [Sutton1998], the proposed reward shaping with random subgoals (RRS), and naive subgoal reward shaping (NRS). SARSA is a basic reinforcement learning algorithm. We used SARSA as a baseline algorithm and implemented the other two methods with it. RRS used two randomly selected states as subgoals from the whole state space. NRS is based on potentialbased reward shaping The potential function outputs a scalar value just when an agent has visited a subgoal state. The potential function is written formally as follows.
(8) 
Informally, NRS shapes rewards of only generated for subgoals with potentialbased reward shaping. The two differences from our method are that NRS has a fixed potential, and the positive potential only for the subgoals. The reward shaping methods were given ordered subgoals or aggregation of states in advance. We set the learning rate for SARSA to 0.01, the discount rate for SARSA to 0.99, and to 1.0. The policy was a softmax. We chose to be 1 so that it would be the same value as the goal reward after grid search on grids of 1, 10, and 100.
We evaluated the learning performance with the time to threshold and the asymptotic performance [Taylor:2009:TLR:1577069.1755839] in terms of the learning efficiency of transfer learning for reinforcement learning. We explain the definitions of the measurements in this experiment. The time to threshold was the number of episodes required to get below a predefined threshold of steps. The asymptotic performance was the final performance of learning. We used the average number of steps between 990 and 1000 episodes.
VB2 Experimental Results
Fig. 2 shows the subgoal distribution acquired from the ten participants and from the random subgoals generated for the fourrooms domain.
In Fig. 2, the color of the start, goal, and subgoal cells are red, blue, and green, respectively. The number in a cell is the frequency at which the participants selected the cell as a subgoal. These subgoals included totally and partially ordered subgoals. As shown in Fig. 2(a), participants tended to set more subgoals in the hallway compared with random subgoals [Fig. 2(b)]. Next, we show the results of the learning experiment. Fig. 3
shows the learning curves of our proposed method and the four other methods. The standard errors also are shown in this figure. We plotted HRS with an average totaling 1000 learnings over all participants. RRS were also averaged by 1000 learnings over 10 patterns. NRS had almost the same conditions as HRS. SARSA was averaged by 10,000 learnings. HRS had the fewest steps for almost all episodes. The results of NRS demonstrated the difficulty with transformation from subgoals into an additional reward function. We also performed an ANOVA among the four methods. We set the thresholds to 500, 300, 100, and 50 steps in terms of the time to threshold. Table
Ishows the mean episodes, the standard deviations and the results of the ANOVA and the subeffect tests for the compared methods for each threshold step.
Thres.  HRS  RRS  SARSA  NRS 

500  2.68(1.86)  2.91(2.10)  3.93(3.07)  5.78(6.39) 
300  5.06(3.05)  5.60(3.61)  6.68(4.29)  10.4(9.17) 
100  17.3(7.92)  18.6(7.89)  26.4(11.0)  38.3(47.1) 
50  33.0(10.2)  36.3(11.0)  51.8(16.2)  59.6(45.8) 
Thres.  HRS< RRS 
{HRS,RRS}< SARSA 
{HRS,RRS,SARSA}< NRS 

500  n.s.  *  * 
300  *  *  * 
100  n.s.  *  * 
50  *  *  * 

As shown in Table I, HRS shortened the required time to approximately 20 episodes for reaching the 50 steps, which was better than the performance of RRS. We did not find a statistically significant difference between HRS, RRS, SARSA, and NRS in terms of asymptotic performance. Our method made the learning faster than the baseline method, and human subgoals lead to better performance than random ones.
VC Navigation in Pinball Domain
The navigation task in the pinball domain involves moving a ball to a target by applying force to it. The pinball domain is difficult for humans because delicate control of the ball is necessary. This delicate control is often required in the control domain. Since humans only provide states, ordered subgoals are more tractable than nearly optimal trajectories in such domains.
The difference with the fourrooms domain is the continuous state space over the position and velocity of the ball on the xy plane. An action space has five discrete actions, four types of force and no force. In this domain, a drag coefficient of 0.995 effectively stops the ball from moving after a finite number of steps when the noforce action is chosen repeatedly; collisions with obstacles are elastic. The four types of force were up, down, right, and left on a plane. Actions were randomly chosen at 10%. An episode terminated with a reward of +10000 when the agent reached the goal. Interruption of any episode occurred when an episode took more than 10,000 steps. The radius of a subgoal was the same as a goal.
VC1 Experimental Setup
We compared HRS with AC, RRS, and NRS in terms of learning efficiency with the time to threshold and the asymptotic performance. We defined this threshold in the domain as the number of episodes required to reach the designated number of steps. The asymptotic performance was the average number of steps between 190 and 200 episodes. A learning consisted of 200 episodes at most. All methods learned a total 100 times from scratch. HRS, RRS, and NRS performed ten learnings with ten patterns. HRS and NRS used two ordered subgoals provided by ten participants. RRS used two ordered subgoals generated randomly. We used the results to evaluate the learning efficiency. The learning took several tens of minutes. A subgoal had only a center position and a radius. The radius was the same as that of the target. A subgoal was achieved when the ball entered the circle of it at any velocity. We used AC as the critic with linear function approximation over a Fourier basis [Konidaris11a] of order three. We used the linear function as the actor, and a softmax policy decided an action. The learning rates were set to 0.01 for both the actor and the critic. The discount factor was set to 0.99. The of NRS was 10,000 so as to be the same value as the goal reward.
VC2 Experimental Results
Fig. 4 shows the subgoal distribution acquired from participants and from the random subgoals generated for the pinball domain. In this figure, the color of the start point, the goal, and subgoals are red, blue, and green, respectively. As shown in Fig. 4, participants focused on four regions of branch points to set subgoals in comparison with random subgoals.
Fig. 5 shows the learning curves of HRS, RRS, AC, and NRS. The learning indicator was the average number of steps per episode over learning 100 times. It took an average shift of ten episodes. The standard errors also are shown in this figure. As shown, HRS performed better than all other methods. RRS and NRS seemed to be almost the same as AC. We evaluated the learning efficiency by using the time to threshold and the asymptotic performance. We used each learning result smoothed using a simple moving average method with the number of time periods at ten episodes. We performed an ANOVA to determine the difference among the four methods, HRS, RRS, AC, and NRS. The HolmBonferroni method was used for a subeffect test. We set the thresholds to 3000, 2000, 1000, and 500 steps in terms of the time to threshold. Table III shows the mean episodes, the standard deviations and the results of the ANOVA and the subeffect tests for the compared methods for each threshold step.
Thres.  HRS  RRS  AC  NRS 

3000  21.2(39.8)  34.6(48.4)  51.7(67.1)  40.0(50.7) 
2000  26.8(40.4)  48.7(54.2)  70.2(77.8)  59.0(59.0) 
1000  56.7(58.5)  89.1(67.9)  101(71.4)  93.0(70.10) 
500  115.3(66.4)  147.3(64.7)  163(54.8)  157(61.6) 
Thres.  HRS < RRS 
HRS < RRS, AC


3000  n.s.  * 
2000  *  * 
1000  *  * 
500  *  * 
From Table III, the difference between HRS and the other three methods in terms of reaching 500, 1000, and 2000 steps per episode was statistically significant. There were statistically significant differences between HRS and both RRS and AC in terms of reaching 3000 steps. There were no other significant differences. This means that HRS learned faster until reaching steps than RRS, AC, and NRS. There was only statistically significant difference between HRS and RRS in terms of asymptotic performance. From these results, we found human ordered subgoals to be more helpful for our proposed method than random ordered subgoals in the pinball domain. We acquired similar results for the fourrooms domain.
VD Pick and Place Task
We used a fetch environment based on the 7DoF Fetch robotics arm of OpenAI Gym[1606.01540]. In the pick and place task, the robotic arm learns to grasp a box and move it to a target position[DBLP:journals/corr/abs180209464]. We converted the original task into a singlegoal reinforcement learning framework because potentialbased reward shaping does not cover the multigoal framework[Ng+HR:1999]. The dimension of observation is larger than the previous navigation task, and the action is continuous. An observation is 25dimensional, and it includes the Cartesian position of the gripper and the object as well as the object’s position relative to the gripper. The reward function generates a reward of 1 every step and a reward of 0 when the task is successful. In [DBLP:journals/corr/abs180209464], the task is written about in detail.
VD1 Experimental Setup
We compared HRS with NRS, RRS, and DDPG [journals/corr/LillicrapHPHETS15]
in terms of learning efficiency with the time to threshold and asymptotic performance. HRS, NRS, and RRS used DDPG as the base. We defined this threshold in the task as the number of epochs required to reach the designated success rate. The asymptotic performance was the average success rate between 190 and 200 epochs. Ten workers stored episodes and calculated the gradient simultaneously in an epoch. HRS and NRS used ordered subgoals provided by five participants. RRS used subgoals randomly generated. The learning in 200 epochs took several hours. We used an OpenAI Baselines
[baselines] implementation for DDPG with default hyperparameter settings. We built the hidden and output layers of the value network over abstract states with the same structure as the q value network. We excluded the action from the input layer. The input of the network is only the observation on subgoal achievement, and the network learns from the discount accumulated reward until subgoal achievement. A subgoal is defined from the information in the observation. We set the margin to to loosen the severe condition to achieve subgoals.VD2 Experimental Results
All five participants determined the subgoal series, the first subgoal was the location available to grasp the object, and the second subgoal was grasping the object. We used the subgoal series for the input of our method Fig. 6 shows the learning curves of HRS, RRS, NRS, and DDPG.
As shown as Fig. 6, the results were averaged across five learnings, and the shaded areas represent one standard error. The random seeds and the locations of the goal and object were varied every learning. HRS worked more effectively than DDPG, especially after about the 75th epoch. NRS had the worst performance through almost all epochs. The mean difference was 30 epochs at a time to threshold of 0.6. This means that our method decreased the number of epochs by 30 to reach the success rate of 0.6 from the DDPG. The NRS and RRS could not achieve the success rate of 0.6. The asymptotic performance of HRS, DDPG, NRS, and RRS are 0.67, 0.63, 0.49, and 0.43 respectively. We confirmed that HRS had the highest asymptotic performance and the fastest achievement at the 0.6 success rate.
Vi Discussion
There was a small difference between HRS and RRS in terms of navigation in the fourrooms domain as shown in Fig. 3. In comparison, the difference was similar to AC and NRS for the pinball domain as shown in Fig. 5. Approximately 65% of states generated randomly were in the optimal trajectory for the fourrooms domain. The pinball task had approximately 20% of states generated randomly in it. The random subgoals were better for the fourrooms domain than in the pinball domain. This is because the fourrooms domain might have more states in an optimal trajectory than the pinball domain. We think that the smaller difference was caused by the characteristics of the fourrooms task, for which most states were in the optimal trajectory.
Potentialbased reward shaping keeps a policy invariant from the transformation of a reward function. From the experimental results of the pinball domain, the asymptotic performances of HRS were statistically significantly different from RRS. There was no significant difference in the fourrooms domain. As shown in Fig. 3, the performance was clearly asymptotic at the 121st of 1000 episodes. In contrast, it was not asymptotic at the 200th episode in Fig. 5. Since our method is based on potentialbased reward shaping, the RRS converges to the same performance as HRS if learning continues.
We compared our method with SARSARS using state aggregation. The comparison is not fair because the amount of domain knowledge is different. Our method needs only several states as subgoals, whereas the state aggregation needs the mapping from states into abstract states. However, the comparison is useful to understand the performance of our method in detail. For the fourrooms domain having a discrete state space, state space aggregation is easily given such that an abstract state is a room. In the pinball domain and pick and place task having continuous state space, it is difficult to obtain a human’s mapping function. This is why we got the abstract state space by discretization. The number of both abstract states was three to align SARSARS with HRS. The performance of our method was lower than the state aggregation for the fourrooms and pinball domains. In comparison, our method outperformed the state aggregation in the pick and place task. The results may show that state aggregation does not work in a highdimensional environment, but our method works well.
The value function over abstract states was initialized by zero in the experiments. As shown in the experimental results, our method improved the learning efficiency in the middle of learning, but could not speed up RL in the beginning. This is because the shaping was zero by the initial potentials, and did not work well in the begging. Since nonzero initialization can shapes the reward in the begging, it might speed up RL. However, the best way to initialize the value function is not clear. This is an open question.
The limitation of providing subgoals is that there is no established methodology, and it may depend on the choices each individual intuitively makes. Future research is hence needed to define subgoals clearly. For the fourrooms domain, as shown in Fig. 2, almost all of the subgoals were scattered within the righttop and rightbottom rooms. From this, we think that many participants tended to consider the rightbottom path as the shortest one. Additionally, there was an interesting observation in that a half of the participants set a subgoal in the hallways. This may mean humans abstractly have a common methodology and preference for giving subgoals. It is necessary to systematically conduct a user study to make this clear.
Providing subgoals is more useful than optimal trajectories when the task requires robothandling skills. We are interested in cognitive loads when a human teaches behaviors to a learning algorithm. There are three points are left as open problems: choosing suitable tasks to provide subgoals, measuring the quantitative difference in cognitive load among the types of provided human knowledge, and developing a graphical user interface (GUI) for teaching by subgoals. We consider tasks with perceptual structures such as navigation in fourrooms to be suitable for providing subgoals. The fourrooms domain is a grid, and the structure is explicitly clear, so hallways between rooms tends to be selected. If the task has a single room, participants would be confused and unsure of where to select subgoals. Tasks without perceptual structures may be suitable for providing optimal trajectories. The GUI is significant to both teachers and agents. The cognitive load of teachers may decrease, and the appropriate subgoals can be acquired to accelerate learning. The agent needs to have interpretability in regards to its behaviors so that human can acquire the desired information for efficiency. We will consider incorporating the XAI approach [molnar2019] into the GUI.
Vii Conclusion
In reinforcement learning, learning a policy is timeconsuming. We aim for accelerating learning with reward transformation based on human subgoal knowledge. Although SARSARS incorporating state aggregation information into rewards is helpful, humans rarely deal with all states in an environment with highdimensional observations. We proposed a method by which a human deals with several characteristic states as a subgoal. We defined a subgoal as the goal state in one of the subtasks into which a human decomposes a task. The main part of our method is the dynamic trajectory aggregation with subgoal series into abstract states. The method works well with an accumulated reward function in the environment. The accumulated reward function returns rewards of nstep transitions. We collected ordered subgoals from participants and used them for evaluation. We evaluated navigation for fourrooms, pinball, and a pick and place task. The experimental results revealed that our method with human subgoals enabled faster learning compared with the baseline method, and human subgoal series were more helpful than random ones. We could apply the SARSARS with our method to an environment with highdimensional observations, and learning was clearly accelerated. Future work involves analyzing the characteristics of human subgoals to clearly define the subgoals humans provide.
Comments
There are no comments yet.