1 Introduction
The RoboCup 3D Soccer Simulation environment provides a dynamic, realtime, complex, adversarial, and stochastic multiagent environment for simulated agents. The simulated agents formalize their goals in two layers: 1. the physical layers, where controls related to walking, kicking etc. are conducted; and 2. the decision layers, where high level actions are taken to emerge behaviors. In this paper, we investigate a mechanism suitable for decision layers to use recently introduced OffPolicy Gradient Decent Algorithms in Reinforcement Leaning (RL) that illustrate learnable knowledge representations to learn about a dynamic role assignment function.
In order to learn about an effective dynamic role assignment function, the agents need to consider the dynamics of agentenvironment interactions. We consider these interactions as the agent’s knowledge. If this knowledge is represented in a formalized form (e.g., firstorder predicate logic) an agent could infer many aspects about its interactions consistent with that knowledge. The knowledge representational forms show different degrees of computational complexities and expressiveness [22]. The computational requirements increase with the extension of expressiveness of the representational forms. Therefore, we need to identify and commit to a representational form, which is scalable for online learning while preserving expressivity. A human soccer player knows a lot of information about the game before (s)he enters onto the field and this prior knowledge influences the outcome of the game to a great extent. In addition, human soccer players dynamically change their knowledge during games in order to achieve maximum rewards. Therefore, the knowledge of the human soccer player is to a certain extent either predictive or goaloriented. Can a robotic soccer player collect and maintain predictive and goaloriented knowledge? This is a challenging problem for agents with time constraints and limited computational resources.
We learn the role assignment function using a framework that is developed based on the concepts of Horde, the realtime learning methodology, to express knowledge using General Value Functions (GVFs) [22]. Similar to Horde’s subagents, the agents in a team are treated as independent RL subagents, but the agents take actions based on their belief of the world model. The agents may have different world models due to noisy perceptions and communication delays. The GVFs are constituted within the RL framework. They are predictions or offpolicy controls that are answers to questions. For example, in order to make a prediction a question must be asked of the form “If I move in this formation, would I be in a position to score a goal?”, or “What set of actions do I need to block the progress of the opponent agent with the number 3?”. The question defines what to learn. Thus, the problem of prediction or control can be addressed by learning value functions. An agent obtains its knowledge from information communicated back and forth between the agents and the agentenvironment interaction experiences.
There are primarily two algorithms to learn about the GVFs, and these algorithms are based on OffPolicy Gradient Temporal Difference (OPGTD) learning: 1. with actionvalue methods, a prediction question uses GQ() algorithm [8], and a control or a goaloriented question uses GreedyGQ() algorithm [9]. These algorithms learned about a deterministic target policies and the control algorithm finds the greedy action with respect to the actionvalue function; and 2. with policygradient methods, a goaloriented question can be answered using OffPolicy ActorCritic algorithm [24], with an extended statevalue function, GTD() [7], for GVFs. The policy gradient methods are favorable for problems having stochastic optimal policies, adversarial environments, and problems with large action spaces. The OPGTD algorithms possess a number of properties that are desirable for online learning within the RoboCup 3D Soccer Simulation environment: 1. offpolicy updates; 2. linear function approximation; 3. no restrictions on the features used; 4. temporaldifference learning; 5. online and incremental; 6. linear in memory and pertimestep computation costs; and 7. convergent to a local optimum or equilibrium point [23, 9].
In this paper, we present a methodology and an implementation to learn about a dynamic role assignment function considering the dynamics of agentenvironment interactions based on GVFs. The agents ask questions and approximate value functions answer to those questions. The agents independently learn about the role assignment functions in the presence of an adversary team. Based on the interactions, the agents may have to change their roles in order to continue in the formation and to maximize rewards. There is a finite number of roles that an agent can commit to, and the GVFs learn about the role assignment function. We have conducted all our experiments in the RoboCup 3D Soccer Simulation League Environment. It is based on the general purpose multiagent simulator SimSpark^{1}^{1}1http://svn.code.sf.net/p/simspark/svn/trunk/. The robot agents in the simulation are modeled based on the Aldebaran NAO^{2}^{2}2http://www.aldebaranrobotics.com/
robots. Each robot has 22 degrees of freedom. The agents communicate with the server through message passing and each agent is equipped with noise free joint perceptors and effectors. In addition to this, each agent has a noisy restricted vision cone of
. Every simulation cycle is limited to , where agents perceive noise free angular measurements of each joint and the agents stimulate the necessary joints by sending torque values to the simulation server. The vision information from the server is available every third cycle (), which provides spherical coordinates of the perceived objects. The agents also have the option of communicating with each other every other simulation cycle () by broadcasting a message. The simulation league competitions are currently conducted with 11 robots on each side (22 total).The remainder of the paper is organized as follows: In Section 2, we briefly discuss knowledge representation forms and existing role assignment formalisms. In Section 3, we introduce GVFs within the context of robotic soccer. In Section 4, we formalize our mechanisms of dynamic role assignment functions within GVFs. In Section 5, we identify the question and answer functions to represent GVFs, and Section 6 presents the experiment results and the discussion. Finally, Section 7 contains concluding remarks, and future work.
2 Related Work
One goal of multiagent systems research is the investigation of the prospects of efficient cooperation among a set of agents in realtime environments. In our research, we focus on the cooperation of a set of agents in a realtime robotic soccer simulation environment, where the agents learn about an optimal or a nearoptimal role assignment function within a given formation using GVFs. This subtask is particularly challenging compared to other simulation leagues considering the limitations of the environment, i.e. the limited locomotion capabilities, limited communication bandwidth, or crowd management rules. The role assignment is a part of the hierarchical machine learning paradigm
[20, 19], where a formation defines the role space. Homogeneous agents can change roles flexibly within a formation to maximize a given reward function.RL framework offerers a set of tools to design sophisticated and hardtoengineer behaviors in many different robotic domains (e.g., [4]). Within the domain of robotic soccer, RL has been successfully applied in learning the keepaway subtask in the RoboCup 2D [18] and 3D [16] Soccer Simulation Leagues. Also, in other RoboCup leagues, such as the Middle Size League, RL has been applied successfully to acquire competitive behaviors [2]. One of the noticeable impact on RL is reported by the Brainstormers team, the RoboCup 2D Simulation League team, on learning different subtasks [14]. A comprehensive analysis of a general batch RL framework for learning challenging and complex behaviors in robot soccer is reported in [15]. Despite convergence guarantees, Q() [21] with linear function approximation has been used in role assignment in robot soccer [5]
and faster learning is observed with the introduction of heuristically accelerated methods
[3]. The dynamic role allocation framework based on dynamic programming is described in [6] for realtime soccer environments. The role assignment with this method is tightly coupled with the agent’s lowlevel abilities and does not take the opponents into consideration. On the other hand, the proposed framework uses the knowledge of the opponent positions as well as other dynamics for the role assignment function.Sutton et al. [22] have introduced a realtime learning architecture, Horde, for expressing knowledge using General Value Functions (GVFs). Our research is built on Horde to ask a set of questions such that the agents assign optimal or nearoptimal roles within formations. In addition, following researches describe methods and components to build strategic agents: [1] describes a methodology to build a cognizant robot that possesses vast amount of situated, reversible and expressive knowledge. [11] presents a methodology to “next” in real time predicting thousands of features of the world state, and [10] presents methods predict about temporally extended consequences of a robot’s behaviors in general forms of knowledge. The GVFs are successfully used (e.g., [13, 25]) for switching and prediction tasks in assistive biomedical robots.
3 Learnable knowledge representation for Robotic Soccer
Recently, within the context of the RL framework [21], a knowledge representation language has been introduced, that is expressive and learnable from sensorimotor data. This representation is directly usable for robotic soccer as agentenvironment interactions are conducted through perceptors and actuators. In this approach, knowledge is represented as a large number of approximate value functions each with its 1. own policy; 2. pseudoreward function; 3. pseudotermination function; and 4. pseudoterminalreward function[22]. In continuous state spaces, approximate value functions are learned using function approximation and using more efficient offpolicy learning algorithms. First, we briefly introduce some of the important concepts related to the GVFs. The complete information about the GVFs are available in [22, 8, 9, 7]. Second, we show its direct application to simulated robotic soccer.
3.1 Interpretation
The interpretation of the approximate value function as a knowledge representation language grounded on information from perceptors and actuators is defined as:
Definition 1
The knowledge expressed as an approximate value function is true or accurate, if its numerical values matches those of the mathematically defined value function it is approximating.
Therefore, according to the Definition (1), a value function asks a question, and an approximate value function is the answer to that question. Based on prior interpretation, the standard RL framework extends to represent learnable knowledge as follows. In the standard RL framework [21], let the agent and the world interact in discrete time steps . The agent senses the state at each time step , and selects an action . One time step later the agent receives a scalar reward , and senses the state . The rewards are generated according to the reward function . The objective of the standard RL framework is to learn the stochastic actionselection policy
, that gives the probability of selecting each action in each state,
, such that the agent maximizes rewards summed over the time steps. The standard RL framework extends to include a terminalrewardfunction, , where is the terminal reward received when the termination occurs in state . In the RL framework, is used to discount delayed rewards. Another interpretation of the discounting factor is a constant probability of termination of arrival to a state with zero terminalreward. This factor is generalized to a termination function , where is the probability of termination at state , and a terminal reward is generated.3.2 OffPolicy ActionValue Methods for GVFs
The first method to learn about GVFs, from offpolicy experiences, is to use actionvalue functions. Let be the complete return from state at time , then the sum of the rewards (transient plus terminal) until termination at time is:
The actionvalue function is:
where, . This is the expected return for a trajectory started from state , and action , and selecting actions according to the policy , until termination occurs with . We approximate the actionvalue function with . Therefore, the actionvalue function is a precise grounded question, while the approximate actionvalue function offers the numerical answer. The complete algorithm for GreedyGQ() with linear function approximation for GVFs learning is as shown in Algorithm (1).
The GVFs are defined over four functions: . The functions act as pseudoreward and pseudoterminalreward functions respectively. Function is also in pseudo form as well. However, function is more substantive than reward functions as the termination interrupts the normal flow of state transitions. In pseudo termination, the standard termination is omitted. In robotic soccer, the base problem can be defined as the time until a goal is scored by either the home or the opponent team. We can consider a pseudotermination has occurred when the striker is changed. The GVF with respect to a stateaction function is defined as:
The four functions, , are the question functions to GVFs, which in return defines the general value function’s semantics. The RL agent learns an approximate actionvalue function, , using the four auxiliary functions and . We assume that the state space is continuous and the action space is discrete. We approximate the actionvalue function using a linear function approximator. We use a feature extractor , built on tile coding [21] to generate feature vectors from state variables and actions. This is a sparse vector with a constant number of “1” features, hence, a constant norm. In addition, tile coding has the key advantage of realtime learning and to implement computationally efficient algorithms to learn approximate value functions. In linear function approximation, there exists a weight vector, , to be learned. Therefore, the approximate GVFs are defined as:
such that, . Weights are learned using the gradientdescent temporaldifference Algorithm (1) [7]. The Algorithm learns stably and efficiently using linear function approximation from offpolicy experiences. Offpolicy experiences are generated from a behavior policy, , that is different from the policy being learned about named as target policy, . Therefore, one could learn multiple target policies from the same behavior policy.
3.3 OffPolicy Policy Gradient Methods for GVFs
The second method to learn about GVFs is using the offpolicy policy gradient methods with actorcritic architectures that use a statevalue function suitable for learning GVFs. It is defined as:
where, is the true statevalue function, and the approximate GVF is defined as:
where, the functions are defined as in the subsection (3.2). Since our the target policy is discrete stochastic, we use a Gibbs distribution of the form:
where, are stateaction features for state , and action , which are in general unrelated to state features , that are used in statevalue function approximation. , is a weight vector, which is modified by the actor to learn about the stochastic target policy. The loggradient of the policy at state , and action , is:
The complete algorithm for OffPAC with linear function approximation for GVFs learning is as shown in Algorithm (2).
We are interested in finding optimal policies for the dynamic role assignment, and henceforth, we use Algorithms (1), and (2) for control purposes^{3}^{3}3We use an C++ implementation of Algorithm (1) and (2) in all of our experiments. An implementation is available in https://github.com/samindaa/RLLib. We use linear function approximation for continuous state spaces, and discrete actions are used within options. Lastly, to summarize, the definitions of the question functions and the answer functions are given as:
Definition 2
The question functions are defined by:

35mm (target policy is greedy w.r.t. learned value function);

35mm (termination function);

35mm (transient reward function); and

35mm (terminal reward function).
Definition 3
The answer functions are defined by:

35mm (behavior policy);

35mm (interest function);

35mm (featurevector function); and

35mm (eligibilitytrace decayrate function).
4 Dynamic Role Assignment
A role is a specification of an internal or an external behavior of an agent. In our soccer domain, roles select behaviors of agents based on different reference criteria: the agent close to the ball becomes the striker. Given a role space, , of size , the collaboration among agents, , is obtained through formations. The role space consists of active and reactive roles. For example, the striker is an active role and the defender could be a reactive role. Given a reactive role, there is a function, , that maps roles to target positions, , on the field. These target positions are calculated with respect to a reference pose (e.g., ball position) and other auxiliary criteria such as crowd management rules. A role assignment function, , provides a mapping from role space to agent space, while maximizing some reward function. The role assignment function can be static or dynamic. Static role assignments often provide inferior performance in robot soccer [6]. Therefore, we learn a dynamic role assignment function within the RL framework using offpolicy control.
4.1 Target Positions with the Primary Formation
Within our framework, an agent can choose one role among thirteen roles. These roles are part of a primary formation, and an agent calculates the respective target positions according to its belief of the absolute ball position and the rules imposed by the 3D soccer simulation server. We have labeled the role space in order to describe the behaviors associated with them. Figure (1) shows the target positions for the role space before the kickoff state. The agent closest to the ball takes the striker role (SK), which is the only active role. Let us assume that the agent’s belief of the absolute ball position is given by . Forward left (FL) and forward right (FR) target positions are offset by . The extended forward left (EX1L) and extended forward right ((EX1R)) target positions are offset by . The stopper (ST) position is given by . The extended middle (EX1M) position is used as a blocking position and it is calculated based on the closest opponent to the current agent. The other target positions, wing left (WL), wing right (WR), wing middle (WM), back left (BL), back right (BR), and back middle (BM) are calculated with respect to the vector from the middle of the home goal to the ball and offset by a factor which increases close to the home goal. When the ball is within the reach of goal keeper, the (GK) role is changed to goal keeper striker (GKSK) role. We slightly change the positions when the ball is near the side lines, home goal, and opponent goal. These adjustments are made in order to keep the target positions inside the field. We allow target positions to be overlapping. The dynamic role assignment function may assign the same role during the learning period. In order to avoid position conflicts an offset is added; the feedback provides negative rewards for such situations.
4.2 Roles to RL Action Mapping
The agent closest to the ball becomes the striker, and only one agent is allowed to become the striker. The other agents except the goalie are allowed to choose from twelve roles. We map the available roles to discrete actions of the RL algorithm. In order to use Algorithm 1, an agent must formulate a question function using a value function, and the answer function provides the solution as an approximate value function. All the agents formulate the same question: What is my role in this formation in order to maximize future rewards? All agents learn independently according to the question, while collaboratively aiding each other to maximize their future reward. We make the assumption that the agents do not communicate their current role. Therefore, at a specific step, multiple agents may commit to the same role. We discourage this condition by modifying the question as What is my role in this formation in order to maximize future rewards, while maintaining a completely different role from all teammates in all time steps?
4.3 State Variables Representation
Figure 2 shows the schematic diagram of the state variable representation. All points and vectors in Figure 2 are defined with respect to a global coordinate system. is the middle point of the home goal, while is the middle point of the opponent goal. is the ball position. . represents the vector length, while represents the angle among three points pivoted at . represents the selflocalized point of the teammate agent. is some point in the direction of the robot orientation of teammate agents. , , represents the midpoint of the tracked opponent agent. represents a point on a vector parallel to unit vector . Using these labels, we define the state variables as:
is the teammate starting id and the ending id. is the number of opponents considered. Angles are normalized to [].
5 Question and Answer Functions
There are twelve actions available in each state. We have left out the striker role from the action set. The agent nearest to the ball becomes the striker. All agents communicate their belief to other agents. Based on their belief, all agents calculate a cost function and assign the closest agent as the striker. We have formulated a cost function based on relative distance to the ball, angle of the agent, number of teammates and opponents within a region near the ball, and whether the agents are active. In our formulation, there is a natural termination condition; scoring goals. With respect to the striker role assignment procedure, we define a pseudotermination condition. When an agent becomes a striker, a pseudotermination occurs, and the striker agent does not participate in the learning process unless it chooses another role. We define the question and answer functions as follows:
5.1 GVF Definitions for StateAction Functions
Question functions:

greedy w.r.t. ,

,

(a) the change of value of the absolute ball position; (b) a small negative reward of for each cycle; (c) a negative reward of is given to all agents within a radius of 1.5 meters;

(a) for scoring against opponent; (b) for opponent scoring; and

seconds.
Answer functions:

greedy w.r.t. target stateaction function,

,

,

(a) we use tile coding to formulate the feature vector. and . . Therefore, there are state variables. (b) state variable is independently tiled with 16 tilings with approximately each with generalization. Therefore, there are active tiles (i.e., tiles with feature 1) hashed to a binary vector dimension . The bias feature is always active, and

.
Parameters:
• 6mm
1. ;
2. (efficient trace implementation);
6mm
3. ; and
4. .
5.2 GVF for Gradient Descent Functions
Question functions:

Gibbs distribution,

,

(a) the change of value of the absolute ball position; (b) a small negative reward of for each cycle; (c) a negative reward of is given to all agents within a radius of 1.5 meters;

(a) for scoring against opponent; (b) for opponent scoring; and

seconds.
Answer functions:

the learned Gibbs distribution is used with a small perturbation. In order to provide exploration, with probability , Gibbs distribution is perturbed using some value. In our experiments, we use . Therefore, we use a behavior policy:

(a) the representations for the statevalue function, we use tile coding to formulate the feature vector. and . . Therefore, there are state variables. (b) state variable is independently tiled with 16 tilings with approximately each with generalization. Therefore, there are active tiles (i.e., tiles with feature 1) hashed to a binary vector dimension . The bias feature is always set to active;

(a) the representations for the Gibbs distribution, we use tile coding to formulate the feature vector. and . . Therefore, there are state variables. (b) state variable is independently tiled with 16 tilings with approximately each with generalization. Therefore, there are active tiles (i.e., tiles with feature 1) hashed to a binary vector dimension . The hashing has also considered the given action. The bias feature is always set to active; and

.
Parameters:
• 6mm
1. ;
2. ;
6mm
3. (efficient trace implementation);
6mm
4. ;
5. ; and
6. .
6 Experiments
We conducted experiments against the teams Boldhearts and MagmaOffenburg, both semifinalists of the RoboCup 3D Soccer Simulation competition in Mexico 2012^{4}^{4}4The published binary of the team UTAustinVilla showed unexpected behaviors in our tests and is therefore omitted.. We conducted knowledge learning according to the configuration given in Section (5). Subsection (6.1) describes the performance of the Algorithm (1), and Subsection (6.2) describes the performance of the Algorithm (2) for the experiment setup.
6.1 GVFs with GreedyGQ()
The first experiments were done using a team size of five with the RL agents against Boldhearts. After 140 games our RL agent increased the chance to win from 30% to 50%. This number does not increase more in the next games, but after 260 games the number of lost games (initially 35%) is reduced to 15%. In the further experiments we used the goal difference to compare the performance of the RL agent. Figure (3) shows the average goal differences that the handtuned role assignment and the RL agents archive in games against Boldhearts and MagmaOffenburg using different team sizes. With only three agents per team the RL agent only needs 40 games to learn a policy that outperforms the handcoded role selection (Figure (3(a))). Also with five agents per team, the learning agent is able to increase the goal difference against both opponents (Figure (3(b))). However, it does not reach the performance of the manually tuned role selection. Nevertheless considering the amount of time spent for finetuning the handcoded role selection, these results are promising. Furthermore, the outcome of the games depends a lot on the underlying skills of the agents, such as walking or dribbling. These skills are noisy, thus the results need to be averaged over many games (std. deviations in Figure (3) are between 0.5 and 1.3).
The results in Figure (3(c)) show a bigger gap between RL and the handcoded agent. However, using seven agents the goal difference is generally decreased, since the defense is easily improved by increasing the number of agents. Also the handcoded role selection results in a smaller goal difference. Furthermore, considering seven agents in each team the state space is already increased significantly. Only 200 games seem to be not sufficient to learn a good policy. Sometimes the RL agents reach a positive goal difference, but it stays below the handcoded role selection. In Section 7, we discuss some of the reasons for this inferior performances for the team size seven. Even though the RL agent did not perform well considering only the goal difference, it has learned a moderately satisfactory policy. After 180 games the amount of games won is increased slightly from initially 10% to approximately 20%.
6.2 GVFs with OffPAC
With OffPAC, we used a similar environment to that of Subsection (6.1), but with a different learning setup. Instead of learning individual policies for teams separately, we learned a single policy for both teams. We ran the opponent teams in a round robin fashion for 200 games and repeated complete runs for multiple times. The first experiments were done using a team size of three with RL agents against both teams. Figure (4(a)) shows the results of bins of 20 games averaged between two trials. After 20 games, the RL agents have learned a stable policy compared to the handtuned policy, but the learned policy bounded above the handtuned role assignment function. The second experiments were done using a team size of five with the RL agents against opponent teams. Figure (4(b)) shows the results of bins of 20 games averaged among three trials. After 100 games, our RL agent increased the chance of winning to 50%. This number does not increase more in the next games. As Figures (4(a)) and (4(b)) show, the three and five agents per team are able to increase the goal difference against both opponents. However, it does not reach the performance of the manually tuned role selection. Similar to Subsection (6.1), the amount of time spent for finetuning the handcoded role selection, these results are promising, and the outcome of the experiment heavily depends on the underlying skills of the agents.
The final experiments were done using a team size of seven with the RL agents against opponent teams. Figure (4(c)) shows the results of bins of 20 games averaged among two trials. Similar to Subsection (6.1), with seven agents per team, the results in Figure (4(c)) show a bigger gap between RL and the handtuned agent. However, using seven agents the goal difference is generally decreased, since the defense is easily improved by increasing the number of agents. Also the handtuned role selection results in a smaller goal difference. Figure 4(c) shows an increase in the trend of winning games. As mentioned earlier, only 200 games seem to be not sufficient to learn a good policy. Even though the RL agents reach a positive goal difference, but it stays below the handtuned role selection method. Within the given setting, the RL agents have learned a moderately satisfactory policy. Whether the learned policy is satisfactory for other teams needs to be further investigated.
The RoboCup 3D soccer simulation is inherently a dynamic, and a stochastic environment. There is an infinitesimal chance that a given situation (state) may occur for many games. Therefore, it is paramount important that the learning algorithms extract as much information as possible from the training examples. We use the algorithms in the online incremental setting, and once the experience is consumed it is discarded. Since, we learned from offpolicy experiences, we can save the tuples, , and learn the policy offline. The GreedyGQ(
) learns a deterministic greedy policy. This may not be suitable for complex and dynamic environments such as the RoboCup 3D soccer simulation environment. The OffPAC algorithm is designed for stochastic environment. The experiment shows that this algorithm needs careful tuning of learning rates and feature selection, as evident from Figure (
4(a)) after 160 games.7 Conclusions
We have designed and experimented RL agents that learn to assign roles in order to maximize expected future rewards. All the agents in the team ask the question “What is my role in this formation in order to maximize future rewards, while maintaining a completely different role from all teammates in all time steps?”. This is a goaloriented question. We use GreedyGQ() and OffPAC to learn experientially grounded knowledge encoded in GVFs. Dynamic role assignment function is abstracted from all other lowlevel components such as walking engine, obstacle avoidance, object tracking etc. If the role assignment function selects a passive role and assigns a target location, the lowerlayers handle this request. If the lowerlayers fail to comply to this request, for example being reactive, this feedback is not provided to the role assignment function. If this information needs to be included; it should become a part of the state representation, and the reward signal should be modified accordingly. The target positions for passive roles are created w.r.t. the absolute ball location and the rules imposed by the 3D soccer simulation league. When the ball moves relatively quickly, the target locations change more quickly. We have given positive rewards only for the forward ball movements. In order to reinforce more agents within an area close to the ball, we need to provide appropriate rewards. These are part of reward shaping [12]. Reward shaping should be handled carefully as the agents may learn suboptimal policies not contributing to the overall goal.
The experimental evidences show that agents are learning competitive role assignment functions for defending and attacking. We have to emphasize that the behavior policy is
greedy with a relatively small exploration or slightly perturbed around the target policy. It is not a uniformly distributed policy as used in
[22]. The main reason for this decision is that when an adversary is present with the intention of maximizing its objectives, practically the learning agent may have to run for a long period to observe positive samples. Therefore, we have used the offpolicy GreedyGQ() and OffPAC algorithms for learning goaloriented GVFs within onpolicy control setting. Our hypothesis is that with the improvements of the functionalities of lowerlayers, the role assignment function would find better policies for the given question and answer functions. Our next step is to let the RL agent learn policies against other RoboCup 3D soccer simulation league teams. Beside the role assignment, we also contributed with testing offpolicy learning in highdimensional state spaces in a competitive adversarial environment. We have conducted experiments with three, five, and seven agents per team. The full game consists of eleven agents. The next step is to extend learning to consider all agents, and to include methods that select informative state variables and features.References
 [1] Degris, T., Modayily, J.: Scalingup Knowledge for a Cognizant Robot. In Notes of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI (2012)
 [2] Gabel, T., Lange, S., Lauer, M., Riedmiller, M.: Bridging the Gap: Learning in the Robocup Simulation and Midsize League. In: Proceedings of the 7th Portuguese Conference on Automatic Control (Controlo) (2006)

[3]
Gurzoni, Jr., J.A., Tonidandel, F., Bianchi, R.A.C.: MarketBased Dynamic Task Allocation using Heuristically Accelerated Reinforcement Learning. In: Proceedings of the 15th Portugese Conference on Progress in Artificial Intelligence. pp. 365–376. EPIA’11, SpringerVerlag, Berlin, Heidelberg (2011)
 [4] Kober, J., Bagnell, J.A.D., Peters, J.: Reinforcement Learning in Robotics: A Survey. International Journal of Robotics Research (July 2013)

[5]
Köse, H., Tatladede, U., Mericli, C., Kaplan, K., Akan, H.L.: QLearning Based MarketDriven MultiAgent Collaboration in Robot Soccer. In: Proceedings of the Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN). pp. 219–228 (2004)
 [6] MacAlpine, P., Urieli, D., Barrett, S., Kalyanakrishnan, S., Barrera, F., LopezMobilia, A., Ştiurcă, N., Vu, V., Stone, P.: UT Austin Villa 2011: A Champion Agent in the RoboCup 3D Soccer Simulation Competition. In: Proceedings of 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012) (June 2012)
 [7] Maei, H.R.: Gradient TemporalDifference Learning Algorithms. PhD Thesis, University of Alberta. (2011), phD Thesis
 [8] Maei, H.R., Sutton, R.S.: GQ(): A General Gradient Algorithm for TemporalDifference Prediction Learning with Eligibility Traces. Proceedings of the 3rd Conference on Artificial General Intelligence (AGI10) pp. 1–6 (2010)
 [9] Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward OffPolicy Learning Control with Function Approximation. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010). pp. 719–726 (2010)
 [10] Modayil, J., White, A., Pilarski, P.M., Sutton, R.S.: Acquiring a Broad Range of Empirical Knowledge in Real Time by TemporalDifference Learning. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC). pp. 1903–1910. IEEE (2012)
 [11] Modayil, J., White, A., Sutton, R.S.: Multitimescale Nexting in a Reinforcement Learning Robot. In: From Animals to Animats 12  12th International Conference on Simulation of Adaptive Behavior (SAB). pp. 299–309 (2012)
 [12] Ng, A.Y., Harada, D., Russell, S.J.: Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML). pp. 278–287. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999)
 [13] Pilarski, P., Dawson, M., Degris, T., Carey, J., Sutton, R.: Dynamic Switching and RealTime Machine Learning for Improved Human Control of Assistive Biomedical Robots. In: 4th IEEE RAS EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob). pp. 296 –302 (June 2012)
 [14] Riedmiller, M., Gabel, T.: On Experiences in a Complex and Competitive Gaming Domain: Reinforcement Learning Meets RoboCup. In: Third IEEE Symposium on Computational Intelligence and Games. pp. 17–23. IEEE (2007)
 [15] Riedmiller, M., Gabel, T., Hafner, R., Lange, S.: Reinforcement Learning for Robot Soccer. Autonomous Robots 27, 55–73 (July 2009)

[16]
Seekircher, A., Abeyruwan, S., Visser, U.: Accurate Ball Tracking with Extended Kalman Filters as a Prerequisite for a HighLevel Behavior with Reinforcement Learning. In: The 6th Workshop on Humanoid Soccer Robots at Humanoid Conference, Bled (Slovenia) (2011)
 [17] Stoecker, J., Visser, U.: Roboviz: Programmable Visualization for Simulated Soccer. In: Röfer, T., Mayer, N.M., Savage, J., Saranli, U. (eds.) RoboCup. pp. 282–293. Lecture Notes in Computer Science, Springer (2011)
 [18] Stone, P., Sutton, R.S., Kuhlmann, G.: Reinforcement Learning for RoboCupSoccer Keepaway. Adaptive Behavior 13(3), 165–188 (2005)
 [19] Stone, P., Veloso, M.: Layered Learning. In: Proceedings of the Eleventh European Conference on Machine Learning. pp. 369–381. Springer Verlag (1999)
 [20] Stone, P., Veloso, M.: Task Decomposition, Dynamic Role Assignment, and LowBandwidth Communication for RealTime Strategic Teamwork. Artificial Intelligence 110(2), 241–273 (June 1999)
 [21] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
 [22] Sutton, R.S., Modayil, J., Delp, M., Degris, T., Pilarski, P.M., White, A., Precup, D.: Horde: A Scalable RealTime Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction. In: The 10th International Conference on Autonomous Agents and Multiagent Systems. pp. 761–768. AAMAS ’11, International Foundation for Autonomous Agents and Multiagent Systems (2011)
 [23] Sutton, R.S., Szepesvári, C., Maei, H.R.: A Convergent O(N) Algorithm for OffPolicy TemporalDifference Learning with Linear Function Approximation. In: Advances in Neural Information Processing Systems (NIPS). pp. 1609–1616. MIT Press (2008)
 [24] Thomas Degris, Martha White, R.S.S.: OffPolicy ActorCritic. In: Proceedings of the TwentyNinth International Conference on Machine Learning (ICML) (2012)
 [25] White, A., Modayil, J., Sutton, R.: Scaling LifeLong OffPolicy Learning. In: International Conference on Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE. pp. 1–6 (2012)
Comments
There are no comments yet.