Many recent works treat the sequential decision-making problem of multiple autonomous, interacting agents as a Multi-Agent Reinforcement Learning (MARL) problem. In this framework, each agent learns a policy to optimize a long-term discounted reward through interacting with the environment and the other agents [zhang2021multi, qie2019joint, cui2019multi]. In many realistic applications, the environment states are only partially observable to the agents [hausknecht2015deep]. Learning in such environments has become a recurrent problem in MARL [zhang2021multi], the difficulty of which lies in the fact that the optimal policy might require a complete history of the whole system to determine the next action to perform [meuleau1999learning]. The authors in [peshkin2001learning] used finite policy graphs to represent policies with memory and successfully applied a policy gradient MARL algorithm in a fully cooperative setting.
In this paper, we consider the problem of controlling a heterogeneous team of agents to satisfy a specification given as a Truncated Linear Temporal Logic (TLTL) formula [li2017reinforcement] using MARL. TLTL is a version of Linear Temporal Logic (LTL) [pnueli1977temporal]. The semantics of TLTL is given over a finite trajectory in a set such as the state space of the system. TLTL has dual semantics. In its Boolean (qualitative) semantics (whether the formula is satisfied), a trajectory satisfying a TLTL formula is accepted by a Finite State Automaton (FSA). The quantitative semantics assigns a degree of satisfaction (robustness) of a formula. Designing a reward function that accurately represents the specification is an important issue in reinforcement learning. We use the robustness of the TLTL specification as the reward of the MARL problem. However, the TLTL robustness is evaluated over the entire trajectory, so we can only get the reward at the end of each episode. To address this, we propose a novel reward shaping technique that introduces two additional reward terms based on the quantitative semantics of TLTL and the corresponding FSA. Such rewards, which are obtained after each step, guide and accelerate the learning.
As in [peshkin2001learning], we consider a partially observable environment and train a finite policy graph for each agent. When deploying the policy, each agent only knows its own state. However, during training we assume all agents know the states of each other. Then we use this information to guide and accelerate learning. This assumption is reasonable in practice because the training environment is highly configurable and a communication system can be easily constructed, while during deployment there is no guarantee that such a communication system is available.
The idea of modifying rewards by using the semantics of temporal logics was introduced in [li2017reinforcement] to address the problem of single agent learning in a fully observable environment. The method was then augmented with Control Barrier Functions (CBF) to successfully control two manipulator arms working together to make and serve hotdogs [li2019formal]. Although the problem in [li2019formal] involved two agents, single agent RL algorithms were used with two different specially designed task specifications. In [sun2020automata], temporal logic rewards were used for multi-agent systems. As opposed to our work, the environment in [sun2020automata] is fully observable.Moreover, each sub-task in [sun2020automata] is restricted to be executed by one agent. In this paper, we allow both independent tasks, which can be accomplished by one agent, and shared tasks, which must be achieved by cooperation of several agents. The rewards in [li2019formal] and [sun2020automata] encourage the FSA to leave its current state, without distinguishing whether it is a transition towards the satisfaction of the TLTL specification. In this paper, the combination of the two reshaped rewards encourage transitions towards satisfaction. Another related work is [hammond2021multi]
, in which the authors apply an actor-critic algorithm to find a control policy for a multi-agent system that maximizes the probability of satisfying an LTL specification. However, the reward does not capture a quantitative semantics, and a fully observable environment is assumed.
The contributions of this paper can be summarized as follows. First, we propose a general procedure that uses MARL algorithms to synthesize distributed control from arbitrary TLTL specifications for a heterogeneous team of agents. Each agent can have different capabilities and both independent and shared sub-tasks can be defined. The agents work in a partially observable environment, and the policy for each agent only requires its own state during deploying. During training, we assume a fully observable environment, which enables the use of TLTL robustness as a reward. Second, we create a novel temporal logic reward shaping technique, where two additional rewards based on TLTL robustness and FSA states are added. Both rewards are obtained immediately after each step, which guides and accelerates the learning process.
2 Preliminaries and Notation
Given a set , we use , , and to denote its cardinality, the set of all subsets of , and a finite sequence over . Given a set , a collection of sets , in which is a set of labels, is called a distribution of if . For , we use to denote the set of labels of elements in that contains .
2.1 Partially Observable Stochastic Games
Single agent reinforcement learning uses Markov Decision Processes (MDP) as mathematical models[puterman2014markov]. In MARL, more complicated models are needed to describe the interactions between the agents and the environment. A popular model is a Stochastic Game (SG) [bucsoniu2010multi]. A Partially Observable Stochastic Game (POSG), which is a generalization of an SG, is defined as a tuple , in which is an index set for the agents; is the discrete joint state space;
is the probability distribution over initial states;is the set of actions for agent , is the discrete observation space and is the observation function; is the transition function that maps the states of the game and the joint action of the agents, defined as , to probability distributions over states of the game at next time step; and is the reward function for agent . If every agent is able to obtain the complete information of the environment state when making decisions, i.e., and for all , then the POSG becomes a fully observable SG.
The goal for each agent is to find a policy that maximizes its value :
where is the discount factor and is a set of policies. Note that for each agent the environment is non-stationary since its reward may depend on other agents’ policies. A memory-less policy is a mapping from the observations of agent to probability distributions over the action space of agent . For a fully observable SG, memory-less policies are sufficient to achieve optimal performance. However, for POSGs, the best memoryless policy can still be arbitrarily worse than the best policy using memory [singh1994learning]. A policy graph [meuleau1999learning] is a common way to represent a policy with memory.
2.2 Truncated Linear Temporal Logic
Truncated Linear Temporal Logic (TLTL) [li2017reinforcement] is a predicate temporal logic inspired from the traditional Linear Temporal Logic (LTL) [pnueli1977temporal]. TLTL can express rich task specifications that are satisfied in finite time. Its formulas are defined over predicates , where is a function over (such as the state space of a system). Then TLTL formulas are evaluated against finite sequences over . TLTL formulas have the following syntax: ϕ:= ⊤ — f(x) ≥0 — ¬ϕ — ϕ∧ψ — ϕ∨ψ — ϕ⇒ψ — ϕ — ○ϕ — ϕUψ — ϕT ψ, where is the Boolean True. (negation), (conjunction) and (disjunction) are Boolean connectives. , , , and are temporal operators that stand for “eventually”, “next”, “until” and “then” respectively. Given a sequence over , (eventually) requires to be satisfied at some time step, (next) requires to be satisfied at the second time step, (until) requires to be satisfied at each time step before is satisfied, (then) requires to be satisfied at least once before is satisfied. TLTL also has derived operators, e.g., (finite time always) and (imply).
TLTL formulas can be interpreted in a qualitative semantics or a quantitative semantics. The qualitative semantics provides a yes/no answer to the corresponding property, while the quantitative semantics generates a measure of the degree of satisfaction. Given a finite trajectory , the quantitative semantics of a formula , also called robustness, is denoted by . We refer the readers to [li2017reinforcement] for detailed definitions of the TLTL qualitative and quantitative semantics.
2.3 Finite State Automata
[Finite State Predicate Automaton [li2019formal]] A finite state predicate automaton (FSPA) corresponding to a TLTL formula is a tuple , in which is a set of automaton states, is the initial state, is the set of final states, is the set of trap states, is a set of predicate Boolean formulas, where the predicates are evaluated over , is the set of transitions, and maps transitions to formulas in .
The semantics of FSPA is defined over finite sequences over . We refer to the elements in as edge guards. A transition is enabled if the corresponding Boolean formula is satisfied. Formally, by noting that a Boolean formula is a particular case of a TLTL formula, the FSPA transitions from to at time if and only if , where is a sequence of from to . Note that the robustness is only evaluated at instead of the whole sequence. Thus we abbreviate to .
A trajectory over is accepting if it ends in the final states . A trajectory over is accepted by if it leads to an accepting trajectory over . Each TLTL can be translated into an equivalent FSPA [li2019formal], in the sense that a trajectory over satisfies the TLTL specification if and only if the same trajectory is accepted by the FSPA. All trajectories over that violate the TLTL formula drive the FSPA to the trap states in .
3 Problem Formulation
The geometry of the world is modeled by an environment graph , where is a set of vertices and is a set of edges. The motions of the agents are restricted by this graph. Assume we have a set of agents where is a label set, and a set of service requests . Let be a function indicating the locations at which a request occurs. For a given request , let be the set of vertices where occurs. The definition of implicitly assumes that one request can occur at multiple vertices, since . Also, multiple requests can occur at the same vertex, since and may have non-empty intersection for . We use a distribution to model the agents’ capabilities for completing different service requests: means that request can be serviced by agent . We consider two types of services: independent and shared. An independent request is any such that . This type of request can only be serviced by agent such that . A shared request is defined by , which means that servicing it requires the cooperation of all the agents that own .
We model the motion and actions of agent by using a deterministic finite transition system where is the set of vertices that can be reached by agent ; is the initial location; is the discrete action space; is a deterministic transition function; is a proposition set; is the satisfaction relation such that 1) for all , , and 2) , , if and only if . The meaning of taking action at state is as follows: if , then the agent tries to move to vertex from ; if , then agent tries to conduct service at ; indicates that the agent stays at the current vertex without conducting any requests. Formally,
For and , a transition is also denoted by . Now we define the Motion and Service (MS) plan for an agent, which is inspired by [chen2011formal].
[Motion and Service Plan] A Motion and Service (MS) plan for an agent
is a sequence of states represented by ordered pairsthat satisfies the following properties:
For all , .
For all , given and , if then .
For all , given and , then such that .
We use to denote the MS plan for an agent from time to , i.e., . A Team Motion and Service (TMS) plan is then defined as .
We assume that there exists a global discrete clock that synchronizes the motions and the services of requests of all agents. We also assume that the times needed to service requests are all equal to 1. In other words, similar to POSGs, at each time step, an agent either chooses to move to a vertex according to or to stay where it is to conduct a particular request in . Before the beginning of the next time step, all motions and requests are completed and the agents are ready to execute the next state in their MS plans. A MS plan thus uniquely defines a sequence of actions for an agent. More specifically, the action derived from a MS plan at is determined by and . If and , then the agent moves to vertex from (for the case where , the agent does not conduct any request and waits at ). On the other hand, means that the agent should conduct request and it must have reached vertex at the previous time step according to property from the above definition.
In this paper, we use the term global behavior to refer to the sequence of requests serviced by the whole team. An independent request is considered to be finished at time if and only if , . For a shared request , we assume that all the agents that own this request are capable of communicating with each other at the same vertex where occurs. A shared service request is completed at time if and only if for all and for all . Thus, given individual MS plans of all the agents, one single global behavior is then uniquely determined. This is directly deduced from the assumption of a constant finishing time for all requests. Next we define a term called Team Trajectory to describe the results of executing a global TMS plan.
[Team Trajectory] Given a team of agents , a set of service requests , a distribution , a function that shows the locations of requests and a TMS plan for the whole team of agents, the Team Trajectory from executing the TMS plan is a sequence of states , each of which is a -tuple, with the first elements equal to the vertices in the corresponding individual MS plan. The last element of the team trajectory at each time is a set , such that .
Now the problem considered in this paper can be formulated as follows.
Given an environment graph , a team of agents as defined above, , a set of service requests , a distribution , a function that shows locations of requests, find a set of MS plans for each agent such that the team trajectory obtained by executing individual MS plans satisfying a given global task specification encoded by a TLTL formula over predicate functions of states of the team trajectory .
A natural, classical motion planning approach to Problem 3.1 would be to design in advance MS plans for each agent. However, this requires the transition function for each agent to be known, which is not always true in realistic applications. Moreover, a priori top-down design that guarantees satisfactory performance can become extremely difficult in complex and time-varying environments [bucsoniu2010multi]. Therefore, we tackle this problem using MARL, in which agents learn how to act by constantly interacting with the environment and other agents, and adjusting their behaviors according to feedback received from the environment. The transition functions are assumed to be unknown.
Consider a team of two heterogeneous agents that have to service three requests: and , with an additional requirement that must not be finished until or has been finished. The environment is shown in Fig. 2 a as a grid world, which can be abstracted by a graph . Each cell represents a vertex (labeled with values at the bottom right) and each facet forms a reflective, two-way edge in . There are three service requests: . There is a team of two robots modeled by transition systems and . Assume both agents are able to move through any edges in . Thus the action spaces are and . The capabilities of conducting services are captured by the distribution , and , which means and are independent requests and is a shared request. For instance, to finish request , robot must move to vertex and then choose to take action . This behavior, which can be interpreted as a sub-task towards the success of the global mission, is formulated by a TLTL formula where and are predicate functions defined over states of the team trajectory as
where is the location of robot at time (included in team trajectory state ), is the distance from to the location at which occurs, and and are positive constants. Then the global specification is given by the TLTL formula . The FSPA corresponding to can be found in Fig. 2. The predicate functions , and are defined similarly as eqn:predicates_independSigma. Predicates are designed in a slightly different form to encode the cooperative behaviors:
where is some constant. Note that although formula is constructed hierarchically, the resultant FSPA is actually nonhierarchical.
4 Problem Solution Using MARL
In summary, our method can be divided into three steps. First, in Sec. 4.1, we formulate an POSG whose solution can be easily transferred to a set of MS plans that solves Problem 3.1. Second, in Sec. 4.2
, we introduce a reward shaping technique that employs robustness and a heuristic energy-like function to produce an additional reward signal. Finally in Sec.4.3, we demonstrate how to solve the POSG with reshaped rewards using algorithms available in the realm of MARL.
4.1 Formulation of Equivalent POSG
It is straight forward to define a POSG whose state is identical with the team trajectory state , i.e., . However, by doing so, the satisfaction of TLTL specifications can only be determined based on the entire team trajectory. Hence, defining the reward function as the TLTL robustness contradicts the Markovian behaviors of the POSG in the sense that the reward only depends on current state of the system. Therefore, an extra element must be added to the POSG as a tracer of the whole team in the process of satisfying the global task. The state of the FSPA corresponding to the TLTL formula suits this job perfectly, and the state space of the POSG will be a product of the team trajectory states and the FSPA states. Now we are ready to introduce the following POSG. We use and to denote the state and action of agent at time . [FSPA augmented POSG] Given Problem 3.1 and an FSPA corresponding to the TLTL specification, an FSPA augmented POSG (FSPA-POSG) is defined as a tuple , in which is the index set for agents, is the discrete state space. is the initial state. The action space of agent is identical with the action space of transition system . The observation space is the state space of . The observation function maps the state of the FSPA-POSG to the state of agent , which means that the policy of agent only depends on its own state. Let and be the state of the FSPA-POSG and the joint action of all agents at time respectively. Then, the transition probability from to under action is defined as
where is the edge guard of the transition from to and is a state of the Team Trajectory included in . Let the reward function be defined as
where is a constant.
In this paper, we assume that the FSPA-POSG has a finite horizon . The rationale of this assumption lies in the fact that any accepting or rejecting team trajectory must be finite (the corresponding FSPA reaches one state either in or in ). However, must be large enough to give agents adequate time to complete the task and thus it becomes a design factor of our approach. Clearly, since the team trajectory state is included in the FSPA-POSG state, constructing a team trajectory from is very simple, then constructing a TMS is also straightforward.
4.2 Temporal Logic Reward Shaping
As discussed in Sec. 1, we assume that the state of the FSPA-POSG is fully observable during training, while the observation function during deploying is as defined in Def. 4.1. The reward function defined above only provides feedback to agents at the very end of each episode, which could cause serious problems when the probability of success under randomly selected actions converges to zero as the mission becomes more and more complicated. Missions described by temporal logics are expected to be complicated. Therefore, reward shaping techniques are introduced to add intermediate rewards to improve the speed and successful rate of learning.
We developed two additional rewards for each agent to guide learning: and . is defined based on TLTL’s quantitative semantics as follows:
where is the disjunction of all predicates of the outgoing edges from to other non-trapped states, is a predicted FSPA-POSG state that all agents take actions and transition to the next state except agent , is the corresponding predicted automaton state. The robustness value can be interpreted as a metric that measures how far it is from leaving the FSPA state given . The individual reward is designed to be the difference between the robustness value under the whole team’s actions and the robustness value with agent being idle. This is reasonable because of two reasons. First, is generated by only using past observations so it does not require the agent model . Second, since we know , the value of predicate functions can be predicted by removing all the services that requires agent ’s action from the set of services that are finished at , which makes available. is able to distinguish each agent’s contribution to the global behavior, and thus provide more informative rewards to agents.
By incorporating , the agents are encouraged to leave the current FSPA state. However, the limitation of is that it cannot distinguish whether a transition is beneficial or not. For example, in Fig. 2, the bad transitions that we want to avoid are colored by orange. As for state , two agents have already reached the location of and the next thing to do is to finish cooperatively. However, if they leave the vertex where lives, will generate positive rewards because the FSPA has left , which clearly violates our intentions. It is true that the learning can still converge to the correct policies, but only through the effect of discount factor , in the sense that transition from to will delay the time of completing the mission and thus yield a lower final reward.
To overcome this issue, we design another reshaped reward signal , which can be seen as a ”potential energy” of the corresponding FSPA state. Before giving the definition of , we first introduce the concept of path length. Let denote the set of all finite trajectories from to . The path length L is defined as
where is a finite sequence of FSPA states connected by transitions in and is a weight function. Then a distance function from to can be defined as
Now an energy function , can be defined as
In words, the energy function of a state is the minimum weighted sum of transitions it takes to reach the set of final states. Finally, is constructed as follows:
ensures that agents will receive an assessment of every FSPA transition immediately. There is no standard procedure for determining the weight function and hence it becomes a design factor.
4.3 Solving the equivalent FSPA-POSG
We use the policy gradient method with policy graphs as in [meuleau1999learning] and [peshkin2001learning] to find the optimal policy for each agent in the FSPA-POSG. In the training phase, both the original reward and the two reshaped rewards require the information of the FSPA state, which is determined by all the agents. Hence, a centralized coordinator is necessary. We assume that the communication among all agents during the training phase is available, which is reasonable as stated in Sec. 1. After training, each agent obtains a policy that tracks its own history states to decide the action to be taken at each time step. The policy of each agent is independent from other agents’ states, i.e., the policies are distributed.
To verify the efficacy of the method described in Sec. 4, we conducted simulations of training on Example 3 with reshaped rewards , and (only original reward), respectively. In each simulation, we trained the policies for episodes. We set the discount factor to , the learning step size to , the constant C in Eqn. eqn:original reward to . The number of internal states of the policy graphs are chosen to be 10 for both agents. We also applied decentralized Q-learning [matignon2012independent] with reshaped rewards to learn policies without memory.
It was not guaranteed that agents’ policies would always converge to the optimal ones. The rates at which the optimal policies were learned within episodes are shown in Table 1. We can see that using boosted the convergence rate by compared to the original reward. Note that memoryless Q-learning completely fails the task. The averaged learning curves in which optimal policies successfully converged are shown in Fig. 4
. We can see that the proposed reward shaping technique can accelerate the speed of learning and significantly reduce the variance.
The behaviors of agents governed by the learned policies are visualized in Fig. 4. In example 3, there are two possible routes to finish the TLTL task: finish then ; finish then . It can be seen that the agents are able to figure out the optimal policy with minimum route length.
We also inspected the final policy graphs and their corresponding trajectories in simulations where agents with reshaped rewards failed to satisfy the TLTL specification. It turns out that in the majority of the failures, the agents would first finish either or and then never reach by wondering among several vertices near or . One possible cause is that the gradient ascent took steps that were too large, which cannot be fully handled by simply decreasing [schulman2017proximal]. More advanced policy gradient methods such as PPO [schulman2017proximal] or TRPO [schulman2015trust] might solve this issue. We will explore this direction in future research.
|Optimal policy convergence rate|
In this paper, we applied MARL to synthesize distributed controls for a heterogeneous team of agents from Truncated Linear Temporal Logic (TLTL) specifications. We assumed a partially observable environment, where each agent’s policy depends on the history of its own states. We used the TLTL robustness as the reward of the MARL and introduce two additional reshaped rewards. Simulation results demonstrated the efficacy of our framework and showed that reward shaping significantly improves the convergence rate to the optimal policy and the learning speed. Future work includes consideration of non-deterministic SGs and other policy gradient methods.