I Introduction
Urbanization, increasing number of vehicles, and lack of transport infrastructures have increased travel time, fuel consumption, and air pollution. Therefore, urban life equals with waste of time, less clean air, and acoustic pollution. Conventional fixed traffic management systems are not able to fight complexity and dynamic of large traffic networks. While, artificial intelligence (AI) are greatly employed to develop intelligent traffic systems (ITS)
Kponyo et al (2016); Balaji et al (2010); Bazzan and Klügl (2014); Rida (2014), multiagent system is an approach to model ITS Roess et al (2004); Vilarinho et al (2017). This framework consists of a population of intelligent and autonomous agents work together in an environment Schaefer et al (2016). Traffic lights Liu (2007), vehicles Adler et al (2005), and pedestrians Teknomo (2006) are considered as agents in modeling of urban traffic networks. Each agent needs to learn from the past experiences which is a key point to approximate a better decisionmaking policy. Multiagent modelbased Wiering (2000) as well as modelfree Chin et al (2011)reinforcement learning (RL) techniques are widely used in researches on ITS Prashanth and Bhatnagar (2011); Balaji et al (2010).In a multitude of researches, any agent only considers its own traffic state in order to determine the control policy. For example, single intersection with two phases is investigated in Abdulhai et al (2003). Length of vehicles queue waiting on the light is considered as state which can be measured by the agent. It decides on extend green time or change it to the next phase so that the number of vehicles waiting on the light is minimized. The results show superiority of Qlearning agent over uniform traffic flows and constantratio traffic flows. In Wiering (2000)
, traffic lights are considered as agents which communicate with vehicles. The vehicles estimate their mean waiting time and transmit this time to traffic light where a popular RL algorithm, namely Qlearning, is used to provide a control for traffic signal scheduling. Results of this study show 22% reduction in waiting time compared to constant time lights. Multiobjective reinforcement learning is utilized to control several traffic lights in
Houli et al (2010). Optimization goals include number of stops of a vehicle, mean stopping time, and length of vehicles’ queue on the next intersection. Its results indicate that multiRL can effectively prevent the queue spillovers under congested condition to avoid largescale traffic jams. Bull et.al. used learner classifiers to control light traffic including 4 intersections
Bull et al (2004). In this research, traffic lights include two phases at each intersection, where one phase is for moving northsouth and one is for eastwest. Controller at each intersection, obtains optimum phase time through extracting ifthen rules. Its results show that performance of the traffic light using learner classifier system has improved significantly compared to constant time traffic light. In Steingrover et al (2005), the learning purpose is modeled in such a way that states indications are based on the summation of the cars waiting times. Obviously the more cars information is received, the model will be more complicated and state space will be larger. This issue is one of the significant problems of large networks. Adaptive control, which is introduced in Prashanth and Bhatnagar (2011), uses the approximate of a function as mapping of states to scheduling. Fuzzy inference engine is exploited to decrease systematic faults of Qalgorithm in Pacheco and Rossetti (2010). The results demonstrate that not only learning in fuzzy framework is done faster than Qlearning but also delay in intersections is decreased considerably. A multiagent fuzzy approach is proposed in Iyer et al (2016), where Qlearning updates the set of rule base in fussy inference engine. In Da Silva et al (2006) a new method which has the capability to estimate an incomplete model of environment is described for a given nonstatic environment. This method is applied in a network composed of 9 intersections. The reported results show that this method has better performance than the modelfree methods and modelbased methods, but could not be generalized and used in larger networks.In other researches, agents consider other agents in determination of their own control policy. For instance, coordination among agents is desired in Medina et al (2010) where the agents not only consider number of waiting vehicles on its own intersection but also they consider number of vehicles which have stopped in adjacent intersections. The RL is applied on 5 intersections within three different scenario. The overall results show improvement in delay time. In Wiering (2000), RL is used to control the traffic in a grid where a type of cooperative learning simultaneously controls the traffic signals and determines the optimal routes. One of the main drawbacks of this method is the high costs of communication and information exchange, specifically when intersections of network are increased. Cooperative RL tries to extract the knowledge from neighbor agents in a scheduling learning Salkham et al (2008). This method is implemented in an area of Dublin including 64 intersections.
This paper introduces a hybrid fuzzy Qlearning and Game theory method for control of traffic lights in multi agent framework. It exploits the benefits of fuzzification as well as interaction with other agents. The traffic network is modeled by considering an autonomous agent controls in which each intersection decides on duration of green phase. The number of vehicles in different inputs of the intersection are measured by the corresponding agent. Any agent interacts with neighbour agents by getting a reward from each decision. This paper proposes that each agent fuzzify the inputs and utilizes in a fuzzy inference system for fuzzy estimation of traffic model states. The agent uses a Qlearning approach modified by Game theory to learn from the past experiences and consider the interaction with neighbor agents. The agent gets a reward proportional to its own traffic state and a reward from each decision from neighbour agents to update its Qlearning algorithm. The neighbour reward and its weighting in Qvalue update is proposed to be fuzzy in the proposed method. The proposed method is applied on a fiveintersection traffic network. The simulation results indicate that proposed method outperforms the fixed time, fuzzy, Qlearning and fuzzy Qlearning control methods in the sense of average delay time.
This paper is unfolds as follows. After this introduction, Qlearning and its fuzzy version are described in the next section. Section 3 is devoted to application of Game theory in ITS. Section 4 and 5 are about problem statement and proposed solution, respectively. Simulation results are given in section 6. Finally, the paper is concluded in section 7.
Ii Qlearning and fuzzy Qlearning
The objective of agents which act in dynamic environments is making optimum decisions. If the agents are not aware of rewards corresponding to various actions, selecting a proper action would be challenging. To achieve this goal, learning adjusts agents’ action selection based on collected data. Each agent tries to optimize its actions with dynamic environment via trial and error in reinforcement learning (RL). The RL is actually how different situations are mapped upon actions to receive the best results or the highest reward. In many cases, actions influence the reward of next steps as well as affect the reward of its corresponding step. There are modelbased Wiering (2000) as well as modelfree Chin et al (2011) RL techniques. In modelfree RL, the agent does not need explicit modeling of the environment because its actions could be directly selected based on rewards. Qlearning is a modelindependent approach where the agent does not access to transfer model Watkins and Dayan (1992); Abdoos et al (2011). Suppose that the agent is in a state , performs an action , from which it gets the rewards from the environment and the environment changes to state . This is given by a tuple in the form of . Stateaction value which represents the expected total reward resulting from taking action in state is denoted by Qvalue . The agent starts with random value and after each action they receive a tuple in the form of . For each tuple the value of stateaction could be calculated according to the following equation:
(1) 
where is the learning rate of agent. means that merely new information is considered and zero means that the agent does not have any learning. is discount factor which determines future rewards. Zero value for this factor makes the agent opportunist which means that the agent only considers current reward. On the other hand, means that the agent will wait for a longer time to achieve a large reward. Qlearning will converge to optimum value
with probability of one if all stateaction pairs are experienced repetitively and learning rate decrease during the time
Pacheco and Rossetti (2010). Generally, RL is useful for solving problems with small dimension discrete state and action space. When the dimension of state and action space becomes larger, the size of search table will be so large that it makes the algorithm very slow due to computational time. On the other hand, when the states or actions are stated continuously, using search table will not be possible. To tackle this problem fuzzy theory is employed. If the intelligent agent has a proper fuzzy set as expert knowledge about the desired area, the ambiguity could be resolved. Thus, intelligent agent can understand vague objectives and unknown environment. In practice, the action in large spaces is facilitated by eliminating Qvalues table. In this method everything is based on quality values and fuzzy inference. Fuzzy inference system (FIS) deals with input and Qlearning algorithm uses the follower section and its active rules as states. Reward signal of Qalgorithm is built in accordance with fuzzy logic, environment reward signal and performance estimation of current action. It is tried to select the action which maximizes the reward signal Glowaty (2005); Bonarini et al (2009). Learning system is able to select one action among actions for each rule. th possible action in th rule is denoted by and its value is shown by consider the following rules Bonarini et al (2009):Learning should find the best result for each rule. If the agent selects an action which results in high value, it may learn optimum policy. Thus, fuzzy inference system may obtain necessary action for each rule Bonarini et al (2009) .
Iii Game theory in ITS
Relation between agent oriented environments and games theory originates from the fact that each state of agentoriented environments can be resembled to a game environment. Profit function of players would be current state of the environment and goal of players is to move toward balanced or equilibrium point (reaching the best decision making policy). Some scholars have studied the application of Game theory to control of traffic lights Goyal and Kaushal (2017); Groot et al (2017). They integrate Game theory into the multiagent interaction approach. Some of them suit the traffic problem into a rigorous mathematical game model Bell (2000); Chen and BenAkiva (1998); Alvarez et al (2008) while others modify the learning method of agents based on Game theory Xinhai and Lunhui (2009). In Alvarez et al (2008)
, signalized intersections are modeled as finite controlled Markov chains and each intersection is seen as noncooperative game where each player try to minimize its queue. The solutions are given as Nash equilibrium and Stackelberbg equilibrium and the simulation results indicate shorter queue length than adaptive control. In
Bell (2000), a twoplayer noncooperative game is articulated between user seeking a path to minimize the expected trip cost and choosing link performance scenarios to maximize the expected trip cost. It shows that the Nash equilibrium point measures network performance. Intelligent traffic control is expressed as a Cournot game where the traffic authority and the users choose their strategies simultaneously and as a bilevel Stackelberg game where the traffic authority is the leader which determines the signal settings in anticipation of the user reactions. In Xinhai and Lunhui (2009), Game theory is used to address coordination between agents based on traffic signal control with Qlearning. It specifies strategies ({red light time plus 4sec, red light time plus 8sec, red light time minus 4s, red light time minus 8s,unchangeably}) and actions ({east west straight and right turn, south north straight and right turn, east west left turn, south north left turn}). Then, an interaction mathematical model via Game theory as a four parameter group is presented. is a group of decisionmakers as players. is a group of any possible strategies and actions, i.e. . represents the information which agents masters. is the benefit function which adopts Qvalue. So, the Nash equilibrium is Xinhai and Lunhui (2009):(3) 
where and denote action of th agent and actions of other agents, respectively. and represent the actions at Nash equilibrium. The renewed Qvalues in distributed reinforcement Qlearning is used to build the payoff values. Qvalue function is updated as:
(4) 
where and are learning rate and discount factor, respectively. and are current state of traffic environment and current action, respectively. is its next state, is the number of traffic signal control agents surrounding th agent, is the Qvalue function for th agent when selects action in state . is reward function of th agent and is reward function of th agent neighboring th agent. is a weighted function which shows the effect of on th agent. Mathematical functions are suggested in Xinhai and Lunhui (2009) for and . Assumption of discrete actionstate space and determination of reward and weighting functions are drawbacks of that work.
Iv Problem Statements
Consider a traffic network in which the lights of each intersection is controlled by an autonomous agents without any centeralized management. Some sensors which are installed below the surface of surrounding streets or traffic cameras of each intersection provide information about traffic situation for the corresponding agent. An agent has to decide on duration of green light at NorthSouth (NS) and WestEast (WE) paths. Also, any agent interacts with neighbour agents. Anyway, the agent is expected to schedule traffic lights optimally, in the sense of average delay, based on the received information from its sensors and received information from neighbor agents.
The agents may have little knowledge about others’ decision due to distribution of information. Even if an agent has previous known information about others’ decision, it is not valid as other agents are also learning. Thus, the environment is dynamic and the behavior of other agents may change during time. Lack of prediction of other agents causes uncertainty in problem solving procedure. This paper looks for a decisionmaking algorithm for lights control agents which considers neighbour agents information in addition to its own information.
V Proposed algorithm
We consider a constant duration for green plus red phases. So, if the agent determines the green phase duration , then the red phase duration is . Any typical agent receives number of vehicles on the NS and WE streets from its own sensors and the green phase duration of neighbour agent in order to schedule its own green phase duration. This paper proposes an autonomous agent with structure in Fig.1 to control each intersection.
The number of vehicles in WE and NS streets which are measured by sensors are fuzzified. Then, a fuzzy inference engine with rules as Eq.II are employed to fire the corresponding output membership functions. Finally, defuzzification results to duration of green phase in NS path (). Thus, the duration of green phase in other path, WE, is . We propose that, Qvalue function which is updated by Eq.4 be the value of each action in Eq.II which is denoted by . This update equation takes the neighbour agents’ decision into account.
The th agent takes decision of neighbor agent into account by reward and a weighting function . The reward is calculated based on average delay obtained from the decision made by the agent and current traffic situation in a fuzzy manner. A fuzzy inference engine obtains these two inputs after fuzzification and gives the reward after defuzzification; see Fig.1. weighting function shows the effect of on the decision of th agent. This weight is also calculated by a fuzzy inference engine. This engine takes its own , the neighbour agents’ , and number of waited vehicles and gives . Suitable choice for reward and weighting function plays a significant role in agent learning. The agent with structure in Fig.1 runs the following algorithm:

Initial value of value for ith traffic signal control agent is in the form of .

Observing by WE and NS sensors which is the current state of th intersection.

Selecting a proper estimation for desired state by fuzzy inference system.

Calculating the reward related to th and th traffic signal control agent and the weighting function for neighboring agents separately.

Observing new state .

Updating value according to equation 4.

Returning to step 2 till the variation of Qvalue becomes less than .
Vi Simulation results
Consider a traffic network with a center and four neighbor intersection. The delay in each intersection depends on physical characteristics of the intersection, traffic light scheduling and number of cars in input streets. We utilized traffic model which is given by the American Highway Capacity Manual (HCM) (Akgungor and Bullen, 1999, Eq.20):
(5) 
where , , , and are average delay (sec), cycle time (sec), green ratio, and degree of saturation, respectively. and , where , , and are capacity (vehicle per hour), green time (sec), and input volume, respectively. We use this model to calculate average delay based on the green phase duration and number of vehicles. For more details of this equation we refer to Akgungor and Bullen (1999).
Assume that and . is volume of vehicles entering each street which varies between to . is duration of the green phase which each agent selects considering fuzzy Qlearning and interaction with adjacent agents. The traffic network simulation algorithm is as follow:

The volume of vehicles entering each intersection (
) are randomly generated by a discrete uniform distribution on the interval
. 
Average delay is calculated by Eq.5.

Each agent decides on the time of green phase .

Go to step 1 until end of simulation time.
Assume structure of the agents as in Fig.1 with the Mamdani FIS with input membership function as in Fig.2 for number of input vehicles and Fig.3 for average delay to calculate the reward functions . Centroid defuzzification by the output membership function as in Fig.4 is considered to estimate a reward value in interval .
The weighting function FIS has number of vehicles, its own green phase duration and the neighbour agents’ green phase duration as inputs. Fig.2 shows the membership function for number of vehicles and Fig.5 depicts the membership function for its own and neighbour green phase duration. Centroid defuzzification is applied to calculate weights on output membership function as in Fig.6 which should be a value between and .
Finally, the agent uses fuzzy Qlearning (Eq.II) with Qvalue update rule (Eq.4) where learning and discount factor are selected to be 0.5 and 0.7, respectively. The membership function for each measured number of vehicles is shown in Fig.7. The output estimates green phase duration with membership functions as in Fig.8.
The proposed method is compared with Fuzzy Qlearning (using Eq.II where is the Qvalue which updates with Eq.1), Qlearning (using Qlearning method with Qvalue which updates with Eq.1), fuzzy(using traditional fuzzy inference method) and fixed time () in the sense of total average delay. Average delay in each time interval is depicted in Fig.9 and the total average delay is illustrated in Fig.10. The results illustrate that total average delay decrease from more than for fixed time scheduling to approximately for the proposed method.
Vii Conclusion
In this study an intelligent control method of a controlling traffic network was performed to decrease average delay time. Each traffic light is considered as a learning agent. This paper proposed a structure for the agents. Each agent learn to decide on the duration of green phase through a fuzzy Qlearning algorithm which is modified by Game theory. Each agent receives a reward from neighbour agents. The reward received from the neighbour and weighted functions of neighboring agents are factors learning algorithm. These parameters are fuzzified through a FIS. Also, the number of vehicles in each street is measured and fuzzified to be used in decision making process. The simulation results were compared with fixed time method and other intelligent methods. The results revealed that our proposed method achieves considerable reduction of average delay in intersections.
References
 Abdoos et al (2011) Abdoos M, Mozayani N, Bazzan AL (2011) Traffic light control in nonstationary environments based on multi agent qlearning. In: 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE, pp 1580–1585
 Abdulhai et al (2003) Abdulhai B, Pringle R, Karakoulas GJ (2003) Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering 129(3):278–285
 Adler et al (2005) Adler JL, Satapathy G, Manikonda V, Bowles B, Blue VJ (2005) A multiagent approach to cooperative traffic management and route guidance. Transportation Research Part B: Methodological 39(4):297–318
 Akgungor and Bullen (1999) Akgungor AP, Bullen AGR (1999) Analytical delay models for signalized intersections. In: 69th ITE Annual Meeting, Nevada, USA
 Alvarez et al (2008) Alvarez I, Poznyak A, Malo A (2008) Urban traffic control problem a game theory approach. In: 47th IEEE Conference on Decision and Control, IEEE, pp 2168–2172
 Balaji et al (2010) Balaji P, German X, Srinivasan D (2010) Urban traffic signal control using reinforcement learning agents. IET Intelligent Transport Systems 4(3):177–188

Bazzan and Klügl (2014)
Bazzan AL, Klügl F (2014) A review on agentbased technology for traffic and transportation. The Knowledge Engineering Review 29(03):375–403
 Bell (2000) Bell MG (2000) A game theory approach to measuring the performance reliability of transport networks. Transportation Research Part B: Methodological 34(6):533–545
 Bonarini et al (2009) Bonarini A, Lazaric A, Montrone F, Restelli M (2009) Reinforcement distribution in fuzzy qlearning. Fuzzy sets and systems 160(10):1420–1443
 Bull et al (2004) Bull L, Sha’Aban J, Tomlinson A, Addison JD, Heydecker BG (2004) Towards distributed adaptive control for road traffic junction signals using learning classifier systems. In: Applications of Learning Classifier Systems, Springer, pp 276–299
 Chen and BenAkiva (1998) Chen O, BenAkiva M (1998) Gametheoretic formulations of interaction between dynamic traffic control and dynamic traffic assignment. Transportation Research Record: Journal of the Transportation Research Board (1617):179–188
 Chin et al (2011) Chin YK, Bolong N, Kiring A, Yang SS, Teo KTK (2011) Qlearning based traffic optimization in management of signal timing plan. International Journal of Simulation, Systems, Science and Technology 12(3):29–35
 Da Silva et al (2006) Da Silva BC, Basso EW, Perotto FS, C Bazzan AL, Engel PM (2006) Improving reinforcement learning with context detection. In: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, ACM, pp 810–812
 Glowaty (2005) Glowaty G (2005) Enhancements of fuzzy qlearning algorithm. Computer Science 7:77–87
 Goyal and Kaushal (2017) Goyal T, Kaushal S (2017) An intelligent scheduling scheme for realtime traffic management using cooperative game theory and ahptopsis methods for next generation telecommunication networks. Expert Systems with Applications
 Groot et al (2017) Groot N, Zaccour G, De Schutter B (2017) Hierarchical game theory for systemoptimal control: Applications of reverse stackelberg games in regulating marketing channels and traffic routing. IEEE Control Systems 37(2):129–152
 Houli et al (2010) Houli D, Zhiheng L, Yi Z (2010) Multiobjective reinforcement learning for traffic signal control using vehicular ad hoc network. EURASIP journal on advances in signal processing 2010(1):724,035
 Iyer et al (2016) Iyer V, Jadhav R, Mavchi U, Abraham J (2016) Intelligent traffic signal synchronization using fuzzy logic and qlearning. In: International Conference on Computing, Analytics and Security Trends (CAST), IEEE, pp 156–161
 Kponyo et al (2016) Kponyo J, Nwizege K, Opare K, Ahmed A, Hamdoun H, Akazua L, Alshehri S, Frank H (2016) A distributed intelligent traffic system using ant colony optimization: A netlogo modeling approach. In: Systems Informatics, Modelling and Simulation (SIMS), International Conference on, IEEE, pp 11–17
 Liu (2007) Liu Z (2007) A survey of intelligence methods in urban traffic signal control. IJCSNS International Journal of Computer Science and Network Security 7(7):105–112
 Medina et al (2010) Medina JC, Hajbabaie A, Benekohal RF (2010) Arterial traffic control using reinforcement learning agents and information from adjacent intersections in the state and reward structure. In: Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on, IEEE, pp 525–530
 Pacheco and Rossetti (2010) Pacheco JC, Rossetti RJ (2010) Agentbased traffic control: a fuzzy qlearning approach. In: 13th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE, pp 1172–1177
 Prashanth and Bhatnagar (2011) Prashanth L, Bhatnagar S (2011) Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems 12(2):412–421
 Rida (2014) Rida M (2014) Modeling and optimization of decisionmaking process during loading and unloading operations at container port. Arabian Journal for Science and Engineering 39(11):8395–8408
 Roess et al (2004) Roess RP, Prassas ES, McShane WR (2004) Traffic engineering. Prentice Hall
 Salkham et al (2008) Salkham A, Cunningham R, Garg A, Cahill V (2008) A collaborative reinforcement learning approach to urban traffic control optimization. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, pp 560–566
 Schaefer et al (2016) Schaefer M, Vokřínek J, Pinotti D, Tango F (2016) Multiagent traffic simulation for development and validation of autonomic cartocar systems. In: Autonomic Road Transport Support Systems, Springer, pp 165–180
 Steingrover et al (2005) Steingrover M, Schouten R, Peelen S, Nijhuis E, Bakker B (2005) Reinforcement learning of traffic light controllers adapting to traffic congestion. In: BNAIC, Citeseer, pp 216–223
 Teknomo (2006) Teknomo K (2006) Application of microscopic pedestrian simulation model. Transportation Research Part F: Traffic Psychology and Behaviour 9(1):15–27
 Vilarinho et al (2017) Vilarinho C, Tavares JP, Rossetti RJ (2017) Intelligent traffic lights: Green time period negotiation. Transportation Research Procedia 22:325–334

Watkins and Dayan (1992)
Watkins CJ, Dayan P (1992) Qlearning. Machine learning 8(34):279–292
 Wiering (2000) Wiering M (2000) Multiagent reinforcement learning for traffic light control. In: ICML, pp 1151–1158
 Xinhai and Lunhui (2009) Xinhai X, Lunhui X (2009) Traffic signal control agent interaction model based on game theory and reinforcement learning. In: International Forum on Computer ScienceTechnology and Applications, IEEE, vol 1, pp 164–168
Comments
There are no comments yet.