The robot box pushing problem has been studied in both single and multi-agent cases . The problem entails one or more robotic agents working together to push a box from an arbitrarily starting location to the goal location, through an environment that is often filled with obstacles that the box and robotic agents cannot pass through and should instead avoid. Robotic agents cooperatively pushing an object through an obstacle filled environment has applications in warehouse automation and disaster relief. In the case of the box pushing problem, the benefit of multi-agent collaboration is two-fold: the box can sometimes be too heavy for a single robot to push by itself, and the overall number of actions performed by a robot team to move the box to the goal location can be greatly reduced through collaboration and joint-actions.
Previous works on the robot box pushing problem can effectively be split into two categories: machine learning solutions and non-machine learning solutions. Of the non-machine learning solutions, there have been several attempts to perform box pushing. Genetic algorithms (GA) have been applied to attempt a Pareto optimal solution that minimizes time taken and energy expended to push a box. GA-based methods did not attempt to include obstacles in the environment and are not necessarily suitable for obstacle ridden environments. The noisy non-dominated sorting bee colony algorithm was applied to a box pushing problem with few obstacles and agents, and focused more on dealing with a noisy environment than the overall box pushing problem . It can also be noted that the results of the Q-learning algorithm are affected by uncooperative actions in a way that is similar to noise. Research has also focused on path planning for box pushing given complete knowledge of the environment by modeling objects in the environment as convex polygons and planning the box trajectory . Having complete knowledge of the environment initially is not a realistic approach in most applications, since the position of each robotic agent can affect the environment, so assuming the environment is dynamic is preferred.
Machine learning solutions to the robot box pushing problem consist primarily of reinforcement learning approaches; specifically Q-learning approaches. Single-agent Q-learning has been extended to allow three agents to push a randomly placed box to the goal location, but no obstacles are presented 
. Q-learning has been applied to the box pushing problem using decision trees for adaptive state aggregation, but the results show that the algorithm performance is not a significant improvement over single-agent Q-learning, and the environment only contains a single obstacle, making the test case trivial. Bayesian-discrimination-function-based reinforcement learning (BRL) is applied to allow multiple agents to push a box across a test area, but the area contains no obstacles . In , single-agent and multi-agent Q-learning are applied to the box pushing problem in an obstacle ridden environment. The results show, counter-intuitively, that multi-agent Q-learning is not capable of reliably reaching the goal and that the agents cannot learn from their experience, while single-agent Q-learning is reliable and learns. Overall, machine learning solutions out-perform non-machine learning solutions, but results tend to show trivial cases or fail to address aspects of the box pushing problem.
In this paper, the implementation and results in  are extended by applying an additional Q-learning algorithm to the box pushing problem, restructuring the action set for the agents, and changing the reward representation. For comparison, all of the algorithms tested in  are also tested. The results show that the extensions to the original method allow the multi-agent team to learn and converge to a policy quicker than the previously implemented algorithms, while also showing an improvement over the single-agent case.
The remainder of the paper is structured as follows. In Section 2, a description of the Q-learning algorithms implemented for this problem is provided, including: state-space, actions, and reward function. In Section 3, the experimental environment is described and experimental results are provided, along with a discussion of the results. Finally, in Section 4, a conclusion is provided.
Four Q-learning algorithms are implemented in this paper: single-agent Q-learning, multi-agent Q-learning with separate Q-tables for each agent, multi-agent Q-learning with a shared Q-table, and cooperative Q-learning with more frequent updates. In this section, these algorithms will be described, along with a detailed description of the state space representation, the possible actions for each agent, and the method for assigning reward to the agents.
In this paper, the Q-learning algorithms 
are modeled as Markov decision processes (MDPs). An MDP is defined as a tuple;, where is a finite set of states, is a finite set of actions, is the reward function, and is the discount factor.
2.1 Reinforcement Learning Algorithms
The single agent Q-learning algorithm is given by
where is the current state, is the selected action, is the learning rate, is the next state based on the transition function and the selected action, and is the next action from state .
The multi-agent (independent) Q-learning algorithm where each agent has its own Q-table is given by
where is agent ’s Q-table, is the state of agent where , is the action taken by agent where , is the next state of agent where , and is the next action taken by agent where .
Similarly, the multi-agent (independent) Q-learning algorithm where all agents share a Q-table is given by
with the difference being that is now simply .
Finally, the cooperative Q-learning algorithm  is given by
where is the weight determining whether the agent Q-table or the neighboring agent Q-tables have more impact on the update. is the number of neighbors of agent in its detection range, is the state of neighbor where , and is the action of neighbor where .
2.2 State Space Representation
Similar to other research in this area, the state space for the box pushing environment is represented by 13 bits. Bits 0-4 are the goal orientation , which is converted to a bit string through the equation
where the function rounds toward negative infinity.
The remainder of the bits are coded based on the presence of obstacles within a respective agent’s sensing range. The circle representing the agent’s sensing range is split into eight sectors of equal size, with each sector corresponding to a bit in the state representation. If an obstacle is within the sensing range in that sector, then the corresponding bit is a 1, otherwise the bit is a 0. Fig. 1 shows a visual representation of the obstacle coding for the environment. Bits are ordered counter-clockwise starting from zero degrees, and most significant to least significant. In this particular example, the coded state would be .
2.3 Action Representation
A representation of the actions is shown in Fig. (a)a. As an extension to the previous works in this area, the actions were redefined and simplified as follows: The actions 1 through 4 are only related to the movement of the box whereas actions 5 and 6 will rotate the box. The box is considered heavy enough that the results of the rotation actions will not influence the movement of the box. In other words, the rotation actions will simply rotate the box around its center point without changing the position of the center point. Since the actions can be applied to the box regardless of its orientation, a method needs to be developed to assure the correct movement. To mathematically show the result of the actions 1 through 4, these equations can be considered where and are the new location of the box center after taking either action, is the slope of the line that represents the box angle, is the amount of movement as a result of the action, and and are the old location of the box center point (see Fig. (b)b).
2.4 Reward Function
Adopted from the previous works , the reward function for all the algorithms implemented in this work are defined by a series of conditions which calculate the total reward. The total reward consists of three parts:
Part 1) The first part of the reward is dedicated to finding the distance between the box and the final goal
where the and are measured from the goal to the center point of the box.
Part 2) The second part of the reward will emphasize the rotation of the box
where is the previous angle of the box and is the new angle of the box. The difference of indicates the number of degrees of the rotation of the box. The main benefit of this part is to discourage the rotations of more than degrees. The objective here to prevent the box from constantly rotating while still providing positive rewards for small rotations that are required to evade obstacles.
Part 3) The final part of the reward is concerned with avoiding the obstacles, , and is simplified as
This part of the reward was previously defined as a set of more complicated conditions which in our tests did not show a significant difference.
The total reward is defined as follows
where are weights that are assigned manually.
3 Experimental Results
3.1 Experimental Environment and the Setup
A simulation system was developed in MATLAB to implement the algorithms. A testing environment was defined as an area of . The location of the obstacles was limited to an area of on the x-axis and on the y-axis. The obstacles radius was defined as , and their locations were generated randomly by the simulation program. The locations of the obstacles were kept the same for all the algorithms to give a fair comparison of the performance of the methods. The position of the goal location was defined as with a radius of . The box dimensions was set as and in the beginning of each episode, the box was moved to the origin.
The simulation for each algorithm was run with maximum of iterations for episodes. Each episode ends when either the box reaches the goal or the maximum number of iterations is reached.
The following parameters were set for all the algorithms: , , , , , , , .
3.2 Single-agent Q-learning Results
As described in the methodology section of this paper, the single-agent Q-learning is implemented here, and the result of the final selected path can be seen in Fig. (a)a.
As it is shown in Fig. (a)a, the algorithm has managed to choose a path that is very close to a straight line. Although an exact straight line path is not possible in this situation because of the obstacles.
A better representation of the performance of the algorithm can be shown as the number of iterations that were taken in each episode before reaching the goal (see Fig. (b)b). This figure shows that in the beginning of the learning process, the algorithm has to try numerous iterations (in this situation, up to 700 at one point) to find a path to reach the goal. But as the learning method progresses, the algorithm finds the shortest path (in this case, after 20 episodes) to reach the goal.
A conclusion can be drawn based on these results that the single-agent Q-learning is effective and can find the best path in a very reasonable amount of time, but a comparison between this and the other algorithms is needed to find whether the single-agent Q-learning is the most effective algorithm.
3.3 Multi-agent Q-learning Results with Separate Q-tables
As discussed in the methodology section, this algorithm will use multiple robots (in this case, 3 robots) to perform the task. Each robot will take an optimum action based on the method defined in section 2.1 and will update its own Q-table based on the reward. Figure (a)a shows the final selected path using this approach.
Noting Fig. (a)a, it is clear that the the box did reach the goal on the final path but there are many redundant actions that have taken place. It can be argued that the best path was not selected here as the path can be considered as stepped shaped. A better representation of the performance of this algorithm is shown in Fig. (b)b. It can be seen that the algorithm does not converges to a line even after 80 episodes although the number of iterations is certainly decreasing. This shows that the 80-episode limit was not sufficient here. The number of iterations in each episode is also noticeably high. This is another sign that the algorithm is not the most efficient one.
3.4 Multi-agent Q-learning Results with a Shared Q-table
The main difference between this algorithm and the previous one is that in this case, only one Q-table is used for all the robots. Each robot takes an optimum action and updates this sole Q-table based on the reward. The Q-table then is shared between all other robots for future action selection. The final selected path for this algorithm is shown in Fig. (a)a.
Fig. (b)b examines the performance of this algorithm. This result shows a significant improvement in comparison to the multi-agent Q-learning with separate Q-tables. A convergence can be seen in the number of required iterations that happened after about 20 episodes. It can be said that using only one shared Q-table can result in significantly better performance than using separate Q-tables for each robot.
It should be noted that although the box does reach to the goal in every episode, this cannot be counted as the only requirement to count an algorithm as efficient. Since the environment dimensions are not extremely large, there is always a factor of chance of getting to the goal after several random movements. Therefore, the efficiency of the algorithm should be assessed on how fast this was achieved.
3.5 Cooperative Q-learning with Frequent Updates
Finally, the proposed algorithm is implemented. The cooperative algorithm is superior to the multi-agent Q-learning in that each robot uses the results from other robots to improve its action selection. The ”Frequent Updates” refers to the fact that the Q-table is also updated more frequently than either of the multi-agent Q-learning algorithms because each robot updates its Q-table after each action instead of one Q-table update for all the robots as happens in multi-agent Q-learning. The result of this algorithm can be seen in Fig. (a)a. The performance of this Cooperative Q-learning can be seen in Fig. (b)b.
The result shown in Fig. (b)b shows a significant improvement over both multi-agent Q-learning algorithms. The algorithm needs significantly less iterations in each episode to reach the goal and converges to the best path after only 20 episodes. The number of iterations needed is also less than half of the same in the multi-agent Q-learning algorithms.
Overall, it can be concluded that the Cooperative Q-learning with Frequent Updates presents significantly better results than the multi-agent Q-learning algorithms (either with separate Q-tables or with a shared Q-table).
3.6 Comparison Between the Algorithms
To better discuss the performance of these four algorithms, Fig. 7 shows a comparison between them.
It is evident from this figure that the Cooperative Q-learning with Frequent Updates is performing significantly better than the other algorithms. Not only the box reaches the goal in less iterations in each episode, but also the algorithm converges surprisingly fast.
It is noteworthy to mention that based on the comparison figure, the single-agent Q-learning performs better than the multi-agent Q-learning.
A comparison between the multi-agent algorithm and the single-agent algorithm was done too. Fig. (a)a shows all the paths that were examined in a multi-agent algorithm before selecting the final one.
As a comparison, Fig. (b)b shows the same result for the single-agent algorithm.
It can be seen that significantly less paths are examined in the single-agent algorithm. The values of the Q-tables are the determinant factor in either case and it is very hard to explain the reason of these behaviors mathematically.
This work presents our implementation and comparison of four reinforcement algorithms. These were: single-agent Q-learning, multi-agent Q-learning with separate Q-tables for each agent, multi-agent Q-learning with a shared Q-table, and cooperative Q-learning with more frequent updates.
A framework was implemented in MATLAB to easily compare the performance of each of these algorithms. The results was shown graphically as the final selected path for each algorithm and also the number of required iterations in each episode before reaching the goal. The comparison between the results showed that the cooperative Q-learning with more frequent updates performed significantly better than the other algorithms. In any case, the multi-agent Q-learning algorithms (either with separate Q-tables or a shared Q-table) did not perform as well as the single-agent Q-learning algorithm.
The cooperative Q-learning with more frequent updates was able to reach the goal in significantly less number of iterations and also converged to the best path astonishingly fast.
The overall conclusion can be that the multi-agent Q-learning algorithms should be replaced with the cooperative Q-learning whenever possible. Also, the number of times that the robots update their Q-tables should be more frequent as this resulted in a remarkably better and faster convergence to the best path.
Chakraborty, J., Konar, A., Nagar, A., Das, S.: Rotation and translation selective pareto optimal solution to the box-pushing problem by mobile robots using nsga-ii. In: 2009 IEEE Congress on Evolutionary Computation. pp. 2120–2126 (May 2009).https://doi.org/10.1109/CEC.2009.4983203
-  Hwang, K.S., Ling, J.L., Wang, W.H.: Adaptive reinforcement learning in box-pushing robots. In: 2014 IEEE International Conference on Automation Science and Engineering (CASE). pp. 1182–1187 (Aug 2014). https://doi.org/10.1109/CoASE.2014.6899476
-  La, H.M., Lim, R., Sheng, W.: Multirobot cooperative learning for predator avoidance. IEEE Transactions on Control Systems Technology 23(1), 52–63 (Jan 2015). https://doi.org/10.1109/TCST.2014.2312392
-  La, H.M., Lim, R.S., Sheng, W., Chen, J.: Cooperative flocking and learning in multi-robot systems for predator avoidance. In: 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems. pp. 337–342 (May 2013). https://doi.org/10.1109/CYBER.2013.6705469
-  Parra-Gonzalez, E.F., Ramirez-Torres, J.G., Toscano-Pulido, G.: A new object path planner for the box pushing problem. In: 2009 Electronics, Robotics and Automotive Mechanics Conference (CERMA). pp. 119–124 (Sept 2009). https://doi.org/10.1109/CERMA.2009.54
-  Rakshit, P., Konar, A., Nagar, A.K.: Multi-robot box-pushing in presence of measurement noise. In: 2016 IEEE Congress on Evolutionary Computation (CEC). pp. 4926–4933 (July 2016). https://doi.org/10.1109/CEC.2016.7744422
-  Wang, Y., Silva, C.W.D.: Multi-robot box-pushing: Single-agent q-learning vs. team q-learning. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 3694–3699 (Oct 2006). https://doi.org/10.1109/IROS.2006.281729
-  Wang, Y., de Silva, C.W.: An object transportation system with multiple robots and machine learning. In: Proceedings of the 2005, American Control Conference, 2005. pp. 1371–1376 vol. 2 (June 2005). https://doi.org/10.1109/ACC.2005.1470156
-  Yasuda, T., Ohkura, K., Yamada, K.: Multi-robot cooperation based on continuous reinforcement learning with two state space representations. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics. pp. 4470–4475 (Oct 2013). https://doi.org/10.1109/SMC.2013.760