1 Introduction
Learning automaton(LA) is one of reinforcement approaches. It is a decision maker which acn choose the optimal action and update its strategy through interacting with the random environment narendra2012learning . As one of the most powerful tools in adaptive learning system, LA has had a myriad of applications oommen1996graph nicopolitidis2002using esnaashari2010data wang2014learning zhao2015cellular jiang2014new .
As illustrated in Fig.1, the process of learning is based on a learning loop involving two entities: the random environment(RE) and the LA. In this process, the LA continuously interacts with the RE to get the feedback to its various actions. According to the responses from to the various actions the environment, LA will update the probability vector with a certain method. Finally, the LA attempts to learn the optimal action by interacting with the RE through sufficient iterations.
The first study concerning LA models dates back to the studies by tsetlin1973automaton which investigated deterministic LA in detail. and varshavskii1963behavior introduced the stochastic variable structure versions of LA. Since then LA has been extensively researched to develop various kinds of algorithms based on deterministic LA [6] and stochastic LA [7]. A comprehensive overview of these researches has been summarized by thathachar2002varieties .
In general, the rate of convergence is one of the vital considerations of learning algorithms. Therefore, and designed a new class of learning automata, called estimator algorithms thathachar1985new thathachar1986estimator . The estimator algorithms have faster rate of convergence than all previous ones. These algorithms, not only maintain and update the probabilities vector of actions like the previously, but also keep estimating the reward probabilities for each action with using a rewardestimator vector to update the action probability vector. In this strategy, even when an action is rewarded, it is possible that the probability of choosing another action is increased papadimitriou2004new . Compared with the traditional learning algorithms, the estimator algorithms have been demonstrated to be more efficient. However, the performance of early estimator algorithms is strictly dependent on the reliability of the estimator’s contents and an unreliable estimator may cause a significant decrease of the accuracy and the speed of convergence papadimitriou2004new . In this situation, , and papadimitriou2004new designed a stochastic estimator rewardinaction learning automaton () which is based on the use of a stochastic estimator. As its much faster speed of convergence and much higher accuracy in choosing the correct action than other estimator algorithms, is is widely accepted as the most classic LA model by now.
Due to the superiority of the estimator algorithms, there are many novel estimators ge2015novel jiang2011new jiang2016new are proposed in recent years. In 2005, Hao Ge ge2015novel proposed a deterministic estimator based LA (Discretized Generalized Confidence Pursuit Algorithm,
) of which the estimate of each action is the upper bound of a confidence interval and extended the algorithm to stochastic estimator schemes. The improved stochastic estimator based LA
is the current fastest LA model. Although the family of estimator learning automata has achieved great improvements in the field of LA, there are still some drawbacks.Because of the fundamental defect, the value of an estimator could not always be strictly unmistakable. Especially in the initial stage of the learning process of LA, the estimator may perform poorly on estimating the reward probabilities of each action. In this situation, a lot of reward would be added to the probabilities of nonoptimal actions. Thus, a large number of extra iterations are needed to compensate for these wrong reward.
In this paper, in order to overcome the drawbacks of estimator algorithms, a novel method based on a double competitive strategy to update the action probability vector is introduced. The proposed Double Competitive Algorithm() learning automata use a stochastic estimator as same as . The first competitive strategy of is that only the action which has the highest current stochastic estimate of reward probability gets the opportunity to increase its probability. And its second strategy is that whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes, the probability of new ‘optimal’ action gets a huge increase while the probability of original ‘optimal’ action decreases a lot. Accordingly, the wrong rewards could be corrected instantly. Consequently, the learning automata converge rapidly and accurately.
The key contributions of this paper are summarized as follows.
 We propose a new algorithm, referred to as Double Competitive Algorithm () and prove that the proposed scheme is in all random stationary environments.
 The proposed is compared with the most classic LA and the fastest LA in various stationary Pmodel random environments. The results indicate that the proposed is more efficient.
The paper is organized as follows. In section 2, we introduce the general idea of LA and the estimator algorithms. The scheme is presented in section3. In section 4, we prove that the proposed scheme is . Extensive simulation results are presented to describe the superiority of the proposed model over the most classic LA and the fastest LA in Section 5. We conclude the paper in the last section.
2 Learning Automata and Estimator Algorithms
2.1 LA and stochastic environment
A LA is defined by a quintuple , where:
is the set of outputs or actions, and is the action chosen by the automata at any time instant .
is the set of inputs to the automata, and is the input at any time instant . The set could be finite or infinite. In this paper, we consider the case when , where represents the events that the LA has been penalized, and represents the events that the LA has been rewarded.
is the set of finite states, and is the state of the automata at any time instant .
is a mapping in terms of the state and input at any time instant , such that, .
is a mapping , and is called the output function which determines the output of the automata depending on the state , such that, .
The random environment interacted with LA is defined as , where and has been defined above. is the set of reward probability, and corresponds to an input action .
2.2 Estimator Algorithms
For the purpose of improving the convergence rate of LA, and designed a newclass of algorithms, called estimator algorithms [9][10]. These algorithms keep running estimates for each action using a rewardestimate vector and then use the estimate to update probabilities. According to the contents of estimators, the estimator algorithms could be divided into two classes, deterministic estimator algorithm and stochastic estimator algorithm.
The class of deterministic estimator is the majority of estimator algorithms, such as oommen1990discretized and agache2002generalized . In these algorithms, the deterministic estimate vector can be computed using the following formula which yields the maximumlikelihood estimate sastry1985systems thathachar1979discretized
(1) 
Where is the number of times the action has been rewarded until the current time , and is the number of times the action has been selected until the current time .
vasilakos1995new introduced a new type of estimator called “stochastic estimator”. The nonstationary environments indicate that the reward probability will vary with time instant which means the optimal action may change from time to time. In vasilakos1995new
, the author added an zero mean normally distributed random number to each action’s estimate probability.
also extended the use of stochastic estimator to stationary environment papadimitriou2004new . The implementation of the stochastic estimator in papadimitriou2004new is to impose a random perturbation to the deterministic estimate, such that(2) 
where is the stochastic estimate of reward probability of action at time , is the deterministic estimate of reward probability of action at time , and
is a random number which is uniformly distributed in an interval. The length of the interval depends on a design parameter
and the number of times that action has been selected up to time instant .3 Double Competitive Algorithm
It is clear that, in estimator learning automata fields, the most important part is to estimate the reward probabilities of each possible action accurately. However, Because of the fundamental defect, the value of an estimator could not always be strictly unmistakable. Especially in the initial stage of the learning process of LA, the estimator may perform poorly on estimating the reward probabilities of each action. Thus, a lot of rewards would be added to the probabilities of nonoptimal actions. As a result, a large number of extra iterations are needed to compensate for these wrong rewards.
The proposed double competitive algorithm () is a learning automaton which updates the action probability with a double competitive strategy to update the action probability. The first competitive strategy of is that only the action which has the highest current stochastic estimate of the reward probability gets the opportunity to increase its probability. And the second competitive strategy is that whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes, the probability of new ‘optimal’ action gets a huge increase while the probability of the original ‘optimal’ action decreases a lot. With the unique two competitive strategies, the wrong rewards could be corrected instantly. Clearly, the ‘optimal’ action would be constantly changing as the estimator is not reliable enough in the early stages of learning, resulting in the probability of each action fluctuates continually. But eventually, when the estimator is fully reliable, as the action which has the highest current stochastic estimate of reward probability tends to be invariable, the LA will converge rapidly.
Besides, since the dramatic changes of the probabilities of any possible action during the learning process, the actions whose probabilities used to be relatively small get more opportunities to be selected. Then their deterministic estimates would be further updated. Therefore, during the learning process, the estimate of each nonoptimal action gets more opportunities to be updated. According to the Law of Large Numbers, the precision of the stochastic estimator would be higher. So the stochastic estimator in
scheme would be more reliable than that in scheme.The procedure of is briefly introduced below.
The scheme
Algorithm
Parameters
resolution parameters
attenuation factor
the number of events that th action has been rewarded up to time instant , for
the number of events that the th action has been selected up to time instant , for
smallest step size
the action that has the highest stochastic estimate of reward probability at the last time .
Method
Initialize =0.1
Initialize for
Initialize and by selecting each action a number of times
Initialize a random integer within
Repeat
Step 2: Receive a feedback from stochastic environment.
Step 3: Set , .
Step 4: Compute the deterministic estimate , by setting .
Step 5: If , go to Step 9.
Step 6: Compute stochastic estimates , where is a random number uniformly distributed within .
Step 7: Select the action that has the highest stochastic estimate of reward probability, where .
Step 8: Update the probability vector according to the following equations:
Step 9: Compute stochastic estimates in the same way with Step 6 and select the action like Step 7 where .
Step 10: if , go to Step 12.
Step 11: Update the probability vector as follows
Step 12: Update the probability vector at time .
End Repeat
End Algorithm
Note that the double competitive strategy is reflected in the twice updating probability procedures. Step 7 and Step 8 are the implementation of the first competitive strategy, only the action which has the highest current stochastic estimate of the reward probability gets the opportunity to increase its probability and in order to satisfy , all the probabilities of the others decrease. The second competitive strategy is summarized in Step 10 and Step 11 where whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes (), the probability of the original ‘optimal’ action is reduced by (determined by the attenuation factor ), and then the new ‘optimal’ action will get an additional reward which equals to the reduced probability of action .
4 Proof of
Whether the given algorithm is is an important standard in LA contexts. Thus, we will show that the proposed scheme is in every stationary environment.
Definition 1 Given any arbitrarily small and , there exists a (that depends on and ) and a such that for all resolution parameter and all time :.
To prove the of scheme, the following two theorems would be used.
Theorem 1:Suppose there exists an index and a time instant such that for all with and for all time . Then there exists an integer such that for all resolution parameters , with probability one as .
Proof: Since we have supposed that for all with and for all time , the action which has the highest stochastic estimate of reward probability will not change, there is no difference between the proposed and scheme. The scheme has been introduced and proved in [11].
Theorem 2:For action ， assume , for any given constants and , there exists and such that for all resolution parameters and all time : .
Proof
:Define the random variable
as the number of times the action is chosen up to time instant . And then we should prove that(3) 
Equivalent to prove that
(4) 
It is clear that the events and are mutually exclusive for any .Then [4] is equivalent to
(5) 
Now, consider an extreme situation in the proposed learning automata. If the random initialization and the action does not have the highest stochastic estimate of reward probability in the first iteration, then the probability of action gets a ninety percent decay. And worse, the th action would not get any reward in the subsequent iterations which means the stochastic estimate of reward probability of action meets that at all time instant . Thus, during any of the first iterations, the largest decrease for any action is . So it is clear that:
(6) 
The probability that action is chosen up to times among iterations has the following upper bound.
(7) 
It is clear that a sum of terms would less than if each element of the sum less than . And when , . Thus we should prove that:
(8) 
Observe the inequality, it is necessary to make sure that is strictly is less than unity when increases. Thus with , such that . Let,
(9) 
Now, we should prove that
(10) 
Where
(11) 
Then we calculate that
(12) 
Using l’Hopital’s rule times, we could get the following equation:
(13) 
Thus, has a limit of zero as tends towards infinity with . In this case, for every action , there is a , and for all , is less than . And, it’s clear that (8) is monotonically decreasing as increases. Let . Hence, (8) is satisfied for all and .Furthermore, for any , we have
(14) 
Thus, we could get that
(15) 
Hence, for any action ,
(16) 
Now, we could repeat this argument for all the actions. Define and as follows:
Thus, for each action, is satisfied for all and , and the theorem is proved.
Now we are ready to prove that scheme is . According to the Definition 1, we should prove the following theorem.
Theorem 3:The is in every random environment. Given any and , there exists a (that depends on and )and such that for all and : .
Proof: The only difference between the proposed scheme and scheme is the method to update the probabilities. Since we have shown that the theorem 1 and theorem 2 work well in , we can prove the of in the same method with which has been introduced in detail in papadimitriou2004new .
5 Simulation results
In the following, the proposed scheme is compared with the most classic LA and which is considered as the current fastest LA. All of the schemes have been proved to be .
Within the context of LA, the speed of convergence is compared by the iterations needed to converge under the five benchmark environments given in papadimitriou2004new . The actions’ reward probabilities for each environment are as follows:

: D={0.65,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

: D={0.60,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

: D={0.55,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

: D={0.70,0.50,0.30,0.20,0.40,0.50,0.40,0.30,0.50,0.20}.

: D={0.10,0.45,0.84,0.76,0.20,0.40,0.60,0.70,0.50,0.30}.
In all the simulations performed, we have the same setting as papadimitriou2004new . The computation of an algorithm is considered to have converged if the probability of choosing an action is greater than or equal to a threshold (). The automaton is considered to have converged correctly when it converges to the action that has the highest reward probability.
Before comparing the performance of different learning automata, a large number of evaluation tests were carried out to determine the ‘best’ parameters for each scheme. The values of ‘best’ parameters are considered to be the best if they yield the fastest convergence and the automaton converges to the correct action in a sequence of experiments. The values of and are taken to the same as those used in papadimitriou2004new . Hence, and . As long as we have determined the ‘best’ parameters, each algorithm was executed 250,000 times for each environment by using the ‘best’ parameters. Before the simulation, to initialize the estimator vector, all the actions were sampled 10 times, and these extra 100 iterations are included in the iteration counts.
Before comparing the overall simulation results, a single ordinary experiment would be executed to show the difference between and during the convergence process. The curves that represent the probability of the optimal action as a function of time are presented in Fig 2.
The results presented in Fig 2 indicate that the probability of the optimal action changes dramatically in the initial stage of the learning process as we have explained earlier. With the number of iterations increasing, the stochastic estimator becomes more and more reliable. When the estimator is sufficiently reliable, the learning automaton converges rapidly. On the other hand, during the convergence process of scheme, once the probability of the optimal action decreases, a lot of extra iterations are needed to compensate for the lost probability.
Besides, as presented in Fig 3, the nonoptimal actions in scheme have more chances to be selected than in scheme. Thus, during the learning process, the estimate of each action gets more opportunities to be updated and the precision of the stochastic estimator would be higher. So the time when the stochastic estimator is reliable enough in scheme would be earlier than that in scheme.
Thus, with the benefits that have been explained above, the overall simulation results are presented as follows.
0.998  0.997  0.996  0.999  0.998  
0.997  0.996  0.995  0.998  0.997  
0.997  0.996  0.995  0.998  0.997 
Environment  

Parameter  Iterations  Parameter  Iterations  Parameter  Iterations  
377  426  351  
664  834  678  
2134  2540  2032  
299  325  298  
633  729  598 
Environment  Improvement  

Parameter  Iterations  Parameter  Iterations  
338  426  20.66%  
633  834  24.10%  
1990  2540  21.65%  
282  325  13.23%  
582  729  20.16% 
Environment  

Parameter  Iterations  Time(ms)  Parameter  Iterations  Time(ms)  
338  0.162  426  3.423  
633  0.339  834  7.417  
1990  1.167  2540  26.577  
282  0.126  325  2.744  
582  0.351  729  9.252 
The accuracies (number of correct convergences/number of experiments) of , and in environment to when using the ‘best’ learning parameters are presented in Table 1. The results show that always has a better accuracy than the other two algorithms. The average numbers of iterations required for convergence are summarized in Table 2, which demonstrate that the scheme converges with a faster speed than and with a little lower speed than the current fastest LA with higher accuracy. In order to ensure that the performance comparison between , and is fair, let us verify the number of iterations required to achieve the same accuracy, a series of experiments have been carried out. The results are shown in Table 3 and Table 4.
On the one hand, compared with the most classic LA model , the proposed scheme achieves a great improvement in the speed of convergence in all benchmark environments. For example, in environment , The converges in 633 iterations, while the requires 834 iterations. Thus, an improvement of in comparison with is obtained.
On the other hand, as indicated in Table 4, the current fastest LA model performs less competitively than the proposed scheme. The superiority of is not only reflected in the fewer number of iterations for convergence, but also established in the time efficiency. Because of the complexity of model when computing the confidence interval, the time required for convergence increases rapidly. Thus, the superiority of the proposed scheme is clear.
In summary, the scheme using a double competitive strategy is more efficient than and . It overcomes the drawbacks of estimator algorithms and provides a novel idea to make breakthroughs in LA fields.
6 Conclusions
In this paper, a novel Pmodel absorbing learning automaton is introduced. With the use of double competitive strategy, the proposed scheme overcomes the drawbacks of existing estimator algorithms. The benefits of proposed scheme are analysed and it is proved to be in every stationary random environment. Extensive simulations have been performed in five benchmark environments, and the results indicate that the proposed scheme converges faster and performs more efficiently than the most classic LA and the current fastest LA . Since the reliability of an estimator is the key to guarantee the convergence of LA, the future work will focus on studying how to make the estimator being reliable enough as soon as possible.
Acknowledgements.
This research work is funded by the National Key Research and Development Project of China (2016YFB0801003), Science and Technology Project of State Grid Corporation of China ( SGCC)，Key Laboratory for Shanghai Integrated Information Security Management Technology Research.References
 (1) Agache, M., Oommen, B.J.: Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 32(6), 738–749 (2002)
 (2) Esnaashari, M., Meybodi, M.R.: Data aggregation in sensor networks using learning automata. Wireless Networks 16(3), 687–699 (2010)
 (3) Ge, H., Jiang, W., Li, S., Li, J., Wang, Y., Jing, Y.: A novel estimator based learning automata algorithm. Applied Intelligence 42(2), 262–275 (2015)
 (4) Jiang, W.: A new class of optimal learning automata. In: International Conference on Intelligent Computing, pp. 116–121. Springer (2011)

(5)
Jiang, W., Li, B., Li, S., Tang, Y., Chen, C.L.P.: A new prospective for learning automata: A machine learning approach.
Neurocomputing 188, 319–325 (2016)  (6) Jiang, W., Zhao, C.L., Li, S.H., Chen, L.: A new learning automata based approach for online tracking of event patterns. Neurocomputing 137, 205–211 (2014)
 (7) Narendra, K.S., Thathachar, M.A.: Learning automata (2012)
 (8) Nicopolitidis, P., Papadimitriou, G.I., Pomportsis, A.S.: Using learning automata for adaptive pushbased data broadcasting in asymmetric wireless environments. IEEE Transactions on vehicular technology 51(6), 1652–1660 (2002)
 (9) Oommen, B.J., Croix, E.d.S.: Graph partitioning using learning automata. IEEE Transactions on Computers 45(2), 195–208 (1996)
 (10) Oommen, B.J., Lanctôt, J.K.: Discretized pursuit learning automata. IEEE Transactions on systems, man, and cybernetics 20(4), 931–938 (1990)
 (11) Papadimitriou, G.I., Sklira, M., Pomportsis, A.S.: A new class of εoptimal learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(1), 246–254 (2004)
 (12) Sastry, P.: Systems of learning automata: Estimator algorithms applications. Ph.D. thesis, Ph. D. Thesis, Dept of Electrical Engineering, Indian Institute of Science, Bangalore, India (1985)
 (13) Thathachar, M., Oommen, B.: Discretized rewardinaction learning automata. J. Cybern. Inf. Sci 2(1), 24–29 (1979)
 (14) Thathachar, M., Sastry, P.S.: A new approach to the design of reinforcement schemes for learning automata. IEEE Transactions on Systems, Man, and Cybernetics (1), 168–175 (1985)
 (15) Thathachar, M.A., Sastry, P.S.: Estimator algorithms for learning automata (1986)
 (16) Thathachar, M.A., Sastry, P.S.: Varieties of learning automata: an overview. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 32(6), 711–722 (2002)
 (17) TSetlin, M., et al.: Automaton theory and modeling of biological systems (1973)
 (18) Varshavskii, V., Vorontsova, I.: On the behavior of stochastic automata with a variable structure. Avtomatika i Telemekhanika 24(3), 353–360 (1963)
 (19) Vasilakos, A.V., Papadimitriou, G.I.: A new approach to the design of reinforcement schemes for learning automata: Stochastic estimator learning algorithm. Neurocomputing 7(3), 275–297 (1995)
 (20) Wang, Y., Jiang, W., Ma, Y., Ge, H., Jing, Y.: Learning automata based cooperative studentteam in tutoriallike system. In: International Conference on Intelligent Computing, pp. 154–161. Springer (2014)
 (21) Zhao, Y., Jiang, W., Li, S., Ma, Y., Su, G., Lin, X.: A cellular learning automata based algorithm for detecting community structure in complex networks. Neurocomputing 151, 1216–1226 (2015)