DeepAI
Log In Sign Up

A double competitive strategy based learning automata algorithm

12/01/2017
by   Chong Di, et al.
Shanghai Jiao Tong University
0

Learning Automata (LA) are considered as one of the most powerful tools in the field of reinforcement learning. The family of estimator algorithms is proposed to improve the convergence rate of LA and has made great achievements. However, the estimators perform poorly on estimating the reward probabilities of actions in the initial stage of the learning process of LA. In this situation, a lot of rewards would be added to the probabilities of non-optimal actions. Thus, a large number of extra iterations are needed to compensate for these wrong rewards. In order to improve the speed of convergence, we propose a new P-model absorbing learning automaton by utilizing a double competitive strategy which is designed for updating the action probability vector. In this way, the wrong rewards can be corrected instantly. Hence, the proposed Double Competitive Algorithm overcomes the drawbacks of existing estimator algorithms. A refined analysis is presented to show the ϵ-optimality of the proposed scheme. The extensive experimental results in benchmark environments demonstrate that our proposed learning automata perform more efficiently than the most classic LA SE_RI and the current fastest LA DGCPA^*.

READ FULL TEXT VIEW PDF
01/16/2020

Reward Shaping for Reinforcement Learning with Omega-Regular Objectives

Recently, successful approaches have been made to exploit good-for-MDPs ...
02/04/2022

Model-Free Reinforcement Learning for Symbolic Automata-encoded Objectives

Reinforcement learning (RL) is a popular approach for robotic path plann...
05/03/2021

Action Candidate Based Clipped Double Q-learning for Discrete and Continuous Action Tasks

Double Q-learning is a popular reinforcement learning algorithm in Marko...
08/16/2019

Performing Deep Recurrent Double Q-Learning for Atari Games

Currently, many applications in Machine Learning are based on define new...
03/22/2022

Action Candidate Driven Clipped Double Q-learning for Discrete and Continuous Action Tasks

Double Q-learning is a popular reinforcement learning algorithm in Marko...
02/13/2020

Fast Reinforcement Learning for Anti-jamming Communications

This letter presents a fast reinforcement learning algorithm for anti-ja...
02/11/2019

Performance Dynamics and Termination Errors in Reinforcement Learning: A Unifying Perspective

In reinforcement learning, a decision needs to be made at some point as ...

1 Introduction

Learning automaton(LA) is one of reinforcement approaches. It is a decision maker which acn choose the optimal action and update its strategy through interacting with the random environment narendra2012learning . As one of the most powerful tools in adaptive learning system, LA has had a myriad of applications oommen1996graph nicopolitidis2002using esnaashari2010data wang2014learning zhao2015cellular jiang2014new .

As illustrated in Fig.1, the process of learning is based on a learning loop involving two entities: the random environment(RE) and the LA. In this process, the LA continuously interacts with the RE to get the feedback to its various actions. According to the responses from to the various actions the environment, LA will update the probability vector with a certain method. Finally, the LA attempts to learn the optimal action by interacting with the RE through sufficient iterations.

Figure 1: Learning automata that interact with a random environment narendra2012learning

The first study concerning LA models dates back to the studies by tsetlin1973automaton which investigated deterministic LA in detail. and varshavskii1963behavior introduced the stochastic variable structure versions of LA. Since then LA has been extensively researched to develop various kinds of algorithms based on deterministic LA [6] and stochastic LA [7]. A comprehensive overview of these researches has been summarized by thathachar2002varieties .

In general, the rate of convergence is one of the vital considerations of learning algorithms. Therefore, and designed a new class of learning automata, called estimator algorithms thathachar1985new thathachar1986estimator . The estimator algorithms have faster rate of convergence than all previous ones. These algorithms, not only maintain and update the probabilities vector of actions like the previously, but also keep estimating the reward probabilities for each action with using a reward-estimator vector to update the action probability vector. In this strategy, even when an action is rewarded, it is possible that the probability of choosing another action is increased papadimitriou2004new . Compared with the traditional learning algorithms, the estimator algorithms have been demonstrated to be more efficient. However, the performance of early estimator algorithms is strictly dependent on the reliability of the estimator’s contents and an unreliable estimator may cause a significant decrease of the accuracy and the speed of convergence papadimitriou2004new . In this situation, , and papadimitriou2004new designed a stochastic estimator reward-inaction learning automaton () which is based on the use of a stochastic estimator. As its much faster speed of convergence and much higher accuracy in choosing the correct action than other estimator algorithms, is is widely accepted as the most classic LA model by now.

Due to the superiority of the estimator algorithms, there are many novel estimators ge2015novel jiang2011new jiang2016new are proposed in recent years. In 2005, Hao Ge ge2015novel proposed a deterministic estimator based LA (Discretized Generalized Confidence Pursuit Algorithm,

) of which the estimate of each action is the upper bound of a confidence interval and extended the algorithm to stochastic estimator schemes. The improved stochastic estimator based LA

is the current fastest LA model. Although the family of estimator learning automata has achieved great improvements in the field of LA, there are still some drawbacks.

Because of the fundamental defect, the value of an estimator could not always be strictly unmistakable. Especially in the initial stage of the learning process of LA, the estimator may perform poorly on estimating the reward probabilities of each action. In this situation, a lot of reward would be added to the probabilities of non-optimal actions. Thus, a large number of extra iterations are needed to compensate for these wrong reward.

In this paper, in order to overcome the drawbacks of estimator algorithms, a novel method based on a double competitive strategy to update the action probability vector is introduced. The proposed Double Competitive Algorithm() learning automata use a stochastic estimator as same as . The first competitive strategy of is that only the action which has the highest current stochastic estimate of reward probability gets the opportunity to increase its probability. And its second strategy is that whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes, the probability of new ‘optimal’ action gets a huge increase while the probability of original ‘optimal’ action decreases a lot. Accordingly, the wrong rewards could be corrected instantly. Consequently, the learning automata converge rapidly and accurately.

The key contributions of this paper are summarized as follows.

- We propose a new algorithm, referred to as Double Competitive Algorithm () and prove that the proposed scheme is in all random stationary environments.

- The proposed is compared with the most classic LA and the fastest LA in various stationary P-model random environments. The results indicate that the proposed is more efficient.

The paper is organized as follows. In section 2, we introduce the general idea of LA and the estimator algorithms. The scheme is presented in section3. In section 4, we prove that the proposed scheme is . Extensive simulation results are presented to describe the superiority of the proposed model over the most classic LA and the fastest LA in Section 5. We conclude the paper in the last section.

2 Learning Automata and Estimator Algorithms

2.1 LA and stochastic environment

A LA is defined by a quintuple , where:

is the set of outputs or actions, and is the action chosen by the automata at any time instant .

is the set of inputs to the automata, and is the input at any time instant . The set could be finite or infinite. In this paper, we consider the case when , where represents the events that the LA has been penalized, and represents the events that the LA has been rewarded.

is the set of finite states, and is the state of the automata at any time instant .

is a mapping in terms of the state and input at any time instant , such that, .

is a mapping , and is called the output function which determines the output of the automata depending on the state , such that, .

The random environment interacted with LA is defined as , where and has been defined above. is the set of reward probability, and corresponds to an input action .

2.2 Estimator Algorithms

For the purpose of improving the convergence rate of LA, and designed a new-class of algorithms, called estimator algorithms [9][10]. These algorithms keep running estimates for each action using a reward-estimate vector and then use the estimate to update probabilities. According to the contents of estimators, the estimator algorithms could be divided into two classes, deterministic estimator algorithm and stochastic estimator algorithm.

The class of deterministic estimator is the majority of estimator algorithms, such as oommen1990discretized and agache2002generalized . In these algorithms, the deterministic estimate vector can be computed using the following formula which yields the maximum-likelihood estimate sastry1985systems thathachar1979discretized

(1)

Where is the number of times the action has been rewarded until the current time , and is the number of times the action has been selected until the current time .

vasilakos1995new introduced a new type of estimator called “stochastic estimator”. The non-stationary environments indicate that the reward probability will vary with time instant which means the optimal action may change from time to time. In vasilakos1995new

, the author added an zero mean normally distributed random number to each action’s estimate probability.

also extended the use of stochastic estimator to stationary environment papadimitriou2004new . The implementation of the stochastic estimator in papadimitriou2004new is to impose a random perturbation to the deterministic estimate, such that

(2)

where is the stochastic estimate of reward probability of action at time , is the deterministic estimate of reward probability of action at time , and

is a random number which is uniformly distributed in an interval. The length of the interval depends on a design parameter

and the number of times that action has been selected up to time instant .

3 Double Competitive Algorithm

It is clear that, in estimator learning automata fields, the most important part is to estimate the reward probabilities of each possible action accurately. However, Because of the fundamental defect, the value of an estimator could not always be strictly unmistakable. Especially in the initial stage of the learning process of LA, the estimator may perform poorly on estimating the reward probabilities of each action. Thus, a lot of rewards would be added to the probabilities of non-optimal actions. As a result, a large number of extra iterations are needed to compensate for these wrong rewards.

The proposed double competitive algorithm () is a learning automaton which updates the action probability with a double competitive strategy to update the action probability. The first competitive strategy of is that only the action which has the highest current stochastic estimate of the reward probability gets the opportunity to increase its probability. And the second competitive strategy is that whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes, the probability of new ‘optimal’ action gets a huge increase while the probability of the original ‘optimal’ action decreases a lot. With the unique two competitive strategies, the wrong rewards could be corrected instantly. Clearly, the ‘optimal’ action would be constantly changing as the estimator is not reliable enough in the early stages of learning, resulting in the probability of each action fluctuates continually. But eventually, when the estimator is fully reliable, as the action which has the highest current stochastic estimate of reward probability tends to be invariable, the LA will converge rapidly.

Besides, since the dramatic changes of the probabilities of any possible action during the learning process, the actions whose probabilities used to be relatively small get more opportunities to be selected. Then their deterministic estimates would be further updated. Therefore, during the learning process, the estimate of each non-optimal action gets more opportunities to be updated. According to the Law of Large Numbers, the precision of the stochastic estimator would be higher. So the stochastic estimator in

scheme would be more reliable than that in scheme.

The procedure of is briefly introduced below.

The scheme

Algorithm

Parameters

resolution parameters

attenuation factor

the number of events that th action has been rewarded up to time instant , for

the number of events that the th action has been selected up to time instant , for

smallest step size

the action that has the highest stochastic estimate of reward probability at the last time .

Method

Initialize =0.1

Initialize for

Initialize and by selecting each action a number of times

Initialize a random integer within

Repeat

Step 1: At time , choose an action

, according to the probability distribution

.

Step 2: Receive a feedback from stochastic environment.

Step 3: Set , .

Step 4: Compute the deterministic estimate , by setting .

Step 5: If , go to Step 9.

Step 6: Compute stochastic estimates , where is a random number uniformly distributed within .

Step 7: Select the action that has the highest stochastic estimate of reward probability, where .

Step 8: Update the probability vector according to the following equations:

Step 9: Compute stochastic estimates in the same way with Step 6 and select the action like Step 7 where .

Step 10: if , go to Step 12.

Step 11: Update the probability vector as follows

Step 12: Update the probability vector at time .

End Repeat
End Algorithm

Note that the double competitive strategy is reflected in the twice updating probability procedures. Step 7 and Step 8 are the implementation of the first competitive strategy, only the action which has the highest current stochastic estimate of the reward probability gets the opportunity to increase its probability and in order to satisfy , all the probabilities of the others decrease. The second competitive strategy is summarized in Step 10 and Step 11 where whenever the ‘optimal’ action which has the highest current stochastic estimate of reward probability changes (), the probability of the original ‘optimal’ action is reduced by (determined by the attenuation factor ), and then the new ‘optimal’ action will get an additional reward which equals to the reduced probability of action .

4 Proof of

Whether the given algorithm is is an important standard in LA contexts. Thus, we will show that the proposed scheme is in every stationary environment.
Definition 1 Given any arbitrarily small and , there exists a (that depends on and ) and a such that for all resolution parameter and all time :.
To prove the of scheme, the following two theorems would be used.
Theorem 1:Suppose there exists an index and a time instant such that for all with and for all time . Then there exists an integer such that for all resolution parameters , with probability one as .
Proof: Since we have supposed that for all with and for all time , the action which has the highest stochastic estimate of reward probability will not change, there is no difference between the proposed and scheme. The scheme has been introduced and proved in [11].
Theorem 2:For action , assume , for any given constants and , there exists and such that for all resolution parameters and all time : .

Proof

:Define the random variable

as the number of times the action is chosen up to time instant . And then we should prove that

(3)

Equivalent to prove that

(4)

It is clear that the events and are mutually exclusive for any .Then [4] is equivalent to

(5)

Now, consider an extreme situation in the proposed learning automata. If the random initialization and the action does not have the highest stochastic estimate of reward probability in the first iteration, then the probability of action gets a ninety percent decay. And worse, the th action would not get any reward in the subsequent iterations which means the stochastic estimate of reward probability of action meets that at all time instant . Thus, during any of the first iterations, the largest decrease for any action is . So it is clear that:

(6)

The probability that action is chosen up to times among iterations has the following upper bound.

(7)

It is clear that a sum of terms would less than if each element of the sum less than . And when , . Thus we should prove that:

(8)

Observe the inequality, it is necessary to make sure that is strictly is less than unity when increases. Thus with , such that . Let,

(9)

Now, we should prove that

(10)

Where

(11)

Then we calculate that

(12)

Using l’Hopital’s rule times, we could get the following equation:

(13)

Thus, has a limit of zero as tends towards infinity with . In this case, for every action , there is a , and for all , is less than . And, it’s clear that (8) is monotonically decreasing as increases. Let . Hence, (8) is satisfied for all and .Furthermore, for any , we have

(14)

Thus, we could get that

(15)

Hence, for any action ,

(16)

Now, we could repeat this argument for all the actions. Define and as follows:

Thus, for each action, is satisfied for all and , and the theorem is proved.

Now we are ready to prove that scheme is . According to the Definition 1, we should prove the following theorem.

Theorem 3:The is in every random environment. Given any and , there exists a (that depends on and )and such that for all and : .

Proof: The only difference between the proposed scheme and scheme is the method to update the probabilities. Since we have shown that the theorem 1 and theorem 2 work well in , we can prove the of in the same method with which has been introduced in detail in papadimitriou2004new .

5 Simulation results

In the following, the proposed scheme is compared with the most classic LA and which is considered as the current fastest LA. All of the schemes have been proved to be .

Within the context of LA, the speed of convergence is compared by the iterations needed to converge under the five benchmark environments given in papadimitriou2004new . The actions’ reward probabilities for each environment are as follows:

  • : D={0.65,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

  • : D={0.60,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

  • : D={0.55,0.50,0.45,0.40,0.35,0.30,0.25,0.20,0.15,0.10}.

  • : D={0.70,0.50,0.30,0.20,0.40,0.50,0.40,0.30,0.50,0.20}.

  • : D={0.10,0.45,0.84,0.76,0.20,0.40,0.60,0.70,0.50,0.30}.

In all the simulations performed, we have the same setting as papadimitriou2004new . The computation of an algorithm is considered to have converged if the probability of choosing an action is greater than or equal to a threshold (). The automaton is considered to have converged correctly when it converges to the action that has the highest reward probability.

Before comparing the performance of different learning automata, a large number of evaluation tests were carried out to determine the ‘best’ parameters for each scheme. The values of ‘best’ parameters are considered to be the best if they yield the fastest convergence and the automaton converges to the correct action in a sequence of experiments. The values of and are taken to the same as those used in papadimitriou2004new . Hence, and . As long as we have determined the ‘best’ parameters, each algorithm was executed 250,000 times for each environment by using the ‘best’ parameters. Before the simulation, to initialize the estimator vector, all the actions were sampled 10 times, and these extra 100 iterations are included in the iteration counts.

Before comparing the overall simulation results, a single ordinary experiment would be executed to show the difference between and during the convergence process. The curves that represent the probability of the optimal action as a function of time are presented in Fig 2.

Figure 2: versus (the extra 100 iterations used to initialize the estimator vector is not included) characteristics of and when operating in environment . For both schemes, the ‘best’ learning parameters are used.
Figure 3: the percentage of the number of each action has been selected to the total number required for convergence of and in environment , when using the ‘best’ learning parameters(250,000 experiments were performed for each scheme).

The results presented in Fig 2 indicate that the probability of the optimal action changes dramatically in the initial stage of the learning process as we have explained earlier. With the number of iterations increasing, the stochastic estimator becomes more and more reliable. When the estimator is sufficiently reliable, the learning automaton converges rapidly. On the other hand, during the convergence process of scheme, once the probability of the optimal action decreases, a lot of extra iterations are needed to compensate for the lost probability.

Besides, as presented in Fig 3, the non-optimal actions in scheme have more chances to be selected than in scheme. Thus, during the learning process, the estimate of each action gets more opportunities to be updated and the precision of the stochastic estimator would be higher. So the time when the stochastic estimator is reliable enough in scheme would be earlier than that in scheme.

Thus, with the benefits that have been explained above, the overall simulation results are presented as follows.

0.998 0.997 0.996 0.999 0.998
0.997 0.996 0.995 0.998 0.997
0.997 0.996 0.995 0.998 0.997
Table 1: Accuracy (number of correct convergences/number of experiments) of and in environment to , when using the ‘best’ learning parameters(250,000 experiments were performed for each scheme in each environment)
Environment
Parameter Iterations Parameter Iterations Parameter Iterations
377 426 351
664 834 678
2134 2540 2032
299 325 298
633 729 598
Table 2: Comparison of the average number of iterations required for convergence of and in environment to , when using the ‘best’ learning parameters(250,000 experiments were performed for each scheme in each environment)
Environment Improvement
Parameter Iterations Parameter Iterations
338 426 20.66%
633 834 24.10%
1990 2540 21.65%
282 325 13.23%
582 729 20.16%
Table 3: Comparison of the average number of iterations required for convergence achieving the same accuracy as shown in Table 1 in environment to (250,000 experiments were performed for each scheme in each environment)
Environment
Parameter Iterations Time(ms) Parameter Iterations Time(ms)
338 0.162 426 3.423
633 0.339 834 7.417
1990 1.167 2540 26.577
282 0.126 325 2.744
582 0.351 729 9.252
Table 4: Comparison of the average number of iterations the average time required for convergence achieving the same accuracy as shown in Table 1 in environment to (250,000 experiments were performed for each scheme in each environment)

The accuracies (number of correct convergences/number of experiments) of , and in environment to when using the ‘best’ learning parameters are presented in Table 1. The results show that always has a better accuracy than the other two algorithms. The average numbers of iterations required for convergence are summarized in Table 2, which demonstrate that the scheme converges with a faster speed than and with a little lower speed than the current fastest LA with higher accuracy. In order to ensure that the performance comparison between , and is fair, let us verify the number of iterations required to achieve the same accuracy, a series of experiments have been carried out. The results are shown in Table 3 and Table 4.

On the one hand, compared with the most classic LA model , the proposed scheme achieves a great improvement in the speed of convergence in all benchmark environments. For example, in environment , The converges in 633 iterations, while the requires 834 iterations. Thus, an improvement of in comparison with is obtained.

On the other hand, as indicated in Table 4, the current fastest LA model performs less competitively than the proposed scheme. The superiority of is not only reflected in the fewer number of iterations for convergence, but also established in the time efficiency. Because of the complexity of model when computing the confidence interval, the time required for convergence increases rapidly. Thus, the superiority of the proposed scheme is clear.

In summary, the scheme using a double competitive strategy is more efficient than and . It overcomes the drawbacks of estimator algorithms and provides a novel idea to make breakthroughs in LA fields.

6 Conclusions

In this paper, a novel P-model absorbing learning automaton is introduced. With the use of double competitive strategy, the proposed scheme overcomes the drawbacks of existing estimator algorithms. The benefits of proposed scheme are analysed and it is proved to be in every stationary random environment. Extensive simulations have been performed in five benchmark environments, and the results indicate that the proposed scheme converges faster and performs more efficiently than the most classic LA and the current fastest LA . Since the reliability of an estimator is the key to guarantee the convergence of LA, the future work will focus on studying how to make the estimator being reliable enough as soon as possible.

Acknowledgements.
This research work is funded by the National Key Research and Development Project of China (2016YFB0801003), Science and Technology Project of State Grid Corporation of China ( SGCC),Key Laboratory for Shanghai Integrated Information Security Management Technology Research.

References

  • (1) Agache, M., Oommen, B.J.: Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 32(6), 738–749 (2002)
  • (2) Esnaashari, M., Meybodi, M.R.: Data aggregation in sensor networks using learning automata. Wireless Networks 16(3), 687–699 (2010)
  • (3) Ge, H., Jiang, W., Li, S., Li, J., Wang, Y., Jing, Y.: A novel estimator based learning automata algorithm. Applied Intelligence 42(2), 262–275 (2015)
  • (4) Jiang, W.: A new class of -optimal learning automata. In: International Conference on Intelligent Computing, pp. 116–121. Springer (2011)
  • (5)

    Jiang, W., Li, B., Li, S., Tang, Y., Chen, C.L.P.: A new prospective for learning automata: A machine learning approach.

    Neurocomputing 188, 319–325 (2016)
  • (6) Jiang, W., Zhao, C.L., Li, S.H., Chen, L.: A new learning automata based approach for online tracking of event patterns. Neurocomputing 137, 205–211 (2014)
  • (7) Narendra, K.S., Thathachar, M.A.: Learning automata (2012)
  • (8) Nicopolitidis, P., Papadimitriou, G.I., Pomportsis, A.S.: Using learning automata for adaptive push-based data broadcasting in asymmetric wireless environments. IEEE Transactions on vehicular technology 51(6), 1652–1660 (2002)
  • (9) Oommen, B.J., Croix, E.d.S.: Graph partitioning using learning automata. IEEE Transactions on Computers 45(2), 195–208 (1996)
  • (10) Oommen, B.J., Lanctôt, J.K.: Discretized pursuit learning automata. IEEE Transactions on systems, man, and cybernetics 20(4), 931–938 (1990)
  • (11) Papadimitriou, G.I., Sklira, M., Pomportsis, A.S.: A new class of ε-optimal learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34(1), 246–254 (2004)
  • (12) Sastry, P.: Systems of learning automata: Estimator algorithms applications. Ph.D. thesis, Ph. D. Thesis, Dept of Electrical Engineering, Indian Institute of Science, Bangalore, India (1985)
  • (13) Thathachar, M., Oommen, B.: Discretized reward-inaction learning automata. J. Cybern. Inf. Sci 2(1), 24–29 (1979)
  • (14) Thathachar, M., Sastry, P.S.: A new approach to the design of reinforcement schemes for learning automata. IEEE Transactions on Systems, Man, and Cybernetics (1), 168–175 (1985)
  • (15) Thathachar, M.A., Sastry, P.S.: Estimator algorithms for learning automata (1986)
  • (16) Thathachar, M.A., Sastry, P.S.: Varieties of learning automata: an overview. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 32(6), 711–722 (2002)
  • (17) TSetlin, M., et al.: Automaton theory and modeling of biological systems (1973)
  • (18) Varshavskii, V., Vorontsova, I.: On the behavior of stochastic automata with a variable structure. Avtomatika i Telemekhanika 24(3), 353–360 (1963)
  • (19) Vasilakos, A.V., Papadimitriou, G.I.: A new approach to the design of reinforcement schemes for learning automata: Stochastic estimator learning algorithm. Neurocomputing 7(3), 275–297 (1995)
  • (20) Wang, Y., Jiang, W., Ma, Y., Ge, H., Jing, Y.: Learning automata based cooperative student-team in tutorial-like system. In: International Conference on Intelligent Computing, pp. 154–161. Springer (2014)
  • (21) Zhao, Y., Jiang, W., Li, S., Ma, Y., Su, G., Lin, X.: A cellular learning automata based algorithm for detecting community structure in complex networks. Neurocomputing 151, 1216–1226 (2015)