Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

by   Seiji Ishihara, et al.

A method of a fusion of fuzzy inference and policy gradient reinforcement learning has been proposed that directly learns, as maximizes the expected value of the reward per episode, parameters in a policy function represented by fuzzy rules with weights. A study has applied this method to a task of speed control of an automobile and has obtained correct policies, some of which control speed of the automobile appropriately but many others generate inappropriate vibration of speed. In general, the policy is not desirable that causes sudden time change or vibration in the output value, and there would be many cases where the policy giving smooth time change in the output value is desirable. In this paper, we propose a fusion method using the objective function, that introduces defuzzification with the center of gravity model weighted stochastically and a constraint term for smoothness of time change, as an improvement measure in order to suppress sudden change of the output value of the fuzzy controller. Then we show the learning rule in the fusion, and also consider the effect by reward functions on the fluctuation of the output value. As experimental results of an application of our method on speed control of an automobile, it was confirmed that the proposed method has the effect of suppressing the undesirable fluctuation in time-series of the output value. Moreover, it was also showed that the difference between reward functions might adversely affect the results of learning.


page 1

page 2

page 3

page 4


The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

In this theoretical paper we are concerned with the problem of learning ...

On- and Off-Policy Monotonic Policy Improvement

Monotonic policy improvement and off-policy learning are two main desira...

Revisit Policy Optimization in Matrix Form

In tabular case, when the reward and environment dynamics are known, pol...

Hybrid and dynamic policy gradient optimization for bipedal robot locomotion

Controlling a non-statically bipedal robot is challenging due to the com...

Smoothed Action Value Functions for Learning Gaussian Policies

State-action value functions (i.e., Q-values) are ubiquitous in reinforc...

Intentionally-underestimated Value Function at Terminal State for Temporal-difference Learning with Mis-designed Reward

Robot control using reinforcement learning has become popular, but its l...

Every Hidden Unit Maximizing Output Weights Maximizes The Global Reward

For a network of stochastic units trained on a reinforcement learning ta...

Please sign up or login with your details

Forgot password? Click here to reset