1 Introduction
While most reinforcement learning algorithms aim at minimizing the Mean Squared Bellman Error, in function approximation it makes more sense to track the Projected Bellman Error. This is because with function approximation the true optimal of the Bellman Equation might not be representable by the function class. An example would be the true solution not being within the range space of the design matrix when using linear architectures. In such a scenario, one looks at the projected optimal solution onto the range space of the design matrix. This projected optimal solution is the fixed point solution of the Bellman Equation.
2 Problem Setup
Let denote a Markov Decision Problem, where states can take values in a state space . Corresponding to a state , we can take an action , corresponding to which we get a reward . The actions can be taken according to a strategy or a policy , where corresponding to each state we take an action as mentioned by the policy .
The planning problem or the evaluation problem corresponds to evaluating the goodness of a policy . Described formally, being at a state , the evaluation of the goodness of the state while following a strategy is the cumulative expected reward earned
These expectations can be computed as Monte Carlo runs on simulations and can be updated online as a type of Robbins Munro Equation.
From this perspective, let be a sequence of states as observed while following a policy . At any state we can compute the cost-to-go as follows :
, where is a parameter or eligibility trace between which weights the rewards.
, where
The Robbins Munro stochastic approximation algorithm for the above equations becomes
This form of update is what is known as a look-up table approach, since we need to keep a separate entry for each states’ cost to go, and update every entry as the simulation proceeds.
In contrast when it comes to function approximation, we approximate by where is a learnable parameter.
We solve the following minimization problem for instance in the stochastic shortest path problem, where is a stopping time :
Gradient descent gives us the following update
By allowing for eligibility trace , the update is
And for discounted problems as
This can be algabraically rearranged so that we can perform the updates in an online fashion
It is known the above update, under mild conditions, converge to the fixed point solution of
where is the Bellman Operator,
is the projection operator on the range space of ,
is the design matrix and
and
, where is the steady state distribution of the Markov Decision Process.
Based on the above equation one has the following Value Iteration update
Though, these algorithms have strong convergence results, a fundamental problem that so exists is that these are not true gradient based methods. To illustrate note that for , get the familiar TD(0) update:
A true gradient descent algorithm based on the the following objective function for a 1-step Bellman Error
where , gives the following update
The other issue is, it seems more intuitive to look at the following objective function
where , are as defined before. We shall call this the Mean Square Projected Bellman Error.
2.1 Derivation
We aim at minimizing the following objective function
where is the projection operator on the range space of
is the 1-step Bellman Operator and
where
This objective function was introduced by Sutton et al and is known as the Mean Squared Projected Bellman Error.
A rough idea that we propose here would be to modify the following algorithm
Input
Output
which is not truly gradient based because of
to
Notice that by definition of the projection operator we can write where
where is the one step Bellman Error for estimate of and
for some
To compute , a gradient descent would result in the following update
With this updated value of we solve the next minimation problem
A gradient descent would result in the following update
The resulting algorithm thus becomes,
Input A trajectory of states simulated according to policy ,
Output
2.2 A sketch of convergence analysis
Let us say that we have an at the begining of some iteration. Let us look at the first update equation :
This algorithm convergence s.t
With this fixed we consider the next update equation
This update converges s.t.
We use this value in the first update to get s.t
Thus we see that running these two updates gives us a sequence which satisfies the following updates
This last equation is known to converge.
2.3 Implementation Details and Extensions to Non-linear Function approximations and control problems
For implementations sake we have used the following schema, where each gradient update step is performed sequentially, without waiting for convergence.
Input A trajectory of states simulated according to policy ,
Output
For non-linear function approximations, say where is a learnable parameter, we have used the following schema
Input A trajectory of states simulated according to policy ,
Output
This has suggested the following critic network design.
2.3.1 Implementation as a deep neural network
Let us say we have the following transition
We can implement this algorithm by the designing a cooperative neural network. It represents a cooperation in the sense that the output of each network reinforces the parameters of the other network.
Here .
The box over and
represents the error which is backpropagated over the left hand network.
The box over and represents the error that is backpropagated over the right hand side network.
2.3.2 Control Problems using Q- factors
For control problems we have the following Q-factor alternative algorithm, where each where is a learnable parameter and and
Input An initial state
Output
2.3.3 Control Problems using Actor-Critic
Since our algorithm is itself a critic network in essence, it can be used with any other actor network in conjunction to solve control problems.
We shall see more of these in the experimant section.
3 A simple example
Let us consider the following problem. A Markov Chain that has 3 states
, and . The transition probabilities are as marked.The each transiotion has reward except for the self transition at and which are . We choose the invariant distribution of respectively for states and .
We have a simple state-feature representation , and where is a small positive numbers to ensure we don’t run into zero gradient estimates.
We first plot the objective function
, which is used for gradient based algorithms w.r.t the tunable paramater







Let us compare this with the MSPBE objective function
Let us now look at how the performs. The is known to track the MSPBE. The algorithm also has the fastest convergence rates compared to gradient based methods.
We plot the of the algorithm with the updates of the tunable parameter .
We first show the result when the initial point of (Figure 7) From the graph we see that it would converge to a suboptimal solution of
The learning rate has been kept fixed at , instead of using decreasing step size. This results in some noise but we ignore that, as the point is made clear that a local minima of is reached.
We are also confirmed that while start at initial values of we reach the global minima of (Figure 7).
Again a bit of overshoot is visible because of the randomness in the simulation and using constant step sizes.
We now look at the gradient based algorithms which aim at tracking
.
We plot vs . The results are also confirmed by looking at plot. We start with 3 different starting positions of : and (Figure 8).
![]() |
![]() |
![]() |
That the gradient based algorithms are slower is confirmed by having taken learning rates , times that of algorithms.
The optimal is also optimal for the , however this may not be true as illustrated in Figure 7. This figure tracks the with changing while using gradient based algorithms. We see that for negative values of the is which is consistent with the plot for negative values of . However, for positive values of the local optima should be at whereas as shown in the figure, the gradient based algorithm tracks the at .
Thus the gradient based algorithms do not track the . Let us look at the GTD2 algorithm which does track the (Figure 7).
However GTD2 has quite slow convergence. The leaning rate has been fixed to which is times that of gradient based algorithms.
We now see that the proposed algorithm not only tracks the (Figure 7) but the learning rate chosen is , which is times as less than GTD2 based algorithm.
The derivation also shows that this can be implemented very easily since it is a gradient based algorithm by itself.
4 Experiments
To analyze the effects of the proposed network update rule we carried out the experiments in two different existing frameworks:Deep Q-Network(DQN)[2] and Deep Deterministic Policy Gradient(DDPG)[1]. We are using these existing frameworks because we aim to investigate the impact of proposed update rule in different existing settings111Implementation code in available at https://github.com/kavitawagh/RLProject.git. DQN and DDPG are the benchmark algorithms in discrete action space and continuous action space setting, respectively.
In the following sections we explain how have we implemented the proposed update rule in DQN and DDPG framework and we also present the results of our experiments.
4.1 Dqn
For MDPs having discrete action space, DQN is the non-linear approximation of Q-function implemented using neural networks. The state is the input to Q-Network and for each action in the action space, Q-Network outputs the Q-value . Greedy policy is used to select single Q-value out of all outputs of the Q-Network, where Q-value selected is .
Non-linear function approximation is known to be unstable or even diverge when used to approximate Q-value function. One of the reason for the instability is the correlation between Q-value and target Q-value when same network is used to predict Q-values and target Q-values. Hence in DQN framework, target Q-values are predicted by different Q-Network whose weight parameters are updated periodically. DQN also uses a replay buffer for training the Q-Network with a batch of transitions. Every time an agent takes some action in environment, the state transition tuple is stored in the replay buffer. In the network training step, a random batch of transition tuples is sampled from replay buffer and used in network optimization. Figure 9 presents the original DQN algorithm.

4.1.1 Modification in algorithm
Figure 10 shows pseudo-flowchart for DQN algorithm along with the proposed update rule. The blocks in black border indicate the processing steps that belongs to original DQN algorithm and we are using as it is in modified algorithm. The shapes in blue indicate the processing steps in the original DQN algorithm those we have replaced with our processing steps highlighted with green color. The names in circles indicate the relation between our implementation and the network architecture proposed in Section 2.3.1.
In every environment step, we sample a batch of transitions from the replay buffer. Feeding to target Q-network we get . The Bellman operator is applied using to get the target to optimize the main Q-network. Main Q-network is then optimized by minimizing the mean square error(MSE) loss between its prediction and . In original DQN algorithm, before optimizing the main Q-network, the weights of target Q-network are set equal to the weights of main Q-network and then main Q-network is optimized. While in our modification, instead of just copying the weights, we take an optimization step on target Q-network.

4.1.2 Environment
We trained our Q-Network for OpenAI Gym CartPole environment. Cart-Pole is the classical reinforcement learning problem where the aim is to balance the pole upright on the cart by moving the cart left or right. State is represented by a list of observations [cart position, cart velocity, pole angle, pole velocity at the tip]. Actions include pushing the cart left or right. Reward is +1 for every step taken by the agent, including the termination step. Episode terminates if pole angle is more than or the accumulated reward is equal to 200.
4.1.3 Results
We trained the Q-Network using original DQN algorithm and DQN algorithm with proposed update rule. Figure 11 shows the plot of actual training reward against the episode numbers. We can see that the results of proposed update rule are comparable to that of the original DQN algorithm. After achieving the highest, the reward goes to zero in some episodes. This is possible because of the stochastic nature of the problem and the behaviour policy being used. We are using epsilon-greedy policy where minimum epsilon value is 0.1, that is agent can take random action with probability 0.1.
![]() |
![]() |
4.2 Ddpg
DDPG is actor-critic algorithm based on deterministic policy gradient for the problems having continuous action space. It uses the non-linear approximation of critic which models the Q-value function and actor which models the deterministic policy . Critic network predicts the Q-value given the state and action as input. Actor network predicts the action given state as the input.
Based on the current network parameters, actor network predict action. Using this predicted action, critic network predicts the Q-value. This Q-value is used as feedback for actor saying how good or bad the action predicted by actor is. The key concept that DDPG based on is the form of the deterministic policy gradient. The gradient of the state-value function w.r.t to actor network parameter have a nice implementable form involving the gradient of Q-value function and the gradient of policy(actor) network w.r.t network parameters. Optimizing the actor network with the gradient-ascent step in the direction of policy gradient guarantees to asymptotically converge into optimal optimal policy. Formally, policy gradient is given by,
where is the state-value function,
is the Q-value function(critic network),
is the deterministic policy(actor network with parameters )
Figure 12 presents the original DDPG algorithm.

4.2.1 Modification in algorithm
Figure 13 shows pseudo-flowchart for DDPG algorithm along with the proposed update rule. The blocks in black border indicate the processing steps that belongs to original DDPG algorithm and we are using as it is in modified algorithm. The shapes in blue indicate the processing steps in the original DDPG algorithm those we have replaced with our processing steps highlighted with green color. The names in circles indicate the relation between our implementation and the network architecture proposed in Section 2.3.1.
There are two actor networks: main actor and target actor; and two critic networks: main critic and target critic. The predictions of main networks are actually used to select action for the agent while target networks are used to calculate Bellman operator. First, a batch of transitions () is sampled from the replay buffer. Feeding as input to target actor, we get . This with in target critic gives . Then we calculate Bellman operator value which is used as target to optimize the main critic network. In original DDPG algorithm, the weights of the both target models are updated using soft update rule. In our modification, we replace the weight update step for target critic by an optimization step.

4.2.2 Environment
We trained our actor-critic network on OpenAI Gym BipedalWalker-v3 environment. BipedalWalker is a robot with a LIDAR system and two legs having four joints, two in each leg. The goal is to train the robot to walk as fast as possible on the simple terrain(with no obstacles). The environment state contains the LIDAR sensor measurements, joint positions, velocity, etc. The action is to apply the torques on four joints. The torque for any joint is value in [-1, 1] and this makes it continuous action space problem. The +1 reward is given for moving one step forward, total 300+ points up to the far end. If the robot falls, it gets -100 reward.
4.2.3 Results
We trained the actor-critic using original DDPG algorithm and DDPG algorithm with proposed update rule. Figure 14 shows the plots for actual training. The no. of steps the robot is able to balance on its legs is larger in modified DDPG than that in the original DDPG (Figure 13(a)). After visualizing the training episodes we came to know that agent in original DDPG is trying to move forward without taking the proper balanced pose first, and was falling immediately. While agent in modified DDPG first tries to take a balanced pose before trying to move forward. Since no. of steps the agent is alive is different for every episode we show the average reward per episode in Figure 13(b). This figure indicates that the modified DDPG explores more than the original DDPG. Figure 13(c) shows the Q-value plot for both the algorithms.
The positive reward(Figure 13(d)) and negative reward(Figure 13(e)) both are higher in magnitude for modified DDPG. This is because we are plotting the total positive and negative rewards accumulated in an episode and since the modified DDPG survives for more no. of time steps it is obvious that it accumulates more reward in both positive and negative direction. In Figure 13(f) the critic loss in modified DDPG is always smaller than critic loss in original DDPG.
![]() |
![]() |
![]() |
![]() positive reward |
![]() negative reward |
![]() |
5 Conclusion
In our project, we came up with the new update rule for critic networks, in which we use and optimize two critic networks. We implemented the proposed update with DQN and DDPG algorithm on CartPole and BipedalWalker environments, respectively. Results of experiments showed that proposed update gives comparable results for both the algorithms. In future we aim to implement proposed update rule for other RL algorithms like TRPO, proximal policy, etc and analyze its performance. We also aim to prove rigorously the convergence of proposed update rule.