1 Intorduction
Many modern neural networks consume an enormous amount of computing resources. This is due to the long time it takes to train a neural network, large data sets, and large number of parameters in them. Some recent studies [blalock2020state, hinton_sparse, frankle2018lottery] showed that many weights are excessive and can be removed without loss (or with a small loss) of neural network performance.
On the other hand, input data for some neural network tasks are sequences of highly correlated frames. Video / audio processing and RL tasks are good examples of such tasks. In such tasks, what a neural network saw at step t1 is very similar to what it sees now at step t. Some studies propose various optimization algorithms for neural networks handling such data [sparnet2020, delta_rnn, oconnor_sigma]. These approaches are based on the idea of asynchronous update of the states of only those neurons that changed significantly compared to the previous step.
Neural network layers can be represented as a combination of matrixvector multiplication and application of a nonlinear transformation. Therefore, any value of the output vector can be represented as
, which is a superposition of a nonlinear function and Multiply and Accumulate (MAC) operation. A MAC operation is the summation of the results of products of the input vector elements and the respective weights. If at least one operand in any of such multiplications is zero, then such multiplication could be omitted. We will call a multiplication of two numbers significant if both its operands are not equal to zero.In this study, we focused on RL tasks and applied a combination of the two abovementioned approaches to optimize them. The first approach yields a 28x reduction in the number of significant neural network multiplications; the second one yields a 1050x reduction. Combined, they yield a 20150x reduction in the number of significant multiplications without substantial performance losses; sometimes the performance even improved. To the best of our knowledge, a combination of these methods has never been applied to RL tasks.
It is worth noting that brain neurons also work asynchronously and send signals to each other only when necessary. In addition, there are no dense layers in the brain, while they are characteristic of classic neural networks. This suggests that the combination of the proposed methods is biologically inspired.
2 Methods
2.1 Deep QNetwork
In RL tasks, an agent receives current environment state as input, after which it selects action , and then goes to new state and receives certain reward . The agent’s goal is to maximize the sum of rewards.
More strictly, the environment is formalized as a Markov decision process. A Markov decision process (MDP) is a tuple
, where is a set of possible states, is a set of possible actions. is the function describing transition between states;, i.e. the probability to get into state
at the next step when selecting action in state . is the function describing receiving rewards; it determines how big is a reward that an agent will receive when transitioning from state to state by selecting action .A strategy an agent uses to selects its actions depending on state is called a policy and is usually denoted by letter Where denotes policy parameters.
One way to solve an RL task are Qfunctionbased approaches. A Qfunction has two parameters – and . determines what reward agent will receive if it performs action from state and then follows policy . If complete information about an environment is available, the exact value of a Qfunction can be calculated. However, the knowledge of the world is usually incomplete and the number of possible states is enormous. Therefore, a Qfunction has to be approximated using neural networks.
This approach was successfully demonstrated by DeepMind in [mnih2015human]. The neural network architecture presented in this study is called DQN (Deep QNetwork). For the purposes of this study, it is the main architecture for optimization.
This neural network is as follows:
A 84x84x4 matrix that is received from the environment is input to the neural network. It is input for the first convolutional layer consisting of 32 8x8 filters with strides equal to 4 and ReLU activations. The second layer consists of 32 4x4 filters with strides equal to 2 and ReLU activation. The third layer consists of 64 3x3 filters with strides equal to 2 and ReLU activation. They are followed by a dense layer with 512 neurons with ReLU activations. At the output there is another dense layer with a number of neurons equal to the number of actions in a video game. Depending on the video game, the number of actions
may vary from 4 to 18.The neural network structure is shown in table:
Layer  Input Shape  Param 
Conv2d1 (8x8, stride=4)  [4, 84, 84]  8,224 
Conv2d2 (4x4, stride=2)  [32, 20, 20]  32,832 
Conv2d4 (3x3, stride=1)  [64, 9, 9]  36,928 
Flatten  [64, 7, 7]  0 
Dense1 (3136, 512)  [3136]  1,606,144 
Dense2 (512, )  [512] 
Double qlearning [double_q_learning] was also used for the training.
We experimented and optimized within the following RL environments: Breakout, SpaceInvaders.
Image 2 shows frames from these video games.
2.2 Pruning, Lottery Ticket
Pruning – i.e. removal of unnecessary neural network weights – is one of the ways to structural sparsity within a neural network.
Neural network pruning is not a novel idea; it was conceived back in the 90s [lecun1990optimal, hassibi1993surgeon]. There are various strategies to identify redundant connections in a neural network (by absolute value, by analyzing the Hessian, etc.).
For this study, we used weight pruning by absolute values: the closer a weight is to zero, the less significant such weight is.
The authors of [frankle2018lottery] have discovered the following interesting fact: if at any stage of neural network training weight values are pruned by absolute value and the remaining weights are reset to the values they had before the training, and then neural network training is resumed , the training of such neural network will go well. However, this does not happen if we set the remaining weights to random (not initial) values. Sparse neural network performance after such training can turn out to be better than unpruned neural network performance.
However, in practice [lotteryRL1, vischer2021lottery, frankle2018lottery] neural network iterative magnitude pruning by absolute value is usually used: neural networks are trained, then a small percentage of their weights (1020 %) are pruned, then the remaining weights are reset to initial values, then this procedure is repeated times. Thus, for the pruning rate , the fraction of pruned weights will be calculated according to the following formula:
(1) 
i  0  1  2  3  4  5  6  7  8  9 

0.000  0.200  0.36  0.488  0.590  0.672  0.730  0.790  0.832  0.866 
where is the pruning iteration, is the percentage of pruned weights at every iteration (pruning rate).
Authors of [lotteryRL1, vischer2021lottery] showed that this phenomenon is observed in RL tasks as well. Authors of [vischer2021lottery] explored this phenomenon in detail for both DQN and PPO algorithm. However, they studied it on small MinAtar subtasks [minatar] (MinAtar is a simplified version of classic AtariGames).
2.3 DeltaNetwork
The main idea behind this approachn[yousefzadeh2019asynchronous, sparnet2020] ] is to use temporal sparsity when working with sequential data.
The output values of the neural network layer number can be written as:
(2) 
(3) 
where is the output of neural network layer, is the input of neural network layer (output of layer), is the weight matrix, and is the bias. In a conventional neural network, for every new input vector
in the moment of time t a total recomputation of output value
is required, which will require multiplications. However, the following should be noted:(4) 
(5) 
(6) 
Thus, it is possible to recompute layer output values at the moment of time t using equations 4, 5, 6 using layer input changes that occur relative to the state at the moment of time .
However, this remark does not lead to neural network optimization by itself. But we can introduce threshold T for output value changes such that recomputation of succeeding neurons is started only when an output value exceeds this threshold.
Authors of [sparnet2020, yousefzadeh2019asynchronous] call this approach "Hysteresis Quantizer". To implement it, it is necessary to introduce an additional variable into each neuron; such variable will be used to record last transmitted value. Thus, the following algorithm will be run on each neuron:
GrAIMattersLab [15, 16] implemented this algorithm in the NeuronFlow processor architecture [neuronflow_processor, neuronflow_hybrid].
2.4 Algorithm
In the previous sections, we described two neural network optimization algorithms. By combining them, we get the following neural network optimization algorithm:
Stage 1:

Train neural network in the environment

Prune top r % weights

Reset weigths to original

Repeat steps 1 and 2 n times
Using this algorithm, we obtain a set of structurally sparse neural networks with different degrees of sparsity. The number of neural networks in the set equals the number of pruning algorithm iterations.
Stage 2: Then we apply the delta network algorithm to these neural networks. As a result, we get a set of new neural networks using both structured and temporal sparsity. Selection of one neural network from such set depends on the desirable balance between the number of significant multiplications and neural network performance.
2.5 Significant operations counting
2.5.1 Number of multiplications in an unoptimized neural network
Let us estimate the number of multiplications in a standard neural network without any optimizations. The following formula can be used to count the number of multiplications in convolutional layer k:
(7) 
where

 is the k+1 layer input size along the x axis

 is the k+1 layer input size along the y axis

 is the number of filters at layer k (k+1 layer input size along the z axis)

 is the convolution size along the x axis

 is the convolution size along the y axis

 is the k layer input size along the z axis
The following formula can be used to count multiplications in dense layer :
(8) 
where

 is the k layer input size (number of neurons at k1 layer)

 is the k layer output size (number of neurons at it)
General results for all layers are presented in table LABEL:table:_ops_common. It should be noted that these results are universal for any game and for any neural network inference.
Layer  Multiplications  Param 

Conv2d1  3,276,800  8,224 
Conv2d2  2,654,208  32,832 
Conv2d4  1,806,336  36,928 
Flatten  0  0 
Dense1  1,605,632  1,606,144 
Dense2 
2.5.2 The number of multiplications in optimized neural networks
It is clear that the degree of weight sparsity will affect the number of nonzero multiplications. It should be taken into account that the percentage of pruned weights can differ in different layers.
Using the delta layer provides different levels of temporal sparsity depending on selected threshold, layer, and input data. That is why average statistics on neuron activations was used for neural network runs. Examples of estimated numbers of significant multiplications are given in the next section and in tables 4, 5.
3 Results
Let us estimate the results of the use of this algorithm in our environments. Rewards received by a neural network and the significant multiplication fraction are the two main metrics that we used. We ran each neural network 50 times in each environment to estimate these values. Neural networks were trained using Nvidia Tesla V100 (10M steps for each neural network).
Figure 3 shows the abovementioned neural network performance metrics at different sparsity levels and for different environments.
80 % of convolutional weights can be pruned for Breakout; notably, the ingame performance will be better than that of an unoptimized neural network version. The results of estimating the number of significant multiplications at this sparsity level and with the delta algorithm threshold of 0.001 are shown in 4. As we can see, the total number of nonzero multiplications is , which is times less than the number of multiplications in an unoptimized neural network.
The results are not as good in SpaceInvaders, but they are still noteworthy. The delta neural network variant allows playing without losing performance at 73 % sparsity and with significant multiplications, which is times less than the number of multiplications in an unoptimized neural network. A layerwise operation analysis is shown in table 5.
The results turn out to be dependent on the game the agent is playing. This can be explained by the fact that the SpaceInvaders game has a lot more changing pixels at each time step than the Breakout game (in the Breakout only the playground and the ball move, while in the SpaceInvaders several objects can move at once  shots, a ship and aliens). This is well confirmed by the difference in Delta sparsity level when playing Breakout and SpaceInvaders (see tables 4 5).
Our reward metric results are similar to the results from [lotteryRL1, vischer2021lottery], where the worst results were in the SpaceInvaders with performance dropping very quickly as neural network sparsity grew.
Layer  Multiplications  Nonzero multiplications  Sparsity weights  Delta sparsity 

Input  0  0  0.0  0.992 
Conv2d1  3,276,800  9468  0.638  0.99 
Conv2d2  2,654,208  5592  0.789  0.968 
Conv2d4  1,806,336  10125  0.824  0.969 
Dense1  1605632  49790  0.0  0.975 
Dense2  2,048  51  0.0  0.871 
Total  9344832  75012  0.79  0.987 
Layer  Multiplications  Nonzero multiplications  Sparsity weights  Delta sparsity 

Input  0  0  0.0  0.986 
Conv2d1  3,276,800  16712  0.635  0.924 
Conv2d2  2,654,208  58216  0.711  0.773 
Conv2d4  1,806,336  88558  0.784  0.849 
Dense1  1,605,632  242450  0.0  0.732 
Dense2  2,048  548  0.0  0.004 
Total  9344832  406486  0.737  0.936 
3.1 Hardware Problem
Despite clear advantages of this approach, there are very few opportunities to use the existing hardware to effectively implement it today. This is due to modern GPUs being designed for handling dense matrices. Nevertheless, there are attempts to turn the situation. Nvidia began offering hardware support of sparse matrix operations on one of its latest Tesla A100 GPUs; however, the maximum supported sparsity is only 75 % so far [krashinsky2020nvidia].
The authors of the abovementioned DeltaNetwork algorithm work for GrAIMatterLabs that introduced the NeuronFlow architecturebased GrAIOne processor that supports delta neuron based neural network design.
The Loihi2 processor that Intel presented [loihi2] in September 2021 also supports the multicore asynchronous architecture and running sparse delta neuronbased neural networks.
4 Conclusion
This study is the first to demonstrate large multiplication redundancy in inference of neural networks handling RL tasks. Minimizing the number of multiplications becomes critical in areas where computation energy efficiency is important. Such areas include Edge AI and robotics. When speaking of the latter, RL algorithm optimization becomes extremely relevant. Although there is currently no suitable hardware available to take full advantage of these benefits, this research highlights the importance and potential of this area.
Comments
There are no comments yet.