## 1 Introduction

The forward and backward propagation are used in training of neural networks. The training is an optimization procedure that minimizes the loss function

over data samples in a data set [4, 5, 15]. The loss function measures on average the error between the computed and the correct solution , e.g. cross entropy error function(1) |

with the softmax function

(2) |

and probability

, target as well as computed at the output.The forward propagation starts with an input . It applies an affine function

followed by a component-wise application of a non-linear activation function

to obtain an output of a layer. It proceeds sequentially through a composition of layers defining the neural network . As a result it computes an output at the final layer , as shown below(3) |

For instance, assuming that we may write

(4) |

where matrix of weights

and vector of bias

, with dimensions and being consistent but potentially different across layers.The backward propagation starts with an error

at the final layer. It uses a chain rule

(5) |

to find the corresponding error at the previous level [18, 35, 43], where is a Jacobian matrix, and , while denotes the Hadamard (component-wise) product. Notice that denotes both the output of -th and the input to -th layer, with . The backward propagation proceeds sequentially backwards through a composition of layers defining the neural network . As a result it computes errors at all layers, as shown below

(6) |

For instance, for the cross entropy error function in (1) we can write

(7) |

and for function in (4) we can write

(8) |

The errors can then be used to update coefficients of functions . In particular, in (4) these coefficients are the weights and bias , where we can write

(9) | |||||

(10) |

We point out that the last term in (9) can have different interpretations. Let us consider deriving for (4

) using component-wise derivatives and a composition of Jacobians. In the former case, the last term will be a vector and the final result will be computed as an outer product. In the latter case, it will be a three dimensional tensor that will collapse to a matrix under multiplication. In both cases, the final result will be the same, though.

Finally, notice that in practice the data samples are often organized into mini-batches to take a better advantage of parallel computing platforms [46, 47]. A smaller mini-batch can often achieve higher test accuracy, while a larger mini-batch can attain higher performance [8, 9, 16, 24, 44]

. Therefore, the stochastic gradient descent and its many variants

[6], often average the updates across the mini-batch, as shown below(11) | |||||

(12) |

where is the mini-batch size and is the learning rate [39].

Notice that the data samples inside a mini-batch are independent and can be processed in parallel, which is often referred to as data parallelism. On the other hand, the forward and backward propagation traversals through the layers are in principle sequential. Therefore, taking advantage of model parallelism, parallelism across layers, is more difficult.

In this paper we will show that in fact the forward and backward propagation can be interpreted as a solution of block bi-diagonal triangular systems of equations. This transformation allows us to use an entire class of well developed parallel numerical techniques to solve these systems. Further, it allows us to choose whether we want to solve these problems approximately or exactly. In this novel approach, the many different parallel schemes developed for this problem in linear algebra can now be seen as simply different ways of performing model parallelism in neural networks.

## 2 Block Bi-Diagonal Triangular Systems

Let a neural network be defined by a composition of affine and non-linear functions that are applied component-wise on the output, for layers. Further, let the affine function be a sum of linear function and a vector of constants . Also, recall that where is the input to -th layer, with , and note that Jacobian .

Notice that we can write forward propagation (3) as the solution of the block bi-diagonal lower triangular system , written in extended form below

(13) |

where and diagonal matrices for .

Also, notice that we can write backward propagation (6) as the solution of the block bi-diagonal upper triangular system , written in extended form below

(14) |

where , and diagonal matrices for .

Notice that it might appear that (13) and (14) define two linear systems, which is contradictory to the non-linear nature of the forward and backward propagation process in neural networks. However, notice that in reality the non-linearity is implicitly hidden in the diagonal elements and of these systems. In fact, we can represent it in this form only because the non-linear activation function is applied component-wise, and therefore if we know the point , the function and its derivative evaluations, then we can always find a scalars and , such that the results can be obtained using a scaling with those constants.

Notice that in forward propagation the points and

are not known ahead of time. However, they can potentially be estimated from a previous pass over the data. This approximation can be effective in the later stages of training, when we are in the region close to the solution. Therefore, we can approximate forward propagation by solving (

13).Notice that in backward propagation the starting points are in fact known and can be pre-computed ahead of time. Therefore, we can perform backward propagation exactly by solving (14).

We point out that the theory we have developed handles arbitrary functions as well as most popular activation functions . In particular, notice that when solving triangular systems (13) with forward and (14) with backward substitution, we always multiply with or diagonal terms, respectively. Therefore, the reformulation of backward propagation in (14) works independent of whether or . The reformulation of forward propagation in (13) works with points and for analogous reasons. Also, notice that when and the desired effect can be achieved by simply setting diagonal element to unity. However, the reformulation of forward propagation in (13) does break down when and , because there is no constant that multiplied by zero produces a non-zero result. When using floating point numbers we are often close, but rarely exactly at the numerical zero, therefore the approach is likely to succeed in practice. Nonetheless, it is clear that very careful and thoughtful algorithm design is required to handle all the corner cases correctly.

Also, notice that a similar formulation can be written for an arbitrary compute graph, that can be expressed as a directed acyclic graph. However, in general the triangular systems will no longer be block bi-diagonal and can have a more complex sparsity pattern.

There is a number of algorithms in numerical linear algebra that have been developed to solve block bi-diagonal triangular systems. In general we can split them into direct methods that solve the system exactly and iterative methods that solve the system approximately. All of these numerical schemes can now be seen as simply different ways of performing model parallelism in neural networks. Also, notice that working with multiple data samples organized into mini-batches is equivalent to finding a solution of a set of shifted systems

(15) | |||

(16) |

for , which is a well studied problem and allows for additional data parallelism.

### 2.1 FNNs

Let us assume that we are given a data sample and that a feedforward neural network is defined by an affine function and an activation function for layers.

Let be the error gradient at the output layer . Recall that the error can be propagated through the neural network by repeated applications of the formula

(17) |

for levels [30].

Notice that following (13) and (14) we can write forward propagation as the solution of the block bi-diagonal lower triangular system , written in extended form below

(18) |

while backward propagation as the solution of the block bi-diagonal upper triangular system , written in extended form below

(19) |

where and diagonal matrices and have elements

(20) | |||||

(21) |

respectively, for .

### 2.2 RNNs

Let us assume that we are given a sequence as a data sample and that a recurrent neural network is defined by an affine function and an activation function for layers and time steps.

Let be the error gradient at the output layer at time step . Recall that the error can be propagated through the neural network by repeated applications of the formula

(22) |

for levels and time steps , where we consider the terms for time to be zero [30].

Notice that following (13) and (14) we can write forward propagation as the solution of the block bi-diagonal lower triangular system , written in extended form below

(23) |

while backward propagation as the solution of the block bi-diagonal upper triangular system , written in extended form below

(24) |

where

(25) |

(26) |

with and diagonal matrices and so that

(27) | |||||

(28) |

respectively, for . Finally, letting we may write

(29) |

that in fact does not depend on time steps .

## 3 Parallel Algorithms

There is a number of algorithms in numerical linear algebra that have been developed to solve block bi-diagonal triangular systems. In general we can split them into direct methods that solve the system exactly and iterative methods that solve the system approximately. All of these numerical schemes can now be seen as simply different ways of performing model parallelism in neural networks. Also, notice that working with multiple data samples organized into mini-batches can simply be seen as performing the solution with shifted systems (15) and (16), which allows for additional data parallelism.

### 3.1 Direct Methods

The direct methods include (parallel) cyclic reduction [2, 7, 19, 20], nested dissection [11, 23], spike-based [10, 33, 34, 38], domain decomposition-based [25, 27, 28] and many other schemes [1, 12, 37, 42, 45].

For example, let us illustrate the parallel solution of the system (18) and (19

) using a variant of cyclic reduction. For clarity let us assume that we have a neural network with an odd number of levels

. The handling of the case with even number of layers is similar.Notice that using scaling and permutation matrices ^{1}^{1}1Please refer to appendix for the exact expression., we may write an equivalent system to (18) in block bi-diagonal lower triangular form as

(30) |

where off-diagonal blocks , right-hand-side , and for .

Also, that using scaling and permutation matrices, we may write an equivalent system to (19) in block bi-diagonal upper triangular form as

(31) |

where off-diagonal blocks , right-hand-side , and for .

Notice that both (30) and (31) decouple into two independent smaller systems of half the size of the original. Since we can apply this approach recursively, we can perform approximate forward and exact backward propagation in parallel in O() steps.

Notice that based on equivalence of original linear system in (18) and (30), we can state that forward propagation in (4) is equivalent to two independent forward propagations

(32) | |||||

(33) |

with modified weights for and inputs and at the initial layer.

### 3.2 Iterative Methods

The iterative methods include fixed-point schemes, such as Jacobi, Gauss-Seidel and Richardson iterations, as well as Krylov subspace methods, such as CG, BiCGStab and GMRES, among many others algorithms [3, 36]. Notice that in this particular application we have non-symmetric systems, where either the strictly upper or lower triangular part is zero. Also, the matrices are often very large and it might be prohibitively expensive to store the vector subspace generated throughout iterations. Therefore Jacobi, Richardson and BiCGStab methods are natural choices for solving these systems.

For example, let us illustrate the parallel solution of the systems (18) and (19) using the Jacobi iteration. Let the splitting and be such that and are the block-diagonal, while and are the strictly block lower and upper triangular parts of the original matrix, respectively. Then, we can write Jacobi iteration for (18) as

(36) |

and Jacobi-like iteration for (19) as

(37) |

Notice that the sufficient condition for this iteration to converge is that matrices and are diagonally dominant, which can be checked ahead of time. This computation is in fact embarrassingly parallel, in the sense that every row of matrix and is independent of others. However, multiple iterations might be required to achieve the desired accuracy.

Let us also briefly touch on the parallel solution of the systems (18) and (19

) using BiCGStab method. The convergence of BiCGStab is governed by the distribution of eigenvalues of the coefficient matrix

[17]. It is well known that the eigenvalues of a triangular matrix are simply the elements on its diagonal [21]. Therefore, it is natural to scale the coefficient matrix by the diagonal, so that all eigenvalues are clustered around one. This can be achieved by rewriting the linear systems (18) as(38) |

and (19) as

(39) |

The scaling by the block diagonal can be seen as preconditioning of the BiCGStab method. In fact more general preconditioning techniques can further speed up convergence, but might be prohibitively expensive in terms of computation and storage per iteration.

Notice that when we apply BiCGStab to solve the linear systems (38) and (39) the most compute intensive step in the algorithm is once again a matrix-vector multiplication of the form and , respectively. The former can be written in extended form as

(40) |

while the latter can be written in extended form as

(41) |

Notice that if we distribute this matrix and the corresponding vectors by rows across multiple nodes then only pair-wise communication between nodes is required to compute the result in parallel. However, multiple iterations of BiCGStab are required for convergence.

### 3.3 Mini-Batch and Shifted Systems

Finally, notice that in neural networks we are often working with a mini-batch of data, and therefore we would need to solve a set of shifted systems, such as the ones shown in (15) and (16), to extract additional data parallelism. Notice that the only difference between these systems are the diagonal blocks and , which are based on a particular data sample, while off diagonal blocks and remain the same. Therefore, the generalization of our approach to a set of shifted systems is feasible for direct methods. In the case of iterative methods, we can address it by relying on the slightly modified implementations of the block Krylov subspace methods [14, 29, 31, 40].

### 3.4 Hybrid Methods

Notice that we might choose to combine the direct and iterative methods to solve a single linear system. This could be especially useful for larger systems resulting from recurrent networks, such as (23) and (24). For instance, we can choose to solve diagonal blocks in (25) and in (26) directly, while the entire system is solved iteratively. In the future, we plan to experiment with all of these strategies.

## 4 Time, Work and Memory Complexity

Let us now make a more detailed analysis of the complexity of parallel forward and backward propagation algorithm. For instance, let us illustrate it for the fully connected neural network with weight matrices , where forward and backward propagation is defined in (4) and (8), respectively. For clarity let us assume that we have a neural network with an odd number of levels . The handling of the case with even number of layers is similar. Also, we will assume a standard theoretical PRAM (CREW) model for the formal analysis [22].

The sequential complexity of forward and backward propagation can be written as

(42) |

where is the batch size and is some scalar constant that measures the amount of operations required to evaluate the non-linear activation function. Notice that the first term in (42) corresponds to the matrix-matrix multiplication, e.g. evaluation of the affine function or , while the last term corresponds to the evaluation of the component-wise non-linear activation function .

Notice that if we assume that for then we can state that the sequential complexity is O.

### 4.1 Direct Methods

The work complexity of parallel forward propagation in (32) - (33) as well as backward propagation in (34) - (35) can be written as

(43) |

where index . The first term in (43) corresponds to the overhead required to construct the intermediate weights and right-hand-sides as well as weights and right-hand-sides for forward as well as backward propagation, respectively. The last terms in (43) correspond to two independent propagations, with the newly constructed weights.

Therefore, the time complexity of forward and backward propagation is

Comments

There are no comments yet.