Recently deep learning has emerged as a useful technique for data classification as well as finding feature representations. We consider the scenario of multi-class classification. A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multi-layer structure. Between two adjacent layers a weight matrix maps the inputs (values in the previous layer) to the outputs (values in the current layer). Assume the training set includes , , where is the feature vector and is the label vector. If is associated with label , then
where is the number of classes and are possible labels. After collecting all weights and biases as the model vector
and having a loss function, a neural-network problem can be written as the following optimization problem.
The regularization term avoids overfitting the training data, while the parameter balances the regularization term and the loss term. The function is non-convex because of the connection between weights in different layers. This non-convexity and the large number of weights have caused tremendous difficulties in training large-scale deep neural networks. To apply an optimization algorithm for solving (2), the calculation of function, gradient, and Hessian can be expensive. Currently, stochastic gradient (SG) methods are the most commonly used way to train deep neural networks (e.g., Bottou, 1991; LeCun et al., 1998b; Bottou, 2010; Zinkevich et al., 2010; Dean et al., 2012; Moritz et al., 2015). In particular, some expensive operations can be efficiently conducted in GPU environments (e.g., Ciresan et al., 2010; Krizhevsky et al., 2012; Hinton et al., 2012). Besides stochastic gradient methods, some works such as Martens (2010); Kiros (2013); He et al. (2016) have considered a Newton method of using Hessian information. Other optimization methods such as ADMM have also been considered (Taylor et al., 2016).
When the model or the data set is large, distributed training is needed. Following the design of the objective function in (2), we note it is easy to achieve data parallelism: if data instances are stored in different computing nodes, then each machine can calculate the local sum of training losses independently.111Training deep neural networks with data parallelism has been considered in SG, Newton and other optimization methods. For example, He et al. (2015) implement a parallel Newton method by letting each node store a subset of instances. However, achieving model parallelism is more difficult because of the complicated structure of deep neural networks. In this work, by considering that the model is distributedly stored we propose a novel distributed Newton method for deep learning. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks.
To be focused, among the various types of neural networks, we consider the standard feedforward networks in this work. We do not consider other types such as the convolution networks that are popular in computer vision.
This work is organized as follows. Section 2 introduces existing Hessian-free Newton methods for deep learning. In Section 3, we propose a distributed Newton method for training neural networks. We then develop novel techniques in Section 4 to reduce running time and memory consumption. In Section 5 we analyze the cost of the proposed algorithm. Additional implementation techniques are given in Section 6. Then Section 7 reviews some existing optimization methods, while experiments in Section 8 demonstrate the effectiveness of the proposed method. Programs used for experiments in this paper are available at
Supplementary materials including a list of symbols and additional experiments can be found at the same web address.
2 Hessian-free Newton Method for Deep Learning
In this section, we begin with introducing feedforward neural networks and then review existing Hessian-free Newton methods to solve the optimization problem.
2.1 Feedforward Networks
A multi-layer neural network maps each feature vector to a class vector via the connection of nodes. There is a weight vector between two adjacent layers to map the input vector (the previous layer) to the output vector (the current layer). The network in Figure 1 is an example.
Let denote the number of nodes at the th layer. We use to represent the structure of the network.444Note that is the number of features and is the number of classes. The weight matrix
and the bias vectorat the th layer are
be the feature vector for the th instance, and and denote vectors of the th instance at the th layer, respectively. We can use
to derive the value of the next layer, where
is the activation function.
If ’s columns are concatenated to the following vector
then we can define
as the weight vector of a whole deep neural network. The total number of parameters is
Because is the output vector of the th data, by a loss function to compare it with the label vector , a neural network solves the following regularized optimization problem
is a regularization parameter, and is a convex function of . Note that we rewrite the loss function in (2) as because is decided by and . In this work, we consider the following loss function
The gradient of is
is the Jacobian of , which is a function of . The Hessian matrix of is
is the identity matrix and
From now on for simplicity we let
2.2 Hessian-free Newton Method
For the standard Newton methods, at the th iteration, we find a direction minimizing the following second-order approximation of the function value:
Note that formally the summation in (13) should be
but it is simplified because is associated with only .
If is positive definite, then (10) is equivalent to solving the following linear system:
Unfortunately, for the optimization problem (10), it is well known that the objective function may be non-convex and therefore is not guaranteed to be positive definite. Following Schraudolph (2002), we can use the Gauss-Newton matrix as an approximation of the Hessian. That is, we remove the last term in (2.1) and obtain the following positive-definite matrix.
Note that from (9), each , is positive semi-definite if we require that is a convex function of . Therefore, instead of using (15), we solve the following linear system to find a for deep neural networks.
where is the Gauss-Newton matrix at the th iteration and we add a term because of considering the Levenberg-Marquardt method (see details in Section 4.5).
For deep neural networks, because the total number of weights may be very large, it is hard to store the Gauss-Newton matrix. Therefore, Hessian-free algorithms have been applied to solve (17). Examples include Martens (2010); Ngiam et al. (2011). Specifically, conjugate gradient (CG) methods are often used so that a sequence of Gauss-Newton matrix vector products are conducted. Martens (2010); Wang et al. (2015) use -operator (Pearlmutter, 1994) to implement the product without storing the Gauss-Newton matrix.
Because the use of operators for the Newton method is not the focus of this work, we leave some detailed discussion in Sections II–III in supplementary materials.
3 Distributed Training by Variable Partition
The main computational bottleneck in a Hessian-free Newton method is the sequence of matrix-vector products in the CG procedure. To reduce the running time, parallel matrix-vector multiplications should be conducted. However, the operator discussed in Section 2 and Section II in supplementary materials is inherently sequential. In a forward process results in the current layer must be finished before the next. In this section, we propose an effective distributed algorithm for training deep neural networks.
3.1 Variable Partition
Instead of using the operator to calculate the matrix-vector product, we consider the whole Jacobian matrix and directly use the Gauss-Newton matrix in (16) for the matrix-vector products in the CG procedure. This setting is possible because of the following reasons.
A distributed environment is used.
With some techniques we do not need to explicitly store every element of the Jacobian matrix.
Details will be described in the rest of this paper. To begin we split each to partitions
Because the number of columns in is the same as the number of variables in the optimization problem, essentially we partition the variables to
subsets. Specifically, we split neurons in each layer to several groups. Then weights connecting one group of the current layer to one group of the next layer form a subset of our variable partition. For example, assume we have a-- neural network in Figure 2. By splitting the three layers to , , groups, we have a total number of partitions . The partition in Figure 2 is responsible for a sub-matrix of . In addition, we distribute the variable to partitions corresponding to the first neuron sub-group of the th layer. For example, the variables of is split to in the partition and in the partition .
By the variable partition, we achieve model parallelism. Further, because from (2.1), our data points are split in a feature-wise way to nodes corresponding to partitions between layers 0 and 1. Therefore, we have data parallelism.
With the variable partition, the second term in the Gauss-Newton matrix (16) for the th instance can be represented as
In the CG procedure to solve (17), the product between the Gauss-Newton matrix and a vector is
However, after the variable partition, each may still be a huge matrix. The total space for storing is roughly
If , the number of data instances, is so large such that
3.2 Distributed Function Evaluation
From (3) we know how to evaluate the function value in a single machine, but the implementation in a distributed environment is not trivial. Here we check the details from the perspective of an individual partition. Consider a partition that involves neurons in sets and from layers and , respectively. Thus
Because (3) is a forward process, we assume that
are available at the current partition. The goal is to generate
and pass them to partitions between layers and . To begin, we calculate
Then, from (3), the following local values can be calculated for
After the local sum in (20) is obtained, we must sum up values in partitions between layers and .
where , , and
3.2.1 Allreduce and Broadcast Operations
The goal of (21) is to generate and broadcast values to some partitions between layers and , so a reduce operation seems to be sufficient. However, we will explain in Section 3.3 that for the Jacobian evaluation and then the product between Gauss-Newton matrix and a vector, the partitions between layers and corresponding to also need for calculating
To this end, we consider an allreduce operation so that not only are values reduced from some partitions between layers and , but also the result is broadcasted to them. After this is done, we make the same result available in partitions between layers and by choosing the partition corresponding to the first neuron sub-group of layer to conduct a broadcast operation. Note that for partitions between layers and (i.e., the last layer), a broadcast operation is not needed.
The three local sums are available at partitions , and respectively. We first conduct an allreduce operation so that are available at partitions , , and . Then we choose to broadcast values to , , and .
Depending on the system configurations, suitable ways can be considered for implementing the allreduce and the broadcast operations (Thakur et al., 2005). In Section IV of supplementary materials we give details of our implementation.
To derive the loss value, we need one final reduce operation. For the example in Figure 2, in the end we have respectively available in partitions
We then need the following reduce operation
and let have the loss term in the objective value.
We have discussed the calculation of the loss term in the objective value, but we also need to obtain the regularization term . One possible setting is that before the loss-term calculation we run a reduce operation to sum up all local regularization terms. For example, in one partition corresponding to neuron subgroups at layer and at layer , the local value is
On the other hand, we can embed the calculation into the forward process for obtaining the loss term. The idea is that we append the local regularization term in (25) to the vector in (20) for an allreduce operation in (21). The cost is negligible because we only increase the length of each vector by one. After the allreduce operation, we broadcast the resulting vector to partitions between layers and that corresponding to the neuron subgroup . We cannot let each partition collect the broadcasted value for subsequent allreduce operations because regularization terms in previous layers would be calculated several times. To this end, we allow only the partition corresponding to in layer and the first neuron subgroup in layer to collect the value and include it with the local regularization term for the subsequent allreduce operation. By continuing the forward process, in the end we get the whole regularization term.
The resulting value is broadcasted to
Then only collects the value and generate the following local sum:
In the end we have
contains regularization terms from
contains regularization terms from
contains regularization terms from
We can then extend the reduce operation in (24) to generate the final value of the regularization term.
3.3 Distributed Jacobian Calculation
where , and . However, these formulations do not reveal how they are calculated in a distributed setting. Similar to Section 3.2, we check details from the perspective of any variable partition. Assume the current partition involves neurons in sets and from layers and , respectively. Then we aim to obtain the following Jacobian components.
Before showing how to calculate them, we first get from (3) that
where , , , and .
are available at the current partition between layers and . Therefore, in (33)-(34) can be obtained. The remaining issue is to generate . We will show that they can be obtained by a backward process. Because the discussion assumes that currently we are at a partition between layers and , we show details of generating and dispatching them to partitions between and . From (3) and (30), can be calculated by