Distributed Newton Methods for Deep Neural Networks

02/01/2018 ∙ by Chien-Chih Wang, et al. ∙ 0

Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. In compared with stochastic gradient methods, it is more robust and may give better test accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently deep learning has emerged as a useful technique for data classification as well as finding feature representations. We consider the scenario of multi-class classification. A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multi-layer structure. Between two adjacent layers a weight matrix maps the inputs (values in the previous layer) to the outputs (values in the current layer). Assume the training set includes , , where is the feature vector and is the label vector. If is associated with label , then

where is the number of classes and are possible labels. After collecting all weights and biases as the model vector

and having a loss function

, a neural-network problem can be written as the following optimization problem.

(1)

where

(2)

The regularization term avoids overfitting the training data, while the parameter balances the regularization term and the loss term. The function is non-convex because of the connection between weights in different layers. This non-convexity and the large number of weights have caused tremendous difficulties in training large-scale deep neural networks. To apply an optimization algorithm for solving (2), the calculation of function, gradient, and Hessian can be expensive. Currently, stochastic gradient (SG) methods are the most commonly used way to train deep neural networks (e.g., Bottou, 1991; LeCun et al., 1998b; Bottou, 2010; Zinkevich et al., 2010; Dean et al., 2012; Moritz et al., 2015). In particular, some expensive operations can be efficiently conducted in GPU environments (e.g., Ciresan et al., 2010; Krizhevsky et al., 2012; Hinton et al., 2012). Besides stochastic gradient methods, some works such as Martens (2010); Kiros (2013); He et al. (2016) have considered a Newton method of using Hessian information. Other optimization methods such as ADMM have also been considered (Taylor et al., 2016).

When the model or the data set is large, distributed training is needed. Following the design of the objective function in (2), we note it is easy to achieve data parallelism: if data instances are stored in different computing nodes, then each machine can calculate the local sum of training losses independently.111Training deep neural networks with data parallelism has been considered in SG, Newton and other optimization methods. For example, He et al. (2015) implement a parallel Newton method by letting each node store a subset of instances. However, achieving model parallelism is more difficult because of the complicated structure of deep neural networks. In this work, by considering that the model is distributedly stored we propose a novel distributed Newton method for deep learning. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks.

To be focused, among the various types of neural networks, we consider the standard feedforward networks in this work. We do not consider other types such as the convolution networks that are popular in computer vision.

This work is organized as follows. Section 2 introduces existing Hessian-free Newton methods for deep learning. In Section 3, we propose a distributed Newton method for training neural networks. We then develop novel techniques in Section 4 to reduce running time and memory consumption. In Section 5 we analyze the cost of the proposed algorithm. Additional implementation techniques are given in Section 6. Then Section 7 reviews some existing optimization methods, while experiments in Section 8 demonstrate the effectiveness of the proposed method. Programs used for experiments in this paper are available at

http://www.csie.ntu.edu.tw/~cjlin/papers/dnn.

Supplementary materials including a list of symbols and additional experiments can be found at the same web address.

2 Hessian-free Newton Method for Deep Learning

In this section, we begin with introducing feedforward neural networks and then review existing Hessian-free Newton methods to solve the optimization problem.

2.1 Feedforward Networks

A multi-layer neural network maps each feature vector to a class vector via the connection of nodes. There is a weight vector between two adjacent layers to map the input vector (the previous layer) to the output vector (the current layer). The network in Figure 1 is an example.

Figure 1: An example of feedforward neural networks.333This figure is modified from the example at http://www.texample.net/tikz/examples/neural-network.

Let denote the number of nodes at the th layer. We use to represent the structure of the network.444Note that is the number of features and is the number of classes. The weight matrix

and the bias vector

at the th layer are

Let

be the feature vector for the th instance, and and denote vectors of the th instance at the th layer, respectively. We can use

(3)

to derive the value of the next layer, where

is the activation function.

If ’s columns are concatenated to the following vector

then we can define

as the weight vector of a whole deep neural network. The total number of parameters is

Because is the output vector of the th data, by a loss function to compare it with the label vector , a neural network solves the following regularized optimization problem

where

(4)

is a regularization parameter, and is a convex function of . Note that we rewrite the loss function in (2) as because is decided by and . In this work, we consider the following loss function

(5)

The gradient of is

(6)

where

(7)

is the Jacobian of , which is a function of . The Hessian matrix of is

(8)

where

is the identity matrix and

(9)

From now on for simplicity we let

2.2 Hessian-free Newton Method

For the standard Newton methods, at the th iteration, we find a direction minimizing the following second-order approximation of the function value:

(10)

where is the Hessian matrix of . To solve (10), first we calculate the gradient vector by a backward process based on (3) through the following equations:

(11)
(12)
(13)
(14)

Note that formally the summation in (13) should be

but it is simplified because is associated with only .

If is positive definite, then (10) is equivalent to solving the following linear system:

(15)

Unfortunately, for the optimization problem (10), it is well known that the objective function may be non-convex and therefore is not guaranteed to be positive definite. Following Schraudolph (2002), we can use the Gauss-Newton matrix as an approximation of the Hessian. That is, we remove the last term in (2.1) and obtain the following positive-definite matrix.

(16)

Note that from (9), each , is positive semi-definite if we require that is a convex function of . Therefore, instead of using (15), we solve the following linear system to find a for deep neural networks.

(17)

where is the Gauss-Newton matrix at the th iteration and we add a term because of considering the Levenberg-Marquardt method (see details in Section 4.5).

For deep neural networks, because the total number of weights may be very large, it is hard to store the Gauss-Newton matrix. Therefore, Hessian-free algorithms have been applied to solve (17). Examples include Martens (2010); Ngiam et al. (2011). Specifically, conjugate gradient (CG) methods are often used so that a sequence of Gauss-Newton matrix vector products are conducted. Martens (2010); Wang et al. (2015) use -operator (Pearlmutter, 1994) to implement the product without storing the Gauss-Newton matrix.

Because the use of operators for the Newton method is not the focus of this work, we leave some detailed discussion in Sections II–III in supplementary materials.

3 Distributed Training by Variable Partition

The main computational bottleneck in a Hessian-free Newton method is the sequence of matrix-vector products in the CG procedure. To reduce the running time, parallel matrix-vector multiplications should be conducted. However, the operator discussed in Section 2 and Section II in supplementary materials is inherently sequential. In a forward process results in the current layer must be finished before the next. In this section, we propose an effective distributed algorithm for training deep neural networks.

3.1 Variable Partition

Instead of using the operator to calculate the matrix-vector product, we consider the whole Jacobian matrix and directly use the Gauss-Newton matrix in (16) for the matrix-vector products in the CG procedure. This setting is possible because of the following reasons.

  1. A distributed environment is used.

  2. With some techniques we do not need to explicitly store every element of the Jacobian matrix.

Details will be described in the rest of this paper. To begin we split each to partitions

Because the number of columns in is the same as the number of variables in the optimization problem, essentially we partition the variables to

subsets. Specifically, we split neurons in each layer to several groups. Then weights connecting one group of the current layer to one group of the next layer form a subset of our variable partition. For example, assume we have a

-- neural network in Figure 2. By splitting the three layers to , , groups, we have a total number of partitions . The partition in Figure 2 is responsible for a sub-matrix of . In addition, we distribute the variable to partitions corresponding to the first neuron sub-group of the th layer. For example, the variables of is split to in the partition and in the partition .

By the variable partition, we achieve model parallelism. Further, because from (2.1), our data points are split in a feature-wise way to nodes corresponding to partitions between layers 0 and 1. Therefore, we have data parallelism.

,

,

,

,

,

,

,

,

,

,

,

,

Figure 2: An example of splitting variables in Figure 1 to partitions by a split structure of --. Each circle corresponds to a neuron sub-group in a layer, while each square is a partition corresponding to weights connecting one neuron sub-group in a layer to one neuron sub-group in the next layer.

With the variable partition, the second term in the Gauss-Newton matrix (16) for the th instance can be represented as

In the CG procedure to solve (17), the product between the Gauss-Newton matrix and a vector is

(18)

is partitioned according to our variable split. From (9) and the loss function defined in (5),

However, after the variable partition, each may still be a huge matrix. The total space for storing is roughly

If , the number of data instances, is so large such that

than storing requires more space than the Gauss-Newton matrix. To reduce the memory consumption, we will propose effective techniques in Sections 3.3, 4.3, and 6.1.

With the variable partition, function, gradient, and Jacobian calculations become complicated. We discuss details in Sections 3.2 and 3.3.

3.2 Distributed Function Evaluation

From (3) we know how to evaluate the function value in a single machine, but the implementation in a distributed environment is not trivial. Here we check the details from the perspective of an individual partition. Consider a partition that involves neurons in sets and from layers and , respectively. Thus

Because (3) is a forward process, we assume that

are available at the current partition. The goal is to generate

and pass them to partitions between layers and . To begin, we calculate

(19)

Then, from (3), the following local values can be calculated for

(20)

After the local sum in (20) is obtained, we must sum up values in partitions between layers and .

(21)

where , , and

The resulting values should be broadcasted to partitions between layers and that correspond to the neuron subset . We explain details of (21) and the broadcast operation in Section 3.2.1.

3.2.1 Allreduce and Broadcast Operations

The goal of (21) is to generate and broadcast values to some partitions between layers and , so a reduce operation seems to be sufficient. However, we will explain in Section 3.3 that for the Jacobian evaluation and then the product between Gauss-Newton matrix and a vector, the partitions between layers and corresponding to also need for calculating

(22)

To this end, we consider an allreduce operation so that not only are values reduced from some partitions between layers and , but also the result is broadcasted to them. After this is done, we make the same result available in partitions between layers and by choosing the partition corresponding to the first neuron sub-group of layer to conduct a broadcast operation. Note that for partitions between layers and (i.e., the last layer), a broadcast operation is not needed.

Consider the example in Figure 2. For partitions , , and , all of them must get calculated via (21):

(23)

The three local sums are available at partitions , and respectively. We first conduct an allreduce operation so that are available at partitions , , and . Then we choose to broadcast values to , , and .

Depending on the system configurations, suitable ways can be considered for implementing the allreduce and the broadcast operations (Thakur et al., 2005). In Section IV of supplementary materials we give details of our implementation.

To derive the loss value, we need one final reduce operation. For the example in Figure 2, in the end we have respectively available in partitions

We then need the following reduce operation

(24)

and let have the loss term in the objective value.

We have discussed the calculation of the loss term in the objective value, but we also need to obtain the regularization term . One possible setting is that before the loss-term calculation we run a reduce operation to sum up all local regularization terms. For example, in one partition corresponding to neuron subgroups at layer and at layer , the local value is

(25)

On the other hand, we can embed the calculation into the forward process for obtaining the loss term. The idea is that we append the local regularization term in (25) to the vector in (20) for an allreduce operation in (21). The cost is negligible because we only increase the length of each vector by one. After the allreduce operation, we broadcast the resulting vector to partitions between layers and that corresponding to the neuron subgroup . We cannot let each partition collect the broadcasted value for subsequent allreduce operations because regularization terms in previous layers would be calculated several times. To this end, we allow only the partition corresponding to in layer and the first neuron subgroup in layer to collect the value and include it with the local regularization term for the subsequent allreduce operation. By continuing the forward process, in the end we get the whole regularization term.

We use Figure 2 to give an illustration. The allreduce operation in (23) now also calculates

(26)

The resulting value is broadcasted to

Then only collects the value and generate the following local sum:

In the end we have

  1. contains regularization terms from

  2. contains regularization terms from

  3. contains regularization terms from

We can then extend the reduce operation in (24) to generate the final value of the regularization term.

1:Let and be the subsets of neurons at the th and th layers corresponding to the current partition.
2:if  then
3:     Read from input, where , and .
4:else
5:     Wait for , .
6:     Calculate by (19).
7:end if
8:After calculating (20), run an allreduce operation to have
(27)
9:available in all partitions between layers and corresponding to .
10:if  is the first neuron sub-group of layer  then
11:     if  then
12:         We broadcast values in (27) to partitions between layers and
13:    corresponding to the neuron subgroup ; see the description after (23)
14:     else
15:         Calculate
16:         If is the first neuron sub-group of layer , run a reduce operation
17:    to get the final ; see (24).
18:     end if
19:end if
Algorithm 1 Function evaluation in a distributed system

3.3 Distributed Jacobian Calculation

From (7) and similar to the way of calculating the gradient in (11)-(14), the Jacobian matrix satisfies the following properties.

(28)
(29)

where , and . However, these formulations do not reveal how they are calculated in a distributed setting. Similar to Section 3.2, we check details from the perspective of any variable partition. Assume the current partition involves neurons in sets and from layers and , respectively. Then we aim to obtain the following Jacobian components.

Before showing how to calculate them, we first get from (3) that

(30)
(31)
(32)

From (28)-(32), the elements for the local Jacobian matrix can be derived by

(33)
and
(34)

where , , , and .

We discuss how to have values in the right-hand side of (33) and (34) available at the current computing node. From (19), we have

available in the forward process of calculating the function value. Further, in (21)-(22) to obtain for layers and , we use an allreduce operation rather than a reduce operation so that

are available at the current partition between layers and . Therefore, in (33)-(34) can be obtained. The remaining issue is to generate . We will show that they can be obtained by a backward process. Because the discussion assumes that currently we are at a partition between layers and , we show details of generating and dispatching them to partitions between and . From (3) and (30), can be calculated by

(35)

Therefore, we consider a backward process of using to generate . In a distributed system, from (32) and (35),