# Exact gradient updates in time independent of output size for the spherical loss family

An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 200,000). Computing the equally large, but typically non-sparse D-dimensional output vector from a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive O(Dd) computational cost for each example, as does updating the D × d output weight matrix and computing the gradient needed for backpropagation to previous layers. While efficient handling of large sparse network inputs is trivial, the case of large sparse targets is not, and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or sampling-based approximations during training. In this work we develop an original algorithmic approach which, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in O(d^2) per example instead of O(Dd), remarkably without ever computing the D-dimensional output. The proposed algorithm yields a speedup of up to D/4d i.e. two orders of magnitude for typical sizes, for that critical part of the computations that often dominates the training time in this kind of network architecture.

• 47 publications
• 11 publications
• 7 publications
12/22/2014

### Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

An important class of problems involves training deep neural networks wi...
11/16/2015

### An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

In a multi-class classification problem, it is standard to model the out...
04/29/2016

### The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

Despite being the standard loss function to train multi-class neural net...
02/28/2019

### Efficient Contextual Representation Learning Without Softmax Layer

Contextual representation models have achieved great success in improvin...
02/04/2016

### A Factorized Recurrent Neural Network based architecture for medium to large vocabulary Language Modelling

Statistical language models are central to many applications that use se...
12/10/2018

### Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

The Softmax function is used in the final layer of nearly all existing s...
10/29/2018

### On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n ...

## 1 Introduction

Many modern applications of neural networks have to deal with data represented, or representable, as very large sparse vectors. Such representations arise in natural language related tasks, where the dimension of that vector is typically (a multiple of) the size of the vocabulary, but also in the sparse user-item matrices of collaborative-filtering applications. It is trivial to handle very large sparse inputs to a neural network in a computationally efficient manner: the forward propagation and update to the input weight matrix after backpropagation are correspondingly sparse. By contrast, training with very large sparse prediction targets is problematic: even if the target is sparse, the computation of the equally large network output and the corresponding gradient update to the huge output weight matrix are not sparse and thus computationally prohibitive. This has been a practical problem ever since Bengio et al. [1] first proposed using a neural network for learning a language model, in which case the computed output vector represents the probability of the next word and is the size of the considered vocabulary, which is becoming increasingly large in modern applications [2]. Several approaches have been proposed to attempt to address this difficulty essentially by sidestepping it. They fall in two categories:

• Sampling or selection based approximations

consider and compute only a tiny fraction of the output’s dimensions sampled at random or heuristically chosen. The reconstruction sampling of

Dauphin et al. [3], the efficient use of biased importance sampling in Jean et al. [4]

, the use of Noise Contrastive Estimation

[5] in Mnih and Kavukcuoglu [6] and Mikolov et al. [7] all fall under this category. As does the more recent use of approximate Maximum Inner Product Search based on Locality Sensitive Hashing techniques[8, 9] to select a good candidate subset.

• Hierarchical softmax [10, 7] imposes a heuristically defined hierarchical tree structure for the computation of the normalized probability of the target class.

Compared to the initial problem of considering all output dimensions, both kinds of approaches are crude approximations. In the present work, we will instead investigate a way to actually perform the exact gradient update that corresponds to considering all outputs, but do so implicitly, in a computationally efficient manner, without actually computing the outputs. This approach works for a relatively restricted class of loss functions, that we call the spherical family, its simplest member being linear output with squared error (a natural choice for sparse real-valued regression targets). For simplicity and clarity we will begin with this squared error case, presenting the computational challenge that arises in the standard naive approach in Section 2 and deriving our algorithmic solution in Section 3. We will then extend our approach to the more general case of loss functions in the spherical family in Section 4. In Section 5 we will discuss numerical stability issues that may arise and detail our numerical stabilization strategy. Section 6 presents experimental validation focusing on timings obtained with our CPU and GPU implementations of our algorithm relative to the naive update algorithm.

## 2 The problem

### 2.1 Problem definition and setup

We are concerned with gradient-descent based training of a deep feed-forward neural network with target vectors of very high dimension

(e.g. ) but that are sparse, i.e. a comparatively small number, at most , of the elements of the target vector are non-zero. Such a -sparse vector will typically be stored and represented compactly as numbers corresponding to pairs (index, value). A network to be trained with such targets will naturally have an equally large output layer of dimension . We can also optionally allow the input to the network to be a similarly high dimensional sparse vector of dimension

. Between the large sparse target, output, and (optionally large sparse) input, we suppose the network’s intermediate hidden layers to be of smaller, more typically manageable, dimension

(e.g. )111Our approach does not impose any restriction on the architecture nor size of the hidden layers, as long as they are amenable to usual gradient backpropagation..

#### Mathematical notation:

• Vectors are denoted using lower-case letters, e.g. , and are considered column-vectors; corresponding row vectors are denoted with a transpose, e.g. .

• Matrices are denoted using upper-case letters, e.g. , with the transpose of .

• The column of is denoted , and its row (both viewed as a column vector).

• denotes the transpose of the inverse of a square matrix.

• denotes a -dimensional column vector filled with ones.

• denotes an indicator function whose value will be 1 if and 0 otherwise.

• is the -dimensional column vector filled with zeros except at index where its value is 1.

• is the identity matrix.

#### Network architecture

We consider a standard feed forward neural network architecture as depicted in Figure 1. An input vector

is linearly transformed into a linear activation

through a input weight matrix

(and an optional bias vector

). This is typically followed by a non-linear transformation to yield the representation of the first hidden layer . This first hidden layer representation is then similarly transformed through a number of subsequent non-linear layers (that can be of any usual kind amenable to backpropagation) e.g. with until we obtain last hidden layer representation . We then obtain the final -dimensional network output as where is a output weight matrix, which will be our main focus in this work. Finally, the network’s -dimensional output is compared to the -dimensional target vector associated with input using squared error, yielding loss .

#### Training procedure

This architecture is a typical (possibly deep) multi-layer feed forward neural network architecture with a linear output layer and squared error loss. Its parameters (weight matrices and bias vectors) will be trained by gradient descent, using gradient backpropagation Rumelhart et al. [11], LeCun [12, 13] to efficiently compute the gradients. The procedure is shown in Figure 1. Given an example from the training set as an (input,target) pair

, a pass of forward propagation proceeds as outlined above, computing the hidden representation of each hidden layer in turn based on the previous one, and finally the network’s predicted output

and associated loss . A pass of gradient backpropagation then works in the opposite direction, starting from and propagating back the gradients and upstream through the network. The corresponding gradient contributions on parameters (weights and biases), collected along the way, are straightforward once we have the associated . Specifically they are and . Similarly for the input layer , and for the output layer . Parameters are then updated through a gradient descent step and , where is a positive learning-rate. Similarly for the output layer which will be our main focus here: .

### 2.2 The easy part: input layer forward propagation and weight update

It is easy and straightforward to efficiently compute the forward propagation, and the backpropagation and weight update part for the input layer when we have a very large -dimensional but sparse input vector with appropriate sparse representation. Specifically we suppose that is represented as a pair of vectors of length (at most) , where contains integer indexes and the associated real values of the elements of such that if , and .

• Forward propagation through the input layer: The sparse representation of as the positions of elements together with their value makes it cheap to compute . Even though may be a huge full matrix, only of its rows (those corresponding to the non-zero entries of ) need to be visited and summed to compute . Precisely, with our sparse representation of this operation can be written aswhere each is a -dimensional vector, making this an operation rather than .

• Gradient and update through input layer: Let us for now suppose that we were able to get gradients (through backpropagation) up to the first hidden layer activations in the form of gradient vector . The corresponding gradient-based update to input layer weights is simply . This is a rank-one update to . Here again, we see that only the rows of associated to the (at most) non-zero entries of need to be modified. Precisely this operation can be written as: making this again a operation rather than .

### 2.3 The hard part: output layer propagation and weight update

Given some network input we suppose we can compute without difficulty through forward propagation the associated last hidden layer representation . From then on:

• Computing the final output incurs a prohibitive computational cost of since is a full matrix. Note that there is a-priori no reason for representation to be sparse (e.g. with a sigmoid non-linearity) but even if it was, this would not fundamentally change the problem since it is that is extremely large, and we supposed reasonably sized already. Computing the residual and associated squared error loss incurs an additional cost.

• The gradient on that we need to backpropagate to lower layers is which is another matrix-vector product.

• Finally, when performing the corresponding output weight update we see that it is a rank-one update that updates all elements of , which again incurs a prohibitive computational cost.

For very large , all these three operations are prohibitive, and the fact that is sparse, seen from this perspective, doesn’t help, since neither nor will be sparse.

## 3 A computationally efficient algorithm for performing the exact online gradient update

Previously proposed workarounds are approximate or use stochastic sampling. We propose a different approach that results in the exact same, yet efficient gradient update, remarkably without ever having to compute large output .

### 3.1 Computing the squared error loss L and the gradient with respect to h efficiently

Suppose that, we have, for a network input example , computed the last hidden representation through forward propagation. The network’s dimensional output is then in principle compared to the high dimensional target . The corresponding squared error loss is . As we saw in Section 2.3, computing it in the direct naive way would have a prohibitive computational complexity of because computing output with a full matrix and a typically non-sparse is . Similarly, to backpropagate the gradient through the network, we need to compute the gradient of loss with respect to last hidden layer representation . This is . So again, if we were to compute it directly in this manner, the computational complexity would be a prohibitive . Provided we have maintained an up-to-date matrix , which is of reasonable size and can be cheaply maintained as we will see in Section 3.4, we can rewrite these two operations so as to perform them in :

##### Loss computation:
 L = ∥O(Dd)Wh−y∥2 (1) = (Wh−y)T(Wh−y) = hTWTWh−yTWh−hTWTy+yTy = hTQh−2hT(WTy)+yTy = hT(QhO(d2)−2WTyO(Kd))+yTyO(K)
 ∇h=∂L∂h = ∂∥Wh−y∥2∂h (2) = 2WT(Wh−y) = 2(WTWh−WTy) = 2(QhO(d2)−WTyO(Kd))

The terms in and are due to leveraging the -sparse representation of target vector . With and , we get altogether a computational cost of which can be several orders of magnitude cheaper than the prohibitive of the direct approach.

### 3.2 Efficient gradient update of W

The gradient of the squared error loss with respect to output layer weight matrix is . And the corresponding gradient descent update to would be , where is a positive learning rate. Again, computed in this manner, this induces a prohibitive computational complexity, both to compute output and residual , and then to update all the elements of (since generally neither nor will be sparse). All elements of must be accessed during this update. On the surface this seems hopeless. But we will now see how we can achieve the exact same update on in . The trick is to represent implicitly as the factorization222Note that we never factorize a pre-exisitng arbitrary , which would be prohibitive as is huge. We will no longer store a nor work on it explicitly, but only matrices and which implicitly represent . and update and instead:

 a)Unew = U−2η(Uh)hT (3) b)Vnew = V+2ηy(U−Tnewh)T (4)

This results in implicitly updating as we did explicitly in the naive approach as we now prove:

 VnewUnew = (V+2ηy(U−Tnewh)T)Unew = VUnew+2ηy(U−Tnewh)TUnew = VUnew+2ηyhTU−1newUnew = V(U−2η(Uh)hT)+2ηyhT(U−1newUnew) = VU−2ηVUhhT+2ηyhT = VU−2η(VUh−y)hT = W−2η(Wh−y)ThT = Wnew

We see that the update of in Eq. 3 is a simple operation. Following this simple rank-one update to , we can use the Sherman-Morrison formula to derive the corresponding rank-one update to which will also be :

 U−Tnew = U−T+2η1−2η∥h∥2(U−Th)hT (5)

It is then easy to compute the , an operation needed in Eq. 4. The ensuing rank-one update of in Eq 4, thanks to the -sparsity of is only : only the rows associated to non-zero elements in are accessed and updated, sited of all rows of we had to modify in the naive update!

### 3.3 Adapting the computation of L and ∇h to the factored representation of W

With the factored representation of as , we only have implicitly, so the terms that entered in the computation of and in the previous section (Eq. 1 and 2) need to be adapted slightly as , which becomes rather than in computational complexity. But this doesn’t change the overall complexity of these computations.

The adapted update computation of and can thus be expressed simply as:

 ∇h=2(Qh^h−UT(VTy)^y)^z (6)

and

 L=hT(Qh^h−2UT(VTy)^y)+yTy (7)

### 3.4 Bookkeeping: keeping an up-to-date Q and U−T

We have already seen, in Eq. 5, how we can cheaply maintain an up-to-date following our update of . Similarly, following our updates to and , we need to keep an up-to-date which is needed to efficiently compute the loss (Eq. 1) and gradient (Eq. 2). We have shown that updates to and in equations 3 and 4 are equivalent to implicitly updating as , and this translates into the following update to :

 Qnew = Q−η(h∇Th+∇hhT)+(4η2L)hhT (8)

One can see that this last bookkeeping operation also has a computational complexity.

#### Proof that this update to Q corresponds to the update Wnew←2(Wh−y)hT

 WTnewWnew = (W−2η(Wh−y)hT)T(W−2η(Wh−y)hT) WTnewWnew = WTW−2ηh(Wh−y)TW−2ηWT(Wh−y)hT +4η2h(Wh−y)T(Wh−y)hT WTnewWnew = Q−2η(hhTWTW−hyTW)−2η(WTWhhT−WTyhT) +4η2h(hTWTWh−hTWTy−yTWh+yTy)hT WTnewWnew = Q−2η(hhTQ−h(WTy)T)−2η(QhhT−(WTy)hT) +4η2h(hTQh−hT(WTy)−(WTy)Th+yTy)hT WTnewWnew = Q−2ηh(hTQ−(WTy)T)−2η(Qh−WTy)hT +4η2h(hTQh−2hTWTy+yTy)hT WTnewWnew = Q−ηh(2(Qh−WTy)∇h)T−η(2(Qh−WTy)∇h)hT +4η2h(hT(Qh−2WTy)+yTy)LhT

where we see that the last term uses the expression of from Eq. 1 and the first two terms uses the expression of from Eq. 6: . Thus we have shown that

 WTnewWnew = Q−ηh∇Th−2η∇hhT+4η2hLhT = Q−η(h∇Th+∇hhT)+(4η2L)hhT

which is the update that we gave in Eq. 8 above.

### 3.5 Putting it all together: detailed online update algorithm and expected benefits

We have seen that we can efficiently compute cost , gradient with respect to (to be later backpropagated further) as well as updating and and performing the bookkeeping for and . Here we put everything together. The parameters of the output layer that we will learn are and implicitly represent as . We first need to initialize these parameter matrices, as well as bookkeeping matrices and in a consistent way, as explained in Algo. 1. We then iterate over the following:

• pick a next input,target example (where is -sparse and uses an appropriate sparse representation)

• perform forward propagation through all layers of the network up to the last hidden layer, to compute last hidden layer representation , that should include a constant 1 first element.

• execute Algo. 2, that we put together from the equations derived above, and that will: compute the associated squared error loss , perform an implicit gradient update step on by correspondingly updating and in a computationally efficient manner, update bookkeeping matrices and accordingly, and compute and return the gradient of the loss with respect to the last hidden layer

• having , further backpropagate the gradients upstream, and use them to update the parameters of all other layers

Having we see that the update algorithm we developed requires operations, whereas the standard approach required operations. If we take , we may state more precisely that the proposed algorithm, for computing the loss and the gradient updates will require roughly operations whereas the standard approach required roughly operations. So overall the proposed algorithm change corresponds to a computational speedup by a factor of . For and the expected speedup is thus 100. Note that the advantage is not only in computational complexity, but also in memory access. For each example, the standard approach needs to access and change all elements of matrix , whereas the proposed approach only accesses the much smaller number elements of as well as the three matrices , , and . So overall we have a substantially faster algorithm whose complexity is independent of , which, while doing so implicitly, will nevertheless perform the exact same gradient update as the standard approach. We want to emphasize here that this approach is entirely different from simply chaining 2 linear layers and and performing ordinary gradient descent updates on these: this would result in the same prohibitive computational complexity as the standard approach, and such ordinary separate gradient updates to and would not be equivalent to the ordinary gradient update to .

### 3.6 Minibatch version of the algorithm for squared error

The algorithm we derived for online gradient is relatively straightforward to extend to the case of minibatches containing examples. We iniialize parameters as in the online case follpwing Algo. 1 and apply the same training procedure outlined in Section. 3.5, but now using minibatches containing examples, rather than a single example vector. The corresponding update and gradient computation is given in Algorithm  3 which follows equivalent steps to the online version of Algorithm 2, but using matrices with columns in place of single column vectors. For example step 3 which in the online algorithm was using dimensional vectors becomes in the minibatch version using matrices instead.

Note that in the minibatch version, in step 6, we update based on the Woodbury equation, which generalizes the Sheman-Morrison formula for and involves inverting an matrix, an operation. But depending on the size of the minibatch , it may become more efficient to solve the corresponding linear equations for each minibatch from scratch every time, rather than inverting that matrix. In which case we won’t need to maintain an at all. Or in cases of minibatches containing more than examples, it may even become more efficient to invert from scratch every time.

In step 9, the update for corresponds to the implicit weight update as we now prove:

We will use the following precomputed quantities: , and and .

 Qnew = WTnewWnew = (W−2η(WH−Y)HT)T(W−2η(WH−Y)HT) = WTW−2ηH(WH−Y)TW−2ηWT(WH−Y)HT +4η2H(WH−Y)T(WH−Y)HT = Q−2η(HHTWTW−HYTW)−2η(WTWHHT−WTYHT) +4η2H(HTWTWH−HTWTY−YTWH+YTY)HT = Q−2η(HHTQ−H(WTY)T)−2η(QHHT−(WTY)HT) +4η2H(HTQH−HT(WTY)−(WTY)TH+YTY)HT = Q−2η(H^HT−H^YT+^HHT−^YHT) +4η2H(HT^H−HT^Y−^YTH+YTY)HT = = Q−η(H(2(^H−^Y))T+(2(^H−^Y))HT)+4η2H(HT(^H−^Y)−^YTH+YTY)HT = Q−η(H∇TH+∇HHT)+4η2H(HT^Z−^YTH+YTY)MHT

which is the update of we use in in step 8 of Algorithm 2.

## 4 Generalizing to a broader family of loss functions

Let the linear activations computed at the output layer. The approach that we detailed for linear output and squared error can be extended to a more general family of loss functions: basically any loss function that can be expressed using only the associated to non-zero together with the squared norm of the whole output vector, and optionally which we will see that we can both compute cheaply. We call this family of loss functions the spherical family of loss functions or in short spherical losses, defined more formally as the family of losses that can be expressed as:

 L=ℓ( ∥o∥2, sum(o), K, oK, yK)

where denotes the vector of indices of of cardinality at most that is associated to non-zero elements of in a sparse representation of; is the corresponding vector of values of at positions , i.e. ; similarly is the vector of values of linear activation at positions , i.e. .

Note that the squared error loss belongs to this family as

 ℓsquared = D∑j=1(oj−yj)2 = D∑j=1o2j−2ojyj+y2j = (D∑j=1o2j)−2(D∑j=1ojyj)+(D∑j=1y2j) = ∥o∥2−2⎛⎝∑j∈Kojyj⎞⎠+⎛⎝∑j∈Ky2j⎞⎠  since for % j∉K we have yj=0 = ∥o∥2−2oTKyK+∥yK∥2 = ℓsquared( ∥o∥2, sum(o), K, oK, yK)

where in particular doesn’t use .

The spherical family of loss functions does not include the standard log of softmax, but it includes possible alternatives, such as the spherical softmax and Taylor-softmax that we will introduce in a later section. Let us detail the steps for computing such a spherical loss from last hidden layer representation :

The gradient of the loss may be backpropagated and the parameters updated in the usual naive way with the following steps:

• compute scalars and as well as -dimensional gradient vector