Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for k-means Clustering

by   Penghang Yin, et al.
McGill University

In this paper, we propose an implicit gradient descent algorithm for the classic k-means problem. The implicit gradient step or backward Euler is solved via stochastic fixed-point iteration, in which we randomly sample a mini-batch gradient in every iteration. It is the average of the fixed-point trajectory that is carried over to the next gradient step. We draw connections between the proposed stochastic backward Euler and the recent entropy stochastic gradient descent (Entropy-SGD) for improving the training of deep neural networks. Numerical experiments on various synthetic and real datasets show that the proposed algorithm finds the global minimum (or its neighborhood) with high probability, when given the correct number of clusters. The method provides better clustering results compared to k-means algorithms in the sense that it decreased the objective function (the cluster) and is much more robust to initialization.



There are no comments yet.


page 9

page 12


A Fixed-Point of View on Gradient Methods for Big Data

Interpreting gradient methods as fixed-point iterations, we provide a de...

Semi-Implicit Back Propagation

Neural network has attracted great attention for a long time and many re...

A Stochastic Alternating Balance k-Means Algorithm for Fair Clustering

In the application of data clustering to human-centric decision-making s...

Stochastic Gradient Descent for Spectral Embedding with Implicit Orthogonality Constraint

In this paper, we propose a scalable algorithm for spectral embedding. T...

Overcoming Challenges in Fixed Point Training of Deep Convolutional Networks

It is known that training deep neural networks, in particular, deep conv...

Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

We propose mS2GD: a method incorporating a mini-batching scheme for impr...

Differentiable Fixed-Point Iteration Layer

Recently, several studies proposed methods to utilize some restricted cl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


-means method appeared in vector quantization in signal processing, and had now become popular for clustering analysis in data mining. In the seminal paper


, Lloyd proposed a two-step alternating algorithm that quickly converges to a local minimum. Lloyd’s algorithm is also known as an instance of the more general Expectation-Maximization (EM) algorithm applied to Gaussian mixtures. In

Bottou_95 , Bottou and Bengio cast Lloyd’s algorithm as Newton’s method, which explains its fast convergence.

Aiming to speed up Lloyd’s algorithm, Elkan elkan proposed to keep track of the distances between the computed centroids and data points, and then cleverly leverage the triangle inequality to eliminate unnecessary computations of the distances. Similar techniques can be found in yingyang . It is worth noting that these algorithms do not improve the clustering quality of Lloyd’s algorithm, but only achieve acceleration. However, there are well known examples where poor initialization can lead to low quality local minima for Lloyd’s algorithm. Random initialization has been used to avoid these low quality fixed points. The article kmeans++ introduced a smart initialization scheme such that the initial centroids are well-separated, which gives more robust clustering than random initialization.

We are motivated by problems with very large data sets, where the cost of a single iteration of Lloyd’s algorithm can be expensive. Mini-batch mnbatch ; stochastic_kmeans was later introduced to adapt -means for large scale data with high dimensions. The centroids are updated using a randomly selected mini-batch rather than all of the data. Mini-batch (stochastic)

-means has a flavor of stochastic gradient descent whose benefits are twofold. First, it dramatically reduces the per-iteration cost for updating the centroids and thus is able to handle big data efficiently. Second, similar to its successful application to deep learning

deep_learning , mini-batch gradient introduces noise in minimization and may help to bypass some bad local minima. Furthermore, the aforementioned Elkan’s technique can be combined with mini-batch -means for further acceleration nested_mnbatch .

In this paper, we propose a backward Euler based algorithm for -means clustering. Fixed-point iteration is performed to solve the implicit gradient step. As is done for stochastic mini-batch -means, we compute the gradient only using a mini-batch of samples instead of the whole data, which enables us to handle massive data. Unlike the standard fixed-point iteration, the proposed stochastic fixed-point iteration outputs an average over its trajectory. Extensive experiments show that, with proper choice of step size for stochastic backward Euler, the proposed algorithm can improve over EM and Mini-batch EM and locate an improved minimum with decreased objective value.

In other words, while Lloyd’s algorithm is effective with a full gradient oracle we achieve better performance with the weaker mini-batch gradient oracle. We are motivated by recent work by two of the authors deep_relax which applied a similar algorithm to accelerate the training of Deep Neural Networks.

2 Stochastic backward Euler

The celebrated proximal point algorithm (PPA) proximal_point for minimizing some function is:


PPA has the advantage of being monotonically decreasing, which is guaranteed for any step size . Indeed, by the definition of in (1), we have

When for any with being the Lipschitz constant of , the (subsequential) convergence to a stationary point is established in PPM . If is differentiable at , it is easy to check that the following optimality condition to (1) holds

By rearranging the terms, we arrive at implicit gradient descent or the so-called backward Euler:


When has the Lipschitz constant and , the fixed point iteration


is a viable option for updating by solving the equation.

It is essentially the gradient descent

on (1) by choosing the step size .

Proposition 1

If , then we have

  • is strongly convex, and the proximal problem (1) has a unique solution .

  • The fixed point iteration (3) generates a sequence converging to at least linearly.

See (Ber08, , Proposition 1.2.3).

Let us consider -means clustering for a set of data points in with centroids . Denoting , we seek to minimize


Note that is non-differentiable at if there exist and such that

This means that there is a data point which has two or more distinct nearest centroids and . The same situation may happen in the assignment step of Lloyd’s algorithm. In this case, we simply assign to one of the nearest centroids. With that said, is basically piecewise differentiable. By abuse of notation, we can define the ’gradient’ of at any point by


where denotes the index set of the points that are assigned to the centroid . From now on and for the rest of the paper, we denote the piecewise gradient by as stated in (2), and none of the results depends on the specific assignment of ambiguous data points . Similarly, we can compute the ’Hessian’ of as was done in Bottou_95 :

where is an -D vector of all ones. In what follows, we analyze how the fixed point iteration (3) works on the piecewise differentiable with discontinuous .

Definition 1

is piecewise Lipschitz continuous on with Lipschitz constant , if can be partitioned into a finite number of sub-domains satisfying , , , and is Lipschitz continuous in each sub-domain , i.e., for each we have

We can see that is at most piecewise -Lipschitz continuous. The following result proves the convergence of fixed point iteration on -means problem.

Theorem 2.1

Let be the -means objective function defined in (4). Suppose is piecewise -Lipschitz. If , then the fixed point iteration for minimizing given by

with the initialization satisfies

  • and as .

  • is bounded. Moreover, if any limit point of a convergent subsequence of lies in the interior of some sub-domain, then the whole sequence converges to with a locally linear rate, which is a fixed point obeying


(a) We know that is piecewise quadratic. Suppose (note that could be on the boundary), then has a uniform expression restricted on which is a quadratic function, denoted by . We can extend the domain of from to the whole , and we denote the extended function still by . Since is quadratic, is -Lipschitz continuous on . Then we have the following well-known inequality

Using the above inequality and the definition of , we have

In the second equality above, we used the identity

with and . Since , is monotonically decreasing. Moreover, since is bounded from below by 0, converges and thus as .

(b) Since as , combining with the fact that , we have is bounded. Consider a convergent subsequence whose limit lies in the interior of some sub-domain. Then for sufficiently large , will always remain in the same sub-domain in which lies and thus . Since by (a), , we have as . Therefore,

which implies is a fixed point. Furthermore, by the piecewise Lipschitz condition,

Since , when is sufficiently large, is also in the same sub-domain containing . By repeatedly applying the above inequality for , we conclude that converges to .

Remark 1

This result can be extended to objective functions that are the pointwise infimum of a set of a finite number of Lipschitz differentiable functions.

2.1 Algorithm description

Instead of using the full gradient in fixed-point iteration, we adopt a randomly sampled mini-batch gradient

at the -th inner iteration. Here, denotes the index set of the points in the -th mini-batch associated with the centroid obeying . The fixed-point iteration outputs a forward looking average over its trajectory. Intuitively averaging greatly stabilizes the noisy mini-batch gradients and thus smooths the descent. We summarize the proposed algorithm in Algorithm 1. Another key ingredient of our algorithm is an aggressive initial step size , which helps bypass bad local minimum at the early stage. Unlike in deterministic backward Euler, diminishing step size is needed to ensure convergence. But should decay slowly because large step size is good for a global search.

Input: number of clusters , step size , mini-batch size , averaging parameter , step size decay parameter .
Initialize: centroid .

  for  do
     for  do
        Randomly sample a mini-batch gradient .
     end for
  end for


Algorithm 1 Stochastic backward Euler for -means.

2.2 Related work

Chaudhari et al. esgd recently proposed the entropy stochastic gradient descent (Entropy-SGD) algorithm to tackle the training of deep neural networks. Relaxation techniques arising in statistical physics were used to change the energy landscape of the original non-convex objective function yet with the minimizers being preserved, which allows easier minimization to obtain a ’good’ minimizer with a better geometry. More precisely, they suggest to replace with a modified objective function called local entropy local_entropy as follows


is the heat kernel. The connection between Entropy-SGD and nonlinear partial differential equations (PDEs) was later established in

deep_relax . The local entropy function turns out to be the solution to the following viscous Hamilton-Jacobi (HJ) PDE at


with the initial condition . In the limit , (6) reduces to the non-viscous HJ equation

whose viscosity solution is exactly the Moreau envelope Moreau_65 :

The gradient descent dynamics for is obtained by taking the limit of the following system of stochastic differential equation as the homogenization parameter :


where is the standard Wiener process. Specifically, we have

with and being the solution of (2.2) for fixed . This gives rise to the implementation of Entropy-SGD deep_relax :

where and are the gradient step sizes for the inner and outer loops, respectively, is the moving average of output from the inner loop, and introduces the noise. Stochastic backward Euler simplifies Entropy-SGD in two aspects. First, the term is absent in SBE as the mini-batch gradient itself already contains the noise. Second, the step sizes and are both set to , which make the algorithm simpler with less tunable parameters.

3 Experimental results

We show by several experiments that the proposed stochastic backward Euler (SBE) gives superior clustering results compared with the state-of-the-art algorithms for -means. SBE scales well for large problems. In practice, only a small number of fixed-point iterations are needed in the inner loop, and this seems not to depend on the size of the problem. Specifically, we chose the parameters imaxit = 5 or and the averaging parameter in all experiments. We remark that SBE is not very sensitive to these parameters. For example, it works equally well for . In addition, we always set .

3.1 2-D synthetic Gaussian data

We generated 4000 synthetic data points in 2-D plane by multivariate normal distributions with 1000 points in each cluster. The means and covariance matrices used for Gaussian distributions are as follows:

For the initial centroids given below,both Lloyd’s algorithm (or EM) and mini-batch EM got stuck at the same local minimum with objective value about ; see the left plot of Fig. 1.

Starting from where EM and mini-batch EM got stuck, we can see that SBE managed to jump over the trap of local minimum and arrived at a better minimum, which seems to be the global minimum; see the right plot of Fig. 1.

Figure 1: Synthetic Gaussian data with 4 centroids. Left: Computed centroids by EM and SBE corresponding to the objective values 1.34 and 0.89, respectively. Right: Plot of objective value v.s. number of (outer) iteration. EM converged quickly but got trapped at a local minimum. SBE bypassed this local minimum and reached a better minimum by jumping over a hill.

3.2 Iris dataset

The Iris dataset, which contains 150 4-D data samples from 3 clusters, was used for comparisons of SBE, EM as well as mini-batch EM algorithms. 100 runs were realized with the initial centroids randomly selected from the data samples. For the parameters, we chose mini-batch size , initial step size, imaxit, omaxit, and decay parameter . The histograms in Fig. 2 record the frequency of objective values given by the three algorithms. Clearly there was chance that EM got stuck at a local minimum whose value is about 0.48, whereas both SBE and mini-batch EM managed to locate an improved minimum valued at around 0.264 every time.

Figure 2: The Iris dataset with 3 clusters. Top left: histogram of objective values obtained by EM in 100 trials. Top right: histogram of objective values obtained by mini-batch EM in 100 trials. Bottom left: histogram of objective values obtained by SBE (proposed) in 100 trials. Bottom right: computed centroids by EM (black) and SBE (red), corresponding to the objective values 0.48 and 0.264, respectively.

3.3 Gaussian data with MNIST centroids

We selected 8 hand-written digit images of dimension from MNIST dataset shown in Fig. 3, and then generated 60,000 images from these 8 centroids by adding Gaussian noise. We compare SBE with both EM and mini-batch EM (mb-EM) mnbatch ; stochastic_kmeans

on 100 independent realizations with random initial guess. For each method, we recorded the minimum, maximum, mean and variance of the 100 objective values by the computed centroids.

Figure 3: 8 selected images from MNIST dataset. 60,000 sample images are generated from these 8 images by adding Gaussian noise.

We first compare SBE and EM with the true number of clusters . For SBE, mini-batch size , maximum number of iterations for backward Euler omaxit=150, maximum fixed-point iterations imaxit= 10 for SBE. We set the maximum number of iterations for EM to be 50, which was sufficient for its convergence. The results are listed in the first two rows of Table 1. We observed SBE always found a minimum around 15.68 up to a tiny error due to the noise from mini-batch. Moreover, note that although we run more iterations (taking the inner loop into account) for SBE than for EM, SBE actually requires less distance evaluations and is computationally cheaper compared with EM because of the small mini-batch. More details will be discussed in section 3.6.

In the comparison between SBE and mb-EM, we reduced mini-batch size to , omaxit, imaxit and tested for . Table 1 shows that with the same mini-batch size, SBE outperforms mb-EM in all three cases, in terms of both mean and variance of the objective values.

Method Batch size Max iter Min Max Mean Variance
8 EM 60000 50 15.6800 27.2828 20.0203 6.0030
SBE 1000 (150,10) 15.6808 15.6808 15.6808 1.49
6 mb-EM 500 100 20.44 23.4721 21.8393 0.67
SBE 500 (100,5) 20.2989 21.2047 20.4939 0.0439
8 mb-EM 500 100 15.9193 18.5820 16.4009 0.7646
SBE 500 (100,5) 15.6816 15.6821 15.6820 1.18
10 mb-EM 500 100 15.9148 18.1848 16.1727 0.4332
SBE 500 (100,5) 15.6823 15.6825 15.6824 1.5
Table 1: Gaussian data generated from MNIST centroids by adding noise. Ground truth . Clustering results for 100 independent trails with random initialization.

3.4 Raw MNIST data

In this example, We used the 60,000 images from the MNIST training set for clustering test, with 6000 samples for each digit (cluster) from 0 to 9. The comparison results are shown in Table 2. We conclude that SBE consistently performs better than EM and mb-EM. The histograms of objective value by the three algorithms in the case are plotted in Fig. 4.

Method Batch size Max iter Min Max Mean Variance
10 EM 60000 50 19.6069 19.8195 19.6725 0.0028
SBE 1000 (150,10) 19.6087 19.7279 19.6201 5.7
8 mb-EM 500 100 20.4948 20.7126 20.5958 0.0018
SBE 500 (100,5) 20.2723 20.4104 20.3090 0.0014
10 mb-EM 500 100 19.9029 20.2347 20.0146 0.0041
SBE 500 (100,5) 19.6103 19.7293 19.6354 0.0011
12 mb-EM 500 100 19.3978 19.7147 19.5136 0.0042
SBE 500 (100,5) 19.0492 19.1582 19.0972 6.2
Table 2: Raw MNIST training data. The ground truth number of clusters is . Clustering results for 100 independent trails with random initialization.
Figure 4: Histograms of objective value for MNIST training data with ground truth number of clusters . Top left: EM. Top right: SBE, mini-batch size of 1000. Bottom left: mn-EM, mini-batch size of 500. Bottom right: SBE, mini-batch size of 500.

3.5 MNSIT features

We extracted the feature vectors of MNIST training data prior to the last layer of LeNet-5 mnist_98 . The feature vectors have dimension 64 and lie in a better manifold compared with the raw data. The results are shown in Table 3 and Fig. 5 and 6.

Method Batch size Max iter Min Max Mean Variance
10 EM 60000 50 1.6238 3.0156 2.1406 0.0977
SBE 1000 (150,10) 1.6238 1.6239 1.6239 2.7
8 mb EM 500 100 2.3428 3.5972 2.7157 0.0666
SBE 500 (100,5) 2.2833 2.4311 2.3274 0.0015
10 mb EM 500 100 1.6504 2.6676 2.1391 0.0712
SBE 500 (100,5) 1.6239 1.6242 1.6240 1.37
12 mb EM 500 100 1.5815 2.6189 1.7853 0.0661
SBE 500 (100,5) 1.5326 1.5891 1.5622 9.8
Table 3: MNIST features generated by LeNet-5 network. The ground truth number of clusters is . Clustering results for 100 independent trails with random initialization.
Figure 5: Histograms of objective value for MNIST feature data with ground truth number of clusters K=10. Top left: EM. Top right: SBE, mini-batch size of 1000. Bottom left: mn-EM, mini-batch size of 500. Bottom right: SBE, mini-batch size of 500.
Figure 6: Objective value for MNIST features dataset. The ground truth number of clusters is . EM got trapped at local minimum around 2.178. Initializing SBE with this local minimizer, an improved minimum around 1.623 was found.

3.6 Comparison of time efficiency

We first compare the per-(outer)iteration costs for EM, mini-batch EM, and SBE, respectively. In every iteration, we need to find the the labels or clusters associated with data points that are used to update the centroids, for which minimum distance between the data points and current centroids are computed. This dominates the total computational cost, especially for big data. In EM, we need to find the labels for all data points when updating the centroids. In contrast, only a small batch of labels are needed in mini-batch EM and SBE. Specifically, for the datasets in sections 3.3 and 3.4 with 60,000 points of dimension 784, the per-iteration computation time of EM was around 2.3 seconds, whereas those of mini-batch EM and SBE (with 5 inner iterations) were 0.03 seconds and 0.1 seconds, respectively. For the raw MNIST data with , EM usually needed around 15 iterations to converge to a good minimum at about 19.6 with random initialization (if succeeded; see Table 2), whereas SBE needed 30 iterations and mini-batch EM always failed to do so. So basically we saw more than 10 savings in time efficiency of SBE, compared with EM. The tests were carried out on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory.

4 Discussions

In this section, we provide an intuitive explanation for why SBE often succeeds to find better local minima than SGD through a simple one-dimensional example in Fig. 7. At the -th iteration, SBE approximately solves due to the noise introduced by mini-batch gradient. Since is technically only piecewise Lipschitz continuous, the backward Euler may have multiple solutions. For example, in Fig. 7, we get two solutions in the leftmost valley and in the second from the left. solved by SBE is close to as it gives a lower objective value of . Then bypasses the local minimum, which explains the increase of objective value shown in the right plot of Fig. 1. We conjecture that averaging of the iterates enables an aggressive step size larger than the theoretical upper-bound as in Theorem 2.1, which also helps skip bad local minima. We did observe blowup phenomenon numerically whenever the averaging scheme was not used. It is of our interest to prove an improved upper-bound for in the future work. Similar to what was done in artina_13 , another direction is to analyze how exact the inner problems have to be solved in order to still guarantee the convergence of the outer Backward Euler problem.

Figure 7: Comparison the updates between SGD and SBE. Left: In the -th update, SGD gives while SBE solves gradient descent on and ends up with . Right: SBE tends to converge to the global minimum of local entropy . We must let in order for SBE to converge to the true global minimum of .


This work was partially supported by AFOSR grant FA9550-15-1-0073 and ONR grant N00014-16-1-2157. We would like to thank Dr. Bao Wang for helpful discussions. We also thank the anonymous reviewers for their constructive comments.


  • (1) M. Artina, M. Fornasier, and F. Solombrino. ”Linearly constrained nonsmooth and nonconvex minimization. SIAM Journal on Optimization”, 23(3), 1904-1937, 2013.
  • (2) D. Arthur and S. Vassilvitskii. ”-means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.
  • (3)

    C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, ”Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses”, Physical review letters, 115(12), 128-101, 2015.

  • (4) D.P. Bertsekas: ”Nonlinear Programming” Second Edition, Athena Scientific, Belmont, Massachusetts, 2008.
  • (5) L. Bottou and Y. Bengio, ”Convergence properties of the -means algorithms”, Advances in neural information processing systems, 1995.
  • (6) P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. ”Entropy-sgd: Biasing gradient descent into wide valleys”, arXiv preprint arXiv:1611.01838, 2016.
  • (7) P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, ”Deep Relaxation: partial differential equations for optimizing deep neural networks”, arXiv preprint arXiv:1704.04932, 2017.
  • (8) Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, ”Yinyang -means: A drop-in replacement of the classic

    -means with consistent speedup”, Proceedings of the 32nd International Conference on Machine Learning, 2015.

  • (9) C. Elkan, ”Using the triangle inequality to accelerate -means”, Proceedings of the 20th International Conference on Machine Learning, 2003.
  • (10) A. Kaplan and R. Tichatschke, ”Proximal point method and nonconvex optimization”, Journal of Global Optimization, 13, 389-406, 1998.
  • (11) Y. LeCun, Y. Bengio, and G. Hinton, ”Deep learning”, Nature, 521(7553), 436-444, 2015.
  • (12) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, ”Gradient-based Learning Applied to Document Recognition”, Proceedings of the IEEE, 86(11), 2278-2324, 1998.
  • (13) S. Lloyd, ”Least squares quantization in PCM”, IEEE transactions on information theory, 28(2), 129-137, 1982.
  • (14) J.-J. Moreau, ” Proximité et dualité dans un espace hilbertien”, Bulletin de la Société Mathématique de France, 93, 273–299, 1965.
  • (15) J. Newling and F. Fleuret, ”Nested Mini-Batch -means”, Advances in Neural Information Processing Systems, 2016.
  • (16) R. Rockafellar, ”Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization, 14, 877-898, 1976.
  • (17) D. Sculley, ”Web-scale -means clustering”, Proceedings of the 19th international conference on World wide web, ACM, 2010.
  • (18) C. Tang and C. Monteleoni. ”Convergence rate of stochastic -means”, arXiv preprint arXiv:1610.04900, 2016.