1 Introduction
The
means method appeared in vector quantization in signal processing, and had now become popular for clustering analysis in data mining. In the seminal paper
lloyd, Lloyd proposed a twostep alternating algorithm that quickly converges to a local minimum. Lloyd’s algorithm is also known as an instance of the more general ExpectationMaximization (EM) algorithm applied to Gaussian mixtures. In
Bottou_95 , Bottou and Bengio cast Lloyd’s algorithm as Newton’s method, which explains its fast convergence.Aiming to speed up Lloyd’s algorithm, Elkan elkan proposed to keep track of the distances between the computed centroids and data points, and then cleverly leverage the triangle inequality to eliminate unnecessary computations of the distances. Similar techniques can be found in yingyang . It is worth noting that these algorithms do not improve the clustering quality of Lloyd’s algorithm, but only achieve acceleration. However, there are well known examples where poor initialization can lead to low quality local minima for Lloyd’s algorithm. Random initialization has been used to avoid these low quality fixed points. The article kmeans++ introduced a smart initialization scheme such that the initial centroids are wellseparated, which gives more robust clustering than random initialization.
We are motivated by problems with very large data sets, where the cost of a single iteration of Lloyd’s algorithm can be expensive. Minibatch mnbatch ; stochastic_kmeans was later introduced to adapt means for large scale data with high dimensions. The centroids are updated using a randomly selected minibatch rather than all of the data. Minibatch (stochastic)
means has a flavor of stochastic gradient descent whose benefits are twofold. First, it dramatically reduces the periteration cost for updating the centroids and thus is able to handle big data efficiently. Second, similar to its successful application to deep learning
deep_learning , minibatch gradient introduces noise in minimization and may help to bypass some bad local minima. Furthermore, the aforementioned Elkan’s technique can be combined with minibatch means for further acceleration nested_mnbatch .In this paper, we propose a backward Euler based algorithm for means clustering. Fixedpoint iteration is performed to solve the implicit gradient step. As is done for stochastic minibatch means, we compute the gradient only using a minibatch of samples instead of the whole data, which enables us to handle massive data. Unlike the standard fixedpoint iteration, the proposed stochastic fixedpoint iteration outputs an average over its trajectory. Extensive experiments show that, with proper choice of step size for stochastic backward Euler, the proposed algorithm can improve over EM and Minibatch EM and locate an improved minimum with decreased objective value.
In other words, while Lloyd’s algorithm is effective with a full gradient oracle we achieve better performance with the weaker minibatch gradient oracle. We are motivated by recent work by two of the authors deep_relax which applied a similar algorithm to accelerate the training of Deep Neural Networks.
2 Stochastic backward Euler
The celebrated proximal point algorithm (PPA) proximal_point for minimizing some function is:
(1) 
PPA has the advantage of being monotonically decreasing, which is guaranteed for any step size . Indeed, by the definition of in (1), we have
When for any with being the Lipschitz constant of , the (subsequential) convergence to a stationary point is established in PPM . If is differentiable at , it is easy to check that the following optimality condition to (1) holds
By rearranging the terms, we arrive at implicit gradient descent or the socalled backward Euler:
(2) 
When has the Lipschitz constant and , the fixed point iteration
(3) 
is a viable option for updating by solving the equation.
It is essentially the gradient descent
on (1) by choosing the step size .
Proposition 1
See (Ber08, , Proposition 1.2.3).
Let us consider means clustering for a set of data points in with centroids . Denoting , we seek to minimize
(4) 
Note that is nondifferentiable at if there exist and such that
This means that there is a data point which has two or more distinct nearest centroids and . The same situation may happen in the assignment step of Lloyd’s algorithm. In this case, we simply assign to one of the nearest centroids. With that said, is basically piecewise differentiable. By abuse of notation, we can define the ’gradient’ of at any point by
(5) 
where denotes the index set of the points that are assigned to the centroid . From now on and for the rest of the paper, we denote the piecewise gradient by as stated in (2), and none of the results depends on the specific assignment of ambiguous data points . Similarly, we can compute the ’Hessian’ of as was done in Bottou_95 :
where is an D vector of all ones. In what follows, we analyze how the fixed point iteration (3) works on the piecewise differentiable with discontinuous .
Definition 1
is piecewise Lipschitz continuous on with Lipschitz constant , if can be partitioned into a finite number of subdomains satisfying , , , and is Lipschitz continuous in each subdomain , i.e., for each we have
We can see that is at most piecewise Lipschitz continuous. The following result proves the convergence of fixed point iteration on means problem.
Theorem 2.1
Let be the means objective function defined in (4). Suppose is piecewise Lipschitz. If , then the fixed point iteration for minimizing given by
with the initialization satisfies

and as .

is bounded. Moreover, if any limit point of a convergent subsequence of lies in the interior of some subdomain, then the whole sequence converges to with a locally linear rate, which is a fixed point obeying
Proof
(a) We know that is piecewise quadratic. Suppose (note that could be on the boundary), then has a uniform expression restricted on which is a quadratic function, denoted by . We can extend the domain of from to the whole , and we denote the extended function still by . Since is quadratic, is Lipschitz continuous on . Then we have the following wellknown inequality
Using the above inequality and the definition of , we have
In the second equality above, we used the identity
with and . Since , is monotonically decreasing. Moreover, since is bounded from below by 0, converges and thus as .
(b) Since as , combining with the fact that , we have is bounded. Consider a convergent subsequence whose limit lies in the interior of some subdomain. Then for sufficiently large , will always remain in the same subdomain in which lies and thus . Since by (a), , we have as . Therefore,
which implies is a fixed point. Furthermore, by the piecewise Lipschitz condition,
Since , when is sufficiently large, is also in the same subdomain containing . By repeatedly applying the above inequality for , we conclude that converges to .
Remark 1
This result can be extended to objective functions that are the pointwise infimum of a set of a finite number of Lipschitz differentiable functions.
2.1 Algorithm description
Instead of using the full gradient in fixedpoint iteration, we adopt a randomly sampled minibatch gradient
at the th inner iteration. Here, denotes the index set of the points in the th minibatch associated with the centroid obeying . The fixedpoint iteration outputs a forward looking average over its trajectory. Intuitively averaging greatly stabilizes the noisy minibatch gradients and thus smooths the descent. We summarize the proposed algorithm in Algorithm 1. Another key ingredient of our algorithm is an aggressive initial step size , which helps bypass bad local minimum at the early stage. Unlike in deterministic backward Euler, diminishing step size is needed to ensure convergence. But should decay slowly because large step size is good for a global search.
2.2 Related work
Chaudhari et al. esgd recently proposed the entropy stochastic gradient descent (EntropySGD) algorithm to tackle the training of deep neural networks. Relaxation techniques arising in statistical physics were used to change the energy landscape of the original nonconvex objective function yet with the minimizers being preserved, which allows easier minimization to obtain a ’good’ minimizer with a better geometry. More precisely, they suggest to replace with a modified objective function called local entropy local_entropy as follows
where
is the heat kernel. The connection between EntropySGD and nonlinear partial differential equations (PDEs) was later established in
deep_relax . The local entropy function turns out to be the solution to the following viscous HamiltonJacobi (HJ) PDE at(6) 
with the initial condition . In the limit , (6) reduces to the nonviscous HJ equation
whose viscosity solution is exactly the Moreau envelope Moreau_65 :
The gradient descent dynamics for is obtained by taking the limit of the following system of stochastic differential equation as the homogenization parameter :
(7) 
where is the standard Wiener process. Specifically, we have
with and being the solution of (2.2) for fixed . This gives rise to the implementation of EntropySGD deep_relax :
where and are the gradient step sizes for the inner and outer loops, respectively, is the moving average of output from the inner loop, and introduces the noise. Stochastic backward Euler simplifies EntropySGD in two aspects. First, the term is absent in SBE as the minibatch gradient itself already contains the noise. Second, the step sizes and are both set to , which make the algorithm simpler with less tunable parameters.
3 Experimental results
We show by several experiments that the proposed stochastic backward Euler (SBE) gives superior clustering results compared with the stateoftheart algorithms for means. SBE scales well for large problems. In practice, only a small number of fixedpoint iterations are needed in the inner loop, and this seems not to depend on the size of the problem. Specifically, we chose the parameters imaxit
= 5 or and the averaging parameter in all experiments. We remark that SBE is not very sensitive to these parameters. For example, it works equally well for . In addition, we always set .
3.1 2D synthetic Gaussian data
We generated 4000 synthetic data points in 2D plane by multivariate normal distributions with 1000 points in each cluster. The means and covariance matrices used for Gaussian distributions are as follows:
For the initial centroids given below,both Lloyd’s algorithm (or EM) and minibatch EM got stuck at the same local minimum with objective value about ; see the left plot of Fig. 1.
Starting from where EM and minibatch EM got stuck, we can see that SBE managed to jump over the trap of local minimum and arrived at a better minimum, which seems to be the global minimum; see the right plot of Fig. 1.
3.2 Iris dataset
The Iris dataset, which contains 150 4D data samples from 3 clusters, was used for comparisons of SBE, EM as well as minibatch EM algorithms. 100 runs were realized with the initial centroids randomly selected from the data samples. For the parameters, we chose minibatch size , initial step size, imaxit
, omaxit
, and decay parameter . The histograms in Fig. 2 record the frequency of objective values given by the three algorithms. Clearly there was chance that EM got stuck at a local minimum whose value is about 0.48, whereas both SBE and minibatch EM managed to locate an improved minimum valued at around 0.264 every time.
3.3 Gaussian data with MNIST centroids
We selected 8 handwritten digit images of dimension from MNIST dataset shown in Fig. 3, and then generated 60,000 images from these 8 centroids by adding Gaussian noise. We compare SBE with both EM and minibatch EM (mbEM) mnbatch ; stochastic_kmeans
on 100 independent realizations with random initial guess. For each method, we recorded the minimum, maximum, mean and variance of the 100 objective values by the computed centroids.
We first compare SBE and EM with the true number of clusters . For SBE, minibatch size , maximum number of iterations for backward Euler omaxit
=150, maximum fixedpoint iterations imaxit
= 10 for SBE. We set the maximum number of iterations for EM to be 50, which was sufficient for its convergence. The results are listed in the first two rows of Table 1. We observed SBE always found a minimum around 15.68 up to a tiny error due to the noise from minibatch. Moreover, note that although we run more iterations (taking the inner loop into account) for SBE than for EM, SBE actually requires less distance evaluations and is computationally cheaper compared with EM because of the small minibatch. More details will be discussed in section 3.6.
In the comparison between SBE and mbEM, we reduced minibatch size to , omaxit
, imaxit
and tested for . Table 1 shows that with the same minibatch size, SBE outperforms mbEM in all three cases, in terms of both mean and variance of the objective values.
Method  Batch size  Max iter  Min  Max  Mean  Variance  

8  EM  60000  50  15.6800  27.2828  20.0203  6.0030 
SBE  1000  (150,10)  15.6808  15.6808  15.6808  1.49  
6  mbEM  500  100  20.44  23.4721  21.8393  0.67 
SBE  500  (100,5)  20.2989  21.2047  20.4939  0.0439  
8  mbEM  500  100  15.9193  18.5820  16.4009  0.7646 
SBE  500  (100,5)  15.6816  15.6821  15.6820  1.18  
10  mbEM  500  100  15.9148  18.1848  16.1727  0.4332 
SBE  500  (100,5)  15.6823  15.6825  15.6824  1.5 
3.4 Raw MNIST data
In this example, We used the 60,000 images from the MNIST training set for clustering test, with 6000 samples for each digit (cluster) from 0 to 9. The comparison results are shown in Table 2. We conclude that SBE consistently performs better than EM and mbEM. The histograms of objective value by the three algorithms in the case are plotted in Fig. 4.
Method  Batch size  Max iter  Min  Max  Mean  Variance  

10  EM  60000  50  19.6069  19.8195  19.6725  0.0028 
SBE  1000  (150,10)  19.6087  19.7279  19.6201  5.7  
8  mbEM  500  100  20.4948  20.7126  20.5958  0.0018 
SBE  500  (100,5)  20.2723  20.4104  20.3090  0.0014  
10  mbEM  500  100  19.9029  20.2347  20.0146  0.0041 
SBE  500  (100,5)  19.6103  19.7293  19.6354  0.0011  
12  mbEM  500  100  19.3978  19.7147  19.5136  0.0042 
SBE  500  (100,5)  19.0492  19.1582  19.0972  6.2 
3.5 MNSIT features
We extracted the feature vectors of MNIST training data prior to the last layer of LeNet5 mnist_98 . The feature vectors have dimension 64 and lie in a better manifold compared with the raw data. The results are shown in Table 3 and Fig. 5 and 6.
Method  Batch size  Max iter  Min  Max  Mean  Variance  

10  EM  60000  50  1.6238  3.0156  2.1406  0.0977 
SBE  1000  (150,10)  1.6238  1.6239  1.6239  2.7  
8  mb EM  500  100  2.3428  3.5972  2.7157  0.0666 
SBE  500  (100,5)  2.2833  2.4311  2.3274  0.0015  
10  mb EM  500  100  1.6504  2.6676  2.1391  0.0712 
SBE  500  (100,5)  1.6239  1.6242  1.6240  1.37  
12  mb EM  500  100  1.5815  2.6189  1.7853  0.0661 
SBE  500  (100,5)  1.5326  1.5891  1.5622  9.8 
3.6 Comparison of time efficiency
We first compare the per(outer)iteration costs for EM, minibatch EM, and SBE, respectively. In every iteration, we need to find the the labels or clusters associated with data points that are used to update the centroids, for which minimum distance between the data points and current centroids are computed. This dominates the total computational cost, especially for big data. In EM, we need to find the labels for all data points when updating the centroids. In contrast, only a small batch of labels are needed in minibatch EM and SBE. Specifically, for the datasets in sections 3.3 and 3.4 with 60,000 points of dimension 784, the periteration computation time of EM was around 2.3 seconds, whereas those of minibatch EM and SBE (with 5 inner iterations) were 0.03 seconds and 0.1 seconds, respectively. For the raw MNIST data with , EM usually needed around 15 iterations to converge to a good minimum at about 19.6 with random initialization (if succeeded; see Table 2), whereas SBE needed 30 iterations and minibatch EM always failed to do so. So basically we saw more than 10 savings in time efficiency of SBE, compared with EM. The tests were carried out on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory.
4 Discussions
In this section, we provide an intuitive explanation for why SBE often succeeds to find better local minima than SGD through a simple onedimensional example in Fig. 7. At the th iteration, SBE approximately solves due to the noise introduced by minibatch gradient. Since is technically only piecewise Lipschitz continuous, the backward Euler may have multiple solutions. For example, in Fig. 7, we get two solutions in the leftmost valley and in the second from the left. solved by SBE is close to as it gives a lower objective value of . Then bypasses the local minimum, which explains the increase of objective value shown in the right plot of Fig. 1. We conjecture that averaging of the iterates enables an aggressive step size larger than the theoretical upperbound as in Theorem 2.1, which also helps skip bad local minima. We did observe blowup phenomenon numerically whenever the averaging scheme was not used. It is of our interest to prove an improved upperbound for in the future work. Similar to what was done in artina_13 , another direction is to analyze how exact the inner problems have to be solved in order to still guarantee the convergence of the outer Backward Euler problem.
Acknowledgement
This work was partially supported by AFOSR grant FA95501510073 and ONR grant N000141612157. We would like to thank Dr. Bao Wang for helpful discussions. We also thank the anonymous reviewers for their constructive comments.
References
 (1) M. Artina, M. Fornasier, and F. Solombrino. ”Linearly constrained nonsmooth and nonconvex minimization. SIAM Journal on Optimization”, 23(3), 19041937, 2013.
 (2) D. Arthur and S. Vassilvitskii. ”means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.

(3)
C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, ”Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses”, Physical review letters, 115(12), 128101, 2015.
 (4) D.P. Bertsekas: ”Nonlinear Programming” Second Edition, Athena Scientific, Belmont, Massachusetts, 2008.
 (5) L. Bottou and Y. Bengio, ”Convergence properties of the means algorithms”, Advances in neural information processing systems, 1995.
 (6) P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. ”Entropysgd: Biasing gradient descent into wide valleys”, arXiv preprint arXiv:1611.01838, 2016.
 (7) P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, ”Deep Relaxation: partial differential equations for optimizing deep neural networks”, arXiv preprint arXiv:1704.04932, 2017.

(8)
Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, ”Yinyang means: A dropin replacement of the classic
means with consistent speedup”, Proceedings of the 32nd International Conference on Machine Learning, 2015.
 (9) C. Elkan, ”Using the triangle inequality to accelerate means”, Proceedings of the 20th International Conference on Machine Learning, 2003.
 (10) A. Kaplan and R. Tichatschke, ”Proximal point method and nonconvex optimization”, Journal of Global Optimization, 13, 389406, 1998.
 (11) Y. LeCun, Y. Bengio, and G. Hinton, ”Deep learning”, Nature, 521(7553), 436444, 2015.
 (12) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, ”Gradientbased Learning Applied to Document Recognition”, Proceedings of the IEEE, 86(11), 22782324, 1998.
 (13) S. Lloyd, ”Least squares quantization in PCM”, IEEE transactions on information theory, 28(2), 129137, 1982.
 (14) J.J. Moreau, ” Proximité et dualité dans un espace hilbertien”, Bulletin de la Société Mathématique de France, 93, 273–299, 1965.
 (15) J. Newling and F. Fleuret, ”Nested MiniBatch means”, Advances in Neural Information Processing Systems, 2016.
 (16) R. Rockafellar, ”Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization, 14, 877898, 1976.
 (17) D. Sculley, ”Webscale means clustering”, Proceedings of the 19th international conference on World wide web, ACM, 2010.
 (18) C. Tang and C. Monteleoni. ”Convergence rate of stochastic means”, arXiv preprint arXiv:1610.04900, 2016.