, Lloyd proposed a two-step alternating algorithm that quickly converges to a local minimum. Lloyd’s algorithm is also known as an instance of the more general Expectation-Maximization (EM) algorithm applied to Gaussian mixtures. InBottou_95 , Bottou and Bengio cast Lloyd’s algorithm as Newton’s method, which explains its fast convergence.
Aiming to speed up Lloyd’s algorithm, Elkan elkan proposed to keep track of the distances between the computed centroids and data points, and then cleverly leverage the triangle inequality to eliminate unnecessary computations of the distances. Similar techniques can be found in yingyang . It is worth noting that these algorithms do not improve the clustering quality of Lloyd’s algorithm, but only achieve acceleration. However, there are well known examples where poor initialization can lead to low quality local minima for Lloyd’s algorithm. Random initialization has been used to avoid these low quality fixed points. The article kmeans++ introduced a smart initialization scheme such that the initial centroids are well-separated, which gives more robust clustering than random initialization.
We are motivated by problems with very large data sets, where the cost of a single iteration of Lloyd’s algorithm can be expensive. Mini-batch mnbatch ; stochastic_kmeans was later introduced to adapt -means for large scale data with high dimensions. The centroids are updated using a randomly selected mini-batch rather than all of the data. Mini-batch (stochastic)
-means has a flavor of stochastic gradient descent whose benefits are twofold. First, it dramatically reduces the per-iteration cost for updating the centroids and thus is able to handle big data efficiently. Second, similar to its successful application to deep learningdeep_learning , mini-batch gradient introduces noise in minimization and may help to bypass some bad local minima. Furthermore, the aforementioned Elkan’s technique can be combined with mini-batch -means for further acceleration nested_mnbatch .
In this paper, we propose a backward Euler based algorithm for -means clustering. Fixed-point iteration is performed to solve the implicit gradient step. As is done for stochastic mini-batch -means, we compute the gradient only using a mini-batch of samples instead of the whole data, which enables us to handle massive data. Unlike the standard fixed-point iteration, the proposed stochastic fixed-point iteration outputs an average over its trajectory. Extensive experiments show that, with proper choice of step size for stochastic backward Euler, the proposed algorithm can improve over EM and Mini-batch EM and locate an improved minimum with decreased objective value.
In other words, while Lloyd’s algorithm is effective with a full gradient oracle we achieve better performance with the weaker mini-batch gradient oracle. We are motivated by recent work by two of the authors deep_relax which applied a similar algorithm to accelerate the training of Deep Neural Networks.
2 Stochastic backward Euler
The celebrated proximal point algorithm (PPA) proximal_point for minimizing some function is:
PPA has the advantage of being monotonically decreasing, which is guaranteed for any step size . Indeed, by the definition of in (1), we have
When for any with being the Lipschitz constant of , the (subsequential) convergence to a stationary point is established in PPM . If is differentiable at , it is easy to check that the following optimality condition to (1) holds
By rearranging the terms, we arrive at implicit gradient descent or the so-called backward Euler:
When has the Lipschitz constant and , the fixed point iteration
is a viable option for updating by solving the equation.
It is essentially the gradient descent
on (1) by choosing the step size .
See (Ber08, , Proposition 1.2.3).
Let us consider -means clustering for a set of data points in with centroids . Denoting , we seek to minimize
Note that is non-differentiable at if there exist and such that
This means that there is a data point which has two or more distinct nearest centroids and . The same situation may happen in the assignment step of Lloyd’s algorithm. In this case, we simply assign to one of the nearest centroids. With that said, is basically piecewise differentiable. By abuse of notation, we can define the ’gradient’ of at any point by
where denotes the index set of the points that are assigned to the centroid . From now on and for the rest of the paper, we denote the piecewise gradient by as stated in (2), and none of the results depends on the specific assignment of ambiguous data points . Similarly, we can compute the ’Hessian’ of as was done in Bottou_95 :
where is an -D vector of all ones. In what follows, we analyze how the fixed point iteration (3) works on the piecewise differentiable with discontinuous .
is piecewise Lipschitz continuous on with Lipschitz constant , if can be partitioned into a finite number of sub-domains satisfying , , , and is Lipschitz continuous in each sub-domain , i.e., for each we have
We can see that is at most piecewise -Lipschitz continuous. The following result proves the convergence of fixed point iteration on -means problem.
Let be the -means objective function defined in (4). Suppose is piecewise -Lipschitz. If , then the fixed point iteration for minimizing given by
with the initialization satisfies
and as .
is bounded. Moreover, if any limit point of a convergent subsequence of lies in the interior of some sub-domain, then the whole sequence converges to with a locally linear rate, which is a fixed point obeying
(a) We know that is piecewise quadratic. Suppose (note that could be on the boundary), then has a uniform expression restricted on which is a quadratic function, denoted by . We can extend the domain of from to the whole , and we denote the extended function still by . Since is quadratic, is -Lipschitz continuous on . Then we have the following well-known inequality
Using the above inequality and the definition of , we have
In the second equality above, we used the identity
with and . Since , is monotonically decreasing. Moreover, since is bounded from below by 0, converges and thus as .
(b) Since as , combining with the fact that , we have is bounded. Consider a convergent subsequence whose limit lies in the interior of some sub-domain. Then for sufficiently large , will always remain in the same sub-domain in which lies and thus . Since by (a), , we have as . Therefore,
which implies is a fixed point. Furthermore, by the piecewise Lipschitz condition,
Since , when is sufficiently large, is also in the same sub-domain containing . By repeatedly applying the above inequality for , we conclude that converges to .
This result can be extended to objective functions that are the pointwise infimum of a set of a finite number of Lipschitz differentiable functions.
2.1 Algorithm description
Instead of using the full gradient in fixed-point iteration, we adopt a randomly sampled mini-batch gradient
at the -th inner iteration. Here, denotes the index set of the points in the -th mini-batch associated with the centroid obeying . The fixed-point iteration outputs a forward looking average over its trajectory. Intuitively averaging greatly stabilizes the noisy mini-batch gradients and thus smooths the descent. We summarize the proposed algorithm in Algorithm 1. Another key ingredient of our algorithm is an aggressive initial step size , which helps bypass bad local minimum at the early stage. Unlike in deterministic backward Euler, diminishing step size is needed to ensure convergence. But should decay slowly because large step size is good for a global search.
2.2 Related work
Chaudhari et al. esgd recently proposed the entropy stochastic gradient descent (Entropy-SGD) algorithm to tackle the training of deep neural networks. Relaxation techniques arising in statistical physics were used to change the energy landscape of the original non-convex objective function yet with the minimizers being preserved, which allows easier minimization to obtain a ’good’ minimizer with a better geometry. More precisely, they suggest to replace with a modified objective function called local entropy local_entropy as follows
is the heat kernel. The connection between Entropy-SGD and nonlinear partial differential equations (PDEs) was later established indeep_relax . The local entropy function turns out to be the solution to the following viscous Hamilton-Jacobi (HJ) PDE at
with the initial condition . In the limit , (6) reduces to the non-viscous HJ equation
whose viscosity solution is exactly the Moreau envelope Moreau_65 :
The gradient descent dynamics for is obtained by taking the limit of the following system of stochastic differential equation as the homogenization parameter :
where is the standard Wiener process. Specifically, we have
where and are the gradient step sizes for the inner and outer loops, respectively, is the moving average of output from the inner loop, and introduces the noise. Stochastic backward Euler simplifies Entropy-SGD in two aspects. First, the term is absent in SBE as the mini-batch gradient itself already contains the noise. Second, the step sizes and are both set to , which make the algorithm simpler with less tunable parameters.
3 Experimental results
We show by several experiments that the proposed stochastic backward Euler (SBE) gives superior clustering results compared with the state-of-the-art algorithms for -means. SBE scales well for large problems. In practice, only a small number of fixed-point iterations are needed in the inner loop, and this seems not to depend on the size of the problem. Specifically, we chose the parameters
imaxit = 5 or and the averaging parameter in all experiments. We remark that SBE is not very sensitive to these parameters. For example, it works equally well for . In addition, we always set .
3.1 2-D synthetic Gaussian data
We generated 4000 synthetic data points in 2-D plane by multivariate normal distributions with 1000 points in each cluster. The means and covariance matrices used for Gaussian distributions are as follows:
For the initial centroids given below,both Lloyd’s algorithm (or EM) and mini-batch EM got stuck at the same local minimum with objective value about ; see the left plot of Fig. 1.
Starting from where EM and mini-batch EM got stuck, we can see that SBE managed to jump over the trap of local minimum and arrived at a better minimum, which seems to be the global minimum; see the right plot of Fig. 1.
3.2 Iris dataset
The Iris dataset, which contains 150 4-D data samples from 3 clusters, was used for comparisons of SBE, EM as well as mini-batch EM algorithms. 100 runs were realized with the initial centroids randomly selected from the data samples. For the parameters, we chose mini-batch size , initial step size,
omaxit, and decay parameter . The histograms in Fig. 2 record the frequency of objective values given by the three algorithms. Clearly there was chance that EM got stuck at a local minimum whose value is about 0.48, whereas both SBE and mini-batch EM managed to locate an improved minimum valued at around 0.264 every time.
3.3 Gaussian data with MNIST centroids
We selected 8 hand-written digit images of dimension from MNIST dataset shown in Fig. 3, and then generated 60,000 images from these 8 centroids by adding Gaussian noise. We compare SBE with both EM and mini-batch EM (mb-EM) mnbatch ; stochastic_kmeans
on 100 independent realizations with random initial guess. For each method, we recorded the minimum, maximum, mean and variance of the 100 objective values by the computed centroids.
We first compare SBE and EM with the true number of clusters . For SBE, mini-batch size , maximum number of iterations for backward Euler
omaxit=150, maximum fixed-point iterations
imaxit= 10 for SBE. We set the maximum number of iterations for EM to be 50, which was sufficient for its convergence. The results are listed in the first two rows of Table 1. We observed SBE always found a minimum around 15.68 up to a tiny error due to the noise from mini-batch. Moreover, note that although we run more iterations (taking the inner loop into account) for SBE than for EM, SBE actually requires less distance evaluations and is computationally cheaper compared with EM because of the small mini-batch. More details will be discussed in section 3.6.
In the comparison between SBE and mb-EM, we reduced mini-batch size to ,
imaxit and tested for . Table 1 shows that with the same mini-batch size, SBE outperforms mb-EM in all three cases, in terms of both mean and variance of the objective values.
|Method||Batch size||Max iter||Min||Max||Mean||Variance|
3.4 Raw MNIST data
In this example, We used the 60,000 images from the MNIST training set for clustering test, with 6000 samples for each digit (cluster) from 0 to 9. The comparison results are shown in Table 2. We conclude that SBE consistently performs better than EM and mb-EM. The histograms of objective value by the three algorithms in the case are plotted in Fig. 4.
|Method||Batch size||Max iter||Min||Max||Mean||Variance|
3.5 MNSIT features
We extracted the feature vectors of MNIST training data prior to the last layer of LeNet-5 mnist_98 . The feature vectors have dimension 64 and lie in a better manifold compared with the raw data. The results are shown in Table 3 and Fig. 5 and 6.
|Method||Batch size||Max iter||Min||Max||Mean||Variance|
3.6 Comparison of time efficiency
We first compare the per-(outer)iteration costs for EM, mini-batch EM, and SBE, respectively. In every iteration, we need to find the the labels or clusters associated with data points that are used to update the centroids, for which minimum distance between the data points and current centroids are computed. This dominates the total computational cost, especially for big data. In EM, we need to find the labels for all data points when updating the centroids. In contrast, only a small batch of labels are needed in mini-batch EM and SBE. Specifically, for the datasets in sections 3.3 and 3.4 with 60,000 points of dimension 784, the per-iteration computation time of EM was around 2.3 seconds, whereas those of mini-batch EM and SBE (with 5 inner iterations) were 0.03 seconds and 0.1 seconds, respectively. For the raw MNIST data with , EM usually needed around 15 iterations to converge to a good minimum at about 19.6 with random initialization (if succeeded; see Table 2), whereas SBE needed 30 iterations and mini-batch EM always failed to do so. So basically we saw more than 10 savings in time efficiency of SBE, compared with EM. The tests were carried out on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory.
In this section, we provide an intuitive explanation for why SBE often succeeds to find better local minima than SGD through a simple one-dimensional example in Fig. 7. At the -th iteration, SBE approximately solves due to the noise introduced by mini-batch gradient. Since is technically only piecewise Lipschitz continuous, the backward Euler may have multiple solutions. For example, in Fig. 7, we get two solutions in the leftmost valley and in the second from the left. solved by SBE is close to as it gives a lower objective value of . Then bypasses the local minimum, which explains the increase of objective value shown in the right plot of Fig. 1. We conjecture that averaging of the iterates enables an aggressive step size larger than the theoretical upper-bound as in Theorem 2.1, which also helps skip bad local minima. We did observe blowup phenomenon numerically whenever the averaging scheme was not used. It is of our interest to prove an improved upper-bound for in the future work. Similar to what was done in artina_13 , another direction is to analyze how exact the inner problems have to be solved in order to still guarantee the convergence of the outer Backward Euler problem.
This work was partially supported by AFOSR grant FA9550-15-1-0073 and ONR grant N00014-16-1-2157. We would like to thank Dr. Bao Wang for helpful discussions. We also thank the anonymous reviewers for their constructive comments.
- (1) M. Artina, M. Fornasier, and F. Solombrino. ”Linearly constrained nonsmooth and nonconvex minimization. SIAM Journal on Optimization”, 23(3), 1904-1937, 2013.
- (2) D. Arthur and S. Vassilvitskii. ”-means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.
C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, ”Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses”, Physical review letters, 115(12), 128-101, 2015.
- (4) D.P. Bertsekas: ”Nonlinear Programming” Second Edition, Athena Scientific, Belmont, Massachusetts, 2008.
- (5) L. Bottou and Y. Bengio, ”Convergence properties of the -means algorithms”, Advances in neural information processing systems, 1995.
- (6) P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. ”Entropy-sgd: Biasing gradient descent into wide valleys”, arXiv preprint arXiv:1611.01838, 2016.
- (7) P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, ”Deep Relaxation: partial differential equations for optimizing deep neural networks”, arXiv preprint arXiv:1704.04932, 2017.
Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, ”Yinyang -means: A drop-in replacement of the classic
-means with consistent speedup”, Proceedings of the 32nd International Conference on Machine Learning, 2015.
- (9) C. Elkan, ”Using the triangle inequality to accelerate -means”, Proceedings of the 20th International Conference on Machine Learning, 2003.
- (10) A. Kaplan and R. Tichatschke, ”Proximal point method and nonconvex optimization”, Journal of Global Optimization, 13, 389-406, 1998.
- (11) Y. LeCun, Y. Bengio, and G. Hinton, ”Deep learning”, Nature, 521(7553), 436-444, 2015.
- (12) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, ”Gradient-based Learning Applied to Document Recognition”, Proceedings of the IEEE, 86(11), 2278-2324, 1998.
- (13) S. Lloyd, ”Least squares quantization in PCM”, IEEE transactions on information theory, 28(2), 129-137, 1982.
- (14) J.-J. Moreau, ” Proximité et dualité dans un espace hilbertien”, Bulletin de la Société Mathématique de France, 93, 273–299, 1965.
- (15) J. Newling and F. Fleuret, ”Nested Mini-Batch -means”, Advances in Neural Information Processing Systems, 2016.
- (16) R. Rockafellar, ”Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization, 14, 877-898, 1976.
- (17) D. Sculley, ”Web-scale -means clustering”, Proceedings of the 19th international conference on World wide web, ACM, 2010.
- (18) C. Tang and C. Monteleoni. ”Convergence rate of stochastic -means”, arXiv preprint arXiv:1610.04900, 2016.