I Introduction
The minmax optimization has been extensively studied in the literature due to its wide range of applications. It generally appears in the fields of statistics, operations research and engineering under the topics of throttling, resource allocation, computer graphics, computational geometry, clustering, anomaly detection and facility location.
There has been attempts to solve this problem via alternative formulations or smoothing approximations in [1, 2, 3, 4]. The technique of solving the minmax optimization by smoothing the target has been extensively studied in the literature. However, even after these extensive studies, the convergence rate analysis is very limited in literature and existing works generally try to show that they converge to the optimal solution given enough time (existence proofs). An example to the limited convergence rate analysis regarding this problem could be found in [5] which solves the generic nonsmooth minmax optimization.
Consequently, to our knowledge, as the first time in the literature, we have derived a major improvement on the convergence guarantees for the minmax optimization problems where the components contributing to the max operation are stronglyconvex, smooth and have bounded gradients. Our convergence rate is such that for an optimality gap of , we need computational resource improving upon the optimization complexity of for nonsmooth stronglyconvex functions having bounded gradients [6].
A specific type of this widely studied optimization problem is named the minimal bounding sphere. There have been several attempts to solve this problem deterministically. The computational complexity is generally superlinear with respect to the number of points and the vector space dimensions such that the dependency is polynomial with integer powers, at times, much larger than 1. Thus, the feasible approaches to this problem generally focus on heuristic methods with experimentally shown efficiency
[7, 8, 9].An alternative approach to finding minimal bounding spheres, with linear timecomplexity dependencies with respect to the number of point and the vector space dimension is the socalled approximation. The corresponding attempts are based on the coreset constructions [10, 11]. The stateoftheart of these approximative solutions has been able to find a bounding sphere with radius , where denotes the actual minimal bounding sphere radius, with timecomplexity [11] for arbitrarily large number of points and vector space dimension . We improve upon this by showing that the timecomplexity can be reduced to (up to logarithmic factors).
We next continue with a rigorous formulation of the problem, after which we demonstrate how the improvements for both the general minmax optimization and the minimal bounding sphere are achieved.
Ia Problem Description for the General MinMax Optimization
Our convex optimization problem is such that the function to be minimized is of the form:
(1) 
where is the number of functions amongst which we select the maximum for a given argument via the operator. Each function is twicedifferentiable and displays strongconvexity with Lipschitzsmoothness. The gradients are also assumed to be bounded at least in the subspace of subjected to the iterative search including the optimal point. This subspace can be either preset or naturally occurring due to the nature of our method. Twice differentiability is required since our analysis depends on behavior of the Hessian matrix.
Normally, in (1) is nonsmooth due to the maximum operator. However, we will show that by optimizing a substitute function, which approximates the original sufficiently well, we can improve the time dependency of the regular convergence rate from to , where is the bigO notation.
After we finish our discussion on the general setting, we will investigate a special case of this minmax optimization called minimal bounding sphere in Section IV.
Ii Smooth Approximation of the Max Operator
Let us define the new function , which we shall use as substitute for , as follows:
(2) 
where is the naturallogarithm, is the naturalexponentiation and . This form of smooth maximum is also referred to as "LogExpSum". We now present a lemma regarding how well approximates .
Lemma 1.
The substitute function is both lower and upper bounded by with the upper bound having an additive redundancy of at most such that
Proof.
By the definitions of and , we have,
(3) 
where due to (1). Since is a monotonically increasing function and for all , the combination of (2) and (3) yields,
(4) 
Again, due to monotonicity of and , we can replace each individual in (4) with as an upperbound. In combination with (4), this would result in the lemma. ∎
Lemma 1 implies that if we optimize instead, we would incur an additional redundancy of at most as a cost for smoothing the target function.
Corollary 1.
The gap between and can be decomposed into a "smoothing" regret and the gap between their smoothed counterparts as follows:
where the "smoothing" regret is .
Proof.
The result follows directly from Lemma 1 ∎
We introduce the shorthand notation for the optimal point minimizing as:
(5) 
Next, we investigate derive some properties of this new function , namely the gradient and Hessian, after which we can investigate its strongconvexity and smoothness parameters.
Iia The gradient and the Hessian of the substitute function
We start with a probability vector definition, which is used for writing weighted sums via expectations.
Definition 1.
Given the "smoother" and the argument , we generate the probability vector such that:
where is the element of the vector .
In the following lemmas, we compute the gradient and, from there, the Hessian of substitute function , which are used for the iterative optimization.
Lemma 2.
We can write the gradient as a weighted combination of individual gradients where the weights sum to such that
where is the expectation operation with respect to the probability mass function corresponding to the size vector . Each element of corresponds to the probability assigned to as defined in Definition 1.
Proof.
The result directly follows after taking the partial derivatives of (2) with respect to each element in . ∎
Lemma 3.
Considering the gradient as a random vector and the Hessian
as a random matrix, each having
possible realizations generated from the probability mass function corresponding to the vector , the Hessian can be computed with the expectation of and the covariance matrix of as follows:where , are defined as in Lemma 2 and the covariance matrix is given as,
(6) 
Proof.
The result directly follows from taking further partial derivatives of gradient in Lemma 2. ∎
In the following section, we explain our methodology for accelerating the convergence rate.
Iii Accelerated Optimization of the Approximation
We utilize Nesterov’s accelerated gradient descent method for smooth and stronglyconvex functions, for which more details are given in [12]. The algorithm is an iterative one, where the iterations are done in an alternating fashion. Starting with the initial argument pair , we have the following iterative relations for and for :
(7)  
with being the condition number of the Hessian in Lemma 5, which is computed as
(8) 
where and
are the lower and upper bounds on the eigenvalues of Hessian
, respectively, the identity matrix of
dimensions is denoted as , and is a set guaranteed to include the convexhull of all iterations , and the optimal point as defined in (5).Generating the Hessian upperbound (the smoothness parameter) , and consequently the condition number for the set is sufficient as in (8). The reason is twofold. Firstly, the optimality gap guarantee shown in the following as Lemma 4 is dependent upon upperbounding the Hessian on line segments pairwise connecting the algorithm iterations (, ) and the optimal point via . All of such segments are encapsulated by the convexhull of , and the optimal point . Secondly, this convexhull is itself a subset of as previously defined.
Lemma 4.
The following optimality gap is guaranteed for :
where is the condition number. and are the strongconvexity and Lipschitzsmoothness parameters, respectively, and is the optimal point as defined in (5).
Proof.
The proof directly follows a similar formulation given in [12] under "the smooth and strong convex case" subsection of the section "Nesterov’s accelerated gradient descent". The only exception is that we do not replace with an upperbound and leave it as is. ∎
Iiia Parameters of StrongConvexity and LipschitzSmoothness
To compute , we bound the eigenvalues of .
Lemma 5.
We can lower and upper bound eigenvalues of the Hessian matrix for as follows:
where and are further defined as the strongconvexity and smoothness parameters for the components from the "" operator generating , i.e. , respectively, such that we have , for . The parameter is a common gradient norm bound for each such that for each and .
Proof.
We start with proving the lowerbound relation. Using Lemma 3, we obtain
since the covariance matrix is lowerbounded by as it is a convex combination of rank selfouterproduct matrices with their lowest eigenvalue being .
The expectation operation is linear. Thus we can replace each with its lowerbound without affecting the inequality relation . After taking the constant identity matrix outside of the expectation, we have the renewed relation
Since the expectation is a convex combination of scalars , we further lower bound by replacing the expectation with , which gives the lowerbound of this lemma.
For the upperbound, we can generate
using Lemma 3 by upper bounding each with and the resulting expectation with similar to the lowerbound.
We can upper bound the covariance matrix by first noting that the eigenvalues of an dimensional outerproduct are and zeros. Consequently, we upper bound it by replacing the negative outerproduct, i.e. , in (6) with . Then, utilizing the linearity of expectation again, we get the final upperbound by replacing the outerproduct inside expectation with . The resulting upperbound is given as
after taking the constant identity matrix outside of expectation. We can replace the scalar with a common squared gradientnorm bound , which gives the upperbound relation of this lemma, thus concluding the proof. ∎
IiiB Algorithm Description
We start at some point . We determine the "smoother" needed to achieve the requested optimality gap and the set such that includes the optimal point and all future iterations , . We use the update rules in (7) after determining the common gradient norm bound , the individual strongconvexity and Lipschitzsmoothness parameters and , respectively, via the set . The condition number and the smoothness parameter are calculated using the lower and upper bounds in Lemma 5. The pseudocode is given in Algorithm 1. For this algorithm, we have the following performance result.
Theorem 1.
We run Algorithm 1 for a given optimality gap guarantee . Then, we achieve the gap after sufficient iterations such that:
where is the bigO notation for asymptotic upperbounding, is the number of functions contributing to the operation resulting in , is the common gradient norm bound for each component function in the operator such that , for all , . is the strongconvexity parameter of the approximation , is the pseudosmoothness parameter upper bounding the matrix , and is the unknown initial distance between and .
Proof.
From Lemma 4, we see that the lower results in faster convergence for a fixed optimality gap. Without further information on the gradient and Hessian bounds, we need to lower the "smoother" for a lower . However, the "smoothing" regret from Corollary 1 works in the opposite direction. Consequently, we will equate both the optimality gap from the smooth approximation and the "smoothing" regret to . This results in , with being the number of function contributing to the same operation. is generated consequently. Immediately, we have the "smoothing" regret in Corollary 1 as . Then, we equate the gap from using the upperbound in Lemma 4. Afterwards, we replace the condition number in accordance with (8) after calculating the strongconvexity and smoothness parameters and via Lemma 5. Finally, we upper bound the initial smooth approximation gap with using the convexity relation and arrive at the result of the theorem. ∎
IiiB1 Computational Cost of the Algorithm
Corollary 2.
For an optimality gap , the computation time needed is such that for an arbitrarily small . More specifically:
where is the average cost of calculating a partial derivative for any , is the number of functions contributing to and is the dimension of the domain of ’s .
Proof.
We need iterations as shown in Theorem 1. We observe that each iteration of the whileloop in Algorithm 1 requires partial derivative calculations. Due to the computation of probability vector with respect to Definition 1, each iteration also requires a total of exponentiation to the power of when . Each of such exponentiations has additional computational cost of . Combination of these costs gives the corollary. ∎
IiiB2 Online Version of the Algorithm (without Specifying )
Corollary 3.
We can achieve the timecomplexity in Corollary 2, which is of the form , in an online fashion with no requested optimality gap guarantee . is the softO notation ignoring logarithmic factors compared to bigO.
Proof.
We initialize with some and run Algorithm 1 with as the optimality guarantee. Then, after sufficient iterations to achieve the requested , we restart Algorithm 1 with a new guarantee , for and repeat nonstop.
For such that for some integer , the total exhausted time can be upperbounded as follows using the fact that is monotonically increasing and is lowerbounded with ,
This bound translates to the same bound in Corollary 2. ∎
In the next section, we shall investigate an interesting specific application for the general accelerated minmax optimization via smooth approximation, which we have introduced.
Iv (1+)Approximation for the Problem of Minimal Bounding Sphere
Let us suppose we have points, each located at for , in the dimensional space . Our minimization target is such that:
(9) 
This is the socalled minimal bounding sphere problem such that it finds an optimal point , which, together with from (9), defines the center and radius of a ball enclosing all of the points in with the smallest possible radius.
The optimal point is defined as:
Since minimizes the maximum euclidean distance to a point , we know that belongs to the convexhull of points since we can always decrease these distances by moving towards the convexhull.
We shall utilize Algorithm 1 with the initial point belonging to this convexhull, e.g. , the arithmetic mean of the points.
Before running Algorithm 1, we determine the strongconvexity and Lipschitzsmoothness parameters, which are for all in this particular problem. Consequently, the overall strongconvexity and pseudosmoothness parameters are also , respectively. reveals itself after combining Lemma 5 with (8), and is defined in Theorem 1 as the maximum smoothness parameter from the individual functions. What only remains to be set in Algorithm 1 is the gradient norm upperbound which inherently includes determining the set guaranteed to include the optimal point , and all iterations , .
Iva Gradient Norm Bound for Minimal Bounding Sphere
Assume the minimal bounding sphere is such that the maximum distance (i.e. radius) between the optimal point and one of the other points is .
Lemma 6.
After setting the initial point and computing using (9), we have the following bounds on the minimal bounding sphere radius :
Proof.
The upperbound is trivial since is not necessarily optimal. The lowerbound comes from the fact that belongs to the convexhull of and, consequently, cannot exceed the diameter of minimal bounding sphere which encloses all points and, hence, their convexhull. ∎
Lemma 7.
The gradient norm upperbound is such that:
where is the initial point of Algorithm 1 and is the requested optimality gap.
Proof.
In accordance with this specific problem, we can further upper bound the smooth approximation optimality guarantee in Lemma 4 by first upper bounding the multiplicand in parenthesis on the greater side of the inequality since we have an exponential multiplier, i.e. , which is guaranteed to be nonnegative. After also upper bounding this exponential multiplier, since the upper bound of multiplicand turns out to be always nonnegative, we obtain the following result:
(10) 
This upper bounding takes place by replacing the quantities in Lemma 4 with their corresponding bounds using the facts , , , and for all . The distance inequality is due to a fact that the minimal bounding sphere has its center at , and is contained inside the said sphere since it is encapsulated by the convexhull of all the points . Similarly, the inequality results from Lemma 1 and , since is again contained in the same minimal bounding sphere with diameter .
Then, by Lemma 1, (10), and setting as in Algorithm 1 for a given optimality gap guarantee , we get
(11) 
Regarding the gradients for minimal bounding sphere problem, using the expectation form of the gradient in Lemma 2 and incorporating the function definition in (9), we have:
(12) 
Combining (11) and (12), we have a bound on the gradient norms of the smoothing function at points as
(13) 
since we can claim , which results from the distance between and some weighted average of points , specifically , being at most the distance between and the point farthest to it, i.e. .
For , its norm is upperbounded by since the diameter of minimal bounding sphere is which includes the initialization . For , combining (12) and Line 1 from Algorithm 1, we have
Using the triangle inequality and by upper bounding the negative terms with 0,
Finally, using (11), we have
(14) 
as we can claim like before.
With (13) and (14), we have bounded the gradient norms at all iterations and . We take an arbitrary , member to the convexhull of , and the optimal point . As discussed in Section III, it is sufficient to generate a gradient norm upper bound for this arbitrary point to obtain . Since is a convex combination of , and , we decompose it into individual parts and insert that version of into (12). Using triangle inequality and the claim for any pair of , the common gradient norm bound turns out to be the maximum of bounds (13) and (14), since the gradient at optimal point is 0. Consequently, we can set . ∎
IvB Convergence result
Before examining the convergence result, we note that, for minimal bounding sphere problem, the approximation translates into converging to a bounding sphere with radius . Consequently, we have that, for some :
meaning the requested optimality gap , i.e for .
Theorem 2.
For the minimal bounding sphere problem, we can generate an approximate solution by achieving a bounding sphere with radius for an arbitrarily small positive using Algorithm 1. After setting for all , and , the overall computational complexity and the total number of iterations by the algorithm are
where is the softO notation which ignores the logarithmic additives and multipliers.
Proof.
We plug in the for all and into the result of Theorem 1 regarding the number of iterations required. We can upper bound the right side of the equality for before using here, since it can only provide further guarantees as shown in Lemma 4. We also plug in the initial distance by definition of , the radius of the minimal bounding sphere, and the selection of from the convexhull of . We note that from Lemma 6 and bound with . Instead of upper bounding by , as previously done in Theorem 1, we can use the upper bound of resulting from Corollary 1 and setting since we have due to . Lastly, we upper bound the reciprocal of optimality gap, i.e. with since . ∎
References
 [1] R. Chen", “"solution of minimax problems using equivalent differentiable functions",” "Computers and Mathematics with Applications", vol. "11", no. "12", pp. "1165 – 1169", "1985". [Online]. Available: "http://www.sciencedirect.com/science/article/pii/089812218590104X"
 [2] I. Zang, “A smoothingout technique for min  max optimization,” Math. Program., vol. 19, no. 1, pp. 61–77, 1980. [Online]. Available: https://doi.org/10.1007/BF01581628
 [3] S.C. Fang and S.Y. Wu", “"solving minmax problems and linear semiinfinite programs",” "Computers and Mathematics with Applications", vol. "32", no. "6", pp. "87 – 93", "1996". [Online]. Available: "http://www.sciencedirect.com/science/article/pii/0898122196001459"
 [4] G. Zhao, Z. Wang, and H. Mou, “Uniform approximation of min/max functions by smooth splines,” J. Computational Applied Mathematics, vol. 236, no. 5, pp. 699–703, 2011. [Online]. Available: https://doi.org/10.1016/j.cam.2011.06.023
 [5] Y. Nesterov, “Smooth minimization of nonsmooth functions,” Math. Program., vol. 103, no. 1, pp. 127–152, 2005. [Online]. Available: https://doi.org/10.1007/s1010700405525
 [6] S. LacosteJulien, M. W. Schmidt, and F. R. Bach, “A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method,” CoRR, vol. abs/1212.2002, 2012. [Online]. Available: http://arxiv.org/abs/1212.2002
 [7] T. B. Larsson, “Fast and tight fitting bounding spheres,” 2008.
 [8] E. "Welzl, “"smallest enclosing disks (balls and ellipsoids)",” in "New Results and New Trends in Computer Science", H. "Maurer, Ed. "Berlin, Heidelberg": "Springer Berlin Heidelberg", "1991", pp. "359–370".
 [9] K. "Fischer, B. Gärtner, and M. Kutz, “"fast smallestenclosingball computation in high dimensions",” in "Algorithms  ESA 2003", year="2003", G. "Di Battista and U. Zwick, Eds. "Berlin, Heidelberg": "Springer Berlin Heidelberg", pp. "630–641".

[10]
M. Bādoiu, S. HarPeled, and P. Indyk, “Approximate clustering via
coresets,” in
Proceedings of the Thiryfourth Annual ACM Symposium on Theory of Computing
, ser. STOC ’02. New York, NY, USA: ACM, 2002, pp. 250–257. [Online]. Available: http://doi.acm.org/10.1145/509907.509947  [11] P. Kumar, J. S. B. Mitchell, and E. A. Yıldırım, “Computing coresets and approximate smallest enclosing hyperspheres in high dimensions*,” 2002.

[12]
S. Bubeck, “Convex optimization: Algorithms and complexity,”
Foundations and Trends in Machine Learning
, vol. 8, no. 34, pp. 231–357, 2015. [Online]. Available: https://doi.org/10.1561/2200000050
Comments
There are no comments yet.