DeepAI

# Non-convex Optimization for Machine Learning

A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimensional spaces or that train non-linear models such as tensor models and deep networks. The freedom to express the learning problem as a non-convex optimization problem gives immense modeling power to the algorithm designer, but often such problems are NP-hard to solve. A popular workaround to this has been to relax non-convex problems to convex ones and use traditional methods to solve the (convex) relaxed optimization problems. However this approach may be lossy and nevertheless presents significant challenges for large scale optimization. On the other hand, direct approaches to non-convex optimization have met with resounding success in several domains and remain the methods of choice for the practitioner, as they frequently outperform relaxation-based techniques - popular heuristics include projected gradient descent and alternating minimization. However, these are often poorly understood in terms of their convergence and other properties. This monograph presents a selection of recent advances that bridge a long-standing gap in our understanding of these heuristics. The monograph will lead the reader through several widely used non-convex optimization techniques, as well as applications thereof. The goal of this monograph is to both, introduce the rich literature in this area, as well as equip the reader with the tools and techniques needed to analyze these simple procedures for non-convex problems.

• 80 publications
• 30 publications
02/05/2016

### Convex Relaxation Regression: Black-Box Optimization of Smooth Functions by Learning Their Convex Envelopes

Finding efficient and provable methods to solve non-convex optimization ...
09/11/2022

### Git Re-Basin: Merging Models modulo Permutation Symmetries

The success of deep learning is thanks to our ability to solve certain m...
03/23/2018

### Lifting Layers: Analysis and Applications

The great advances of learning-based approaches in image processing and ...
04/18/2020

### Optimization in Machine Learning: A Distribution Space Approach

We present the viewpoint that optimization problems encountered in machi...
12/31/2021

### High Dimensional Optimization through the Lens of Machine Learning

This thesis reviews numerical optimization methods with machine learning...
06/13/2012

### Hierarchical POMDP Controller Optimization by Likelihood Maximization

Planning can often be simpli ed by decomposing the task into smaller tas...
07/08/2020

### Non-parametric Models for Non-negative Functions

Linear models have shown great effectiveness and flexibility in many fie...

#### 1.1 Non-convex Optimization

The generic form of an analytic optimization problem is the following

 min\vx∈\bRp f(\vx) s.t. \vx∈\cC,

where is the variable of the problem, is the objective function of the problem, and is the constraint set

of the problem. When used in a machine learning setting, the objective function allows the algorithm designer to encode proper and expected behavior for the machine learning model, such as fitting well to training data with respect to some loss function, whereas the constraint allows restrictions on the model to be encoded, for instance, restrictions on model size.

An optimization problem is said to be convex if the objective is a convex function, as well as the constraint set is a convex set. We refer the reader to § 2 for formal definitions of these terms. An optimization problem that violates either one of these conditions, i.e., one that has a non-convex objective, or a non-convex constraint set, or both, is called a non-convex optimization problem. In this monograph, we will discuss non-convex optimization problems with non-convex objectives and convex constraints (§ 4, 5, 6, and 8), as well as problems with non-convex constraints but convex objectives (§ 3, 7, 9, 10, and 8). Such problems arise in a lot of application areas.

#### 1.2 Motivation for Non-convex Optimization

Modern applications frequently require learning algorithms to operate in extremely high dimensional spaces. Examples include web-scale document classification problems where

-gram-based representations can have dimensionalities in the millions or more, recommendation systems with millions of items being recommended to millions of users, and signal processing tasks such as face recognition and image processing and bio-informatics tasks such as splice and gene detection, all of which present similarly high dimensional data.

Dealing with such high dimensionalities necessitates the imposition of structural constraints on the learning models being estimated from data. Such constraints are not only helpful in regularizing the learning problem, but often essential to prevent the problem from becoming ill-posed. For example, suppose we know how a user rates some items and wish to infer how this user would rate other items, possibly in order to inform future advertisement campaigns. To do so, it is essential to impose some structure on how a user’s ratings for one set of items influences ratings for other kinds of items. Without such structure, it becomes impossible to infer any new user ratings. As we shall soon see, such structural constraints often turn out to be non-convex.

In other applications, the natural objective of the learning task is a non-convex function. Common examples include training deep neural networks and tensor decomposition problems. Although non-convex objectives and constraints allow us to accurately model learning problems, they often present a formidable challenge to algorithm designers. This is because unlike convex optimization, we do not possess a handy set of tools for solving non-convex problems. Several non-convex optimization problems are known to be NP-hard to solve. The situation is made more bleak by a range of non-convex problems that are not only NP-hard to solve optimally, but NP-hard to solve approximately as well

(Meka et al., 2008).

#### 1.3 Examples of Non-Convex Optimization Problems

Below we present some areas where non-convex optimization problems arise naturally when devising learning problems.

Sparse Regression

The classical problem of linear regression seeks to recover a linear model which can effectively predict a response variable as a linear function of covariates. For example, we may wish to predict the average expenditure of a household (the response) as a function of the education levels of the household members, their annual salaries and other relevant indicators (the covariates). The ability to do allows economic policy decisions to be more informed by revealing, for instance, how does education level affect expenditure.

More formally, we are provided a set of covariate/response pairs where and . The linear regression approach makes the modeling assumption where is the underlying linear model and is some benign additive noise. Using the data provided , we wish to recover back the model as faithfully as possible.

A popular way to recover is using the least squares formulation

 \bth=argmin\bt∈\bRp n∑i=1\bryi−\vx⊤i\bt2.

The linear regression problem as well as the least squares estimator, are extremely well studied and their behavior, precisely known. However, this age-old problem acquires new dimensions in situations where, either we expect only a few of the features/covariates to be actually relevant to the problem but do not know their identity, or else are working in extremely data-starved settings i.e., .

The first problem often arises when there is an excess of covariates, several of which may be spurious or have no effect on the response. § 7 discusses several such practical examples. For now, consider the example depicted in Figure 1.1

, that of expenditure prediction in a situation when the list of indicators include irrelevant ones such as whether the family lives in an odd-numbered house or not, which should arguably have no effect on expenditure. It is useful to eliminate such variables from consideration to promote consistency of the learned model.

The second problem is common in areas such as genomics and signal processing which face moderate to severe data starvation and the number of data points available to estimate the model is small compared to the number of model parameters to be estimated, i.e., . Standard statistical approaches require at least data points to ensure a consistent estimation of all model parameters and are unable to offer accurate model estimates in the face of data-starvation.

Both these problems can be handled by the sparse recovery approach, which seeks to fit a sparse model vector (i.e., a vector with say, no more than non-zero entries) to the data. The least squares formulation, modified as a sparse recovery problem, is given below

 \bthsp=argmin\bt∈\bRp n∑i=1\bryi−\vx⊤i\bt2 s.t. \bt∈\cB0(s),

Although the objective function in the above formulation is convex, the constraint (equivalently – see list of mathematical notation at the beginning of this monograph) corresponds to a non-convex constraint set111See Exercise 2.5.. Sparse recovery effortlessly solves the twin problems of discarding irrelevant covariates and countering data-starvation since typically, only (as opposed to ) data points are required for sparse recovery to work which drastically reduces the data requirement. Unfortunately however, sparse-recovery is an NP-hard problem (Natarajan, 1995).

Recommendation Systems Several internet search engines and e-commerce websites utilize recommendation systems to offer items to users that they would benefit from, or like, the most. The problem of recommendation encompasses benign recommendations for songs etc, all the way to critical recommendations in personalized medicine.

To be able to make accurate recommendations, we need very good estimates of how each user likes each item (song), or would benefit from it (drug). We usually have first-hand information for some user-item pairs, for instance if a user has specifically rated a song or if we have administered a particular drug on a user and seen the outcome. However, users typically rate only a handful of the hundreds of thousands of songs in any commercial catalog and it is not feasible, or even advisable, to administer every drug to a user. Thus, for the vast majority of user-item pairs, we have no direct information.

It is useful to visualize this problem as a matrix completion problem: for a set of users and items , we have an preference matrix where encodes the preference of the user for the item. We are able to directly view only a small number of entries of this matrix, for example, whenever a user explicitly rates an item. However, we wish to recover the remaining entries, i.e., complete this matrix. This problem is closely linked to the collaborative filtering technique popular in recommendation systems.

Now, it is easy to see that unless there exists some structure in matrix, and by extension, in the way users rate items, there would be no relation between the unobserved entries and the observed ones. This would result in there being no unique way to complete the matrix. Thus, it is essential to impose some structure on the matrix. A structural assumption popularly made is that of low rank: we wish to fill in the missing entries of assuming that is a low rank matrix. This can make the problem well-posed and have a unique solution since the additional low rank structure links the entries of the matrix together. The unobserved entries can no longer take values independently of the values observed by us. Figure 1.2 depicts this visually.

If we denote by , the set of observed entries of , then the low rank matrix completion problem can be written as

 ^Alr=argminX∈\bRm×n ∑(i,j)∈Ω\brXij−Aij2 s.t. \rank(X)≤r,

This formulation also has a convex objective but a non-convex rank constraint222See Exercise 2.5.. This problem can be shown to be NP-hard as well. Interestingly, we can arrive at an alternate formulation by imposing the low-rank constraint indirectly. It turns out that333See Exercise 3.4. assuming the ratings matrix to have rank at most is equivalent to assuming that the matrix can be written as with the matrices and having at most columns. This leads us to the following alternate formulation

 ^Alv=argminU∈\bRm×rV∈\bRn×r ∑(i,j)∈Ω\brU⊤iVj−Aij2.

There are no constraints in the formulation. However, the formulation requires joint optimization over a pair of variables instead of a single variable. More importantly, it can be shown444See Exercise 4.5. that the objective function is non-convex in .

It is curious to note that the matrices and can be seen as encoding -dimensional descriptions of users and items respectively. More precisely, for every user , we can think of the vector (i.e., the -th row of the matrix ) as describing user , and for every item , use the row vector to describe the item in vectoral form. The rating given by user to item can now be seen to be . Thus, recovering the rank matrix also gives us a bunch of -dimensional latent vectors describing the users and items. These latent vectors can be extremely valuable in themselves as they can help us in understanding user behavior and item popularity, as well as be used in “content”-based recommendation systems which can effectively utilize item and user features.

The above examples, and several others from machine learning, such as low-rank tensor decomposition, training deep networks, and training structured models, demonstrate the utility of non-convex optimization in naturally modeling learning tasks. However, most of these formulations are NP-hard to solve exactly, and sometimes even approximately. In the following discussion, we will briefly introduce a few approaches, classical as well as contemporary, that are used in solving such non-convex optimization problems.

#### 1.4 The Convex Relaxation Approach

Faced with the challenge of non-convexity, and the associated NP-hardness, a traditional workaround in literature has been to modify the problem formulation itself so that existing tools can be readily applied. This is often done by relaxing the problem so that it becomes a convex optimization problem. Since this allows familiar algorithmic techniques to be applied, the so-called convex relaxation approach has been widely studied. For instance, there exist relaxed, convex problem formulations for both the recommendation system and the sparse regression problems. For sparse linear regression, the relaxation approach gives us the popular LASSO formulation.

Now, in general, such modifications change the problem drastically, and the solutions of the relaxed formulation can be poor solutions to the original problem. However, it is known that if the problem possesses certain nice structure, then under careful relaxation, these distortions, formally referred to as a“relaxation gap”, are absent, i.e., solutions to the relaxed problem would be optimal for the original non-convex problem as well.

Although a popular and successful approach, this still has limitations, the most prominent of them being scalability. Although the relaxed convex optimization problems are solvable in polynomial time, it is often challenging to solve them efficiently for large-scale problems.

#### 1.5 The Non-Convex Optimization Approach

Interestingly, in recent years, a new wisdom has permeated the fields of machine learning and signal processing, one that advises not to relax the non-convex problems and instead solve them directly. This approach has often been dubbed the non-convex optimization approach owing to its goal of optimizing non-convex formulations directly.

Techniques frequently used in non-convex optimization approaches include simple and efficient primitives such as projected gradient descent, alternating minimization, the expectation-maximization algorithm, stochastic optimization, and variants thereof. These are very fast in practice and remain favorites of practitioners.

At first glance, however, these efforts seem doomed to fail, given to the aforementioned NP-hardness results. However, in a series of deep and illuminating results, it has been repeatedly revealed that if the problem possesses nice structure, then not only do relaxation approaches succeed, but non-convex optimization algorithms do too. In such nice cases, non-convex approaches are able to only avoid NP-hardness, but actually offer provably optimal solutions. In fact, in practice, they often handsomely outperform relaxation-based approaches in terms of speed and scalability. Figure 1.3 illustrates this for some applications that we will investigate more deeply in later sections.

Very interestingly, it turns out that problem structures that allow non-convex approaches to avoid NP-hardness results, are very similar to those that allow their convex relaxation counterparts to avoid distortions and a large relaxation gap! Thus, it seems that if the problems possess nice structure, convex relaxation-based approaches, as well as non-convex techniques, both succeed. However, non-convex techniques usually offer more scalable solutions.

#### 1.6 Organization and Scope

Our goal of this monograph is to present basic tools, both algorithmic and analytic, that are commonly used in the design and analysis of non-convex optimization algorithms, as well as present results which best represent the non-convex optimization philosophy. The presentation should enthuse, as well as equip, the interested reader and allow further readings, independent investigations, and applications of these techniques in diverse areas.

Given this broad aim, we shall appropriately restrict the number of areas we cover in this monograph, as well as the depth in which we cover each area. For instance, the literature abounds in results that seek to perform optimizations with more and more complex structures being imposed - from sparse recovery to low rank matrix recovery to low rank tensor recovery. However, we shall restrict ourselves from venturing too far into these progressions. Similarly, within the problem of sparse recovery, there exist results for recovery in the simple least squares setting, the more involved setting of sparse M-estimation, as well as the still more involved setting of sparse M-estimation in the presence of outliers. Whereas we will cover sparse least squares estimation in depth, we will refrain from delving too deeply into the more involved sparse M-estimation problems.

That being said, the entire presentation will be self contained and accessible to anyone with a basic background in algebra and probability theory. Moreover, the bibliographic notes given at the end of the sections will give pointers that should enable the reader to explore the state of the art not covered in this monograph.

#### 2.1 Convex Analysis

We recall some basic definitions in convex analysis. Studying these will help us appreciate the structural properties of non-convex optimization problems later in the monograph. For the sake of simplicity, unless stated otherwise, we will assume that functions are continuously differentiable. We begin with the notion of a convex combination.

[Convex Combination] A convex combination of a set of vectors , in an arbitrary real space is a vector where , and .

A set that is closed under arbitrary convex combinations is a convex set. A standard definition is given below. Geometrically speaking, convex sets are those that contain all line segments that join two points inside the set. As a result, they cannot have any inward “bulges”.

[Convex Set] A set is considered convex if, for every and , we have as well.

Figure 2.1 gives visual representations of prototypical convex and non-convex sets. A related notion is that of convex functions which have a unique behavior under convex combinations. There are several definitions of convex functions, those that are more basic and general, as well as those that are restrictive but easier to use. One of the simplest definitions of convex functions, one that does not involve notions of derivatives, defines convex functions as those for which, for every and every , we have . For continuously differentiable functions, a more usable definition follows.

[Convex Function] A continuously differentiable function is considered convex if for every we have , where is the gradient of at .

A more general definition that extends to non-differentiable functions uses the notion of subgradient to replace the gradient in the above expression. A special class of convex functions is the class of strongly convex and strongly smooth functions. These are critical to the study of algorithms for non-convex optimization. Figure 2.2 provides a handy visual representation of these classes of functions.

[Strongly Convex/Smooth Function] A continuously differentiable function is considered -strongly convex (SC) and -strongly smooth (SS) if for every , we have

 α2\norm\vx−\vy22≤f(\vy)−f(\vx)−\ip∇f(\vx)\vy−\vx≤β2\norm\vx−\vy22.

It is useful to note that strong convexity places a quadratic lower bound on the growth of the function at every point – the function must rise up at least as fast as a quadratic function. How fast it rises is characterized by the SC parameter . Strong smoothness similarly places a quadratic upper bound and does not let the function grow too fast, with the SS parameter dictating the upper limit.

We will soon see that these two properties are extremely useful in forcing optimization algorithms to rapidly converge to optima. Note that whereas strongly convex functions are definitely convex, strong smoothness does not imply convexity111See Exercise 2.5.. Strongly smooth functions may very well be non-convex. A property similar to strong smoothness is that of Lipschitzness which we define below.

[Lipschitz Function] A function is -Lipschitz if for every , we have

 \absf(\vx)−f(\vy)≤B⋅\norm\vx−\vy2.

Notice that Lipschitzness places a upper bound on the growth of the function that is linear in the perturbation i.e., , whereas strong smoothness (SS) places a quadratic upper bound. Also notice that Lipschitz functions need not be differentiable. However, differentiable functions with bounded gradients are always Lipschitz222See Exercise 2.5.. Finally, an important property that generalizes the behavior of convex functions on convex combinations is the Jensen’s inequality.

[Jensen’s Inequality] If is a random variable taking values in the domain of a convex function , then This property will be useful while analyzing iterative algorithms.

#### 2.2 Convex Projections

The projected gradient descent technique is a popular method for constrained optimization problems, both convex as well as non-convex. The projection step plays an important role in this technique. Given any closed set , the projection operator is defined as

 Π\cC(\vz):=\argmin\vx∈\cC \norm\vx−\vz2.

In general, one need not use only the -norm in defining projections but is the most commonly used one. If is a convex set, then the above problem reduces to a convex optimization problem. In several useful cases, one has access to a closed form solution for the projection.

For instance, if i.e., the unit ball, then projection is equivalent333See Exercise 2.5. to a normalization step

 Π\cB2(1)(\vz)={\vz/\norm\vz2if \norm\vz>1\vzotherwise.

For the case , the projection step reduces to the popular soft thresholding operation. If , then , where is a threshold that can be decided by a sorting operation on the vector (see Duchi et al., 2008, for details).

Projections onto convex sets have some very useful properties which come in handy while analyzing optimization algorithms. In the following, we will study three properties of projections. These are depicted visually in Figure 2.3 to help the reader gain an intuitive appeal.

[Projection Property-O] For any set (convex or not) and , let . Then for all , .

This property follows by simply observing that the projection step solves the the optimization problem . Note that this property holds for all sets, whether convex or not. However, the following two properties necessarily hold only for convex sets.

[Projection Property-I] For any convex set and any , let . Then for all , .

###### Proof.

To prove this, assume the contra-positive. Suppose for some , we have . Now, since is convex and , for any , we have . We will now show that for some value of , it must be the case that . This will contradict the fact that is the closest point in the convex set to and prove the lemma. All that remains to be done is to find such a value of . The reader can verify that any value of suffices. Since we assumed , any value of chosen this way is always in . ∎

Projection Property-I can be used to prove a very useful contraction property for convex projections. In some sense, a convex projection brings a point closer to all points in the convex set simultaneously. [Projection Property-II] For any convex set and any , let . Then for all , .

###### Proof.

We have the following elementary inequalities

 \norm\vz−\vx22 =\norm(^\vz−\vx)−(^\vz−\vz)22 =\norm^\vz−\vx22+\norm^\vz−\vz22−2\ip^\vz−\vx^\vz−\vz ≥\norm^\vz−\vx22+\norm^\vz−\vz22 (Projection Property-I) ≥\norm^\vz−\vx22 ∎

Note that Projection Properties-I and II are also called first order properties and can be violated if the underlying set is non-convex. However, Projection Property-O, often called a zeroth order property, always holds, whether the underlying set is convex or not.

We now move on to study the projected gradient descent algorithm. This is an extremely simple and efficient technique that can effortlessly scale to large problems. Although we will apply this technique to non-convex optimization tasks later, we first look at its behavior on convex optimization problems as a warm up exercise. We warn the reader that the proof techniques used in the convex case do not apply directly to non-convex problems. Consider the following optimization problem:

 min\vx∈\bRpf(\vx)s.t.\vx∈\cC. (CVX-OPT)

In the above optimization problem, is a convex constraint set and is a convex objective function. We will assume that we have oracle access to the gradient and projection operators, i.e., for any point we are able to access and .

The projected gradient descent algorithm is stated in Algorithm 1. The procedure generates iterates by taking steps guided by the gradient in an effort to reduce the function value locally. Finally it returns either the final iterate, the average iterate, or the best iterate.

#### 2.4 Convergence Guarantees for PGD

We will analyze PGD for objective functions that are either a) convex with bounded gradients, or b) strongly convex and strongly smooth. Let be the optimal value of the optimization problem. A point will be said to be an -optimal solution if .

##### 2.4.1 Convergence with Bounded Gradient Convex Functions

Consider a convex objective function with bounded gradients over a convex constraint set i.e., for all .

Let be a convex objective with bounded gradients and Algorithm 1 be executed for time steps with step lengths . Then, for any , if , then .

We see that the PGD algorithm in this setting ensures that the function value of the iterates approaches on an average. We can use this result to prove the convergence of the PGD algorithm. If we use OPTION 3, i.e., , then since by construction, we have for all , by applying Theorem 2.4.1, we get

 f(^\vxbest)≤1TT∑t=1f(\vxt)≤f∗+ϵ,

If we use OPTION 2, i.e., , which is cheaper since we do not have to perform function evaluations to find the best iterate, we can apply Jensen’s inequality (Lemma 2.2) to get the following

 f(^\vxavg)=f\br1TT∑t=1\vxt≤1TT∑t=1f(\vxt)≤f∗+ϵ.

Note that the Jensen’s inequality may be applied only when the function is convex. Now, whereas OPTION 1 i.e., , is the cheapest and does not require any additional operations, does not converge to the optimum for convex functions in general and may oscillate close to the optimum. However, we shall shortly see that does converge if the objective function is strongly smooth. Recall that strongly smooth functions may not grow at a faster-than-quadratic rate.

The reader would note that we have set the step length to a value that depends on the total number of iterations for which the PGD algorithm is executed. This is called a horizon-aware setting of the step length. In case we are not sure what the value of would be, a horizon-oblivious setting of can also be shown to work444See Exercise 2.5..

###### Proof (of Theorem 2.4.1)..

Let denote any point in the constraint set where the optimum function value is achieved. Such a point always exists if the constraint set is closed and the objective function continuous. We will use the following potential function to track the progress of the algorithm. Note that measures the sub-optimality of the -th iterate. Indeed, the statement of the theorem is equivalent to claiming that .

(Apply Convexity) We apply convexity to upper bound the potential function at every step. Convexity is a global property and very useful in getting an upper bound on the level of sub-optimality of the current iterate in such analyses.

 Φt=f(\vxt)−f(\vx∗)≤\ip∇f(\vxt)\vxt−\vx∗

We now do some elementary manipulations

 \ip∇f(\vxt)\vxt−\vx∗=1η\ipη⋅∇f(\vxt)\vxt−\vx∗ =12η\br\norm\vxt−\vx∗22+\normη⋅∇f(\vxt)22−\norm\vxt−η⋅∇f(\vxt)−\vx∗22 =12η\br\norm\vxt−\vx∗22+\normη⋅∇f(\vxt)22−\norm\vzt+1−\vx∗22 ≤12η\br\norm\vxt−\vx∗22+η2G2−\norm\vzt+1−\vx∗22,

where the first step applies the identity , the second step uses the update step of the PGD algorithm that sets , and the third step uses the fact that the objective function has bounded gradients.

(Apply Projection Property) We apply Lemma 2.3 to get

 \norm\vzt+1−\vx∗22≥\norm\vxt+1−\vx∗22

Putting all these together gives us

 Φt≤12η\br\norm\vxt−\vx∗22−\norm\vxt+1−\vx∗22+ηG22

The above expression is interesting since it tells us that, apart from the term which is small as , the current sub-optimality is small if the consecutive iterates and are close to each other (and hence similar in distance from ).

This observation is quite useful since it tells us that once PGD stops making a lot of progress, it actually converges to the optimum! In hindsight, this is to be expected. Since we are using a constant step length, only a vanishing gradient can cause PGD to stop progressing. However, for convex functions, this only happens at global optima. Summing the expression up across time steps, performing telescopic cancellations, using , and dividing throughout by gives us

 1TT∑t=1Φt ≤12ηT\br\norm\vx∗22−∥\vxT+1−\vx∗∥22+ηG22 ≤12√T\br\norm\vx∗22+G2,

where in the second step, we have used the fact that and . This gives us the claimed result. ∎

##### 2.4.2 Convergence with Strongly Convex and Smooth Functions

We will now prove a stronger guarantee for PGD when the objective function is strongly convex and strongly smooth (see Definition 2.2). Let be an objective that satisfies the -SC and -SS properties. Let Algorithm 1 be executed with step lengths . Then after at most steps, we have .

This result is particularly nice since it ensures that the final iterate converges, allowing us to use OPTION 1 in Algorithm 1 when the objective is SC/SS. A further advantage is the accelerated rate of convergence. Whereas for general convex functions, PGD requires iterations to reach an -optimal solution, for SC/SS functions, it requires only iterations.

The reader would notice the insistence on the step length being set to . In fact the proof we show below crucially uses this setting. In practice, for many problems, may not be known to us or may be expensive to compute which presents a problem. However, as it turns out, it is not necessary to set the step length exactly to . The result can be shown to hold even for values of which are nevertheless large enough, but the proof becomes more involved. In practice, the step length is tuned globally by doing a grid search over several values, or per-iteration using line search mechanisms, to obtain a step length value that assures good convergence rates.

###### Proof (of Theorem 2.4.2)..

This proof is a nice opportunity for the reader to see how the SC/SS properties are utilized in a convergence analysis. As with convexity in the proof of Theorem 2.4.1, the strong convexity property is a global property that will be useful in assessing the progress made so far by relating the optimal point with the current iterate . Strong smoothness on the other hand, will be used locally to show that the procedure makes significant progress between iterates.

We will prove the result by showing that after at most steps, we will have . This already tells us that we have reached very close to the optimum. However, we can use this to show that is -optimal in function value as well. Since we are very close to the optimum, it makes sense to apply strong smoothness to upper bound the sub-optimality as follows

 f(\vxT)≤f(\vx∗)+\ip∇f(\vx∗)\vxT−\vx∗+β2\norm\vxT−\vx∗22.

Now, since is an optimal point for the constrained optimization problem with a convex constraint set , the first order optimality condition (see Bubeck, 2015, Proposition 1.3) gives us for any . Applying this condition with gives us

 f(\vxT)−f(\vx∗)≤β2\norm\vxT−\vx∗22≤ϵ,

which proves that is an -optimal point. We now show . Given that we wish to show convergence in terms of the iterates, and not in terms of the function values, as we did in Theorem 2.4.1, a natural potential function for this analysis is .

(Apply Strong Smoothness) As discussed before, we use it to show that PGD always makes significant progress in each iteration.

 f(\vxt+1)−f(\vxt)≤\ip∇f(\vxt)\vxt+1−\vxt+β2\norm\vxt−\vxt+122 =\ip∇f(\vxt)\vxt+1−\vx∗+\ip∇f(\vxt)\vx∗−\vxt+β2\norm\vxt−\vxt+122 =1η\ip\vxt−\vzt+1\vxt+1−\vxt+\ip∇f(\vxt)\vx∗−\vxt+β2\norm\vxt−\vxt+122

(Apply Projection Rule) The above expression contains an unwieldy term . Since this term only appears during projection steps, we eliminate it by applying Projection Property-I (Lemma 2.3) to get

 \ip\vxt−\vzt+1\vxt+1−\vx∗ ≤\ip\vxt−\vxt+1\vxt+1−\vx∗ =\norm\vxt−\vx∗22−\norm\vxt−\vxt+122−\norm\vxt+1−\vx∗222

Using and combining the above results gives us

 f(\vxt+1)−f(\vxt)≤\ip∇f(\vxt)\vx∗−\vxt+β2\br\norm\vxt−\vx∗22−\norm\vxt+1−\vx∗22

(Apply Strong Convexity) The above expression is perfect for a telescoping step but for the inner product term. Fortunately, this can be eliminated using strong convexity.

 \ip∇f(\vxt)\vx∗−\vxt≤f(\vx∗)−f(\vxt)−α2\norm\vxt−\vx∗22

Combining with the above this gives us

 f(\vxt+1)−f(\vx∗)≤β−α2\norm\vxt−\vx∗22−β2\norm\vxt+1−\vx∗22.

The above form seems almost ready for a telescoping exercise. However, something much stronger can be said here, especially due to the term. Notice that we have . This means

 β2\norm\vxt+1−\vx∗22≤β−α2\norm\vxt−\vx∗22,

which can be written as

 Φt+1≤\br1−αβΦt≤exp\br−αβΦt,

where we have used the fact that for all . What we have arrived at is a very powerful result as it assures us that the potential value goes down by a constant fraction at every iteration! Applying this result recursively gives us

 Φt+1≤exp\br−αtβΦ1=exp\br−αtβ\norm\vx∗22,

since . Thus, we deduce that after at most steps which finishes the proof ∎

We notice that the convergence of the PGD algorithm is of the form . The number is the condition number of the optimization problem. The concept of condition number is central to numerical optimization. Below we give an informal and generic definition for the concept. In later sections we will see the condition number appearing repeatedly in the context of the convergence of various optimization algorithms for convex, as well as non-convex problems. The exact numerical form of the condition number (for instance here it is ) will also change depending on the application at hand. However, in general, all these definitions of condition number will satisfy the following property.

[Condition Number - Informal] The condition number of a function is a scalar that bounds how much the function value can change relative to a perturbation of the input.

Functions with a small condition number are stable and changes to their input do not affect the function output values too much. However, functions with a large condition number can be quite jumpy and experience abrupt changes in output values even if the input is changed slightly. To gain a deeper appreciation of this concept, consider a differentiable function that is also -SC and -SS. Consider a stationary point for i.e., a point such that . For a general function, such a point can be a local optima or a saddle point. However, since is strongly convex, is the (unique) global minima555See Exercise 2.5. of . Then we have, for any other point

 α2\norm\vx−\vy22≤f(\vy)−f(\vx)≤β2\norm\vx−\vy22

Dividing throughout by gives us

 f(\vy)−f(\vx)α2\norm\vx−\vy22∈\bs1,βα:=[1,κ]

Thus, upon perturbing the input from the global minimum to a point distance away, the function value does change much – it goes up by an amount at least but at most . Such well behaved response to perturbations is very easy for optimization algorithms to exploit to give fast convergence.

The condition number of the objective function can significantly affect the convergence rate of algorithms. Indeed, if is small, then would be small, ensuring fast convergence. However, if then and the procedure might offer slow convergence.

#### 2.5 Exercises

Show that strong smoothness does not imply convexity by constructing a non-convex function that is -SS. Show that if a differentiable function has bounded gradients i.e., for all , then is Lipschitz. What is its Lipschitz constant?
Hint: use the mean value theorem. Show that for any point , the projection onto the ball is given by . Show that a horizon-oblivious setting of while executing the PGD algorithm with a convex function with bounded gradients also ensures convergence.
Hint: the convergence rates may be a bit different for this setting. Show that if is a strongly convex function that is differentiable, then there is a unique point that minimizes the function value i.e., . Show that the set of sparse vectors is non-convex for any . What happens when ? Show that , the set of matrices with rank at most , is non-convex for any . What happens when ? Consider the Cartesian product set . Show that it is convex. Consider a least squares optimization problem with a strongly convex and smooth objective. Show that the condition number of this problem is equal to the condition number of the Hessian matrix of the objective function. Show that if is a strongly convex function that is differentiable, then optimization problems with as an objective and a convex constraint set always have a unique solution i.e., there is a unique point that is a solution to the optimization problem . This generalizes the result in Exercise 2.5.
Hint: use the first order optimality condition (see proof of Theorem 2.4.2)

#### 2.6 Bibliographic Notes

The sole aim of this discussion was to give a self-contained introduction to concepts and tools in convex analysis and descent algorithms in order to seamlessly introduce non-convex optimization techniques and their applications in subsequent sections. However, we clearly realize our inability to cover several useful and interesting results concerning convex functions and optimization techniques given the paucity of scope to present this discussion. We refer the reader to literature in the field of optimization theory for a much more relaxed and deeper introduction to the area of convex optimization. Some excellent examples include (Bertsekas, 2016; Boyd and Vandenberghe, 2004; Bubeck, 2015; Nesterov, 2003; Sra et al., 2011).

#### 3.1 Non-Convex Projections

Executing the projected gradient descent algorithm with non-convex problems requires projections onto non-convex sets. Now, a quick look at the projection problem

 Π\cC(\vz):=\argmin\vx∈\cC \norm\vx−\vz2

reveals that this is an optimization problem in itself. Thus, when the set to be projected onto is non-convex, the projection problem can itself be NP-hard. However, for several well-structured sets, projection can be carried out efficiently despite the sets being non-convex.

##### 3.1.1 Projecting into Sparse Vectors

In the sparse linear regression example discussed in § 1,

 \bth=\argmin\norm\bt0≤s n∑i=1\bryi−\vx⊤i\bt2,

applying projected gradient descent requires projections onto the set of -sparse vectors i.e., . The following result shows that the projection can be carried out by simply sorting the coordinates of the vector according to magnitude and setting all except the top- coordinates to zero.

For any vector , let be the permutation that sorts the coordinates of in decreasing order of magnitude, i.e., . Then the vector is obtained by setting if and otherwise.

###### Proof.

We first notice that since the function is an increasing function on the positive half of the real line, we have . Next, we observe that the vector must satisfy for all otherwise we can decrease the objective value by ensuring this. Having established this gives us . This is clearly minimized when has the coordinates of with largest magnitude. ∎

##### 3.1.2 Projecting into Low-rank Matrices

In the recommendation systems problem, as discussed in § 1

 ^Alr=argmin\rank(X)≤r ∑(i,j)∈Ω\brXij−Aij2,

we need to project onto the set of low-rank matrices. Let us first define this problem formally. Consider matrices of a certain order, say and let be an arbitrary set of matrices. Then, the projection operator is defined as follows: for any matrix ,

 Π\cC(A):=\argminX∈\cC \normA−XF,

where is the Frobenius norm over matrices. For low rank projections we require to be the set of low rank matrices . Yet again, this projection can be done efficiently by performing a Singular Value Decomposition on the matrix and retaining the top singular values and vectors. The Eckart-Young-Mirsky theorem proves that this indeed gives us the projection.

[Eckart-Young-Mirsky theorem] For any matrix , let be the singular value decomposition of such that where . Then for any , the matrix can be obtained as where , , and .

Although we have stated the above result for projections with the Frobenius norm defining the projections, the Eckart-Young-Mirsky theorem actually applies to any unitarily invariant norm including the Schatten norms and the operator norm. The proof of this result is beyond the scope of this monograph.

Before moving on, we caution the reader that the ability to efficiently project onto the non-convex sets mentioned above does not imply that non-convex projections are as nicely behaved as their convex counterparts. Indeed, none of the projections mentioned above satisfy projection properties I or II (Lemmata 2.3 and 2.3). This will pose a significant challenge while analyzing PGD-style algorithms for non-convex problems since, as we would recall, these properties were crucially used in all convergence proofs discussed in § 2.

#### 3.2 Restricted Strong Convexity and Smoothness

In § 2, we saw how optimization problems with convex constraint sets and objective functions that are convex and have bounded gradients, or else are strongly convex and smooth, can be effectively optimized using PGD, with much faster rates of convergence if the objective is strongly convex and smooth. However, when the constraint set fails to be convex, these results fail to apply.

There are several workarounds to this problem, the simplest being to convert the constraint set into a convex one, possibly by taking its convex hull111The convex hull of any set is the “smallest” convex set that contains . Formally, we define . If is convex then it is its own convex hull., which is what relaxation methods do. However, a much less drastic alternative exists that is widely popular in non-convex optimization literature.

The intuition is a simple one and generalizes much of the insights we gained from our discussion in § 2. The first thing we need to notice222See Exercise 3.4. is that the convergence results for the PGD algorithm in § 2 actually do not require the objective function to be convex (or strongly convex/strongly smooth) over the entire . These properties are only required to be satisfied over the constraint set being considered. A natural generalization that emerges from this insight is the concept of restricted properties that are discussed below.

[Restricted Convexity] A continuously differentiable function is said to satisfy restricted convexity over a (possibly non-convex) region if for every we have , where is the gradient of at .

As before, a more general definition that extends to non-differentiable functions, uses the notion of subgradient to replace the gradient in the above expression.

[Restricted Strong Convexity/Smoothness] A continuously differentiable function is said to satisfy -restricted strong convexity (RSC) and -restricted strong smoothness (RSS) over a (possibly non-convex) region if for every , we have

 α2\norm\vx−\vy22≤f(\vy)−f(\vx)−\ip∇f(\vx)\vy−\vx≤β2\norm\vx−\vy22.

Note that, as Figure 3.1 demonstrates, even non-convex functions can demonstrate the RSC/RSS properties over suitable subsets. Conversely, functions that satisfy RSC/RSS need not be convex. It turns out that in several practical situations, such as those explored by later sections, the objective functions in the non-convex optimization problems do satisfy the RSC/RSS properties described above, in some form.

We also remind the reader that the RSC/RSS definitions presented here are quite generic and presented to better illustrate basic concepts. Indeed, for specific non-convex problems such as sparse recovery, low-rank matrix recovery, and robust regression, the later sections will develop more refined versions of these properties that are better tailored to those problems. In particular, for sparse recovery problems, the RSC/RSS properties can be shown to be related333See Exercise 7.10. to the well-known restricted isometry property (RIP).

#### 3.3 Generalized Projected Gradient Descent

We now present the generalized projected gradient descent algorithm (gPGD) for non-convex optimization problems. The procedure is outlined in Algorithm 2. The reader would find it remarkably similar to the PGD procedure in Algorithm 1. However, a crucial difference is in the projections made. Whereas PGD utilized convex projections, the gPGD procedure, if invoked with a non-convex constraint set , utilizes non-convex projections instead.

We will perform the convergence analysis for the gPGD algorithm assuming that the projection step in the algorithm is carried out exactly. As we saw in the preceding discussion, this can be accomplished efficiently for non-convex sets arising in several interesting problem settings. However, despite this, the convergence analysis will remain challenging due to the non-convexity of the problem.

Firstly, we will not be able to assume that the objective function we are working with is convex over the entire . Secondly, non-convex projections do not satisfy projection properties I or II. Finally, the first order optimality condition ((Bubeck, 2015, Proposition 1.3)) we used to prove Theorem 2.4.2 also fails to hold for non-convex constraint sets. Since the analyses for the PGD algorithm crucially used these results, we will have to find workarounds to all of them. We will denote the optimal function value as