In the recent years, much effort has been made to minimize strongly convex finite sums with first order information. Recent developments, combining both numerical efficiency and sound theoretical guarantees, such as linear convergence rates, include SVRG , SAG , SDCA  or SAGA  to solve the following problem:
where the functions correspond to a loss computed at a sample of the dataset, and is a (eventually non-smooth) penalization. However, theoretical guarantees about these algorithms, such as linear rates guaranteeing a numerical complexity to obtain a solution -distant to the minimum, require both strong convexity of and a gradient-Lipschitz property on each , namely for any , where stands for the Euclidean norm on and is the Lipschitz constant. However, some problems, such as the linear Poisson regression, which is of practical importance in statistical image reconstruction among others (see  for more than a hundred references) do not meet such a smoothness assumption. Indeed, we have in this example for where
are the features vectors andare the labels, and where the model-weights must satisfy the linear constraints for all .
Motivated by machine learning problems described in Section4 below, that do not satisfy the gradient-Lipschitz assumption, we consider a more specific task relying on a new smoothness assumption. Given convex functions with such that , a vector , features vectors corresponding to the rows of a matrix we consider the objective
where , is a -strongly convex function and is the open polytope
that we assume to be non-empty. Note that the linear term can be included in the regularization but the problem stands clearer if it is kept out. We say that a function is - smooth, where , if it is a differentiable and strictly monotone convex function that satisfies
for all . We detail this property and its specificities in Section 2. All along the chapter, we assume that the functions are - smooth. Note also that the Poisson regression objective fits in this setting, where is - smooth and . See Section 4.1 below for more details.
Standard first-order batch solvers (non stochastic) for composite convex objectives are ISTA and its accelerated version FISTA 
and first-order stochastic solvers are mostly built on the idea of Stochastic Gradient Descent (SGD). Recently, stochastic solvers based on a combination of SGD and the Monte-Carlo technique of variance reduction , , ,  turn out to be both very efficient numerically (each update has a complexity comparable to vanilla SGD) and very sound theoretically, because of strong linear convergence guarantees, that match or even improve the one of batch solvers. These algorithms involve gradient steps on the smooth part of the objective and theoretical guarantees justify such steps under the gradient-Lipschitz assumptions thanks to the descent lemma [7, Proposition A.24]. Without this assumption, such theoretical guarantees fall apart. Also, stochastic algorithms loose their numerical efficiency if their iterates are projected on the feasible set at each iteration as Equation (2) requires. STORC  can deal with constrained objectives without a full projection but is restricted to compact sets of constraints which is not the case of . Then, a modified proximal gradient method from  provides convergence bounds relying on self-concordance  rather than the gradient-Lipschitz property. However, the convergence rate is guaranteed only once the iterates are close to the optimal solution and we observed in practice that this algorithm is simply not working (since it ends up using very small step-sizes) on the problems considered here. Recently,  has provided new descent lemmas based on relative-smoothness that hold on a wider set of functions including Poisson regression losses. This work is an extension of  that presented the same algorithm and detailed its application to Poisson regression losses. While this is more generic than our work, they only manage to reach sublinear convergence rates that applies only on positive solution (namely ) while we reach linear rates for any solution .
The first difficulty with the objective (2) is to remain in the open polytope . To deal with simpler constraints we rather perform optimization on the dual problem
where for a function , the Fenchel conjugate is given by , and is the domain of the function . This strategy is the one used by Stochastic Dual Coordinate Ascent (SDCA) . The dual problem solutions are box-constrained to which is much easier to maintain than the open polytope . Note that as all are strictly decreasing (because they are strictly monotone on with ), their dual are defined on . By design, this approach keeps the dual constraints maintained all along the iterations and the following proposition, proved in Section 7.1, ensures that the primal iterate converges to a point of . Assume that the polytope is non-empty, the functions are convex, differentiable, with for and that is strongly convex. Then, strong duality holds, namely and the Karush-Kuhn-Tucker conditions relate the two optima as
where is the minimizer of and the maximizer of .
In this chapter, we introduce the smoothness property and its characteristics and then we derive linear convergence rates for SDCA without the gradient-Lipschitz assumption, by replacing it with smoothness, see Definition 3. Our results provide a state-of-the-art optimization technique for the considered problem (2), with sound theoretical guarantees (see Section 3) and very strong empirical properties as illustrated on experiments conducted with several datasets for Poisson regression and Hawkes processes likelihood (see Section 5). We study also SDCA with importance sampling  under smoothness and prove that it improves both theoretical guarantees and convergence rates observed in practical situations, see Sections 3.2 and 5
. We provide also a heuristic initialization technique in Section5.3 and a "mini-batch"  version of the algorithm in Section 5.4 that allows to end up with a particularly efficient solver for the considered problems. We motivate even further the problem considered in this chapter in Figure 1, where we consider a toy Poisson regression problem (with 2 features and 3 data points), for which L-BFGS-B typically fails while SDCA works. This illustrates the difficulty of the problem even on such an easy example.
We first introduce the smoothness property in Section 2, relate it to self-concordance in Proposition 2 and translate it in the Fenchel conjugate space in Proposition 2. Then, we present the shifted SDCA algorithm in Section 3 and state its convergence guarantees in Theorem 1 under the smoothness assumption. We also provide theoretical guarantees for variants of the algorithm, one using proximal operators  and the second using importance sampling  which leads to better convergence guarantees in Theorem 3.2. In Section 4 we focus on two specific problems, namely Poisson regression and Hawkes processes, and explain how they fit into the considered setting. Section 5 contains experiments that illustrate the benefits of our approach compared to baselines. This Section also proposes a very efficient heuristic initialization and numerical details allowing to optimize over several indices at each iteration, which is a trick to accelerate even further the algorithm.
2 A tighter smoothness assumption
To have a better overview of what smoothness is, we formulate the following proposition giving an equivalent property for smooth functions that are twice differentiable. Let be a convex strictly monotone and twice differentiable function. Then,
This proposition is proved in Section 7.4 and we easily derive from it that on , on and on are - smooth. This proposition is linked to the self-concordance property introduced by Nesterov  widely used to study losses involving logarithms. For the sake of clarity, the results will be presented for functions whose domain is a subset of as this leads to lighter notations. A convex function is standard self-concordant if
This property has been generalized [1, 38] but always consists in controlling the third order derivative by the second order derivative, initially to bound the second order Taylor approximation used in the Newton descent algorithm . While right hand sides of both properties ( and ) might look arbitrarily chosen, in fact they reflect the motivating use case of the logarithmic barriers where and for which the bound is reached. Hence, smoothness is the counterpart of self-concordance but to control the second order derivative with the first order derivative. As it is similarly built, smoothness shares the affine invariant property with self-concordance. It means that if is -smooth then with is also - smooth with the same constant . An extension to the multivariate case where is likely feasible but is useless for our algorithm and hence beyond the scope of this paper. From the smoothness property of a function , we derive several characteristics for its Fenchel conjugate starting with the following proposition. Let be a strictly monotone convex function and be its twice differentiable Fenchel conjugate. Then,
This proposition is proved in Section 7.5 and is the first step towards a series of convex inequalities for the Fenchel conjugate of smooth functions. These inequalities, detailed in Section 7.6, bounds the Bregman divergence of such functions and are compared to what can be obtained with self-concordance or strong convexity (on a restricted set) in the canonical case where . It appears with smoothness we obtain tighter bounds than what is achievable with other assumptions and that all these bounds are reached (and hence cannot be improved) in the canonical case (see Table 3).
3 The Shifted SDCA algorithm
The dual objective (4) cannot be written as a composite sum of convex functions as in the general objective (1), which is required for stochastic algorithms such as SRVG  or SAGA . It is better to use a coordinate-wise approach to optimize this problem, which leads to SDCA , in which the starting point has been shifted by . This shift is induced by the relation linking primal and dual variables at the optimum: the second Karush-Kuhn-Tucker condition from Proposition 4,
We first present the general algorithm (Algorithm 1), then its proximal alternative (Algorithm 2) and finally how importance sampling leads to better theoretical results. We assume that we know bounds such that for any , such bounds can be explicitly computed from the data in the particular cases considered in this chapter, see Section 4 for more details.
The next theorem provides a linear convergence rate for Algorithm 1 where we assume that each is - smooth (see Definition 3). Suppose that we known bounds such that for and assume that all are - smooth with differentiable Fenchel conjugates and is 1-strongly convex. Then, Algorithm 1 satisfies
The proof of Theorem 1 is given in Section 7.7. It states that in the considered setting, SDCA achieves a linear convergence rate for the dual objective. The bounds are provided in Section 4 below for two particular cases: Poisson regression and likelihood Hawkes processes. Equipped with these bounds, we can compare the rate obtained in Theorem 1 with already known linear rates for SDCA under the gradient-Lipschitz assumption . Indeed, we can restrict the domain of all to on which Proposition 2 states that all are -strongly convex. Now, following carefully the proof in  leads to the convergence rate given in Equation (6) but with
Since for any , Theorem 1 provides a faster convergence rate. The comparative gain depends on the values of and but it increases logarithmically with the value of . Table 1 below compares the explicit values of these linear rates on a dataset used in our experiments for Poisson regression.
Convergence rates for the primal objective are not provided since the primal iterate typically belongs to only when it is close enough to the optimum. This would make most of the values of the primal objective undefined and therefore not comparable to .
3.1 Proximal algorithm
where is a convex, prox capable and possibly non-differentiable function, we use the same technique as Prox-SDCA  with a proximal lower bound that leads to
see Section 7.2 for details. This leads to a proximal variant described in Algorithm 2 below, which is able to handle various regularization techniques and which has the same convergence guarantees as Algorithm 1 given in Theorem 1. Also, note that assuming that can be written as (8) with a prox-capable function is rather unrestrictive, since one can always add a ridge penalization term in the objective.
3.2 Importance sampling
Importance sampling consists in adapting the probabilities of choosing a sample(which is by default done uniformly at random, see Line 4 from Algorithm 1) using the improvement which is expected by sampling it. Consider a distribution on with probabilities such that for any and . The Shifted SDCA and Shifted Prox-SDCA with importance sampling algorithms are simply obtained by modifying the way is sampled in Line 4 of Algorithms 1 and 2: instead of sampling uniformly at random, we sample using such a distribution . The optimal sampling probability is obtained in the same way as  and it also leads under our smoothness assumption to a tighter convergence rate, as stated in Theorem 3.2 below. Suppose that we known bounds such that for and assume that all are - smooth with differentiable Fenchel conjugates and is 1-strongly convex. Consider defined by (7) and consider the distribution
where . The proof is given in Section 7.8. This convergence rate is stronger than the previous one from Theorem 1 since . Table 1 below compares the explicit values of these linear rates on a dataset used in our experiments for Poisson regression (facebook dataset). We observe that the smooth rate with importance sampling is orders of magnitude better than the one obtained with the standard theory for SDCA which exploits only the strong convexity of the functions .
|strongly convex||strongly convex with||smooth||smooth with|
|importance sampling||importance sampling|
4 Applications to Poisson regression and Hawkes processes
In this Section we describe two important models that fit into the setting of this chapter. We precisely formulate them as in Equation (2) and give the explicit value of bounds such as , where is the solution to the dual problem (4).
4.1 Linear Poisson regression
Poisson regression is widely used to model count data, namely when, in the dataset, each observation is associated an integer output for . It aims to find a vector such that for a given function ,
is the realization of a Poisson random variable of intensity. A convenient choice is to use for as it always guarantees that . However, using the exponential function assumes that the covariates have a multiplicative effect that often cannot be justified. The tougher problem of linear Poisson regression, where and is the polytope , appears to model additive effects. For example, this applies in image reconstruction. The original image is retrieved from photons counts
distributed as a Poisson distribution with intensity, that are received while observing the image with different detectors represented by the vectors . This application has been extensively studied in the literature, see [16, 4, 40] and  for a review with a hundred references. Linear Poisson regression is also used in various fields such as survival analysis with additive effects  and web-marketing  where the intensity corresponds to an intensity of clicks on banners in web-marketing. To formalize, we consider a training dataset with and and assume without loss of generality that for while for where (this simply means that we put first the samples corresponding to a label ). The negative log-likelihood of this model with a penalization function can be written as
where corresponds to the level of penalization, with the constraint that for . This corresponds to Equation (2) with for , which are - smooth functions, and with
Note that the zero labeled observations can be safely removed from the sum and are fully encompassed in . The algorithms and results proposed in Section 3 can therefore be applied for this model.
4.2 Hawkes processes
Hawkes processes are used to study cross causality that might occur in one or several events series. First, they were introduced to study earthquake propagation, the network across which the aftershocks propagate can be recovered given all tremors timestamps . Then, they have been used in high frequency finance to describe market reactions to different types of orders . In the recent years Hawkes processes have found many new applications including crime prediction  or social network information propagation . A Hawkes process  is a multivariate point-process: it models timestamps of nodes using a multivariate counting process with a particular auto-regressive structure in its intensity. More precisely, we say that a multivariate counting process where for is a Hawkes process if the intensity of has the following structure:
The are called baselines intensities, and correspond to the exogenous intensity of events from node , and the functions for are called kernels. They quantify the influence of past events from node on the intensity of events from node
. The main parametric model for the kernels is the so-calledexponential kernel, in which we consider
with . In this model the matrix is understood as an adjacency matrix, since entry quantifies the impact of the activity of node on the activity of node , while are memory parameters. We stack these parameters into a vector containing the baselines and the self and cross-excitation parameters . Note that in this model the memory parameters are supposed to be given. The associated goodness-of-fit is the negative log-likelihood, which is given by the general theory of point processes (see ) as
Let us define the following weights for and ,
that can be computed efficiently for exponential kernels thanks to recurrence formulas (the complexity is linear with respect to the number of events of each node). Using the parametrization of the kernels from Equation (9) we can rewrite each term of the negative log-likelihood as
To rewrite in a vectorial form we define as the number of events of node and the following vectors for :
that are the model weights involved in , and
which correspond to the vector involved in the linear part of the primal objective (2) and finally
for which contains all the timestamps data computed in the weights computed in Equation (10). With these notations the negative log-likelihood for node can be written as
First, it shows that the negative log-likelihood can be separated into independent sub-problems with goodness-of-fit that corresponds to the intensity of node with the weights carrying data from the events of the other nodes . Each subproblem is a particular case of the primal objective (2), where all the labels are equal to . As a consequence, we can use the algorithms and results from Section 3 to train penalized multivariate Hawkes processes very efficiently.
4.3 Closed form solution and bounds on dual variables
In this Section with provide the explicit solution to Line 5 of Algorithm 2 when the objective corresponds to the linear Poisson regression or the Hawkes process goodness-of-fit. In Proposition 4.3 below we provide the closed-form solution of the local maximization step corresponding to Line 5 of Algorithm 2.
This closed-form expression allows to derive a numerically very efficient training algorithm, as illustrated in Section 5 below. For these two use cases, the dual loss is given by for any (with for the Hawkes processes). For this specific dual loss, we can provide also upper bounds for all optimal dual variables , as stated in the next Proposition. For Poisson regression and Hawkes processes, if and if for all , we have the following upper bounds on the dual variables at the optimum:
for any . The proofs of Propositions 4.3 and 4.3 are provided in Section 7.9. Note that the inner product assumption from Proposition 4.3 is mild: it is always met for the Hawkes process with kernels given by (9) and it it met for Poisson regression whenever one applies for instance a min-max scaling on the features matrix.
To evaluate efficiently Shifted SDCA we have compared it with other optimization algorithms that can handle the primal problem (2) nicely, without the gradient-Lipschitz assumptions. We have discarded the modified proximal gradient method from  since most of the time it was diverging while computing the initial step with the Barzilai-Borwein method on the considered examples. We consider the following algorithms.
This is a first order batch algorithm that relies on relative-smoothness  instead of the gradient Lipschitz assumption. Its application to linear Poisson regression has been detailed in  and its analysis provides convergence guarantees with a sublinear convergence rate in . However, this method is by design limited to solutions with positive entries (namely ) and provides guarantees only in this case. Its theoretical step-size decreases linearly with and is too small in practice. Hence, we have tuned the step-size to get the best objective after 300 iterations.
This is a stochastic gradient descent algorithm with variance reduction introduced in [19, 41]. We used a variant introduced in , which uses Barzilai-Borwein in order to adapt the step-size, since gradient-Lipschitz constants are unavailable in the considered setting. We consider this version of variance reduction, since alternatives such as SAGA and SAG  do not propose variants with Barzilai-Borwein type of step-size selection.
. It relies on an estimation of the inverse of the Hessian based on gradients differences. This technique allows L-BFGS-B to consider the curvature information leading to faster convergence than other batch first order algorithms such as ISTA and FISTA.
This is the standard second-order Newton algorithm which computes at each iteration the hessian of the objective to solve a linear system with it. In our experiments, the considered objectives are both -smooth and self-concordant . The self-concordant property bounds the third order derivative by the second order derivative, giving explicit control of the second order Taylor expansion . This ensures supra-linear convergence guarantees and keeps all iterates in the open polytope (3) if the starting point is in it . However, the computational cost of the hessian inversion makes this algorithm scale very poorly with the number of dimensions (the size of the vectors ).
This is the Shifted-SDCA algorithm, see Algorithm 2, without importance sampling. Indeed, the bounds given in Proposition 4.3 are not tight enough to improve convergence when used for importance sampling in the practical situations considered in this Section (despite the fact that the rates are theoretically better). A similar behavior was observed in .
SVRG and L-BFGS-B are almost always diverging in these experiments just like in the simple example considered in Figure 1. Hence, the problems are tuned to avoid any violation of the open polytope constraint (3), and to output comparable results between algorithms. Namely, to ensure that for any iterate , we scale the vectors so that they contain only non-negative entries, and the iterates of SVRG and L-BFGS-B are projected onto . This highlights two first drawbacks of these algorithms: they cannot deal with a generic feature matrix and their solutions contain only non-negative coefficients. For each run, the simply take where . This simple choice seemed relevant for all the considered problems.
5.1 Poisson regression
For Poisson regression we have processed our feature matrices to obtain coefficients between 0 and 1. Numerical features are transformed with a min-max scaler and categorical features are one hot encoded. We run our experiments on six datasets found on UCI dataset repository and Kaggle111https://www.kaggle.com/datasets (see Table 2 for more details).
|dataset||wine 222  https://archive.ics.uci.edu/ml/datasets/wine+quality||facebook 333 https://archive.ics.uci.edu/ml/datasets/Facebook+metrics||vegas 444 https://archive.ics.uci.edu/ml/datasets/Las+Vegas+Strip||news 555 https://archive.ics.uci.edu/ml/datasets/online+news+popularity||property 666 https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction||simulated 777
The features have been generated with a folded normal distribution and the generating vectorhas 30 non zero coefficients sampled with a normal distribution.
These datasets are used to predict a number of interactions for a social post (news and facebook), the rating of a wine or a hotel (wine and vegas) or the number of hazards occurring in a property (property). The last one comes from simulated data which follows a Poisson regression. In Figure 2 we present the convergence speed of the five algorithms. As our algorithms follow quite different schemes, we measure this speed regarding to the computational time. In all runs, NoLips, SVRG and L-BFGS-B cannot reach the optimal solution as the problem minimizer contains negative values. This is illustrated in detail in Figure 3 for vegas dataset where it appears that all solvers obtain similar results for the positive values of but only Newton and SDCA algorithms are able to estimate the negatives values of . As expected, the Newton algorithm becomes very slow as the number of features increases. SDCA is the only first order solver that reaches the optimal solution. It combines the best of both world, the scalability of a first order solver and the ability to reach solutions with negative entries.
5.2 Hawkes processes
If the adjacency matrix is forced to be entrywise positive, then no event type can have an inhibitive effect on another. This ability to exhibit inhibitive effect has direct implications on real life datasets especially in finance where these effects are common [2, 3, 33]. In Figure 4 we present the aggregated influence of the kernels obtained after training a Hawkes process on a finance dataset exploring market microstructure . While L-BFGS-B (or SVRG, or NoLips) recovers only excitation in the adjacency matrix, SDCA also retrieves inhibition that one event type might have on another.
It is expected that when stocks are sold (resp. bought) the price is unlikely to go up (resp. down) but this is retrieved by SDCA only. On simulated data this is even clearer and in Figure 5 we observe the same behavior when the ground truth contains inhibitive effects. Our experiment consists in two simulated Hawkes processes with 10 nodes and sum-exponential kernels with 3 decays. There are only excitation effects - all are positive - in the first case and we allow inhibitive effects in the second. Events are simulated according to these kernels that we try to recover. While it would be standard to compare the performances in terms of log-likelihood obtained on the a test sample, nothing ensures that the problem optimizer lies in the feasible set of the test set. Hence the results are compared by looking at the estimation error (RMSE) of the adjacency matrix across iterations. Figure 5 shows that SDCA always converges faster towards its solution in both cases and that when the adjacency matrix contains inhibitive effects, SDCA obtains a better estimation error than L-BFGS-B.
5.3 Heuristic initialization
The default dual initialization in  () is not a feasible dual point. Instead of setting arbitrarily to , we design, from three properties, a vector that is linearly linked to and then rely on Proposition 5.3 to find a heuristic starting point from for Poisson regression and Hawkes processes.
Property 1: link with .
Proposition 5.3 relates exactly to the inverse of the norm of . For Poisson regression and Hawkes processes, the value of the dual optimum is linearly linked to the inverse of the norm of . Namely, if there is such that for any , then , the solution of the dual problem
satisfies for all . This Proposition is proved in Section 7.12. It suggests to consider for all .
Property 2: link with .
Property 3: link with the features matrix.
The inner product is positive and at the optimum, the Karush-Kuhn-Tucker Condition (5) (which links to through ) tells that is likely to be large if is poorly correlated to other features, i.e. if is small. Finally, the choice
for , takes these three properties into account.
Figure 6 plots the optimal dual variables from the Poisson regression experiments of Section 5.1 against the the vector from Equation 11. We observe in these experiments a good correlation between the two, but is only a good guess for initialization up to a multiplicative factor that the following proposition aims to find.
For Poisson regression and Hawkes processes and , if we constraint the dual solution to be collinear with a given vector , i.e. for some then the optimal value for is given by
Combined with the previous Properties, we suggest to consider
as an initial point, where is defined in Equation (11). This Proposition is proved in Section 7.13. Figure 7 presents the values of given its initial value for and shows that the rescaling has worked properly. We validate this heuristic initialization by showing that it leads to a much faster convergence in Figure 8 below. Indeed, we observe that SDCA initialized with Equation (12) reaches optimal objective much faster than when initialization consists in setting all dual variables arbitrarily to 1.
5.4 Using mini batches
At each step , SDCA  maximizes the dual objective by picking one index and maximizing the dual objective over the coordinate of the dual vector , and sets
In some cases this maximization has a closed-form solution, such as for Poisson regression (see Proposition 4.3) or least-squares regression where leads to the explicit update
In some other cases, such as logistic regression, this closed form solution cannot be exhibited and we must perform a Newton descent algorithm. Each Newton step consists in computingand , which are one dimensional operations given and . Hence, in a large dimensional setting, when the observations have many non zero entries, the main cost of the steps resides mostly in computing and . Since and must also be computed when using a closed-form solution, using Newton steps instead of the closed-form is eventually not much more computationally expensive. So, in order to obtain a better trade-off between Newton steps and inner-products computations, we can consider more than a single index on which we maximize the dual objective. This is called the mini-batch approach, see Stochastic Dual Newton Ascent (SDNA) . It consists in selecting a set of indices at each iteration . The value of becomes in this case