I Introduction
Convex regression (CR) is concerned with fitting a convex function to a finite number of observations. In particular, suppose that we are given noisy observations of a convex function as
(1) 
where
’s are random variables, while
. The objective of CR is then to estimate the true function , given the observations ’s, in a way in which the estimated function is convex.CR is a particular class of shapeconstrained regression problems, and since its first conception in the 50’s, it has attracted much attention in various domains, such as statistics, economics, operations research, signal processing and control [1, 2, 3]. In economics, CR has been motivated by the need for approximating consumers’ utility functions from empirical data [4], a task which has been recently reconsidered in the context of personalized optimization with user’s feedback [5].
In this paper, we study least squares estimators (LSEs) for CR. LSEs have some key advantages over many other estimators proposed in the literature for CR (e.g., constrained Gaussian processes [6], splines [7], or others [8]). First, LSEs are nonparametric and hence they do not require any tuning and avoid the issue of selecting an appropriate estimation structure. Second, LSEs can be computed by solving a convex quadratic program. Therefore, at least in theory, they can be solved very efficiently using interior point methods. Third, being based on the least squares paradigm, they can be naturally extended to timevarying cases (when the function to be estimated changes continuously in time) by, e.g., exponential forgetting coefficients; these cases are becoming more and more important in the current data streaming era [9].
One of the main theoretic drawback is however that the class of functions that can be enforced is limited to the general convex functions, while in many applications one would like to be able to impose at least smoothness and/or strong convexity.
In this paper, our contributions are as follows,
First, we propose a novel smooth strongly convex CR algorithm. The resulting estimator has at its heart a convex quadratically constrained quadratic program, with variables and constraints. We also report on its computational complexity as a function of . The building blocks for this novel algorithm are the recent results in smooth strongly convex interpolation [10, 11].
Second, we propose a decomposition scheme based on the alternating direction method of multipliers (ADMM) to lessen the computational complexity and make the method parallel. The resulting computational complexity per iteration is then which is tight (i.e., no LSE can obtain a lower computational complexity). The ADMM approach is based on a properly constructed constraint graph, as well as a dual formulation of the local subproblems.
The results presented in this paper generalize LSE to smooth strongly convex functions, and the nonsmooth results can be reobtained as a special case^{1}^{1}1 Notation.Vectors are indicated with , and matrices with . The Euclidean norm is indicated with , the infinity norm as . is the transpose operator. Symmetric positive (semi)definite matrices of dimension are indicated as . For convex functions , we indicate with their subgradient at point and with their convex conjugate. is the standard bigO notation..
Ii Definitions and problem statement
We start by formally defining the functional class of interest. Given two parameters and satisfying , we consider convex functions satisfying both a smoothness and a strong convexity condition. Given a convex function , we say that the function is smooth and strongly convex, which we denote with , iff the following two conditions are satisfied:

Inequality holds and corresponding subgradients ;

Function is convex.
These definitions allow for to be equal to (i.e., the nonsmooth case). In the case of a finite , the first condition implies differentiability of the function. When , this condition becomes vacuous, and the function can be nondifferentiable. The class of generic convex functions simply corresponds to . The case can be discarded, as it only involves simple quadratic functions, that can be estimated parametrically much more efficiently.
The problem we are interested in solving can be then formalized by using the empirical norm as:
(2) 
As we will see, the solution of (2) will consist of two parts: (i) the solution of a finitedimensional optimization problem defined on the observation set, and (ii) an interpolating function which maintains the functional class also in all the other points of the domain. In this respect, we define the notion of interpolation as follows.
Definition 1
The set where and for all is interpolable iff there exists a function , , such that and for all .
Iii Shapeconstrained leastsquares
Iiia State of the art
The infinite dimensional optimization problem in (2) can be reduced to a finite dimensional one for the case as follows. Let , for and define the vector . Let , and define the vector . We can now rewrite the optimization problem (2) for on the observation set as the following convex quadratic program (QP):
(3a)  
(3b) 
Problem (3) is a convex QP with variables and constraints (after removing the trivial ones). Constraint (3b) imposes convexity on the observation points.
A solution of (3), that can be labelled as , is interpolable by construction. In fact, once the solution is retrieved, an allowed estimator/interpolating function for function at point is given by
(4) 
Two comments are in order at this point:
First, estimator (4) is nonsmooth and nonstrongly convex in general. While adhoc smoothing techniques do exist [12], one incurs a daunting tradeoff between smoothing quality (in terms of low Lipschitz constant ) and estimation quality (in terms of small error w.r.t. the nonsmooth solution). Typically, if one wants to retrieve functions of the class with a low , one has to expect poor estimation quality .
Second, even though problem (3) is a convex QP and one can use offtheshelf solvers to solve it efficiently (e.g., OSQP [13], or ECOS [14]), the number of shape constraints is still . In addition, its computational complexity grows at least as , making practically hard to solve problems with and even small . However, thanks to the decomposable structure of problem (3) one can resort to firstorder methods [3, 12, 15] whose computational complexity scales as per iteration, to tackle problems up to in dimensions that can go up to for the very recent [15]. Note that a computational complexity of is the least one can expect, given the constraints. Finally, partitioning methods have also been advocated [16] but not explored here.
IiiB Smooth strongly convex regression
In this paper, we propose a new set of constraints (instead of (3b)) together with an interpolation procedure for to enforce smoothness and strong convexity automatically. We use and adapt results from [10, 11] for this purpose. The basic idea is that smoothness and strong convexity interchange via the procedure of convex conjugation. While we leave the technical details to the above mentioned papers, we can cite the following interpolability results.
Theorem 1
(interpolability) [10, Theorem 4] The set is interpolable iff the following set of conditions holds for every pair of indices :
(5) 
With this in place, we are ready to modify the constraint set (3b) to accommodate . In particular, to solve (2) for a , we can leverage the convex problem:
(6a)  
(6b) 
This is a convex quadratically constrained quadratic problem (QCQP), a special case of a secondorder conic program [17]. Once a solution is found, the following theorem describe an allowed estimation/interpolation strategy.
Theorem 2
For any set that is interpolable, an allowed interpolating function is
(7) 
where
(8) 
and where indicates the convex hull.
Proof: See Appendix A.
Problem (6) together with the interpolation strategy (7) yield the promised smooth strongly convex estimator for function . If we set and , we retrieve the nonsmooth estimator. For , , we obtain a nonstrongly convex smooth estimator, while for and a nonsmooth strongly convex one.
Problem (6) is a QCQP in + variables and  constraints, which can be solved with offtheshelf convex solvers, e.g., ECOS [14], or MOSEK. Since the computational complexity grows at least as , similar practical limitations than the nonsmooth estimator apply here. We will show how to overcome them by resorting to ADMM next.
IiiC Parallel implementation
Since the computational complexity of (6) could be prohibitive for practical applications, we move now to understand how one can decompose the problem into smaller parts and reduce the overall complexity. Strategies to use firstorder methods (proximal methods and ADMM) on the quadratic problem (3) for nonsmooth convex functions have been reported in [3, 12, 15]. In the case of (3), the problem is separable in both cost and constraints (once dualized) and a decomposition strategy is rather direct. For the case of (6) however, each constraint couples the variables and due to the quadratic term, and a decomposition is not immediate. We consider here a novel edgebased ADMM decomposition.
Consider the set of constraints of type (5) for all and define the constraint graph, as a directed graph , whose vertices are the nodes , and edges are all the combinations of : . The cardinality of is . For each edge , consider its two nodes, say and , and define the edge variables , , as well as . In this context, each directed edge has its own functional and derivative variables.
We use the notation to indicate that node is one of the two vertices of edge , while we explicitly write to indicate that edge is the directed edge with as source node and as sink node. For ease of representation, we also define the local constraint,
(9) 
We now split problem (6) at the edges, so that each variable containing node , is different for each edge having node has one of its vertices. Then we enforce equality of all the node variables via the supporting vector for each , . With this philosophy, problem (6) can be rewritten in the equivalent form:
(10a)  
(10c)  
It is key that the constraint is present only once for each directed edge.
The above problem can now be tackled with ADMM. We leave the derivations out, since standard, and report only the final result. Start with some initialization for the auxiliary variable and initialize the scaled dual variables of the constraints (10c) as . Set the penalization . Then at each step,

Solve the edge QCQP for each edge :
(11) 
Update the variables for each node :
(12) 
Update the variables , and node :
(13) 
Set for all , for all , and , for all , , and go to 1).
Convergence of the above procedure to a minimizer of the original problem (6) is assured by the standard ADMM convergence results [18]. In practice, one stops the ADMM iterations after a specified error criterion has been met, or a number of iterations has been reached. In this case, we consider as approximate solution to our problem, the final vector that ADMM yields (we let be the nearoptimal values and be the nearoptimal values) and we use the approximate interpolating function [see Th. 2]:
(14) 
(15) 
Note that function (14) is still a smooth strongly convex function by construction, just the ADMM approximate solution is only approximately interpolable, which means that and instead of equality (meaning that function (14) is only approximately minimizing the LS metric). In practice, we notice that this approximation delivers good feasible results for small errors.
IiiD Solving the local dual
Solving (11) in the primal form is still too computational complex (especially if it has to be done for each edge for each iteration of the ADMM algorithm). However, since there is only one scalar constraint, the dual is monodimensional and easier to solve with a few iterations of a projected Newton’s method. To do this, we write (11) in the standard form:
(16a)  
(16b) 
where , , , , , and are properly defined matrices, vectors, and scalars to match (16) to the original (11). Define the Lagrangian function
(17) 
for the dual variable . Note that is positive definite by construction (due to ), and therefore setting , the inverse exists and it is welldefined. Set also . The dual problem of (16) is:
(18) 
whose solution can be found with standard projected Newton’s method^{2}^{2}2 For the sake of completeness, the gradient of the dual function is
(19) 
Putting things together, the ADMM procedure (14) with a a Newton’s method to solve the monodimensional local dual problems (18) has computational complexity of (where the term comes from the computation of the inverse ) and for , the leading error is .
Iv Numerical evaluation
We evaluate the presented algorithms in terms of computational time and mean square error on a toy problem. The aim is to evaluate scalability and quality in a simple onedimensional setting. A more detailed numerical investigation is deferred to future research efforts. All the computations are performed using Python 3.6, on a 2.7 GHz DualCore Intel Core i5 laptop with 8GB of RAM. We use CVXPY [19] for solving the nonsmooth and smooth problems. Internally, the QPs are solved with OSQP, while the QCQPs with ECOS.
The true generating function is over . Function is smooth and strongly convex. The observations are generated by adding noise drawn from .
We run (i) the nonsmooth problem with QP (3) and estimator (4); (ii) the smooth strongly convex problem with QCQP (6) and estimator (7); (iii) the edgebased ADMM to solve the smooth problem with local problems (11) and approximate estimator (14), with and initial specified running a Gaussian process estimator [20], which offers a computationally cheap smooth (but in general nonconvex) first approximation. For the Gaussian process we use a square exponential kernel. For the ADMM we solve the local problems resorting to their duals and employing a projected Newton’s method with backtracking. The stopping criterion for the ADMM has been selected as
(20) 
In Figure 1, we report the three methods (nonsmooth, smooth, and ADMM) for the task of estimating the given function
with an increasing amount of observations. Here the observation noise standard deviation is set to
, while . We also set and . We display the estimated function with a continuous blue line, while the true function is a dashed orange line. Using grey dots, we represents the observations. As we notice, nonsmooth and smooth estimators achieve the desired estimation task with higher accuracy the more observations are present. ADMM also does well (even with a moderate error ).We run simulations for different ’s, considering , and two versions of ADMM, one with and the other with . (All the other parameters have been left the same as the ones in Figure 1). All the results are averaged over realizations of the observations for the centralized problems, and for the parallel methods (which are more dependent on the realization for the number of iterations).
Figure 2 represents the computational times of the various methods. As we see, the smooth method is the slowest one (as increases), as expected. The ADMM methods run slower than the other methods at the beginning due to the nonaccurate initialization (and therefore due to the need for more iterations to reach the specified stopping criterion), while as increases, they run the fastest (since the initialization becomes better and better).
Figure 3 represents an error metric defined as
(21) 
This metric is defined over a finer equispaced sampling , over with . The idea is to capture the error in the estimator , rather than just the mean square error on the observation points. As we see and expected, the smooth estimator is better than the nonsmooth one. ADMM delivers estimates of comparable accuracy than the smooth one. In the figure, we also display an line, which is the typical convergence rate of nonparametric LSE for [21, 22, 23, 12], in the meansquareerrorontheobservationpoints sense.
What is interesting here is that the metric is built on the estimated which for ADMM is the approximate (14). Because of this, the ADMM’s estimator can be better than the smooth one (especially in low observation settings) This is per se quite interesting and will be further explored in the future.
V Conclusion
We have proposed a method to perform smooth strongly convex regression in a nonparametric least squares sense. The method relies on the solution of a convex quadratically constrained quadratic program. We have discussed computational complexity and offered a firstorder alternative based on ADMM to lessen the computational load to a tight for noisy observations of the true function.
Appendix
Va Proof of Theorem 2
The theorem follows from the discussion in [10, Remark 2] with some extra computations. For the sake of completeness, we report it here. In particular, take any set . If this set is interpolable, then an allowed interpolating function is [10, Remark 2]:
with
(22) 
As in [10, Remark 2] by convex conjugation and curvature addition, an interpolating function of is:
(23) 
Computing can be done in the standard way,
(24) 
For the inner optimization problem (sup), by firstorder optimality conditions,
(25) 
which substituted back into (24) yields
(26) 
From [10, Theorem 4(c)] follows that
Operating the substitutions in (26) and calling , the thesis follows.
References
 [1] R. J. Samworth and B. Sen, “Editorial: Nonparametric Inference Under Shape Constraints,” Statistical Science, vol. 33, no. 4, 2018.
 [2] P. Groeneboom, G. Jongbloed, and J. A. Wellner, “Estimation of a Convex Function: Characterizations and Asymptotic Theory,” The Annals of Statistics, vol. 29, no. 6, pp. 1653 – 1698, 2001.
 [3] N. S. Aybat and Z. Wang, “A Parallel Method for Large Scale Convex Regression Problems,” in Proceedings of the IEEE Conference in Decision and Control, 2014.
 [4] R. F. Meyer and J. W. Pratt, “The Consistent Assessment and Fairing of Preference Functions,” IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 3, pp. 270 – 278, 1968.
 [5] A. Simonetto, E. Dall’Anese, J. Monteil, and A. Bernstein, “Personalized Optimization with User’s Feedback,” arXiv: 1905.00775, 2019.
 [6] X. Wang and J. O. Berger, “Estimating Shape Constrained Functions Using Gaussian Processes,” SIAM/ASA Journal on Uncertainty Quantification, vol. 4, no. 1, pp. 1 – 25, 2016.
 [7] A. L. Dontchev, H. Qi, and L. Qi, “Quadratic Convergence of Newton’s Method for Convex Interpolation and Smoothing,” Constructive Approximation, vol. 19, pp. 123 – 143, 2003.
 [8] M. Birke and H. Dette, “Estimating a Convex Function in Nonparametric Regression,” Scandinavian Journal of Statistics, vol. 34, no. 2, pp. 384 – 404, 2007.
 [9] E. Dall’Anese, A. Simonetto, S. Becker, and L. Madden, “Optimization and Learning with Information Streams: Timevarying Algorithms and Applications,” Signal Processing Magazine (to appear), May 2020.
 [10] A. Taylor, J. Hendrickx, and F. Glineur, “Smooth Strongly Convex Interpolation and Exact Worstcase Performance of Firstorder Methods,” Mathematical Programming, vol. 161, no. 1, pp. 307 – 345, 2017.
 [11] A. Taylor, “Convex Interpolation and Performance Estimation of Firstorder Methods for Convex Optimization,” Ph.D. dissertation, Université catholique Louvain, Belgium, January 2017.
 [12] R. Mazumder, A. Choudhury, G. Iyengar, and B. Sen, “A Computational Framework for Multivariate Convex Regression and Its Variants,” Journal of the American Statistical Association, vol. 114, no. 525, pp. 318–331, 2019.
 [13] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP: An operator splitting solver for quadratic programs,” arXiv: 1711.08013, 2017.
 [14] A. Domahidi, E. Chu, and S. Boyd, “ECOS: An SOCP Solver for Embedded Systems,” in Proceedings of the ECC, 2013.
 [15] M. Lin, D. Sun, and K.C. Toh, “Efficient algorithms for multivariate shapeconstrained convex regression problems,” arXiv: 2002.11410, 2020.
 [16] L. A. Hannah and D. B. Dunson, “Multivariate Convex Regression with Adaptive Partitioning,” JMLR, vol. 14, pp. 3261 – 3294, 2013.
 [17] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.

[18]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
Optimization and Statistical Learning via the Alternating Direction Method of
Multipliers,”
Foundations and Trends® in Machine Learning
, vol. 3, no. 1, pp. 1 – 122, 2011.  [19] S. Diamond and S. Boyd, “CVXPY: A Pythonembedded modeling language for convex optimization,” JMLR, vol. 17, no. 83, 2016.
 [20] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, US: The MIT Press, 2006.
 [21] E. Seijo and B. Sen, “Nonparametric Least Squares Estimation of a Multivariate Convex Regression Function,” The Annals of Statistics, vol. 39, no. 3, pp. 1633 – 1657, 2011.
 [22] E. Lim and P. W. Glynn, “Consistency of Multidimensional Convex Regression,” Operation Research, vol. 60, no. 1, pp. 196 – 208, 2012.
 [23] J. Blanchet, P. W. Glynn, J. Yan, and Z. Zhou, “Multivariate Distributionally Robust Convex Regression under Absolute Error Loss,” in Proceedings of NeurIPS, 2019.