# Sparse Unit-Sum Regression

This paper considers sparsity in linear regression under the restriction that the regression weights sum to one. We propose an approach that combines ℓ_0- and ℓ_1-regularization. We compute its solution by adapting a recent methodological innovation made by Bertsimas et al. (2016) for ℓ_0-regularization in standard linear regression. In a simulation experiment we compare our approach to ℓ_0-regularization and ℓ_1-regularization and find that it performs favorably in terms of predictive performance and sparsity. In an application to index tracking we show that our approach can obtain substantially sparser portfolios compared to ℓ_1-regularization while maintaining a similar tracking performance.

## Authors

• 6 publications
• 2 publications
• ### Bayesian Linear Regression on Deep Representations

A simple approach to obtaining uncertainty-aware neural networks for reg...
12/14/2019 ∙ by John Moberg, et al. ∙ 0

• ### Supervised Linear Regression for Graph Learning from Graph Signals

We propose a supervised learning approach for predicting an underlying g...
11/05/2018 ∙ by Arun Venkitaraman, et al. ∙ 0

• ### Lass-0: sparse non-convex regression by local search

We compute approximate solutions to L0 regularized linear regression usi...
11/13/2015 ∙ by William Herlands, et al. ∙ 0

• ### Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression

We consider Hadamard product parametrization as a change-of-variable (ov...
03/22/2019 ∙ by Peng Zhao, et al. ∙ 0

• ### Homotopy Parametric Simplex Method for Sparse Learning

High dimensional sparse learning has imposed a great computational chall...
04/04/2017 ∙ by Haotian Pang, et al. ∙ 0

• ### An Exact Solution Path Algorithm for SLOPE and Quasi-Spherical OSCAR

Sorted L_1 penalization estimator (SLOPE) is a regularization technique ...
10/29/2020 ∙ by Shunichi Nomura, et al. ∙ 0

• ### Sparse Methods for Automatic Relevance Determination

This work considers methods for imposing sparsity in Bayesian regression...
05/18/2020 ∙ by Samuel H. Rudy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear regression with coefficients that sum to one (henceforth unit-sum regression) is used in portfolio optimization and other economic applications such as forecast combinations (Timmermann, 2006) and synthetic control (Abadie et al., 2010).

In this paper, we focus on obtaining a sparse solution (i.e. containing few non-zero elements) to the unit-sum regression problem. A sparse solution may be desirable for a variety of reasons, such as making a model more interpretable, improving estimation efficiency if the underlying parameter vector is known to be sparse, remedying identification issues if the number of variables exceeds the number of observations, or application-specific reasons such as reducing cost by limiting the amount of constituents in a portfolio.

A popular method to produce sparsity is to use regularization. Theoretically, the most straightforward way to obtain a sparse solution is to use -regularization (also known as best-subset selection), which amounts to restricting the number of non-zero elements in the solution. However, the use of -regularization is NP-hard (Coleman et al., 2006; Natarajan, 1995) and has traditionally been seen as computationally infeasible for problems with more than about 40 variables, both in unit-sum regression and standard linear regression.

Due to these computational difficulties, -regularization has often been replaced by -regularization, also known as Lasso (Tibshirani, 1996). In -regularization, the -norm restriction that restricts the number of non-zero elements is replaced by an -norm restriction that restricts the absolute size of the coefficients. This turns the problem into an easier to solve convex optimization problem. An -norm restriction shrinks the weights towards zero and, as a consequence of the shrinkage, produces sparsity by setting some weights exactly equal to zero.

The use of -regularization in the presence of a unit-sum restriction was first considered by DeMiguel et al. (2009) and Brodie et al. (2009) in the context of portfolio optimization. They show that -regularization is able to produce sparsity in combination with a unit-sum restriction. In addition, they demonstrate that the combination can be viewed as a restriction on the sum of the negative weights. In some applications it is highly desirable to have a parameter that explicitly controls the sum of the negative weights. For example, in a portfolio optimization context negative weights represent potentially costly short positions.

However, the unit-sum restriction causes a problem when using -regularization: due to the unit-sum restriction the -norm of the weights cannot be smaller than 1. This imposes a lower bound on the amount of shrinkage produced by -regularization. In turn, this places an upper bound on the sparsity produced by -regularization. This upper bound depends entirely on the data, which makes it difficult to rely on -regularization if a specific level of sparsity is desired. In addition, due to the bound there does not always exist a value of the tuning parameter that guarantees the existence of a unique solution. Furthermore, Fastrich et al. (2014) point out that a combination of a non-negativity restriction and a unit-sum restriction fixes the -norm of the weights to 1, which renders -regularization useless.

In order to address these issues and obtain sparse solutions in unit-sum regression, we use a recent innovation in -regularization in the standard linear regression setting by Bertsimas et al. (2016). They show that modern Mixed-Integer Optimization (MIO) solvers can find a provably optimal solution to -regularized regression for problems of practical size. To achieve this, the solver is provided with a good initial solution obtained from a discrete first-order (DFO) algorithm. In a simulation study, they show that -regularization performs favorably compared to -regularization in terms of predictive performance and sparsity.

An extended simulation study comparing - and -regularization in the standard linear regression setting is performed by Hastie et al. (2017). They find that find that -regularization outperforms -regularization if the signal-to-noise ratio (SNR) is high, while -regularization performs better if the SNR is low. Additionally, they find that if the tuning parameters are selected to optimize predictive performance, -regularization yields substantially sparser solutions.

A combination of - and -regularization (-regularization) is studied in the standard linear regression context by Mazumder et al. (2017). They observe that this combination yields a predictive performance similar to -regularization if the SNR is low, and a predictive performance similar to -regularization if the SNR is high. In addition, they find that -regularization produces more sparsity compared to -regularization, if the tuning parameters are selected in order to optimize predictive performance.

Motivated by the results in the standard linear regression setting, we propose the use of -regularization in unit-sum regression. Specifically, let be a -vector and let be a matrix, then we consider the problem

 minβ∥y−Xβ∥22,s% .t. m∑i=1βi=1,∥β∥0≤k,∥β∥1≤1+2s, (1)

where are the elements of , is the -norm of , is the -norm of , and . Notice that this problem is equivalent to -regularized unit-sum regression if is sufficiently large, and equivalent to -regularized unit-sum regression if .

The formulation in (1) provides users with explicit control over both the sparsity of the solution and the sum of the negative weights of the solution. In addition, if the tuning parameters are selected in order to maximize predictive performance, we find in a simulation experiment that -regularization:

• performs better than -regularization in terms of predictive performance, especially if the signal-to-noise ratio is low.

• performs well compared to -regularization in terms of predictive performance, especially for higher signal-to-noise ratios, while at the same time producing much sparser solutions.

The main contributions of this paper can be summarized as follows. [1] We propose -regularization for the unit-sum regression problem. [2] We analyze the problem for orthogonal design matrices and provide an algorithm to compute its solution. [3] We show how the algorithm for the orthogonal design case can be used in finding a solution to the general problem by extending the framework of Bertsimas et al. (2016) to unit-sum regression. [4] We perform a simulation experiment which shows that our approach performs favorably compared to -regularization or -regularization. [5] We demonstrate in an application to stock index tracking that a -regularization is able to find substantially sparser portfolios than -regularization, while maintaining a similar out-of-sample tracking error.

The remainder of the paper is structured as follows. In Section 2, problem (1) is studied under the assumption that is orthogonal and an algorithm for the orthogonal case is presented. Section 3 analyzes the sparsity production for the orthogonal case and yields some intuitions about the problem. Section 4 links the algorithm for the orthogonal case to the framework of Bertsimas et al. (2016) in order to find a solution to the general problem. In Section 5, the simulation experiments are presented. Section 6 provides an application to index tracking.

## 2 Orthogonal Design

As problem (1) is difficult to study in its full generality, we first consider the special case that is orthogonal. We derive properties of a solution to (1) under orthogonality and use these properties in order to construct an algorithm that finds a solution. The algorithm is presented at the end of the section. In Section 4 this algorithm is used in finding a solution to the general problem by extending the framework of Bertsimas et al. (2016). In Section 3 we analyze the sparsity of the solution under orthogonality.

Assume that , where is the identity matrix. Let us write , so that minimizing in is equivalent to minimizing . Define

 T:={β∈Rm ∣∣ ∣∣ m∑i=1βi=1, ∥β∥0⩽k, ∥β∥1⩽1+2s}. (2)

Then, problem (1) can be written as .

We assume the elements of are different and . Without further loss of generality we assume . In Section 2.4 we relax the assumption that and allow for .

Let

 Az:={β∈Rm ∣∣ ∣∣ m∑i=1βi=1, ∥β∥0⩽k, ∥β∥1=1+2z},

where , so that . If , then for some . Let us denote this value of with . We will now show that can be computed from the signs of its elements and . In order to show this, we first solve a related problem and then show that is equal to the solution of a specific case of this related problem.

Let and be disjoint sets with cardinalities and , respectively, where . Define

 B(P,N,z):={β∈Rm ∣∣ ∣∣ ∑i∈Pβi=1+z,  ∑i∈Nβi=−z, and βi=0 if i∈(P∪N)C}.

Minimization of over the affinely restricted set has the solution

 (3)

Recall that and let and . Furthermore, let be the set of vectors with elements that have the same signs as the elements of , then . Notice that the difference between and is that there are no sign restrictions on elements , for which . Consequently,

 Q(ˆβ)=minβ∈A^z∩CQ(β)⩾minβ∈B(ˆP,ˆN,^z)Q(β)=Q(β(ˆP,ˆN,^z)).

However, if , then for sufficiently small . Furthermore, as is a parabola in with a minimum at , we find that for small . As , this is a contradiction. Hence, , which is our first result.

###### Proposition 1.

If , then .

So, the problem can be decomposed into finding the components of the triplet that minimizes . Next, we will study the properties of these components.

### 2.1 Properties of Q(β(P,N,z)) as a function of P and N

The sorting of reveals an ordered structure in the sets and that minimize . This structure is described in the following result.

###### Proposition 2.

If , then and if , and if .

The proof is given in the Appendix. For sets such as and , we use the notation , as in (3). The following result shows that and should be maximized such that .

###### Lemma 1.

If and , where , , , then .

The proof is given in the Appendix.

We will now consider the relationship between and the pair . With reference to (3), let us consider the sets

 Pz :={q ∣∣ ∣∣ ηq−(∑qi=1ηi)−1−zq>0}, (4) Nz :={q ∣∣ ∣∣ ηm−q+1−(∑qi=1ηm−i+1)+zq<0}, (5)

with cardinalities and . As

 ηq−(∑qi=1ηi)−1−zq =q−1q⎛⎜ ⎜⎝ηq−(∑q−1i=1ηi)−1−zq−1⎞⎟ ⎟⎠ <ηq−1−(∑q−1i=1ηi)−1−zq−1,

we find , and similarly if and if . Additionally, we find that is increasing in , and similarly that is increasing in . So, by Lemma 1 we have following result for .

###### Proposition 3.

If and , then .

We will now analyze how varies with if , and use this to find a minimizer if . The case that is treated separately in Section 2.3.

### 2.2 Properties of Q(β(pz,nz,z)) as a function of z for pz+nz≤k

As and are integers, they increase discontinuously as increases. In this subsection we show that and its derivative are continuous in despite these discontinuities in and . This will allow us to show that if .

Let and , for . We then find the ordering , and

 ηp =(∑pi=1ηi)−1−z+pp,  p=1,…,m, ηp+1 =(∑pi=1ηi)−1−z+p+1p=(∑p+1i=1ηi)−1−z+p+1p+1,p=1,…,m−1. (6)

Consequently, if , then

 ηp >(∑pi=1ηi)−1−zp⩾ηp+1. (7)

Similarly, let and , for . Then and

 ηm−n+1 =(∑ni=1ηm−i+1)+z−m−n+1n,n=1,…,m, ηm−n =(∑ni=1ηm−i+1)+z−m−nn=(∑n+1i=1ηm−i+1)+z−m−nn+1,n=1,…,m−1. (8)

Consequently, if , then

 ηm−n+1 <(∑ni=1ηm−i+1)+zn⩽ηm−n. (9)

Using the cardinalities and of the sets and in (4) and (5), let . If , then . If , then

. The loss function

 Q(β(pz,nz,z))=pz{(∑pzi=1ηi)−1−zpz}2+I{nz⩾1}nz{(∑nzi=1ηm−i+1)+znz}2+m−nz∑i=pz+1η2i

is a continuous function of for , with derivative

 (dQ(β(pz,nz,z))(dz =−2{(∑pzi=1ηi)−1−zpz}+2I{nz⩾1}{(∑nzi=1ηm−i+1)+znz}, (10)

which is continuous for . That is, using (6), if , then

 −(∑pzi=1ηi)−1−zpz↑ηpz+1

and if , then

 −(∑pz+1i=1ηi)−1−zpz+1↓ηpz+1.

A similar continuity holds for the second term of (10) due to (8). The derivative (10) is increasing in , but it is negative for due to (7) and (9), which imply

 (dQ(β(pz,nz,z))(dz ≤−2ηpz+1+2ηm−nz⩽0.

We summarize these results in a proposition.

###### Proposition 4.

The function is continuous in for , and the derivative with respect to is negative for .

As is strictly decreasing in over if , we conclude that if .

### 2.3 The case that ps+ns>k

If , then . So an alternative approach is required. By Lemma 1 and the fact that , we should compare the objective values for all pairs for which , and . In order to do so for a given pair , we need to find the value of that minimizes . This minimizing value, which we will denote by , must satisfy . We will now show that is either equal to or to .

We find

 Q(β(p,n,z))=(p+n)(∑pi=1ηi+∑nj=1ηm−j+1−1p+n)2+p+npn(spn−z)2+m−n∑i=p+1ηi,

where

 spn=n(∑pi=1ηi)−1p+n−p∑ni=1ηm−i+1p+n.

As is quadratic in with a minimum at , we find that if , then , and if , then .

In the case that , the minimum does not exist, since as . Furthermore, . So if then for some , by Proposition 3 and due to the negative gradient of . In the case that , then or , and . So if , then by Lemma 1. Similarly if , then . So if , then for all .

Hence, if , we can compute for each pair that satisfies , and and use this to compute the objective value . By comparing the objective values, we can find the triplet for which .

Combining these findings with the findings from the previous sections, we can construct an algorithm to find an element of . This algorithm is presented in Algorithm 1.

### 2.4 Extension

The case can be treated in a way similar to the case , except that in the proof of Proposition 2 the assumption was needed. We therefore provide a proof for .

###### Proposition 5.

Proposition 2 holds true when .

The proof is given in the Appendix.

## 3 Sparsity Under Orthogonality

In this section, we use the results from Section 2 to study the sparsity of the solution to (1) under orthogonality.

As both - and -regularization produce sparsity, we can analyze how the sparsity of the solution to (1) depends on the tuning parameters and . From Algorithm 1, it is straightforward to observe that the amount of non-zero elements in is equal to where and . So the -regularization component only produces additional sparsity if .

In order to gain some insights into the sparsity produced by the -regularization component, we consider the maximum sparsity produced by -regularization if . Notice that the sparsity is maximized if is minimized, which happens when . Furthermore, if , then . So, the minimum number of non-zero elements is equal to

 min(¯p,k) =¯p =argmaxi(i ∣∣ ∣∣ i−1∑j=1(ηj−ηi)<1) =argmaxi(i ∣∣ ∣∣ i−1∑j=1jΔj<1), (11)

where . This shows that the maximum sparsity produced by -regularization depends entirely on the size of the gaps between the largest elements of . So the maximum amount of sparsity does not change if the same constant is added to each element of .

To further analyze the maximum sparsity produced by the -regularization component, we consider two special cases of : one case without noise and one case with noise.

Linear and Noiseless. Suppose that the largest elements of are linearly spaced with distance (i.e. for some ). Then, using (11), we can derive the following closed-form expression for the minimum number of non-zero elements:

 ¯p=⌊12(√Δ+8Δ+1)⌋,

where rounds down to the nearest integer. As this function is weakly decreasing in , the maximum sparsity is increasing in . So, we obtain the intuition that if the largest elements of are more similar, then less sparsity can be produced by -regularization.

Equal and Noisy. Let , where has i.i.d. elements , , and is an -vector with elements for all . As all elements of are equal, the gaps between the elements of are equal to the gaps between the order statistics of , scaled by the constant . So, the size of the gaps between the largest elements of is increasing in . Therefore, according to (11), the maximum sparsity is increasing in . As an increase in represents an increase in noise, we can draw the intuitive conclusion that if has elements of similar size, then the maximum amount of sparsity produced by -regularization increases with noise.

## 4 General Case

In this section, we describe how a solution can be found for the general case, in which is not required to be orthogonal. To do so, we adapt the framework laid out by Bertsimas et al. (2016) for standard linear regression. This framework consists of two components. The first component is a Discrete First-Order (DFO) algorithm that uses an algorithm for the orthogonal problem as a subroutine in each iteration. The solution to this DFO algorithm is then used as an initial solution for the second component. The second component relies on reformulating (1) as an MIO problem, which can be solved to provable optimality by using an MIO solver.

### 4.1 Discrete First-Order Algorithm

In the construction of the DFO algorithm, we closely follow Bertsimas et al. (2016), but use a different constraint set that includes an additional -norm restriction and unit-sum restriction.

Denote the objective function as

 f(β)=12∥y−Xβ∥22.

This function is Lipschitz continuously differentiable, as

 ∥∇f(β)−∇f(η)∥22 =∥X′X(β−η)∥22 ≤∥X′X∥2∥β−η∥22 =L∗∥β−η∥22,

where

is the largest absolute eigenvalue of

. So, we can apply the following result.

###### Proposition 6 (Nesterov, 2013; Bertsimas et al., 2016).

For a convex Lipschitz continuous function , we have

 f(η)≤QL(η,β):=f(β)+L2∥η−β∥22+∇f(β)′(η−β), (12)

for all , and , where is smallest constant such that .

Given some fixed , we can minimize the bound in (12) with respect to under the constraint set , as given in (2). Following Bertsimas et al. (2016), we find

 argminη∈TQL(η,β) =argminη∈T(f(β)+L2∥η−β∥22+∇f(β)′(η−β)+12L∥∇f(β)∥22−12L∥∇f(β)∥22) =argminη∈T(f(β)+L2∥η−(β−1L∇f(β))∥22−12L∥∇f(β)∥22) =argminη∈T∥η−(β−1L∇f(β))∥22. (13)

Notice that (13) can be computed using Algorithm 1. Therefore, it is possible to use iterative updates in order to decrease the objective value. Specifically, let and recursively define , for all . Then by Proposition 6,

 f(βr)=QL∗(βr,βr)≥QL∗(βr+1,βr)≥f(βr+1).

In Algorithm 2, we present an algorithm that uses this updating step until some convergence criterion is reached.

### 4.2 Mixed-Integer Optimization

In this section, an MIO formulation for problem (1) is presented. In order to formulate problem (1) as an MIO problem, we use three sets of auxiliary variables. The variables and are used to specify the positive and negative parts of the arguments ,