1 Introduction
Linear regression with coefficients that sum to one (henceforth unitsum regression) is used in portfolio optimization and other economic applications such as forecast combinations (Timmermann, 2006) and synthetic control (Abadie et al., 2010).
In this paper, we focus on obtaining a sparse solution (i.e. containing few nonzero elements) to the unitsum regression problem. A sparse solution may be desirable for a variety of reasons, such as making a model more interpretable, improving estimation efficiency if the underlying parameter vector is known to be sparse, remedying identification issues if the number of variables exceeds the number of observations, or applicationspecific reasons such as reducing cost by limiting the amount of constituents in a portfolio.
A popular method to produce sparsity is to use regularization. Theoretically, the most straightforward way to obtain a sparse solution is to use regularization (also known as bestsubset selection), which amounts to restricting the number of nonzero elements in the solution. However, the use of regularization is NPhard (Coleman et al., 2006; Natarajan, 1995) and has traditionally been seen as computationally infeasible for problems with more than about 40 variables, both in unitsum regression and standard linear regression.
Due to these computational difficulties, regularization has often been replaced by regularization, also known as Lasso (Tibshirani, 1996). In regularization, the norm restriction that restricts the number of nonzero elements is replaced by an norm restriction that restricts the absolute size of the coefficients. This turns the problem into an easier to solve convex optimization problem. An norm restriction shrinks the weights towards zero and, as a consequence of the shrinkage, produces sparsity by setting some weights exactly equal to zero.
The use of regularization in the presence of a unitsum restriction was first considered by DeMiguel et al. (2009) and Brodie et al. (2009) in the context of portfolio optimization. They show that regularization is able to produce sparsity in combination with a unitsum restriction. In addition, they demonstrate that the combination can be viewed as a restriction on the sum of the negative weights. In some applications it is highly desirable to have a parameter that explicitly controls the sum of the negative weights. For example, in a portfolio optimization context negative weights represent potentially costly short positions.
However, the unitsum restriction causes a problem when using regularization: due to the unitsum restriction the norm of the weights cannot be smaller than 1. This imposes a lower bound on the amount of shrinkage produced by regularization. In turn, this places an upper bound on the sparsity produced by regularization. This upper bound depends entirely on the data, which makes it difficult to rely on regularization if a specific level of sparsity is desired. In addition, due to the bound there does not always exist a value of the tuning parameter that guarantees the existence of a unique solution. Furthermore, Fastrich et al. (2014) point out that a combination of a nonnegativity restriction and a unitsum restriction fixes the norm of the weights to 1, which renders regularization useless.
In order to address these issues and obtain sparse solutions in unitsum regression, we use a recent innovation in regularization in the standard linear regression setting by Bertsimas et al. (2016). They show that modern MixedInteger Optimization (MIO) solvers can find a provably optimal solution to regularized regression for problems of practical size. To achieve this, the solver is provided with a good initial solution obtained from a discrete firstorder (DFO) algorithm. In a simulation study, they show that regularization performs favorably compared to regularization in terms of predictive performance and sparsity.
An extended simulation study comparing  and regularization in the standard linear regression setting is performed by Hastie et al. (2017). They find that find that regularization outperforms regularization if the signaltonoise ratio (SNR) is high, while regularization performs better if the SNR is low. Additionally, they find that if the tuning parameters are selected to optimize predictive performance, regularization yields substantially sparser solutions.
A combination of  and regularization (regularization) is studied in the standard linear regression context by Mazumder et al. (2017). They observe that this combination yields a predictive performance similar to regularization if the SNR is low, and a predictive performance similar to regularization if the SNR is high. In addition, they find that regularization produces more sparsity compared to regularization, if the tuning parameters are selected in order to optimize predictive performance.
Motivated by the results in the standard linear regression setting, we propose the use of regularization in unitsum regression. Specifically, let be a vector and let be a matrix, then we consider the problem
(1) 
where are the elements of , is the norm of , is the norm of , and . Notice that this problem is equivalent to regularized unitsum regression if is sufficiently large, and equivalent to regularized unitsum regression if .
The formulation in (1) provides users with explicit control over both the sparsity of the solution and the sum of the negative weights of the solution. In addition, if the tuning parameters are selected in order to maximize predictive performance, we find in a simulation experiment that regularization:

performs better than regularization in terms of predictive performance, especially if the signaltonoise ratio is low.

performs well compared to regularization in terms of predictive performance, especially for higher signaltonoise ratios, while at the same time producing much sparser solutions.
The main contributions of this paper can be summarized as follows. [1] We propose regularization for the unitsum regression problem. [2] We analyze the problem for orthogonal design matrices and provide an algorithm to compute its solution. [3] We show how the algorithm for the orthogonal design case can be used in finding a solution to the general problem by extending the framework of Bertsimas et al. (2016) to unitsum regression. [4] We perform a simulation experiment which shows that our approach performs favorably compared to regularization or regularization. [5] We demonstrate in an application to stock index tracking that a regularization is able to find substantially sparser portfolios than regularization, while maintaining a similar outofsample tracking error.
The remainder of the paper is structured as follows. In Section 2, problem (1) is studied under the assumption that is orthogonal and an algorithm for the orthogonal case is presented. Section 3 analyzes the sparsity production for the orthogonal case and yields some intuitions about the problem. Section 4 links the algorithm for the orthogonal case to the framework of Bertsimas et al. (2016) in order to find a solution to the general problem. In Section 5, the simulation experiments are presented. Section 6 provides an application to index tracking.
2 Orthogonal Design
As problem (1) is difficult to study in its full generality, we first consider the special case that is orthogonal. We derive properties of a solution to (1) under orthogonality and use these properties in order to construct an algorithm that finds a solution. The algorithm is presented at the end of the section. In Section 4 this algorithm is used in finding a solution to the general problem by extending the framework of Bertsimas et al. (2016). In Section 3 we analyze the sparsity of the solution under orthogonality.
Assume that , where is the identity matrix. Let us write , so that minimizing in is equivalent to minimizing . Define
(2) 
Then, problem (1) can be written as .
We assume the elements of are different and . Without further loss of generality we assume . In Section 2.4 we relax the assumption that and allow for .
Let
where , so that . If , then for some . Let us denote this value of with . We will now show that can be computed from the signs of its elements and . In order to show this, we first solve a related problem and then show that is equal to the solution of a specific case of this related problem.
Let and be disjoint sets with cardinalities and , respectively, where . Define
Minimization of over the affinely restricted set has the solution
(3) 
Recall that and let and . Furthermore, let be the set of vectors with elements that have the same signs as the elements of , then . Notice that the difference between and is that there are no sign restrictions on elements , for which . Consequently,
However, if , then for sufficiently small . Furthermore, as is a parabola in with a minimum at , we find that for small . As , this is a contradiction. Hence, , which is our first result.
Proposition 1.
If , then .
So, the problem can be decomposed into finding the components of the triplet that minimizes . Next, we will study the properties of these components.
2.1 Properties of as a function of and
The sorting of reveals an ordered structure in the sets and that minimize . This structure is described in the following result.
Proposition 2.
If , then and if , and if .
The proof is given in the Appendix. For sets such as and , we use the notation , as in (3). The following result shows that and should be maximized such that .
Lemma 1.
If and , where , , , then .
The proof is given in the Appendix.
We will now consider the relationship between and the pair . With reference to (3), let us consider the sets
(4)  
(5) 
with cardinalities and . As
we find , and similarly if and if . Additionally, we find that is increasing in , and similarly that is increasing in . So, by Lemma 1 we have following result for .
Proposition 3.
If and , then .
We will now analyze how varies with if , and use this to find a minimizer if . The case that is treated separately in Section 2.3.
2.2 Properties of as a function of for
As and are integers, they increase discontinuously as increases. In this subsection we show that and its derivative are continuous in despite these discontinuities in and . This will allow us to show that if .
Let and , for . We then find the ordering , and
(6) 
Consequently, if , then
(7) 
Similarly, let and , for . Then and
(8) 
Consequently, if , then
(9) 
Using the cardinalities and of the sets and in (4) and (5), let . If , then . If , then
. The loss function
is a continuous function of for , with derivative
(10) 
which is continuous for . That is, using (6), if , then
and if , then
A similar continuity holds for the second term of (10) due to (8). The derivative (10) is increasing in , but it is negative for due to (7) and (9), which imply
We summarize these results in a proposition.
Proposition 4.
The function is continuous in for , and the derivative with respect to is negative for .
As is strictly decreasing in over if , we conclude that if .
2.3 The case that
If , then . So an alternative approach is required. By Lemma 1 and the fact that , we should compare the objective values for all pairs for which , and . In order to do so for a given pair , we need to find the value of that minimizes . This minimizing value, which we will denote by , must satisfy . We will now show that is either equal to or to .
We find
where
As is quadratic in with a minimum at , we find that if , then , and if , then .
In the case that , the minimum does not exist, since as . Furthermore, . So if then for some , by Proposition 3 and due to the negative gradient of . In the case that , then or , and . So if , then by Lemma 1. Similarly if , then . So if , then for all .
Hence, if , we can compute for each pair that satisfies , and and use this to compute the objective value . By comparing the objective values, we can find the triplet for which .
Combining these findings with the findings from the previous sections, we can construct an algorithm to find an element of . This algorithm is presented in Algorithm 1.
2.4 Extension
The case can be treated in a way similar to the case , except that in the proof of Proposition 2 the assumption was needed. We therefore provide a proof for .
Proposition 5.
Proposition 2 holds true when .
The proof is given in the Appendix.
3 Sparsity Under Orthogonality
In this section, we use the results from Section 2 to study the sparsity of the solution to (1) under orthogonality.
As both  and regularization produce sparsity, we can analyze how the sparsity of the solution to (1) depends on the tuning parameters and . From Algorithm 1, it is straightforward to observe that the amount of nonzero elements in is equal to where and . So the regularization component only produces additional sparsity if .
In order to gain some insights into the sparsity produced by the regularization component, we consider the maximum sparsity produced by regularization if . Notice that the sparsity is maximized if is minimized, which happens when . Furthermore, if , then . So, the minimum number of nonzero elements is equal to
(11) 
where . This shows that the maximum sparsity produced by regularization depends entirely on the size of the gaps between the largest elements of . So the maximum amount of sparsity does not change if the same constant is added to each element of .
To further analyze the maximum sparsity produced by the regularization component, we consider two special cases of : one case without noise and one case with noise.
Linear and Noiseless. Suppose that the largest elements of are linearly spaced with distance (i.e. for some ). Then, using (11), we can derive the following closedform expression for the minimum number of nonzero elements:
where rounds down to the nearest integer. As this function is weakly decreasing in , the maximum sparsity is increasing in . So, we obtain the intuition that if the largest elements of are more similar, then less sparsity can be produced by regularization.
Equal and Noisy. Let , where has i.i.d. elements , , and is an vector with elements for all . As all elements of are equal, the gaps between the elements of are equal to the gaps between the order statistics of , scaled by the constant . So, the size of the gaps between the largest elements of is increasing in . Therefore, according to (11), the maximum sparsity is increasing in . As an increase in represents an increase in noise, we can draw the intuitive conclusion that if has elements of similar size, then the maximum amount of sparsity produced by regularization increases with noise.
4 General Case
In this section, we describe how a solution can be found for the general case, in which is not required to be orthogonal. To do so, we adapt the framework laid out by Bertsimas et al. (2016) for standard linear regression. This framework consists of two components. The first component is a Discrete FirstOrder (DFO) algorithm that uses an algorithm for the orthogonal problem as a subroutine in each iteration. The solution to this DFO algorithm is then used as an initial solution for the second component. The second component relies on reformulating (1) as an MIO problem, which can be solved to provable optimality by using an MIO solver.
4.1 Discrete FirstOrder Algorithm
In the construction of the DFO algorithm, we closely follow Bertsimas et al. (2016), but use a different constraint set that includes an additional norm restriction and unitsum restriction.
Denote the objective function as
This function is Lipschitz continuously differentiable, as
where
is the largest absolute eigenvalue of
. So, we can apply the following result.Proposition 6 (Nesterov, 2013; Bertsimas et al., 2016).
For a convex Lipschitz continuous function , we have
(12) 
for all , and , where is smallest constant such that .
Given some fixed , we can minimize the bound in (12) with respect to under the constraint set , as given in (2). Following Bertsimas et al. (2016), we find
(13) 
Notice that (13) can be computed using Algorithm 1. Therefore, it is possible to use iterative updates in order to decrease the objective value. Specifically, let and recursively define , for all . Then by Proposition 6,
In Algorithm 2, we present an algorithm that uses this updating step until some convergence criterion is reached.
Comments
There are no comments yet.