Linear regression models are among the most extensively employed statistical methods in science and industry alike (bro2002maximum; intriligator1978econometric; isobe1990linear; nievergelt2000tutorial; reeder2004multicoil). Their simplicity, ease of use and performance in low-data regimes enables their usage in various prediction tasks. As the number of observations usually exceeds the number of variables, a practitioner has to resort to approximating the solution of an overdetermined system. Least squares approximation benefits from a closed-form solution and is the oldest (gauss1995theoria)
and most known approach in linear regression analysis. Among the benefits of linear regression models there is the possibility of easily interpreting how much each variate is contributing to the approximation of the dependent variable by means of observing the magnitudes and signs of the associated parameters.
In some application domains, partitioning the variables in non-overlapping subsets is beneficial either as a way to insert human knowledge into the regression analysis task or to further improve model interpretability. When considering high-dimensionality data, grouping variables together is also a natural way to make it easier to reason about the data and the regression result. As an example, consider a regression task where the dependent variable is the score achieved by students in an University or College exam. A natural way to group the dependent variables is to divide them into two groups where one contains the variables which represent a student’s effort in the specific exam (hours spent studying, number of lectures attended…), while another contains the variables related to previous effort and background (number of previous exams passed, number of years spent at University or College, grade average…). Assuming all these variables could be measured accurately, it might be interesting to know how much each group of variables contributes to the student’s score. As a further example, when analyzing complex chemical compounds, it is possible to group together fine-grained features to obtain a partition which refers to high-level properties of the compound (such as structural, interactive and bond-forming among others), and knowing how much each high-level property contributes to the result of the analysis is often of great practical value(caron13block).
In this paper we introduce a variation on the linear regression problem which allows for partitioning variables into meaningful groups. The parameters obtained by solving the problem allow one to easily assess the contribution of each group to the dependent variable as well as the importance of each element of the group.
The newly introduced problem is not easy to solve and indeed we will prove the non-convexity of the objective, and the NP-completeness of the problem itself. In Section 3 we introduce two possible algorithms to solve the problem. One is based on an Alternate Convex Search method (wendell1976minimization), where the optimization of the parameters is iterative and can get trapped into local minima; the other is based on a reformulation of the original problem into an exponential number of sub-problems, where the exponent is the cardinality of the partition. We prove convergence of the alternating least square algorithm and the global optimality of the result returned by the second approach. We also provide guidance for building a branch and bound (lawler1966branch) solution that might be useful when the cardinality of the partition is too large to use the exact algorithm.
We test the two algorithms on several datasets. Our experiments include data extracted from the analysis of chemical compounds (caron13block) in a particular setting where this kind of analysis already proved to be of value to practitioners, and a number of datasets having a large amount of features which we selected from the UCI repository (dua2019uci): in this latter case the number, size, and composition of the partition has been decided arbitrarily just to experiment with the provided algorithms. Our experimental results show that the exact algorithm is usually a good choice, the non-exact algorithm being preferable when high accuracy is not required and/or the cardinality of the partition is too large.
While to the best of our knowledge the regression problem and the algorithms we present are novel, there has been previous work dealing with alternative formulations to the linear regression problem. Some of them have shown to be of great practical use and have received attention from both researchers and practitioners.
Partial Least Squares (PLS) Regression (wold2009pls)
is a very popular method in hard sciences such as chemists and chemometrics. PLS has been designed to address the bad behavior of ordinary least squares when the dataset is small, especially when the number of features is large in comparison. In such cases, one can try to select a smaller set of features allowing a better behavior. A very popular way to select important features is to use Principal Component Analysis (PCA) to select the features that contributes most to the variation in the dataset. However, since PCA is based on the data matrix alone, one risks to filter out features that are highly correlated with the target variables in. PLS has been explicitly designed to solve this problem by decomposing and simultaneously and in such a way to explain as much as possible of the covariance between and (abdi10partial). Our work is substantially different from these approaches since we are not concerned at all with the goal of removing variables. On the contrary, we group them so to make the result more interpretable and to provide valuable information about the importance of each group.
Yet another set of techniques that resembles our work are those where a partition of the variables is used to select groups of features. Very well known members of this family of algorithms are group lasso methods (bakin99adaptive; yuan2006model) (huang2012selective provide a review of such methodologies). In these works, the authors tackle the problem of selecting grouped variables for accurate prediction. In this case, as in ours, the groups for the variables are defined by the user, but in their case the algorithm needs to predict which subset of the groups will lead to better performances (i.e., either all variables in a group will be used as part of the solution or none of them will be). This is a rather different problem with respect to the one that we introduce here. In our case, we shall assume that all groups are relevant to the analysis. However, in our case we seek a solution where all variables in the same group contributes in the same direction (i.e., with the same sign) to the solution. We argue that this formulation allows for an easier interpretation of the contribution of the whole group as well as of the variables included in each group.
In this paper we introduce a new least squares problem and provide algorithms to solve it. Our contributions include:
The definition of the Partitioned Least Squares (PartitionedLS) problem;
A formal non-convexity proof for the objective of the PartitionedLS problem;
PartLS-alt: an iterative algorithm solving the PartitionedLS problem;
PartLS-opt: an optimal algorithm solving the PartitionedLS problem;
A formal proof of convergence of PartLS-alt;
A formal proof of the global optimality of PartLS-opt;
The main ideas needed to implement a branch and bound solution aimed at optimizing PartLS-opt;
A formal proof of NP-completeness of the PartitionedLS problem;
Information about how to update the algorithms to regularize the solutions;
Information about how to leverage the non-negative least squares algorithm (lawson1995solving) to improve numerical stability;
An experimentation of the two algorithms over several datasets.
2 Model description
-th component of vector.
|Shorthand to specify vectors (or matrices) in terms of their components. For instance shall denote a vector such that .|
|,||is the index for iterating over the subsets belonging to the partition.|
|,||is the index for iterating over the variables.|
|an matrix containing the descriptions of the training instances.|
|matrix multiplication operation (we also simply write it when the notation appears clearer).|
|a vector of length containing the labels assigned to the examples in .|
|wildcard used in subscriptions to denote whole columns or whole rows: e.g., denotes the -th column of matrix and denotes its -th row.|
|denotes an optimal solution, e.g., denotes the optimal solution of the PartitionedLS problem, while denotes the optimal solution of the PartitionedLS-b problem.|
|a partition matrix, , with iff variable belongs to the -th element of the partition.|
|the set of all indices in the -th element of the partition: .|
|index of the partition element to which belongs, i.e.: is such that .|
|Hadamard (i.e., element-wise) product. When used to multiply a matrix by a column vector, it is intended that the columns of the matrix are each one multiplied (element-wise) by the column vector.|
|Hadamard (i.e., element-wise) division.|
|element-wise larger-than operator: is equivalent to for .|
In this work we denote matrices with capital bold letters such as and vectors with lowercase bold letters as . In the text we use a regular (non-bold) font weight when we refer to the name of the vector or when we refer to scalar values contained in the vector. In other words, we use the bold face only when we refer to the vector itself. For instance, we might say that the values in the vector are those contained in the vector , which contains in position the scalar . We consistently define each piece of notation as soon as we use it, but we also report it in Table 1, where the reader can more easily access the whole notation employed throughout the paper.
Let us consider the problem of inferring a linear least squares model to predict a real variable given a vector . We will assume that the examples are available at learning time as an matrix and column vector . We will also assume that the problem is expressed in homogeneous coordinates, i.e., that has an additional column containing values equal to , and that the intercept term of the affine function is included into the weight vector to be computed.
The standard least squares formulation for the problem at hand is to minimize the quadratic loss over the residuals, i.e.:
This is a problem that has the closed form solution . As mentioned in Section 1, in many application contexts where is large, the resulting model is hard to interpret. However, it is often the case that domain experts can partition the elements in the weights vector into a small number of groups and that a model built on this partition would be much easier to interpret. Then, let be a “partition” matrix for the problem at hand (this is not a partition matrix in the linear algebra sense, it is simply a matrix containing the information needed to partition the features of the problem). More formally, let be a matrix where is equal to iff feature number belongs to the -th partition element. We will also write to denote the set of all the features belonging to the -th partition element.
Here we introduce the Partitioned Least Squares (PartitionedLS) problem, a model where we introduce additional variables and express the whole regression problem in terms of these new variables (and in terms of how the original variables contribute to the predictions made using them). The simplest way to describe the new model is to consider its regression function (to make the discussion easier, we start with the data matrix expressed in non-homogenous coordinates and switch to homogenous coordinates afterwards):
i.e., computes a vector whose -th component is the one reported within parenthesis (see Table 1 for details on the notation). The first summation is over the sets in the partition that domain experts have identified as relevant, while the second one iterates over all variables in that set. We note that the -th weight contributes to the -th element of the partition only if feature number belongs to it. As we shall see, we require that all values are nonnegative, and that . Consequently, the expression returns a vector of predictions calculated in terms of two sets of weights: the weights, which are meant to capture the magnitude and the sign of the contribution of the -th element of the partition, and the weights, which are meant to capture how each feature in the -th set contributes to it. We note that the weight vector is of the same length as the vector in the least squares formulation. Despite this similarity, we prefer to use a different symbol because the interpretation of (and the constraints on) the weights are different with respect to the weights.
It is easy to verify that the definition of in (1) can be rewritten in matrix notation as:
where is the Hadamard product extended to handle column-wise products. More formally, if is a matrix, is a dimensional vector with all entries equal to , and is a column vector of length , then ; where the symbol on the right hand side of the definition is the standard Hadamard product. Equation (2) can be rewritten in homogeneous coordinates as:
where incorporates a column with all entries equal to 1, and we consider an additional group (with index ) having a single variable in it. Given the constraints on variables, is forced to assume a value equal to and the value of is then totally incorporated into . In the following we will assume for ease of notation that the problem is given in homogeneous coordinates and that the constants and already account for the additional single-variable group.
The partitioned least square (PartitionedLS) problem is formulated as:
In summary, we want to minimize the squared residuals of , as defined in (3), under the constraint that for each subset in the partition, the set of weights form a distribution: they need to be all nonnegative as imposed by constraint and they need to sum to as imposed by constraint .
Unfortunately we do not know a closed form solution for this problem. Furthermore, the problem is not convex and hence hard to solve to global optimality using standard out-of-the-box solvers. Even worse, later on we shall prove that the problem is actually NP-complete. The following theorem states the non-convexity of the objective function formally.
The PartitionedLS problem is not convex.
It suffices to show that the Hessian of the objective function is not positive semidefinite. By Schwarz’s theorem, since the loss function has continuous second partial derivatives, the matrix is symmetric and we can apply the Sylvester criterion for checking positive definiteness. In practice, we prove that the Hessian is not positive semidefinite by showing that not all leading principal minors are larger than zero. In our specific case, the second minor can be shown to assume values smaller than zero and this proves the theorem.
In summary, we need to show that the second principal minor can be smaller than zero, in formulae:
It suffices to show that the Hessian of the objective function is not positive semidefinite. By Schwarz’s theorem, since the loss function has continuous second partial derivatives, the matrix is symmetric and we can apply the Sylvester criterion for checking positive definiteness. In practice, we prove that the Hessian is not positive semidefinite by showing that not all leading principal minors are larger than zero. In our specific case, the second minor can be shown to assume values smaller than zero and this proves the theorem. In summary, we need to show that the second principal minor can be smaller than zero, in formulae:
Let us denote with the objective of the PartitionedLS problem
In evaluating the derivatives, we consider the variables of the PartitionedLS problem in the following order: and assume the problem is not trivial, i.e., that . In the following, without loss of generality, we will assume that . Under these assumptions, to prove that the second minor is smaller than zero amounts to prove that:
The partial derivative of the loss function with respect to a specific variable is:
where denotes the index of the partition to which feature belongs and, for the sake of convenience, we define:
The partial derivative w.r.t. one of the variables is:
The second order derivatives we are interested into are:
where we assumed since in the case we are interested, we have and . Instantiating these derivatives for :
Now, it is enough to observe that the term on the left hand-side of the minus sign only depends on , while the term on the right-hand side of the minus sign depends also on and can be made arbitrarily large as we increase the values of these variables, thus making (1) negative and the Hessian not semidefinite positive.
In the following we will provide two algorithms that solve the above problem. One is an alternating least squares approach which scales well with , but it is not guaranteed to provide the globally optimal solution. The other one is a reformulation of the problem through a (possibly) large number of convex problems whose minimum is guaranteed to be the globally optimal solution of the original problem. Even though the second algorithm does not scale well with , we believe that this should not be a problem since the PartitionedLS is by design well suited for a small group of interpretable groups. However, we do sketch a possible branch and bound strategy to mitigate this problem in Section 3.4.
The PartitionedLS model presented so far has no regularization mechanism in place and, as such, it risks overfitting the training set. Since the values are normalized by definition, the only parameters that need regularization are those collected in the vector. Then, the regularized version of the objective function simply adds a penalty on the size of the vector:
where the squared euclidean norm could be substituted with the L1 norm in case a LASSO-like regularization is preferred.
3.1 Alternating Least Squares approach
In the PartitionedLS problem we aim at minimizing a non-convex objective, where the non-convexity depends on the multiplicative interaction between and variables in the expression . Interestingly, if one fixes , the expression results in a matrix that does not depend on any variable. Then, the whole expression can be rewritten as a problem whose objective function depends on the parameter vector and is the convex objective function of a standard least squares problem in the variables. In a similar way, it can be shown that by fixing one also ends up with a convex optimization problem . Indeed, after fixing , the objective function is the squared norm of a vector whose components are affine functions of vector (see Section 3.3 for more details). These observations naturally lead to the formulation of an alternating least squares solution where one alternates between solving and . In Algorithm 1 we formalize this intuition into the PartLS-alt function where, after initializing and randomly, we iterate until some stopping criterion is satisfied (in our experiments we fixed a number of iterations, but one may want to stop the algorithm as soon as and
do not change between two iterations). At each iteration we take the latest estimate for thevariables and solve the problem based on that estimate, we then keep the newly found variables and solve the problem based on them. At each iteration the overall objective is guaranteed not to increase in value and, indeed, we prove that, if the algorithm is never stopped, the sequence of and vectors found by PartLS-alt has at least one accumulation point and that all accumulation points are partial optima111A partial optima of a function is a point such that and . with the same function value.
Let be the sequence of and vectors found by PartLS-alt to the PartitionedLS problem and assume that the objective function is regularized as described in (5), then:
the sequence of has at least one accumulation point, and
all accumulation points are partial optima attaining the same value of the objective function.
The PartitionedLS problem is actually a biconvex optimization problem and Algorithm 1 is actually a specific instantiation of the Alternating Convex Search strategy (gorski07biconvex) to solve biconvex problems. Theorem 4.9 in (gorski07biconvex) implies that:
if the sequence is contained in a compact set then it has at least one accumulation point, and
if for each accumulation point of the sequence , either the optimal solution of the problem with fixed is unique, or the optimal solution of the problem with fixed is unique; then all accumulation points are partial optima and have the same function value.
The first requirement is fulfilled in our case since is constrained by definition into , while the regularization term prevents from growing indefinitely. The second requirement is fulfilled since for fixed the optimization function is quadratic and strictly convex in . Hence, the solution is unique.
3.2 Reformulation as a set of convex subproblems
Here we show how the PartitionedLS problem can be reformulated as a new problem with binary variables which, in turn, can be split into a set of convex problems such that the smallest objective function value among all local (and global) minimizers of these convex problems is also the global optimum value of the PartitionedLS problem.
The PartitionedLS-b problem is a PartitionedLS problem in which the variables are substituted by a binary variable vector , and the normalization constraints over the variables are dropped:
The PartitionedLS-b problem turns out to be a Mixed Integer Nonlinear Programming (MINLP) problem with a peculiar structure. More specifically, we note that the above definition actually defines minimization problems, one for each of the possible instances of vector . Interestingly, each one of the minimization problems can be shown to be convex by the same argument used in Section 3.1 (for fixed variables) and we will prove that the minimum attained by minimizing all sub-problems corresponds to the global minimum of the original problem. We also show that by simple algebraic manipulation of the result found by a PartitionedLS-b solution, it is possible to write a corresponding PartitionedLS solution attaining the same objective.
The main breakthrough here derives from noticing that in the original formulation the variables are used to keep track of two facets of the solution: i) the magnitude and ii) the sign of the contribution of each subset in the partition of the variables. With the vector keeping track of the signs, one only needs to reconstruct the magnitude of the contributions to recover the solution of the original problem.
The following theorem states the equivalence between the PartitionedLS and the PartitionedLS-b problem. More precisely, we will prove that for any feasible solution of one of the two problems, one can build a feasible solution of the other problem with the same objective function value, from which equality between the optimal values of the two problems immediately follows.
Let be a feasible solution of the PartitionedLS-b problem. Then, there exists a feasible solution of the PartitionedLS problem such that:
Analogously, for each feasible solution of the PartitionedLS problem, there exists a feasible solution of the PartitionedLS-b problem such that (6) holds. Finally, , where and denote, respectively, the optimal value of the PartitionedLS problem and of the PartitionedLS-b problem.
Let be a feasible solution of the PartitionedLS-b problem and let be a normalization vector containing in the normalization factor for variables in partition subset :
Then, for each such that , we define as follows:
while for any such that we can define , e.g., as follows:
In fact, for any such that , any definition of for such that would be acceptable. The vector can be reconstructed simply by taking the Hadamard product of and :
In order to prove (6), we only need to prove that
The equality is proved as follows:
where in between row 2 and row 3 we used the fact that and are two ways to write the same thing (the former using directly the partition number , and the latter using the notation to get the partition number from the feature number ). To be more precise, we only considered the case when for all . But the result can be easily extended to the case when for some , by observing that in this case the corresponding terms give a null contribution to both sides of the equality.
Now, let be a feasible solution of the PartitionedLS problem. Then, we can build a feasible solution for the PartitionedLS-b problem as follows. For any let:
while for each , let:
Equivalence between the objective function values at and is proved in a way completely analogous to what we have seen before.
Finally, the equivalence between the optimal values of the two problems is an immediate corollary of the previous parts of the proof. In particular, it is enough to observe that for any optimal solution of one of the two problems, there exists a feasible solution of the other problem with the same objective function value, so that both and holds, and, thus, .
The complete algorithm, which detects and returns the best solution of the PartitionedLS-b problems by iterating over all possible vectors , is implemented by the function PartLS-opt reported in Algorithm 2.
When dealing with the PartitionedLS-b problem, the regularization term introduced for the objective function of the PartitionedLS problem, reported in (5), needs to be slightly updated so to accommodate the differences in the objective function when used in Algorithm 2. In this second case, since the variables do not appear in the optimization problems obtained after fixing the different binary vectors , the regularization term is replaced by . We notice that since the new regularization term is still convex, it does not hinder the convexity of the optimization problems.
3.3 Numerical Stability
The optimization problems solved within Algorithms 1 and 2, despite being convex, are sometimes hard to solve due to numerical problems. General-purpose solvers often find the data matrix to be ill-conditioned and return sub-optimal results. In this section we show how to rewrite the problems so to mitigate these difficulties. The main idea is to recast the minimization problems as standard least squares and non-negative least squares problems, and to employ efficient solvers for these specific problems rather than the general-purpose ones.
We start by noticing that the minimization problem at line 1 of Algorithm 1 can be easily solved by a standard least square algorithm since the expression computes to a constant matrix and the original problem simplifies to the ordinary least squares problem: .
For what concerns the minimization problem at line 1 of the same algorithm, we notice that we can initially ignore the constraint . Without such constraint, the problem turns out to be a non-negative least squares problem. Indeed, we note that expression can be rewritten as the constant matrix multiplied by the vector , so that the whole minimization problem could be rewritten as:
After such problem has been solved, the solution of the problem including the constraint can be easily obtained by dividing each subset by a normalizing factor and multiplying the corresponding variable by the same normalizing factor (it is the same kind of operations we exploited in Section 3.2; in that context the normalizing factors were denoted with ).
which, again, is a non-negative least squares problem.
As previously mentioned, by rewriting the optimization problems as described above and by employing special-purpose solvers for the least squares and the non-negative least squares problems, solutions appear to be more stable and accurate.
Many non-negative least squares solvers do not admit an explicit regularization term. An -regularization term equivalent to can be implicitly added by augmenting the data matrix with additional rows. The trick is done by setting all the additional to and the -th additional row as follows:
When the additional -th row and the additional are plugged into the expression inside the norm in (7), the expression evaluates to:
which reduces to when squared and summed over all the as a result of the evaluation of the norm.
3.4 An alternative branch-and-bound approach
Algorithm 2 is based on a complete enumeration of all possible vectors . Of course, such an approach becomes too expensive as soon as gets large. As already previously commented, PartLS-opt is by design well suited for small values, so that complete enumeration should be a valid option most of the times. However, for the sake of completeness, in this section we discuss a branch-and-bound approach, based on implicit enumeration, which could be employed as gets large. Pseudo-code detailing the approach is reported in Algorithm 3.
First, we remark that the PartitionedLS-b problem can be reformulated as follows
where we notice that vector and the nonnegativity constraints have been eliminated, and replaced by the new constraints, which impose that for any , all variables such that must have the same sign. The new problem is a quadratic one with a convex quadratic objective function and simple (but non-convex) bilinear constraints. We note that, having removed the variables, the scalar objective do not need the distinction between groups anymore and it can rewritten as or, in matrix form, as . Hence, we can reformulate the problem as follows
where , , and . Different lower bounds for this problem can be computed. The simplest one is obtained by simply removing all the constraints, which results in an unconstrained convex quadratic problem. A stronger, but more costly, lower bound can be obtained by solving the classical semidefinite relaxation of quadratic programming problems. First, we observe that problem (9) can be rewritten as follows (see (Shor87))
where . Next, we observe that the equality constraint is equivalent to requiring that is a psd (positive semidefinite) matrix and is of rank one. If we remove the (non-convex) rank one requirement, we end up with the following convex relaxation of (9) requiring the solution of a semidefinite programming problem:
Note that by Schur complement, constraint “” is equivalent to the following semidefinite constraint:
No matter which problem we solve to get a lower bound, after having solved it we can consider the vector of the optimal values of the variables at its optimal solution and we can compute the following quantity for each
If for all , then the optimal solution of the relaxed problem is feasible and also optimal for the original problem (9) and we are done. Otherwise, we can select an index such that (e.g., the largest one, corresponding to the largest violation of the constraints), and split the original problem into two subproblems, one where we impose that all variables , , are nonnegative, and the other where we impose that all variables , , are nonpositive. Lower bounds for the new subproblems can be easily computed by the same convex relaxations employed for the original problem (9), but with the additional constraints. The violations are computed also for the subproblems and, in case one of them is strictly positive, the corresponding subproblem may be further split into two further subproblems, unless its lower bound becomes at some point larger than or equal to the current global upper bound of the problem, which is possibly updated each time a new feasible solution of (9) is detected. As previously commented, Algorithm 3 provides a possible implementation of the branch-and-bound approach. More precisely, Algorithm 3 is an implementation where nodes of the branch-and-bound tree are visited in a depth-first manner. An alternative implementation is, e.g., the one where nodes are visited in a lowest-first manner, i.e., the first node to be visited is the one with the lowest lower bound.
In this section we establish the theoretical complexity of the PartitionedLS-b problem. In view of reformulation (8), it is immediately seen that the cases where for all , are polynomially solvable. Indeed, in this situation problem (8) becomes unconstrained and has a convex quadratic objective function. Here we prove that as soon as we move from to , the problem becomes NP-complete. We prove this by showing that each instance of the NP-complete problem subset sum (see, e.g., (garey1979computers)) can be transformed in polynomial time into an instance of problem (8). We recall that problem subset sum is defined as follows. Let be a collection of positive integers. We want to establish whether there exists a partition of this set of integers into two subsets such that the sums of the integers belonging to the two subsets is equal, i.e., whether there exist such that:
Now, let us consider an instance of problem (8) with partitions and two variables and for each partition (implying ). The data matrix and vector have rows defined as follows (when and are not restricted, they are assumed to vary on and respectively):
When the values so defined are plugged into problem (8) we obtain:
We prove the following theorem, which states that an instance of the subset sum problem (11) can be solved by solving the corresponding instance (12) of problem (8), and, thus, establishes NP-completeness of the PartitionedLS-b problem.
As a first step we derive the optimal solutions of the following restricted two-dimensional problems for :
This problems admits at least a global minimizer since its objective function is strictly convex quadratic. Global minimizers should be searched for among regular KKT points and irregular points. Regular points are those who fulfill a constraint qualification. In particular, in this problem all feasible points, except the origin, fulfill the constraint qualification based on the linear independence of the gradients of the active constraints. This is trivially true since there is a single constraint and the gradient of such constraint is null only at the origin. Thus, the only irregular point is the origin. In order to detect the KKT points, we first write down the KKT conditions:
where is the Lagrange multiplier of the constraint. We can enumerate all KKT points of problem (13). By summing up the first two equations, we notice that
must hold. This equation is satisfied if:
either , which implies , in view of . As previously mentioned, the origin is the unique irregular point. So, it is not a KKT point but when searching for the global minimizer, we need to compute the objective function value also at such point and this is equal to ;
or , which implies, in view of the complementarity condition, that , and, after substitution in the first two equations, we have the two KKT points