In many regression applications, the response is multivariate. If all of the components of this response were numerical, then the standard multivariate response linear regression model could be used. If some response components are categorical, then it is unclear what should be done. In this article, we develop a method for multivariate response regression when all of the components of the response are categorical. For example, given the gene expression profile of a patient with cancer originating in the kidney, a practitioner may want to predict both the cancer type (chromophobe, renal clear cell carcinoma, or renal papillary cell carcinoma) and five-year mortality risk (high or low). To simplify matters, we will focus on the bivariate categorical response regression model, but as discussed in a later section, our developments can be generalized to settings with arbitrarily many categorical response variables.
1.1 Bivariate categorical response regression model
Let be the random bivariate categorical response with numerically-coded support
when the explanatory variables have values in the vectorwith its first entry set to one. Existing work on this problem proposed and analyzed links between and the multivariate distribution of the response (McCullagh and Nelder, 1989; Glonek and McCullagh, 1995). For reasons to be discussed, we consider the simple link defined by
where is three-dimensional array of unknown regression coefficients and is the regression coefficient vector corresponding to the response category pair . This model can be expressed as a univariate multinomial logistic regression model for the categorical response where has numerically coded support ;
Many methods exist for penalized (univariate response) multinomial logistic regression. For example, Zhu and Hastie (2004) proposed a ridge-penalized multinomial logistic regression model, and later, Vincent and Hansen (2014) proposed to use a group-lasso penalty on rows of the unknown regression coefficient matrix. The latter approach allows for variable selection since implies that the
th predictor does not affect the response category probabilities.Simon et al. (2013) studied the sparse group-lasso from a computational perspective: the multinomial logistic regression model fits neatly into their computational framework.
Other recent methods for fitting the multinomial logistic regression model rely on dimension reduction rather than variable selection. Powers et al. (2018) proposed a nuclear norm penalized multinomial logistic regression model, which could be characterized as a generalization of the stereotype model of Anderson (1984). Price et al. (2019) penalized the euclidean norm of pairwise differences of regression coefficient vectors for each category, which encourages fitted models for which estimated probabilities are identical for some categories.
While these methods can perform well in terms of prediction and interpretability for multinomial logistic regression, if applied to the multivariate categorical regression model, none would exploit the fact that is constructed using two distinct response variables. One could fit two separate multinomial logistic regression models, but this would fail to exploit the association between the two responses unless they were conditionally independent. Thus, there is a need to develop a new penalized likelihood framework for fitting (1) that exploits the multivariate response. Our proposed method does this, produces interpretable fitted models, and applies when , , and are large.
1.2 Parsimonious parametric restrictions
We propose two parametric restrictions to reduce the number of parameters in (1) and incorporate the special structure of the bivariate response. The first assumes that only a subset of the predictors are relevant in the model. Specifically, if for any constant and matrix of ones , then a change in the th predictor’s value does not affect the response’s joint probability mass function, i.e. the th predictor is irrelevant. By setting , it is immediate to see that imposing sparsity of the form , where is an estimator of , is a natural way to achieve variable selection of this kind. This restriction may be helpful when there are many predictors.
The second restriction we consider is that a subset of the predictors can only affect the two marginal distributions of the response: and
. Specifically, the joint distribution of the responseis determined by its local odds ratios:
and its two marginal distributions and (Agresti, 2002). We suppose that changes to a subset of the entries in do not affect the odds ratios in (2), so they can only affect the marginal distributions of the response (or be irrelevant).
Suppose, for the moment, thatand . The log odds ratio is then
If the th element of the vector were zero, then changes to the th element of would not affect the odds ratio, so the th predictor can only affect the marginal distributions of the response. Let and let be the matricized version of with
One can see that if the th predictor only affects the marginal distributions of the response (or are irrelevant), then the log odds ratio
must have th component equal to zero for all .
When or , we can express the logarithm of all of the local odds ratios in (2) in terms of a constraint matrix . For example, when and ,
and the vector of local log odds ratios is . If the vector , then the th predictor can only affect the marginal distributions of the response.
We propose to fit the model (1) by penalized likelihood. We add a group-lasso penalty that is non-differentiable when the optimization variable corresponding to is such that for . This encourages solutions for which for some , so that some of the predictors are estimated to only affect the marginal distributions of the response. We also add a second group-lasso penalty that is non-differentiable when the optimization variable corresponding to for . This has the effect of removing predictors from the model entirely.
1.3 Alternative parameterizations
where , , and are unknown coefficient vectors. This parameterization and generalizations were also discussed by Glonek and McCullagh (1995). We found penalized likelihood optimization based on (3) to be much more difficult than our method based on (1). See Qaqish and Ivanova (2006) for more on the computational challenges involved with parameterizations like (3). In addition, the parameterization (1) allows us to establish asymptotic properties our estimator using results from Bach (2010). Nonetheless, we consider penalized likelihood methods for estimating , and a promising direction for future research.
1.4 Multi-label classification methods
The problem of fitting multivariate categorical response regression models is closely related to the problem of “multi-label” classification. In the computer science literature, multi-label classification refers to the task of predicting many categorical (most often, binary) response variables from a common set of predictors. One of the most popular class of methods for fitting the multivariate binary response regression model is the so-called “binary relevance” approach, which effectively fits a separate model for each of the categorical response variables. Loss functions used for fitting these models, however, often take the classification accuracy of all response variables into account, so in this sense, these methods make use of the multivariate response. For a comprehensive review of binary relevance in multi-label classification, seeZhang et al. (2018).
Naturally, there are many extensions of the binary relevance approach which account for dependence between the response variables (Montañes et al., 2014)
. One such approach is based on “classifier chains”(Read et al., 2009), which can be described as fitting successive (univariate) categorical response regression models, where in each successive model, the categorical responses from the previously fit models are included as predictors. For example, in the bivariate categorical response case, one would fit two models in sequence: and so that one could then predict from some new value of , say , using (i) and then predict from and the predicted value of using (ii). “Nested stacking” uses a similar approach, except when fitting (ii), replaces with its predicted value from (i) (Senge et al., 2013).
Many methods other than those based on binary relevance exist, e.g., see the review paper by Tsoumakas and Katakis (2007). However, in general, these methods are often not model-based, nor is the focus of these methods both prediction accuracy and interpretability of fitted models, as is the focus of our proposed methodology.
2 Penalized likelihood for bivariate categorical response regression
We assume that we have observed the result of independent multinomial experiments. Let be the values of the explanatory variables for the th subject and let
be the observed response category counts for the th subject . The subjects model assumes that is a realization of
The negative log-likelihood function, up to constants, evaluated at is
Without loss of generality, we set for all .
To discover the parsimonious structure described in Section 1.2, we propose the penalized maximum likelihood estimator,
where are user-specified tuning parameters; and is the Euclidean norm of a vector. As , the estimator in (5) becomes equivalent to fitting separate multinomial logistic regression models to each of the categorical response variables. Conversely, as , (5) tends towards the group lasso penalized multinomial logistic regression estimator for response . Throughout, will used to denote (5).
The matrix used in (5) is distinct from the matrix introduced in Section 1.2. Specifically, the matrix is constructed by appending additional, linearly dependent columns to so that all log odds ratios are penalized. If we had instead used the matrix in (5), our estimator would depend on which log odds ratios the columns of correspond to. Thus, using the matrix avoids this issue and penalizes all possible log odds ratios equivalently. In the case that , it is trivial to see . In the case that and , for example, the additional column of would be the second column minus the first column of . Using this matrix , implies , and thus, implies that the th predictor can only affect the marginal distributions of the response variables.
In addition to encouraging variable selection, the second penalty on the ’s leads to a (practically) unique solution. If , the solution to (5) is not unique because for any vector and any minimizer of (5), say , has the same objective function value as . When , a minimizer is (practically) unique: , the intercept, is non-unique, but the submatrix excluding the intercept is unique. This follows from the fact that for a minimizer , for all , , otherwise cannot be the solution to (5). Making the intercept unique is trivial: one could simply impose the additional constraint that , in which case (5) would be entirely unique. A similar argument about uniqueness in penalized multinomial logistic regression models was used in Powers et al. (2018).
3 Statistical properties
We study the statistical properties of (5) with , , , and varying. We focus our attention on settings where the predictors are non-random. To simplify notation, we study the properties of a version of our estimator when the intercept is also penalized. Define the matrix throughout. For a given set , let denote the submatrix of including only rows whose indices belong to where denotes the cardinality of . Finally, let denote the squared Frobenius norm of a matrix .
To establish an error bound, we must first define our target parameter, i.e., the value of from (1) for which our estimator is consistent. As described in the previous section, for any which leads to (1), also leads to for any . Let the set denote the set of all which lead to (1), i.e.,
Then, with , we define our estimation target as Notice, we could equivalently write
where denotes the negative log-likelihood divided by , and . Since the optimization in (6) is over feasible set , and because and are equivalent for all elements of , is simply the element of which minimizes , i.e., (6) does not depend on the data or tuning parameters. By the same argument used to describe the uniqueness of (5) in the previous section, is unique.
We will require the following assumptions:
A1. The responses are independent and generated from (4) for
A2. The predictors are normalized so that for .
We will also require the definition of a number of important quantities and sets. Let denote the subset of such that and ; let denote the subset of such that and ; and let . The sets , , and denote the set of predictors which affect the log odds ratios, affect only the marginal probabilities, and are irrelevant, respectively. Let denote this partition of into and . Next, with , we define the set
In the Appendix, we show that when the tuning parameters and are chosen as prescribed in Theorem 1, belongs to the set with high probability. This set is needed establish our third assumption, A3. Let denote the version of that takes matrix variate inputs, and let denote the Hessian of with respect to the vectorization of its argument.
(Restricted eigenvalue) For alland , there exists a constant such that
Assumption A3 is effectively a restricted eigenvalue condition, which often appear in the penalized maximum likelihood estimation literature (Raskutti et al., 2010). If we assumed that the probabilities for all class combinations were bounded below by a positive constant over all , would be proportional to the restricted eigenvalue for least squares estimators, i.e., .
We must also define the following subspace compatibility constant (Negahban et al., 2012) which we write as
The quantity measures the magnitude of the log odds penalty over the set of matrices with Frobenius norm no greater than one, where and for a subset of with . Note that only the cardinality of affects . In the following remark, we provide an upper bound on .
For all and , and every set ,
With assumptions A1-A3, we are ready to state our main result, which will depend on Condition 1, detailed below.
Suppose assumptions A1-A3 hold and let , , , and be fixed constants. Let and . If , , and Condition 1 holds, i.e., is sufficiently large, then
with probability at least .
The proof of Theorem 1, which can be found in the Appendix, relies on the generalized self-concordance (Bach, 2010) of the multinomial negative log-likelihood. In our proof, we have an exact condition on the magnitude of needed for the result of Theorem 1 to hold.
Let . Given fixed constants , , and , is sufficiently large so that , where with as specified in Theorem 1.
The bound in Theorem 1 illustrates the effects of both the group-lasso penalty and the penalty corresponding to the log odds ratios. In particular, corresponds to having to estimate total nonzero rows of , whereas the additional term comes from shrinking the rows of which do not satisfy .
The constants balance the magnitude of and relative to . In doing so, they affect the error bound by scaling and , but also by controlling the restricted eigenvalue . Specifically, controls the set : for example, if were small, a larger would mean a larger . Hence, the optimal choice of would be that which increases relative to the magnitude of
The result of Theorem 1 also demonstrates that in the case that the two response variables are truly independent, or that no predictors affect only the marginal distributions, our estimator still achieves a near oracle rate of convergence.
Suppose the conditions of Theorem 1 hold, , and . If , i.e., no predictors affect the log odds, , and there exists a constant such that for all , then
Under the same conditions, if instead, , i.e., predictors are only irrelevant or affect log odds ratios, then
The result of Corollary 1 agrees with what one would expect under the scenarios considered. First, if the two models were truly independent () one has to estimate nonzero coefficients per model and loses the factor of since . Conversely, if , we pay the price of for inappropriately biasing our estimates towards models assuming independence.
In this section, we propose a proximal gradient descent algorithm (Parikh and Boyd (2014), Chapter 4) to compute (5). Throughout, we treat and as fixed. We let denote the objective function from (5) evaluated at with tuning parameter pair and recall denotes the negative log-likelihood (divided by ). In the following subsection, we describe our proposed proximal gradient descent algorithm at a high-level, and in the subsequent section, we describe how to solve the main subproblem in our iterative procedure.
4.2 Accelerated proximal gradient descent algorithm
Proximal gradient descent is a first order iterative algorithm which generalizes gradient descent. As in gradient descent, to obtain the th iterate of our algorithm, we must compute the gradient of evaluated at the th iterate . Letting
for the gradient can be expressed as
One way to motivate our algorithm is through an application of the majorize-minimize principle. Specifically, since the negative log-likelihood is convex and has Lipschitz continuous gradient (Powers et al., 2018), we know
for all and with some sufficiently small step size . From (8), it follows that
for all with equality when That is, the right hand side of the above is a majorizing function of at . Hence, if we obtain the th iterate of with
the majorize-minimize principle (Lange, 2016) ensures that Thus, to solve (5), we propose to iteratively solve (9). It is well known that the sequence of iterates generated by an accelerated version of this procedure (see Algorithm 1) converge to their optimal values at a quadratic rate when is fixed to be smaller than with being the Lipschitz constant of . For example, see Beck and Teboulle (2009) or Section 4.2 of Parikh and Boyd (2014) and references therein.
After some algebra, we can write (9) as
|Fortunately, (10) can be solved efficiently row-by-row of . In particular, this problem can be split into separate optimization problems since for ,|
where denotes the th row of For the intercept (i.e., ), the solution has a simple closed form:
Then, it is straightforward to see that for , each of the subproblems in (11) can be expressed as
where corresponds to a row of , , and . In the following subsection, we show that this problem has a closed form solution, making its computation extremely efficient.
4.3 Efficient computation of subproblem (12)
We now describe how to compute (12). Our first theorem reveals that can be obtained in essentially closed form. Throughout, let denote Moore-Penrose pseudoinverse of a matrix .
A proof of Theorem 2 can be found in the Appendix. The results suggest that we can first screen all rows of , as we know that those euclidean norm less than will have minimizer . Of the rows that survive this screening, we need either apply the result from or . Based on the statement of Theorem 2, is immediate and does not require any optimization. Regarding , there is no analytic expressions for which would satisfy for arbitrary and . However, it turns out that the structure of our yields a closed form expression for , which we detail in the following proposition.
Together, Theorem 2 and Proposition 1 verify that we can solve (12) in a closed form. Since the singular value decomposition of , , and can be precomputed and stored, these updates are extremely efficient to compute. To provide further intuition about the result of Theorem 2, we provide the closed form solution for this setting which covers and in the case where
(Solutions for (12) with ) Suppose so that . Let . Then, with
it follows that
We propose to iteratively update using (9) where we solve the subproblems using the result of Theorem 2. This approach is especially efficient for large and moderately sized and since each of the subproblems involves a -dimensional optimization variable. In practice, when the tuning parameter is relatively large, of Theorem