Semi-parametric Order-based Generalized Multivariate Regression

02/19/2016 ∙ by Milad Kharratzadeh, et al. ∙ 0

In this paper, we consider a generalized multivariate regression problem where the responses are monotonic functions of linear transformations of predictors. We propose a semi-parametric algorithm based on the ordering of the responses which is invariant to the functional form of the transformation function. We prove that our algorithm, which maximizes the rank correlation of responses and linear transformations of predictors, is a consistent estimator of the true coefficient matrix. We also identify the rate of convergence and show that the squared estimation error decays with a rate of o(1/√(n)). We then propose a greedy algorithm to maximize the highly non-smooth objective function of our model and examine its performance through extensive simulations. Finally, we compare our algorithm with traditional multivariate regression algorithms over synthetic and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem Setup

In linear multivariate regression, we have the following model:

(1)

where

is the response vector (

), is the predictor vector, is the coefficient matrix, and represents the noise with i.i.d. elements that are independent of . In this paper, we consider the following extension of this problem:

(2)

where is a non-degenerate monotonic function called the utility or link function. When the input of is a vector or a matrix, it is implied that is applied separately on each individual element to give the output, which is a vector or matrix of the same size as the input. Without loss of generality, we assume that is an increasing function. We propose a semi-parametric, rank-based approach to estimate which is invariant with respect to the functional form of functions. Our approach only uses the ordering of the elements of

, which makes it more robust to outliers and heavy-tailed noise compared to traditional regression algorithms. This also makes our approach applicable to cases where the numeric values of

are not available, and only their ordering is known.

We show that it is possible to consistently estimate solely based on the ordering of the elements of . Our approach to estimating is based on maximizing Kendall’s rank correlation of and . For notational simplicity, we assume that all the link functions are equal and denote them by ; however, all the results presented in this paper hold for the case where there is a separate link function, , for each observation. Let us rewrite (2) in matrix form:

(3)

where is the number of predictors, is the number of responses, and denotes the number of instances. , , and correspond, respectively, to the -th rows of , , and . To find , we propose to solve:

(4)

where denotes the -th column of . The intuition behind this formulation is that since is increasing and the error is i.i.d. and independent of , when we have , it is more likely to have than the other way around. The term in the summation is zero for . Maximizing is equivalent to maximizing the average rank correlation of and since corresponds to the average over the observations of the Kendall rank correlation between and .

2 Motivating Examples and Related Work

2.1 Learning from non-linear measurements

In many practical settings, the measurements or observations are noisy non-linear functions of a linear transformation of an underlying signal. This could be due to the uncertainties and non-linearities of the measurement device or arise from the experimental design (e.g., censoring or quantization). In the statistics and economics literature, this model is known as the single-index model and it has been studied extensively (Li89, ; Ichi93, ; Xia06, ; Del06, ; Yi15, ; Rad15, ). The response in the single-index model is univariate and the form of the link function is sometimes assumed known unknown.

In our model, the response is a vector (which leads to a multivariate regression inference problem) and we assume that the functional form of the link function is unknown. Also, as explained in more detail below, our inference approach only uses the ordering of the elements of the response vector. Recently, it has been shown that under certain assumptions (e.g., when the predictors are drawn from a Gaussian distribution), Lasso with non-linear measurements is equivalent to one with linear measurement with an equivalent input noise proportional to the non-linearity of the link function 

(Thr15, ). Thus, it has been suggested to use Lasso in the non-linear case as if the measurements were linear. Here, and under much more general conditions, we show that our algorithm performs better than a simple application of Lasso to the non-linear problem.

2.2 Learning from the ordering of responses

Our approach is particularly of interest in applications (e.g., surveys) where subjects order a set of items based on their preferences. Some examples include physicians ranking different life-sustaining therapies they are likely to remove from critically ill patients after they have already made the decision to gradually withdraw support (Chr93, ; Fes12, ), or people ranking different types of sushi based on their preference (Kam11, ). In these scenarios, the underlying model cannot be learned by traditional regression techniques, which require a numerical response. However, our algorithm is directly applicable since it only uses the ordering of the elements of the response vector.

Even in the scenarios where the actual values of responses are available (e.g., numerical ratings), it is often more sensible to focus on the ordering rather than striving to learn based on the assigned numerical values. As discussed in (Kam10, ), there is often no invariant and objective mapping between true preference and observed ratings among users or survey participants, since “each user uses his/her own mapping based on a subjective and variable criterion in his/her own mind”. Thus, the mappings might be inconsistent among different users. Moreover, the mappings might be inconsistent for a given user across different items; as noted in (Lua04, ), only trained experts, e.g., wine tasters, are capable of providing consistent mappings for different items. By using the ordering in training and prediction, we minimize the effects of these inconsistencies.

2.3 Collaborative and content-based filtering

Our work is also related to the problem of personalized recommendation systems, but with important differences. Recommendation systems can be divided into three main categories: content–based filtering, collaborative filtering, and hybrid models; see (Lop11, ; Lee12, ; Bob13, ) for recent surveys. Content–based filtering employs the domain knowledge of users (e.g, demographic information and user profile data) and items (e.g., genre or actors of a movie) to predict the ratings. Collaborative filtering does not use any user or item information except a partially observed rating matrix, with rows and vectors corresponding to users and items and matrix elements corresponding to ratings. In general, the rating matrix is extremely sparse, since each user, normally, does not experience and rate all items. Hybrid systems combine collaborative and content–based filtering, e.g., by making separate predictions with each filtering approach and averaging the results.

If the regression–based framework described in this paper were used in a recommendation system, it would predict each user’s ordering of a set of items based on a set of features for that user. These features could include demographic information, user profile data, or ratings of a fixed set of items. Contrary to content–based filtering, our approach does not need domain-specific knowledge about the features of items (e.g. relevant features are different for books and movies). This is potentially useful in applications where the items to be ranked are diverse in nature - for example, products on the online Amazon store. Also, as opposed to collaborative filtering, we can incorporate user profile data and provide predictions for new users even if they have provided no prior ratings (i.e., providing predictions only based on features such as demographic data). In Section 8, we provide a preference prediction task where neither collaborative nor content–based filtering is applicable since we want to make predictions for new users without using domain-specific knowledge about the items. Thus, our algorithm is different from the problems of collaborative and content-based filtering. It is important to stress that we introduce and study a general semi-parametric multivariate regression method which can be used in recommendation systems, but this is just one of multiple potential applications.

2.4 Maximum rank correlation estimation

In (Han87, ), Han considered a problem similar to (2) but with an important difference. His formulation, called Maximum Rank Correlation (MRC) estimation, was stated for the multiple regression setting, where is real-valued rather than a -vector, and his goal was to maximize the rank correlation across instances. Therefore, the goal in MRC estimation is to capture the ordering of (across instances), whereas in this paper, our goal is to capture the ordering of for a fixed (for a specific instance, across responses). This difference significantly changes the scope of the problem and its theoretical properties. Considering the ordering across responses enables us to model applications where an instance’s ordering (or rating) of a set of items depends exclusively on its predictors. Also, as we see in the next section, the identifiability and consistency conditions for problem (2) differ significantly from those of the multiple regression problem.

There are extensions of the MRC approach (e.g., (Cav98, ; Abr03, )

), but they all are in the multiple regression domain and only differ in how they define the objective function to solve the same problem. Our work differs from them for the same reasons mentioned above. Unlike exploded and ordered logit models 

(All94, ; Ala13, ), our approach is semi-parametric and invariant to the functional form of . Our work is also markedly different from the ‘learning to rank’ problem in the information retrieval literature, where the goal is to find relevant documents for a given query (Cao07, ; Liu11, ).

3 Summary of Results

In Section 4, we outline the identifiability conditions on the true model. We show that the true matrix, , is identifiable up to a scaling and addition of a vector to all its columns. Moreover, we show that the maximizer of the optimization problem in (4) is a consistent estimator of the true matrix. In Section 5, we study the convergence rate and show that our estimator’s squared error decays faster than if the Hessian of is negative definite at . In Section 6, we provide a greedy algorithm to estimate the maximizer of (4) in polynomial time and show how our algorithm can be extended to provide sparsity among the elements of . Finally, in Sections 7 and 8, we provide extensive experimental results using both synthetic and real datasets and show that our algorithm is successful in estimating the underlying true models and provides better prediction results compared to other applicable algorithms.

4 Strong Consistency

In this section, we show that the solution of (4) is strongly consistent under certain conditions. is invariant to the multiplication of all elements of by a positive constant; i.e., for , . The objective function also does not change if the same vector is added to all of the columns of ; i.e., for any , , where is a vector of all ones. These invariances are expected, since to have a semi-parametric estimate, we target maximizing the rank correlation and ranks are not affected when all the elements are multiplied by a positive constant (), or are increased/decreased by the same amount (). In other words, since our estimation is semi-parametric in (and thus, must be invariant to strictly monotonic transformations of observations), and are equivalent. So, we assume that (normalization), and that the last column of is all zeros (subtracting the last column from all columns). We perform the optimization in (4) over the set

where denotes the Frobenius norm.

Let us denote the true coefficient matrix by and without loss of generality assume ; otherwise, we can find and such that which gives the same value for objective function as . Also, to have a non-degenerate problem, we assume that does not have rank 1, since in that case there exist two vectors such that , and . Therefore, the ordering of the elements of will be either the same as the ordering of the elements of (if ) or the reverse of it (if ) with perturbations due to the noise. So, different observed orderings of the elements of are caused merely by noise which is not of interest. Finally, we assume that no two columns of are equal, because in that case the expected values of the corresponding elements of will be the same.

Given the model in (3), to prove strong consistency, we need the following three conditions:

  • is a non-degenerate increasing function and changes value at least at one non-zero point (i.e., is not a step function changing value only at 0).

  • The elements of

    are i.i.d. random variables.

  • The rows of are i.i.d. random vectors of size , independent of the elements of , and have a distribution function such that:

    • the support of is not contained in any proper linear subspace of , and

    • for all the conditional distribution of given the other components has everywhere positive Lebesgue density.

  • is an interior point of . Moreover, has a rank higher than one and no two columns of it are equal.

As we show later, the second part of condition (C1) is needed in the proof of identifiability. However, for all practical purposes, the step function at 0 can be replaced by an approximate function changing value over for some , for which the theoretical results we present in the following hold. Conditions (C3.1) and (C3.2) are also required for identifiability, and hold in many settings; e.g., when the rows of have a multivariate Gaussian distribution. In (C4), implies that its last column is all zeros, and because no two columns are equal, we can conclude that every column except the last one has at least one non-zero element. For some known constant which is less than all the absolute values of these non-zero elements, define

Denoting the solution of (4) over the set by , we prove:

(5)

We conduct the proof in three steps. In Lemma 1, we prove the identifiability; in Lemma 2, we prove the convergence of to the expected value of the rank correlation; and finally, we prove the consistency in Theorem 1.

Lemma 1 (Identifiability).

Given (C1)(C4), attains the unique maximum of over the set .

Proof.

For a given such that , we have:

(6)

where , and is the -th column of . For any matrix , we have:

(7)

For any and , and maximize this expected value, because for any given , the larger term between and is chosen in the expected value. Therefore, maximizes the expected value of each of the terms in (4) and consequently, maximizes . Next, we show that is the unique maximizer of over the set .

For any , we show that

(8)

For any pair of , we can define the following two sets:

If for some , then both and have Lebesgue measure greater than 0. We claim that given and (C4), a pair and such that exists. We prove this by contradiction. Assume that such a pair does not exist. Then, since the last columns of both and are zero, setting implies that there exists such that for . Also, for any , there must exist a such that . Combining these two, we get . Since this holds for all and , we can conclude that all columns of are multiples of each other and consequently, has rank 1, which is a contradiction. Therefore, there exists a pair and for which both and have Lebesgue measure greater than 0.

Now, define the following two sets:

(9)

Next, we show that and/or have Lebesgue measure greater than 0. This is trivial if is strictly increasing, since and . We show that and/or have positive measure in general. Take an ; it is clear that for any , is also in , and is in . If we change from to , then changes from to and changes from to . Since is non-degenerate and changes value at a non-zero point, there exists a neighborhood such that for , we have and/or . Thus and/or have Lebesgue measure greater than 0.

Without loss of generality, assume that defined for and has Lebesgue measure greater than 0. In the following we show that , which proves the lemma. We have:

(10)
(11)
(12)

The summation in (10) is over all distinct pairs, and . The inequality in (11) is based on the argument made for (7); for any given , , and , the expectation in (10) is greater than or equal to zero (because maximizes (7) for every and any pair and ). Therefore, by removing all pairs of and such that and , and conditioning only on , we get a smaller term. The first term in (12) is strictly greater than zero because has positive measure, and hence . The second term is positive, because for any , the term inside is strictly greater than zero. ∎

Lemma 2 (Convergence).

Given (C1)(C4), and denoting

we have:

Proof.

Define

It is clear that is bounded and moreover, given (C3), , and consequently , are continuous in almost surely. Therefore, since is compact and separable, for any , there exists a sequence in a countable dense subset of such that:

Thus, almost surely:

Taking the expected value of both sides of these limits, we get for all :

(13)

We use the Borel–Cantelli lemma (Kle13, ) to prove the lemma.

(14)
(15)

is a compact set and also for any small . However, for a compact set, every open cover has a finite subcover. Thus, for a finite , there exist , where , and:

(16)

Also, assume that for , we have and . This is possible to achieve because of (13). Now, from (14)–(15), we have:

(17)
(18)

For each

, we can invoke the Small Law of Large Numbers for

U-statistics (Ser09, , Chapter 5) to get:

(19)

In the following, we use the Borel-Cantelli lemma (Kle13, ) twice to get the desired result. Reordering the summands, we can re-write the first term in (18) as follows:

According to the Borel-Cantelli lemma, the almost sure convergence in (19) implies that each of terms marked by is finite, and thus their sum is finite (since is finite). The same argument can be made for the second term in (18). Therefore, is less than something finite, and hence, is finite. Applying the Borel-Cantelli result proves the lemma. ∎

Theorem 1.

Given (C1)(C4), the solution of (4) over the set , , is strongly consistent; i.e.,

Proof.

Given the results of Lemmas 1 and 2, we are now ready to prove the consistency of the solution of (4) over . We do so by showing that any set, , that contains , also contains as .

Define . is compact and there exists which is always greater than 0 (because attains the unique maximum and ). From the result of Lemma 2, we know that for any , there is an , such that for , for all

with probability 1. This implies that

cannot be in ; because otherwise, we get which is in contradiction with almost sure convergence. Thus, with probability 1, and since this is true for any , we have almost surely. ∎

5 Rate of Convergence

For ease of notation, let be the vectorization of the matrix , except the last column which is assumed to be all zero. Thus,

(20)

For , the corresponding is in , the set of -dimensional vectors with norm 1. So, we can denote and its columns as functions of , and write:

(21)

where is the joint vector of predictors and responses for an instance, and denotes the ’th column of . Let correspond to and correspond to . Based on (C4) we know that is an interior point of . In the previous section, we showed that . In this section, we study the rate of convergence and show that , where is the Euclidean norm.

The following Lemma plays a critical role in establishing the rate of convergence. Its proof, using results from (Pakes, ; Nolan, ; Sherman, ), is included in the Appendix.

Lemma 3.

For in an neighborhood of , and , we have:

(22)
Proof.

See the Appendix. ∎

For the next Theorem we require that there exists an neighborhood of and a constant for which for all . Assume and denote the first and second order derivatives with respect to . The existence of is guaranteed since is an interior point of . Also, exists if is negative definite, because (since is maximized at ) and in neighborhood of , Taylor expansion gives us:

(23)

Since is negative definite, there exists a positive such that . In an neighborhood of , setting for any small positive gives . In sum, a sufficient condition for the existence of is the following:

  • The matrix is negative definite.

Theorem 2.

Assume that there exists an neighborhood of and a constant for which for all . Then, the squared estimation error decays with a rate faster than : .

Proof.

By definition of we have:

(24)

Rewriting this inequality using the result of Lemma 3 gives:

(25)

which gives us . ∎

Corollary 1.

Given (C1)(C5), we have .

6 Optimization Algorithm

In the previous section, we showed that solving (4) provides a consistent estimate of . However, the objective function is very non-smooth (the sum of many step functions) and finding can be challenging. In this section, we propose a fast, greedy algorithm to solve (4).

First, consider the following maximization problem:

(26)

where and are given real numbers. This objective function is a piece-wise constant function changing values at (one step function changes value at each of these points). Therefore, this function takes different values. Computing each of these values requires operations, and thus, finding by computing all the possible function values requires operations.

We propose an algorithm to find . The algorithm works as follows. First, we sort the sequence , in . Then, we start from a value less than the first sorted point, i.e., , and at each step, move forward to the next smallest point. At each step, we cumulatively add or subtract 1 depending on the sign of (i.e., depending on whether one of the step functions went from 0 to 1 or vice-versa) and keep track of the largest cumulative sum seen so far, and the value of for which that maximum happened. After going through all points, the largest observed cumulative sum is equal to the maximum value of the objective function, and the corresponding value of is . Since the objective function is piece-wise constant, its maximum is attained over an interval; we set to the center of this interval. Therefore, we can solve (4) in which is equivalent to . The details of the algorithm are summarized below under Algorithm 1. We use this algorithm repeatedly to solve (4).

Inputs: , , ; Output:
Compute:
Sort
Notation: : corresponding of
for  do
     if  then
         
         
         
     else if  then
         
         if  then
              
              
         end if
     else
         
         if  then
              
              
         end if
     end if
end for
Algorithm 1 — for solving (26)

We use an alternating maximization scheme to solve (