# Learning Mixed Multinomial Logit Model from Ordinal Data

Motivated by generating personalized recommendations using ordinal (or preference) data, we study the question of learning a mixture of MultiNomial Logit (MNL) model, a parameterized class of distributions over permutations, from partial ordinal or preference data (e.g. pair-wise comparisons). Despite its long standing importance across disciplines including social choice, operations research and revenue management, little is known about this question. In case of single MNL models (no mixture), computationally and statistically tractable learning from pair-wise comparisons is feasible. However, even learning mixture with two MNL components is infeasible in general. Given this state of affairs, we seek conditions under which it is feasible to learn the mixture model in both computationally and statistically efficient manner. We present a sufficient condition as well as an efficient algorithm for learning mixed MNL models from partial preferences/comparisons data. In particular, a mixture of r MNL components over n objects can be learnt using samples whose size scales polynomially in n and r (concretely, r^3.5n^3(log n)^4, with r≪ n^2/7 when the model parameters are sufficiently incoherent). The algorithm has two phases: first, learn the pair-wise marginals for each component using tensor decomposition; second, learn the model parameters for each component using Rank Centrality introduced by Negahban et al. In the process of proving these results, we obtain a generalization of existing analysis for tensor decomposition to a more realistic regime where only partial information about each sample is available.

## Authors

• 39 publications
• 30 publications
• ### The Search Problem in Mixture Models

We consider the task of learning the parameters of a single component o...

10/04/2016 ∙ by Avik Ray, et al. ∙ 0

• ### Analyzing Tensor Power Method Dynamics in Overcomplete Regime

We present a novel analysis of the dynamics of tensor power iterations i...

11/06/2014 ∙ by Anima Anandkumar, et al. ∙ 0

• ### The Preference Learning Toolbox

Preference learning (PL) is a core area of machine learning that handles...

06/04/2015 ∙ by Vincent E. Farrugia, et al. ∙ 0

• ### Rank Centrality: Ranking from Pair-wise Comparisons

The question of aggregating pair-wise comparisons to obtain a global ran...

09/08/2012 ∙ by Sahand Negahban, et al. ∙ 0

• ### Solving a Mixture of Many Random Linear Equations by Tensor Decomposition and Alternating Minimization

We consider the problem of solving mixed random linear equations with k ...

08/19/2016 ∙ by Xinyang Yi, et al. ∙ 0

• ### Curse of Heterogeneity: Computational Barriers in Sparse Mixture Models and Phase Retrieval

We study the fundamental tradeoffs between statistical accuracy and comp...

08/21/2018 ∙ by Jianqing Fan, et al. ∙ 0

• ### A Method with Feedback for Aggregation of Group Incomplete Pair-Wise Comparisons

A method for aggregation of expert estimates in small groups is proposed...

08/21/2017 ∙ by Vitaliy Tsyganok, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Background. Popular recommendation systems such as collaborative filtering are based on a partially observed ratings matrix. The underlying hypothesis is that the true/latent score matrix is low-rank and we observe its partial, noisy version. Therefore, matrix completion algorithms are used for learning, cf. [8, 14, 15, 20]. In reality, however, observed preference data is not just scores. For example, clicking one of the many choices while browsing provides partial order between clicked choice versus other choices. Further, scores do convey ordinal information as well, e.g. score of 4 for paper A and score of 7 for paper B by a reviewer suggests ordering B A. Similar motivations led Samuelson to propose the Axiom of revealed preference [21] as the model for rational behavior. In a nutshell, it states that consumers have latent order of all objects, and the revealed preferences through actions/choices are consistent with this order. If indeed all consumers had identical ordering, then learning preference from partial preferences is effectively the question of sorting.

In practice, individuals have different orderings of interest, and further, each individual is likely to make noisy choices. This naturally suggests the following model – each individual has a latent distribution over orderings of objects of interest, and the revealed partial preferences are consistent with it, i.e. samples from the distribution. Subsequently, the preference of the population as a whole can be associated with a distribution over permutations. Recall that the low-rank structure for score matrices, as a model, tries to capture the fact that there are only a few different types of choice profile. In the context of modeling consumer choices as distribution over permutation, MultiNomial Logit (MNL) model with a small number of mixture components provides such a model.

Mixed MNL. Given objects or choices of interest, an MNL model is described as a parametric distribution over permutations of with parameters : each object , has a parameter associated with it. Then the permutations are generated randomly as follows: choose one of the objects to be ranked at random, where object is chosen to be ranked

with probability

. Let be object chosen for the first position. Now to select second ranked object, choose from remaining with probability proportional to their weight. We repeat until all objects for all ranked positions are chosen. It can be easily seen that, as per this model, an item is ranked higher than with probability .

In the mixed MNL model with mixture components, each component corresponds to a different MNL model: let be the corresponding parameters of the components. Let denote the mixture distribution, i.e. . To generate a permutation at random, first choose a component with probability , and then draw random permutation as per MNL with parameters .

Brief history. The MNL model is an instance of a class of models introduced by Thurstone [23]. The description of the MNL provided here was formally established by McFadden [17]. The same model (in form of pair-wise marginals) was introduced by Zermelo [25] as well as Bradley and Terry [7] independently. In [16], Luce established that MNL is the only distribution over permutation that satisfies the axiom of Independence from Irrelevant Alternatives.

On learning distributions over permutations, the question of learning single MNL model and more generally instances of Thurstone’s model have been of interest for quite a while now. The maximum likelihood estimator, which is logistic regression for MNL, has been known to be consistent in large sample limit, cf.

[13]. Recently, RankCentrality [19] was established to be statistical efficient. For learning sparse mixture model, i.e. distribution over permutations with each mixture being delta distribution, [11] provided sufficient conditions under which mixtures can be learnt exactly using pair-wise marginals – effectively, as long as the number of components scaled as where components satisfied appropriate incoherence condition, a simple iterative algorithm could recover the mixture. However, it is not robust with respect to noise in data or finite sample error in marginal estimation. Other approaches have been proposed to recover model using convex optimization based techniques, cf. [10, 18]. MNL model is a special case of a larger family of discrete choice models known as the Random Utility Model (RUM), and an efficient algorithm to learn RUM is introduced in [22]. Efficient algorithms for learning RUMs from partial rankings has been introduced in [3, 4]. We note that the above list of references is very limited, including only closely related literature. Given the nature of the topic, there are a lot of exciting lines of research done over the past century and we shall not be able to provide comprehensive coverage due to a space limitation.

Problem. Given observations from the mixed MNL, we wish to learn the model parameters, the mixing distribution , and parameters of each component . The observations are in form of pair-wise comparisons. Formally, to generate an observation, first one of the mixture components is chosen; and then for of all possible pairs, comparison outcome is observed as per this MNL component111We shall assume that, outcomes of these pairs are independent of each other, but coming from the same MNL mixture component. This is effectively true even they were generated by first sampling a permutation from the chosen MNL mixture component, and then observing implication of this permutation for the specific pairs, as long as they are distinct due to the Irrelevance of Independent Alternative hypothesis of Luce that is satisfied by MNL.. These pairs are chosen, uniformly at random, from a pre-determined pairs: . We shall assume that the selection of is such that the undirected graph , where , is connected.

We ask following questions of interest: Is it always feasible to learn mixed MNL? If not, under what conditions and how many samples are needed? How computationally expensive are the algorithms?

We briefly recall a recent result [1] that suggests that it is impossible to learn mixed MNL models in general. One such example is described in Figure 1. It depicts an example with and and a uniform mixture distribution. For the first case, in mixture component , with probability the ordering is (we denote objects by and ); and in mixture component , with probability the ordering is . Similarly for the second case, the two mixtures are made up of permutations and . It is easy to see the distribution over any -wise comparisons generated from these two mixture models is identical. Therefore, it is impossible to differentiate these two using -wise or pair-wise comparisons. In general, [1] established that there exist mixture distributions with over objects that are impossible to distinguish using -wise comparison data. That is, learning mixed MNL is not always possible.

Contributions. The main contribution of this work is identification of sufficient conditions under which mixed MNL model can be learnt efficiently, both statistically and computationally. Concretely, we propose a two-phase learning algorithm: in the first phase, using a tensor decomposition method for learning mixture of discrete product distribution, we identify pair-wise marginals associated with each of the mixture; in the second phase, we use these pair-wise marginals associated with each mixture to learn the parameters associated with each of the MNL mixture component.

The algorithm in the first phase builds upon the recent work by Jain and Oh [12]. In particular, Theorem 3 generalizes their work for the setting where for each sample, we have limited information - as per [12], we would require that each individual gives the entire permutation; instead, we have extended the result to be able to cope with the current setting when we only have information about , potentially finite, pair-wise comparisons. The algorithm in the second phase utilizes RankCentrality [19]. Its analysis in Theorem 4 works for setting where observations are no longer independent, as required in [19].

We find that as long as certain rank and incoherence conditions are satisfied by the parameters of each of the mixture, the above described two phase algorithm is able to learn mixture distribution and parameters associated with each mixture, faithfully using samples that scale polynomially in and – concretely, the number of samples required scale as with constants dependent on the incoherence between mixture components, and as long as as well as , the graph of potential comparisons, is a spectral expander with the total number of edges scaling as . For the precise statement, we refer to Theorem 1.

The algorithms proposed are iterative, and primarily based on spectral properties of underlying tensors/matrices with provable, fast convergence guarantees. That is, algorithms are not only polynomial time, they are practical enough to be scalable for high dimensional data sets.

Notations. We use for the first positive integers. We use to denote the outer product such that . Given a third order tensor and a matrix , we define a linear mapping as . We let

be the Euclidean norm of a vector,

be the operator norm of a matrix, and be the Frobenius norm. We say an event happens with high probability (w.h.p) if the probability is lower bounded by such that as scales to .

## 2 Main result

In this section, we describe the main result: sufficient conditions under which mixed MNL models can be learnt using tractable algorithms. We provide a useful illustration of the result as well as discuss its implications.

Definitions. Let denote the collection of observations, each of which is denoted as dimensional, valued vector. Recall that each observation is obtained by first selecting one of the mixture MNL component, and then viewing outcomes, as per the chosen MNL mixture component, of randomly chosen pair-wise comparisons from the pre-determined comparisons . Let denote the th observation with if the th pair is not chosen amongst the randomly chosen pairs, and (respectively ) if (respectively ) as per the chosen MNL mixture component. By definition, it is easy to see that for any and ,

 E[xt,k] =ℓN[r∑a=1qaPka], % where Pka=w(a)jk−w(a)ikw(a)jk+w(a)ik. (1)

We shall denote for . Therefore, in a vector form

 E[xt] =ℓNPq, where  P=[P1…Pr]∈[−1,1]N×r. (2)

That is, is a matrix with columns, each representing one of the mixture components and is the mixture probability. By independence, for any , and any two different pairs ,

 E[xt,kxt,m] (3)

Therefore, the matrix or equivalently tensor is proportional to except in diagonal entries, where

 M2 =PQPT≡r∑a=1qa(Pa⊗Pa), (4)

being diagonal matrix with its entries being mixture probabilities, . In a similar manner, the tensor is proportional to (except in entries), where

 M3 =r∑a=1qa(Pa⊗Pa⊗Pa). (5)

Indeed, empirical estimates and , defined as

 ^M2 =1|S|[∑t∈Sxt⊗xt], and ^M3=1|S|[∑t∈Sxt⊗xt⊗xt], (6)

provide good proxy for and for large enough number of samples; and shall be utilized crucially for learning model parameters from observations.

Sufficient conditions for learning. With the above discussion, we state sufficient conditions for learning the mixed MNL in terms of properties of :

• has rank ; let ,

be the largest and smallest singular values of

.

• For a large enough universal constant ,

 N ≥C′r3.5μ6(M2)(σ1(M2)σr(M2))4.5. (7)

In the above, represents incoherence of a symmetric matrix . We recall that for a symmetric matrix of rank

, the incoherence is defined as

 μ(M) =√Nr(maxi∈[N]∥Ui∥). (8)
• The undirected graph with is connected. Let be adjacency matrix with if and otherwise; let with being degree of vertex and let be normalized Laplacian of . Let and . Let the eigenvalues of stochastic matrix be . Define spectral gap of :

 ξ(G) =1−max{λ2(L),−λn(L)}. (9)

Note that we choose a graph to collect pairwise data on, and we want to use a graph that is connected, has a large spectral gap, and has a small number of edges. In condition (C3), we need connectivity since we cannot estimate the relative strength between disconnected components (e.g. see [13]). Further, it is easy to generate a graph with spectral gap bounded below by a universal constant (e.g. ) and the number of edges , for example using the configuration model for Erdös-Renyi graphs. In condition (C2), we require the matrix to be sufficiently incoherent with bounded . For example, if and the profile of each type in the mixture distribution is sufficiently different, i.e. , then we have and . We define , , and . The following theorem provides a bound on the error and we refer to the appendix for a proof.

###### Theorem 1.

Consider a mixed MNL model satisfying conditions (C1)-(C3). Then for any , there exists positive numerical constants such that for any positive satisfying

 0<ε< (qminξ2(G)d2min16qmaxrσ1(M2)b5d2max)0.5, (10)

Algorithm 1 produces estimates and so that with probability at least ,

 ∣∣^qa−qa∣∣ ≤ε, and ∥^w(a)−w(a)∥∥w(a)∥ ≤C(rqmaxσ1(M2)b5d2maxqminξ2(G)d2min)0.5 ε, (11)

for all , as long as

 |S| ≥C′rN4log(N/δ)qminσ1(M2)2ε2(1ℓ2+σ1(M2)ℓN+r4σ1(M2)4σr(M2)5). (12)

An illustration of Theorem 1. To understand the applicability of Theorem 1, consider a concrete example with ; let the corresponding weights and be generated by choosing each weight uniformly from . In particular, the rank order for each component is a uniformly random permutation. Let the mixture distribution be uniform as well, i.e. . Finally, let the graph be chosen as per the Erdös-Rényi model with each edge chosen to be part of the graph with probability , where . For this example, it can be checked that Theorem 1 guarantees that for , , and , we have for all , and . That is, for and choosing , we need sample size of to guarantee error in both and smaller than . Instead, if we choose , we only need . Limited samples per observation leads to penalty of factor of in sample complexity. To provide bounds on the problem parameters for this example, we use standard concentration arguments. It is well known for Erdös-Rényi random graphs (see [6]) that, with high probability, the number of edges concentrates in implying , and the degrees also concentrate in , implying . Also using standard concentration arguments for spectrum of random matrices, it follows that the spectral gap of is bounded by w.h.p. Since we assume the weights to be in , the dynamic range is bounded by . The following Proposition shows that , , and .

###### Proposition 2.1.

For the above example, when , , , and with high probability.

Supposen now for general , we are interested in well-behaved scenario where and . To achieve arbitrary small error rate for , we need , which is achieved by sample size with .

## 3 Algorithm

We describe the algorithm achieving the bound in Theorem 1

. Our approach is two-phased. First, learn the moments for mixtures using a tensor decomposition, cf. Algorithm

2: for each type , produce estimate of the mixture weight and estimate of the expected outcome defined as in (1). Secondly, for each , using the estimate , apply RankCentrality, cf. Section 3.2, to estimate for the MNL weights .

To achieve Theorem 1, and is sufficient. Next, we describe the two phases of algorithms and associated technical results.

### 3.1 Phase 1: Spectral decomposition.

To estimate and from the samples, we shall use tensor decomposition of and , the empirical estimation of and respectively, recall (4)-(6). Let be the eigenvalue decomposition and let

 H = M3[UM2Σ−1/2M2,UM2Σ−1/2M2,UM2Σ−1/2M2].

The next theorem shows that and are sufficient to learn and exactly, when has rank (throughout, we assume that ).

###### Theorem 2 (Theorem 3.1 [12]).

Let have rank

. Then there exists an orthogonal matrix

and eigenvalues , such that the orthogonal tensor decomposition of is

 H = r∑a=1λHa(vHa⊗vHa⊗vHa).

Let . Then the parameters of the mixture distribution are

 P=UM2Σ1/2M2VHΛH and % Q=(ΛH)−2.

The main challenge in estimating (resp. ) from empirical data are the diagonal entires. In [12], alternating minimization approach is used for matrix completion to find the missing diagonal entries of , and used a least squares method for estimating the tensor directly from the samples. Let denote the set of off-diagonal indices for an matrix and denote the off-diagonal entries of an tensor such that the corresponding projections are defined as

 PΩ2(M)ij≡{Mij if i≠j,0 otherwise. and PΩ3(T)ijk≡{Tijk if i≠j,% j≠k, k≠i,0 otherwise.

for and .

In lieu of above discussion, we shall use and to obtain estimation of diagonal entries of and respectively. To keep technical arguments simple, we shall use first samples based , denoted as and second samples based , denoted by in Algorithm 2.

Next, we state correctness of Algorithm 2 when is small; proof is in Appendix.

###### Theorem 3.

There exists universal, strictly positive constants such that for all and , if

 |S| ≥ C′rN4log(N/δ)qminσ1(M2)2ε2(1ℓ2+σ1(M2)ℓN+r4σ1(M2)4σr(M2)5), and N ≥ C′r3.5μ6(σ1(M2)σr(M2))4.5,

then there exists a permutation over such that Algorithm 2 achieves the following bounds with a choice of for all , with probability at least :

 |^qπi−qi|≤ε, and ∥^Pπi−Pi∥≤ε√rqmaxσ1(M2)qmin,

where defined in (8) with run-time .

### 3.2 Phase 2: RankCentrality.

Recall that represents collection of pairs and is the corresponding graph. Let denote the estimation of for the mixture component ; where is defined as per (1). For each , using and , we shall use the RankCentrality [19] to obtain estimation of . Next we describe the algorithm and guarantees associated with it.

Without loss of generality, we can assume that is such that for all . Given this normalization, RankCentrality estimates

as stationary distribution of an appropriate Markov chain on

. The transition probabilities are for all . For , they are function of . Specifically, transition matrix with if , and for for ,

 ~p(a)ik,jk =1dmax(1+~Pka)2 % and ~p(a)jk,ik = 1dmax(1−~Pka)2, (14)

Finally, for all . Let be a stationary distribution of the Markov chain defined by . That is,

 ~π(a)i =∑j~p(a)ji~π(a)jfor all i∈[n]. (15)

Computationally, we suggest obtaining estimation of by using power-iteration for iterations. As argued before, cf. [19], , is sufficient to obtain reasonably good estimation of .

The underlying assumption here is that there is a unique stationary distribution, which is established by our result under the conditions of Theorem 1. Now is an approximation of the ideal transition probabilities, where where if and for all . Such an ideal Markov chain is reversible and as long as is connected (which is, in our case, by choice), the stationary distribution of this ideal chain is (recall, we have assumed to be normalized so that all its components up to ).

Now is an approximation of such an ideal transition matrix . In what follows, we state result about how this approximation error translates into the error between and . Recall that , and are maximum and minimum vertex degrees of and as defined in (9).

###### Theorem 4.

Let be non-bipartite and connected. Let for some positive . Then, for some positive universal constant ,

 ∥~π(a)−w(a)∥∥w(a)∥ ≤Cb5/2ξdmaxdminε. (16)

And, starting from any initial condition, the power iteration manages to produce an estimate of within twice the above stated error bound in iterations.

Proof of the above result can be found in Appendix. For spectral expander (e.g. connected Erdos-Renyi graph with high probability), and therefore the bound is effectively for bounded dynamic range, i.e. .

## 4 Discussion

Learning distribution over permutations of objects from partial observation is fundamental to many domains. In this work, we have advanced understanding of this question by characterizing sufficient conditions and associated algorithm under which it is feasible to learn mixed MNL model in computationally and statistically efficient (polynomial in problem size) manner from partial/pair-wise comparisons. The conditions are natural – the mixture components should be “identifiable” given partial preference/comparison data – stated in terms of full rank and incoherence conditions of the second moment matrix. The algorithm allows learning of mixture components as long as number of mixture components scale for distribution over permutations of objects.

To the best of our knowledge, this work provides first such sufficient condition for learning mixed MNL model – a problem that has remained open in econometrics and statistics for a while, and more recently Machine learning. Our work nicely complements the impossibility results of

[1].

Analytically, our work advances the recently popularized spectral/tensor approach for learning mixture model from lower order moments. Concretely, we provide means to learn the component even when only partial information about the sample is available unlike the prior works. To learn the model parameters, once we identify the moments associated with each mixture, we advance the result of [19] in its applicability. Spectral methods have also been applied to ranking in the context of assortment optimization in [5].

## References

• [1] A. Ammar, S. Oh, D. Shah, and L. Voloch. What’s your choice? learning the mixed multi-nomial logit model. In Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems, 2014.
• [2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. CoRR, abs/1210.7559, 2012.
• [3] H. Azari Soufiani, W. Chen, D. C Parkes, and L. Xia. Generalized method-of-moments for rank aggregation. In Advances in Neural Information Processing Systems 26, pages 2706–2714. 2013.
• [4] H. Azari Soufiani, D. Parkes, and L. Xia. Computing parametric ranking models via rank-breaking. In Proceedings of The 31st International Conference on Machine Learning, pages 360–368, 2014.
• [5] J. Blanchet, G. Gallego, and V. Goyal. A markov chain approximation to choice modeling. In EC, pages 103–104, 2013.