Convex Latent Effect Logit Model via Sparse and Low-rank Decomposition

08/22/2021
by   Hongyuan Zhan, et al.
0

In this paper, we propose a convex formulation for learning logistic regression model (logit) with latent heterogeneous effect on sub-population. In transportation, logistic regression and its variants are often interpreted as discrete choice models under utility theory (McFadden, 2001). Two prominent applications of logit models in the transportation domain are traffic accident analysis and choice modeling. In these applications, researchers often want to understand and capture the individual variation under the same accident or choice scenario. The mixed effect logistic regression (mixed logit) is a popular model employed by transportation researchers. To estimate the distribution of mixed logit parameters, a non-convex optimization problem with nested high-dimensional integrals needs to be solved. Simulation-based optimization is typically applied to solve the mixed logit parameter estimation problem. Despite its popularity, the mixed logit approach for learning individual heterogeneity has several downsides. First, the parametric form of the distribution requires domain knowledge and assumptions imposed by users, although this issue can be addressed to some extent by using a non-parametric approach. Second, the optimization problems arise from parameter estimation for mixed logit and the non-parametric extensions are non-convex, which leads to unstable model interpretation. Third, the simulation size in simulation-assisted estimation lacks finite-sample theoretical guarantees and is chosen somewhat arbitrarily in practice. To address these issues, we are motivated to develop a formulation that models the latent individual heterogeneity while preserving convexity, and avoids the need for simulation-based approximation. Our setup is based on decomposing the parameters into a sparse homogeneous component in the population and low-rank heterogeneous parts for each individual.

READ FULL TEXT VIEW PDF

Authors

page 18

10/26/2020

A Sparse β-Model with Covariates for Networks

Data in the form of networks are increasingly encountered in modern scie...
01/27/2021

An Early Stopping Bayesian Data Assimilation Approach for Mixed-Logit Estimation

The mixed-logit model is a flexible tool in transportation choice analys...
12/02/2019

Factor Analysis on Citation, Using a Combined Latent and Logistic Regression Model

We propose a combined model, which integrates the latent factor model an...
10/01/2016

Tuning Parameter Calibration in High-dimensional Logistic Regression With Theoretical Guarantees

Feature selection is a standard approach to understanding and modeling h...
07/13/2020

Nested Dirichlet Process For Population Size Estimation From Multi-list Recapture Data

Heterogeneity of response patterns is important in estimating the size o...
05/09/2017

SILVar: Single Index Latent Variable Models

A semi-parametric, non-linear regression model in the presence of latent...
02/22/2020

On the Multiplicative Decomposition of Heterogeneity in Continuous Assemblages

A system's heterogeneity (equivalently, diversity) amounts to the effect...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The mixed effect logistic regression model (mixed logit) has great success in transportation econometrics applications. Mixed logit is a generalized linear mixed model for multinomial response

(McCullagh and Nelder, 1983). In addition, it can also be derived from discrete choice theory as a random utility model (McFadden, 2001). It brings advantages over fixed effect multinomial logistic regression (multinomial logit) due to its flexibility to incorporate individual heterogeneity. In mixed logit, the parameters for each observation are random and their distributions are estimated from data. Therefore, conditioned on the same observed variables, two observations may not share the same outcome due to the differences in individual-level parameters. Under random utility theory, the parameter heterogeneity for each observation models taste variations in the population. In addition, when interpreted as a discrete choice model, mixed logit relaxes the Independence of Irrelevant Alternative (IIA) assumption (McFadden, 1981; Train, 2009). The flexibility to model individual heterogeneity is important in transportation applications. For example in the analysis of traffic accident outcomes, the effect of age is heterogeneous since people of the same age can have diverse health conditions, which affects the degree of resistance to the impact of crash (Kim et al., 2008; Swedler et al., 2012; Kim et al., 2013).

The flexibility of applying mixed logit model comes with the difficulties to specifying and estimating the model. In the past twenty years, many advances have been made on understanding the mixing distributions used (Cherchi and Polak, 2005; Fosgerau, 2006; Neuhaus and McCulloch, 2011; McCulloch and Neuhaus, 2011)

, and relaxing the restrictions on the parametric form of the distribution, which leads to semi-parametric or non-parametric models

(Greene and Hensher, 2003; Bansal et al., 2018). There are no consensus on the robustness of mixing distributions. Many authors has discussed about the sensitivity of estimates to the specification of mixing distributions (Heckman and Singer, 1984; Hensher and Greene, 2003; Agresti et al., 2004; Huang and Zhao, 2015). On the applications of mixed logit to welfare analysis, it has been shown that the choice of mixing distributions has profound impact (Cherchi and Polak, 2005; Fosgerau, 2006; Huang and Zhao, 2015). However, there are practical scenarios in which the influence of mixing distribution misspecification is secondary, pointed out in McCulloch and Neuhaus (2011). Nevertheless, many recent efforts have been put on making logit models more robust, from both computational and modeling perspective. A line of work is on the usage of non-parametric estimation for mixing distributions (Train, 2008; De Blasi et al., 2010; Bansal et al., 2018). A comparison between parametric and non-parametric estimation of the mixing distributions are provided in Bhat and Lavieri (2018). An early work along this line is the latent class model proposed by Greene and Hensher (2003), in which the heterogeneous parameters are conditioned on the latent class individuals belonged to. The use of mixture-of-normal for approximating the unknown mixing distributions are studied in (Fosgerau and Hess, 2009; Keane and Wasi, 2013). More recently, the logit-mixture logit (LML) is proposed in Train (2016). LML is a generalization of the latent class logit (Greene and Hensher, 2003) and many other mixture models (Train, 2008; Fox et al., 2011; Fosgerau and Mabit, 2013)

, using functionals such as splines to approximate the kernel of the probability mass function of discrete mixing distribution. Under very general assumptions, any continuous mixing distribution can be approximated by a discrete logit-mixture with sufficiently large support

(Train, 2016). Therefore, non-parametric extensions can be considered as improvements on the robustness with respect to misspecification of mixing distributions.

Mixed logit and the non-parametric variants are typically estimated with maximum simulated likelihood (MSL) estimator (McFadden, 1989; Hajivassiliou and Ruud, 1994)

or the Expectation-Maximization (EM) algorithm

(Dempster et al., 1977). The expectation operator is nested inside the logarithm in the maximum likelihood estimator. Therefore, simulations are often used to approximate the expectation, leading to the MSL estimator. A large body of work has been done on making the simulation more efficient and numerically more robust approximation of the expectation integrals in practice. Quasi-Monte Carlo (QMC) methods is studied by several authors (Train, 2000; Bhat, 2001; Bastin et al., 2004; Munger et al., 2012)

. It has been shown that QMC reduces the variance of simulation and the number of draws needed in practice. However, it should be noted that the number of draws have to increase with the sample size in theory

(Hensher and Greene, 2003; Train and others, 2009), and there still lacks theoretical guidelines on the sufficient number of draws required on finite samples. Quasi-Newton methods (Shanno, 1970; Fletcher, 1970) or Trust-region methods (Wright and Nocedal, 1999; Bastin et al., 2006) are then applied to optimize the simulation-based objective function. The EM algorithm for estimating mixed logit model and its variants are studied in Train (2008); Pacifico (2012); Jagabathula and Rusmevichientong (2016); Vij and Krueger (2017). Zhan et al. (2019) recently studies stochastic gradient method to estimate mixed logit models with Gaussian mixing distribution. Bansal et al. (2018) studies the Minorization-Maximization (MM) algorithm, an extension of EM, for fitting the logit-mixture logit model. Note that the maximum likelihood estimators for mixed logit and the non-parametric variants are non-convex, due to the expectation operator nested inside logarithm. Therefore, both the MSL-based estimation or EM-type algorithms are dependent on the starting point. Hence, the model parameters and the estimated mixing distribution can be unstable because multiple local optima exists in the optimization landscape. The effect of initialization on mixed logit parameter estimation was recently studied by Hole and Yoo (2017).

This paper is motivated to develop a model that allows the analysts to incorporate individual heterogeneity while being convex. Preserving convexity is important for obtaining stable estimation. A naive way to allow heterogeneity in the model is by parameterizing each observation independently. This leads to an over-parametrized model. A remedy is to have heterogeneous parameters for each sub-population or cluster. However, the clusters or grouping effect in the population is unobserved. Existent work marginalized the latent variables which produces non-convex objective functions

(Greene and Hensher, 2003; Train, 2008; Fosgerau and Hess, 2009; De Blasi et al., 2010; Keane and Wasi, 2013; Train, 2016; Bansal et al., 2018). Therefore, we would like to avoid marginalization in the estimation. Mixing distributions can be considered as probabilistic priors on the latent heterogeneity in the population. Alternatively, structural prior

is a way to encode general knowledge about the structure of data, frequently used in machine learning and inverse problems

(Kaipio et al., 1999; Mansinghka et al., 2006; Huang et al., 2011; Jenatton et al., 2011a; Haeffele et al., 2014). In order to avoid marginalization, we exploit structural priors that can be expressed non-probabilistically. To take advantage of these structural information, our proposed model is estimated using a regularized maximum likelihood approach. In particular, we consider the following structure: the homogeneous effect in the population is sparse and the individual heterogeneity is intrinsically low dimension despite the high dimensionality of ambient space. The reason and theoretical justification for imposing this geometric structure will be made clear later. Our work is related to low-rank and sparse matrix decomposition (Candès and Recht, 2009; Candès et al., 2011). In many applications, researchers discovered that corrupted or unobserved data can be recovered almost exactly if the data matrix is approximately low-rank (Candès and Recht, 2009; Candès and Tao, 2010; Candès et al., 2011; Richard et al., 2012; Soltanolkotabi et al., 2014; Aybat et al., 2015). The Netflix challenge (Bennett et al., 2007) is probably the most well-known application of this technique. Moreover, data from many real problems is found to distributed on intrinsically low dimensional manifolds even if the observations are high dimensional (Levina and Bickel, 2005; Udell and Townsend, 2018). We develop a fast optimization algorithm to fit the proposed convex latent effect logit model and evaluate the model on traffic accidents extracted from The Statewide Integrated Traffic Records System (SWITRS) (The California Highway Patrol, 2014). The computation of one instance of the model on datasets of 10000 samples can be finished in minutes, whereas estimating a mixed logit model on datasets of similar size with the same set of variables takes several hours using commercial software NLOGIT (Kim et al., 2013; Econometric Software, Inc., Technical report) .

The remaining of this paper is organized as follows: section 2 reviews background knowledge about logit models; section 3 introduces our sparse and low-rank decomposition-based convex objective function to incorporate heterogeneity, and explains why low-rankness is a broader geometry description for several different data generating process; section 4

describes the optimization algorithms for learning the parameters in the model. A greedy scheme for hyperparameter tuning is also introduced in section

4; section 6 presents case studies using our proposed model for traffic accident analysis, followed by the conclusion in section 7.

2 Backgrounds

We provide a brief review for multinomial logit and mixed logit model in this section, and point out the mathematical challenges. In the following sections, let a pair of observations, where is the observed variables (features) and is a categorical response from the set . For example, in choice modeling, represents the decision made by an agent in observation; in traffic crash modeling, denotes the severity category suffered from a crash. The probability of observation response conditioned on the variables and parameter is modeled by the multinomial distribution

(1)

In discrete choice literature, are usually referred as the propensity functions for making decisions . The conditional utility 111better interpreted as “harms” in traffic accidents from decision is

(2)

Variants of logit models differs in how the propensity functions are modeled and the methods used for estimating . In multinomial logistic regression (MNL), the propensity functions are parametrized by linear functionals with parameters ,

(3)

Equation (3

) implies that two observations have identical probability distribution of outcomes if

. This is a restrictive assumption that can be violated in the presence of individual heterogeneity. Mixed logit models relax this restriction by placing probability distributions on the parameters, such that individual has heterogeneous parameters governed by a probability distribution . is usually referred as the mixing distribution and the parameters for

are to be estimated. Normal, log-normal, and truncated normal distributions are among the popular choices of parametric mixing distributions

(Train, 2009). On the other hand, can be constructed via a non-parametric approach and via discrete mixture of distributions (Fosgerau and Hess, 2009; Fosgerau and Mabit, 2013; Train, 2016; Vij and Krueger, 2017). Let denote the sample size. The unobserved heterogeneous parameters are usually integrated out, yielding the marginal log-likelihood estimator:

(4)

There are two main mathematical challenges in solving the optimization problem in (4). First, is usually non-concave. Second, the expectation taken over the distribution of often has no closed-form expressions and requires further approximation. is usually replaced by the Maximum Simulated Likelihood (MSL) estimator (McFadden, 1989; Hajivassiliou and Ruud, 1994) using (quasi-) Monte Carlo simulation to approximate the expectation. The Expectation Maximization (EM) (Dempster et al., 1977) iterative procedure is another option.

3 Sparse and Low-rank Decomposition

We describe our sparse and low-rank decomposition approach for modeling heterogeneity in this section. Let denote the parameter for the -th individual. For each observation , we decompose the individual parameters by

(5)

The individual parameter

is separated into a homogeneous effect vector

, and a heterogeneous counter-part for individual . Without any assumptions on , it is impossible to learn the homogeneous-heterogeneous effect decomposition since equation (5

) over-parametrizes the data. In mixed logit, the degrees of freedom in

is constrained by the mixing distributions used. However, this leads to an non-convex objective function as discussed earlier. Instead of imposing mixing distirbutions, we will penalize the degree of freedom of the model by imposing geometry constraints. We first describe the notations before introducing the formulation. Let

(6)

be a matrix appending the heterogeneous parameters for the individuals as columns. Denote the homogeneous effect for the -th category as , and let

(7)

be a matrix collecting the homogeneous effects. Let denote the -th row of . The following constrained optimization problem is considered.

(8)

where , and are hyper-parameters to be selected. is the so-called group- norm used by group Lasso (Yuan and Lin, 2006; Friedman et al., 2010b). When is smaller than minimum group- norm of the solutions to the unconstrained problem, the optima in (8) forces into a vector of zeros for some . Therefore, the optimal solution is sparse and there are no homogeneous effect for variable if . When , the heterogeneity effects are constrained in a low-rank vector space. In the next section, we explain why imposing low-rank structural assumptions helps uncovering the separation between homogeneous and heterogeneous effect. Problem (8) is non-convex due to the rank constraint. A convex relaxation to (8) will be presented in section 4.

3.1 Why Low-Rankness?

In this section, we explain situations where low-rank heterogeneous effect arises.

3.1.1 Gaussian mixing with approximately low-rank covariance

The homogeneous-heterogeneous effect decomposition in (5) resembles the linear representation for Gaussian mixing variables. In Gaussian mixed logit model, the random parameters can be decomposed as

(9)

Hence the mean of the Gaussian distributions corresponds to the homogeneous effect, and

(10)

where . Low-rank matrix of mixing effects occurs when the covariance matrix is (approximately) low-rank. This happens when the covariance matrix is a superposition of a low-rank component and a sparse components (Chandrasekaran et al., 2012; Richard et al., 2012; Meng et al., 2014; Oymak et al., 2015; Chen et al., 2015), or when the covariance matrix has a block diagonal structure where the variables within each block are highly correlated and variables across different blocks are independent (Saunderson et al., 2012; Liutkus and Yoshii, 2017).

3.1.2 Latent clustered heterogeneity

Suppose the population forms clusters, such that the predictive variables have identical or similar influence on individuals belonging to the same cluster. Let be a partition of the data, i.e., if , . If the individual heterogeneities are identical on each cluster, then

(11)

Therefore, observations in the same cluster have identical columns in . As a result, . (11) can be relaxed slightly, consider

(12)

(12) leads to grouping of the heterogeneous effects, and forces the latent effects in the same group are stays close. can be approximated by a low-rank matrix in this case. Note that latent clustering of heterogeneity is also an implicit assumption in discrete mixture multinomial logits (Greene and Hensher, 2003; Train, 2008; Pacifico, 2012; Jagabathula and Rusmevichientong, 2016; Vij and Krueger, 2017).

3.1.3 Latent matrix factorization

Suppose with . can be equivalently expressed as

(13)

where , and both and has full column-rank. The matrix is usually interpreted as latent loadings, and are scores for each of the basis in . This is sometimes referred as a matrix factorization representation (Lee and Seung, 1999, 2001; Ding et al., 2005, 2008; Abdi and Williams, 2010; Gillis, 2014). Under this interpretation, individual heterogeneities are created through different linear combinations of the same latent sources.

3.2 Convex relaxation

Optimization with matrix rank constraints is NP-Hard. Finding the exact solutions to problem (8) is difficult and can be unpractical. Therefore, we use a convex program to approximate the original formulation. Let be the nuclear norm for matrix define as following.

(14)

where is the rank of and is the

-th largest singular value of

. The nuclear norm is a convex relaxation for the rank function (Fazel et al., 2001; Recht et al., 2008, 2010) and has been successfully applied on many problems involving the rank function as objective or constraints (Candès and Recht, 2009; Candès et al., 2011; Soltanolkotabi et al., 2014; Aybat et al., 2014; Udell and Townsend, 2018; Chandrasekaran et al., 2012; Richard et al., 2012; Meng et al., 2014; Oymak et al., 2015; Chen et al., 2015) (and references therein). Therefore, we solve the following convex program

(15)

(15) reformulates (8) by first transforming the original problem into the Lagrangian form, then applying the nuclear norm relaxation to . (15) is the addition of three convex functions and therefore the convexity is preserved. are non-negative hyperparameters correspond to the upper bound and in the hard-constraint version.

4 Optimization

4.1 Proximal Gradient algorithm

We now describe an efficient algorithm to solve the convex problem (15). Note that the terms and are non-smooth. Therefore methods requiring second order continuity, for example quasi-Newton methods (Fletcher, 1970; Broyden, 1970; Shanno, 1970) and Trust-region methods (Sorensen, 1982; Bastin et al., 2006), cannot be applied. The main steps in the algorithm are to deal with these non-smooth terms. Problem has the form

(16)

where is a smooth convex part corresponds to the multinomial loss, is convex but non-smooth part consists of . Proximal Gradient method (PG) (Beck and Teboulle, 2009; Cai et al., 2010; Toh and Yun, 2010; Ma et al., 2011; Jenatton et al., 2011b) is an efficient first-order algorithm to handle the non-smoothness in . It performs the following simple iteration

(17)

where is the step-size and is the proximal operator associated with and defined below.

(18)

Suppose is a continuously differentiable function with Lipschitz gradients, i.e.,

(19)

for all . The convergence rate of Proximal Gradient algorithm is given by the theorem below.

Theorem 1.

(Beck and Teboulle, 2009) Suppose is a convex continuous function and possibly non-smooth, is a convex continuously differentiable function satisfying the Lipschitz condition in (19) with Lipschitz constant . Let be a sequence of iterates generated from (17) using constant step-size , and let be an optimal solution of (16). Then

for any .

In particular, Theorem 1 suggests the Proximal Gradient algorithm requires iterations to achieve . When the Lipschitz constant is unknown, Beck and Teboulle (2009) shown a line-search algorithm to select the step-sizes , which requires computing the solution to the proximal operator (18

) for each line-search trial step. Solving the minimization problem in the proximal operation can be costly in some cases. Thus we derive the closed-form solutions for the proximal sub-problems and describe an adaptive step-size heuristic later. The Proximal Gradient method closely resembles the iterations in Gradient Descent, especially when the proximal operator admits close-form solutions. We now apply PG to the convex latent logit model (

15). Let denote the iterates after the -th PG iteration (17). Applying (17) on (15) leads to the following update steps using intermediate variables

(20a)
(20b)
(20c)

The sub-problem from the proximal operator (20c) can be decoupled into two separate minimization problems:

(21a)
(21b)

Moreover, both (21a) and (21b) has closed-form solutions. The optimal solution to (21a) can be obtained by solving the proximal operator independently over each row . The sub-problems with respect to are essentially the proximal operation arisen from the group-Lasso penalty (Yuan and Lin, 2006; Jenatton et al., 2011b). For each , the analytical solution for the proximal operator associated with step-size and is

(22)

where the operator denotes thresholding . The closed-form solution for the proximal operator associated with nuclear norm penalty and step-size can be obtained via a generalization of the soft-thresholding procedure in (22

). Suppose the Singular Value Decomposition (SVD) for

is

(23)

where is the diagonal matrix of singular values of and . The closed-form solution to (21b) is given by the Singular-Value Thresholding (SVT) operator (Cai et al., 2010; Toh and Yun, 2010; Ma et al., 2011),

(24)

(22) and (24) together provide the solutions to the proximal sub-problem (20c).

4.2 Acceleration, adaptive step-size and randomized SVT

We apply several simple techniques to improve the convergence speed of Proximal Gradient algorithm described in the previous section. Nesterov’s momentum method is a well-known technique to derive optimal convergence rate for first-order optimization methods on smooth convex problems (Nesterov, 1983). The extension to accelerate the convergence on non-smooth objectives via proximal operators are developed by Beck and Teboulle (Beck and Teboulle, 2009), known as the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA). Under constant step-size or line-searched step-sizes, FISTA achieves improved worse-case convergence rate of . The acceleration technique is applied in our implementation. In practice, the Lipschitz constant might be unknown or hard to compute. On the other hand, backtracking line-search requires computing the proximal operator over a range of different step-sizes. Since the intermediate variable changes due to different trial step-size during the line search process, the Singular Value Decomposition of needs to be re-computed in each trial step in order to obtain . Therefore backtracking line-search becomes costly due to multiple computations of SVD in each iteration. Hence we use an adaptive step-size scheme as a surrogate to line-search. After each iteration, the objective value of (15) is measured, the step-size is halved if the objective value increases compared to the previous iteration. Note that Accelerated Proximal Gradient (APG) is not a strictly descent method, as oppose to Gradient Descent, therefore tentative increments of the function value do occur. This phenomenon is referred as the Nesterov ripples in the literature (Nesterov, 1983; O’donoghue and Candes, 2015). Further, we employ a function value-based restarting criterion introduced in (O’donoghue and Candes, 2015). The algorithm is restarted with the current iterates as re-initialization when ripples are detected, and the step-size is halved. The acceleration momentums are reset at each restart. A partial SVD is needed in order to obtain the proximal solution for the nuclear norm of the heterogeneous effect matrix , as shown in eqn. (24). Note that the dimension of is . Therefore, SVD computation becomes a bottleneck during the optimization when or is large. To scale up the estimation problem (15) on large datasets, we deploy the randomized Singular Value Decomposition algorithm Halko et al. (2011) to perform the SVT operation. To obtain an approximate SVD for a matrix , we seek for a matrix with orthonormal columns such that

(25)

and approximately spans the range of . In addition, we would like the computation of to be lightweight, by the mean of random projection. Let be standard Gaussian random vectors, where . Then

(26)

forms a matrix with linearly-independent columns with high probability. This idea is formalized in the seminal paper by Halko et al. (2011). The randomized SVD algorithm is given in Algorithm 1. Essentially, Algorithm 1 can be viewed as first performing dimensionality reduction via random projection (line 1-4), then running the deterministic SVD for a matrix of smaller dimension (line 5) and lastly recovering the SVD for the original matrix (line 6-7). Algorithm 2 increases the guess for minimum number of singular values required to reach below the threshold and applies randomized SVD until the threshold is found. The Fast Accelerated Proximal Gradient with Adaptive Restart (FAPGAR) algorithm combining these features is listed in Algorithm 3.

Input: input matrix , number of top singular vectors desired
Output: approximate rank- SVD of

1:

generate random matrix

with i.i.d. standard normal entries.
2:construct random projection
3:QR decomposition to obtain orthonormal basis for .
4:construct smaller dimension than
5:rank- SVD for ,
6:.
7:output .
Algorithm 1 Randomized SVD

Input: input matrix , threshold , initial guess .
Output: approximate Singular-Value Thresholding operation on

1:
2:while  do
3:     
4:     
5:end while
6:
7:output
Algorithm 2 Randomized SVT

Input: number of iterations , initial step-size , initialization , tolerance , tuning parameters
Output: solution to (15).

1:
2:
3:
4:
5:for  to  do
6:     
7:     for  to  do proximal operator for group penalty
8:         
9:     end for
10:      proximal operator for nuclear norm penalty
11:     if  then
12:          restart
13:         
14:         
15:         
16:          adjust step-size
17:     else
18:          acceleration
19:         
20:         
21:         
22:         
23:     end if
24:     if  then
25:         break
26:     end if
27:end for
Algorithm 3 Fast Accelerated Proximal Gradient with Adaptive Restart

5 Greedy Local Continuation for Pathwise Solutions

Algorithm 3 solves one instance of problem (15), given and . Hence, we solve (15) over a range of and using Algorithm 3, then choose the regularization constants by some goodness-of-fit criteria. Suppose the search space in each is discretized into points. Grid search requires calling Algorithm 3 times in order to obtain the solution over the two dimensional grids. This can be prohibitive. In this section, we describe a fast strategy via greedy local search and continuation method to establish the solution over a range of tuning constants.

5.1 Prediction for new observations

Given a new unseen sample , the latent heterogeneous effect needs to be decided. As show in section 3.1, clustering is one of the situations causing low-rank latent effect. Therefore, given a new observation , we query its

-nearest neighbors using distance defined by one minus the cosine similarity

. The heterogeneous effect for the new test sample is set to the weighted average of heterogeneous effect of observations belonged to the neighborhood of the new test sample, i.e., .

5.2 Greedy local continuation

For Lasso or other type penalized fixed effect generalized linear models, Friedman et al. (Friedman et al., 2007, 2010a). showed that forms a continuous path of the regularization constant . Therefore, the model parameters stay close when change in the penalty constant is small enough. Based on this observation, Friedman et al. proposed warm start strategy to compute the solution over the entire path of regularization hyperparameters, where the optimal solution from a neighboring is used as initialization (Friedman et al., 2007, 2010a). This idea is extended to the nuclear norm penalized problem in (Ma et al., 2011).

This warm start continuation strategy can also be applied on top of FAPGAR (Algorithm 3), when either or is fixed. Although it is possible to produce the solution path over the entire two dimensional grids generated by the Cartesian product with warm-starting strategy, the computation cost becomes large since calls of Algorithm 3 is required. Here is the cardinality of grids in . Hence, we develop a greedy strategy with continuation method to avoid the computation over the entire search space. We propose a coordinate-wise search strategy. and is optimized alternatively using the continuation scheme while the other is being fixed. After a solution path over is computed for a fixed , we pick the value of yielding the best prediction on a separate validation set. After is selected, the solution path for is computed with the warm-start continuation strategy. The process continuous and search for and alternatively. Let denote the constants selected after coordinate-wise outer iterations. A cycle is a configuration such that for some . The search process stops when a cycle is detected. Figure 1 provides an illustration of the Greedy Local Continuation search.

Figure 1: Graphical illustration of Greedy Local Continuation search. and are optimized alternatively with warm-starting strategy.

6 Case Studies

Experimental evaluations on the proposed low-rank and sparse decomposition-based latent effect model (15) and the solution algorithm FAPGAR are described in this section. Traffic accident records from The Statewide Integrated Traffic Records System (SWITRS) (The California Highway Patrol, 2014) in California are used in the experiments. It has been reported that single-vehicle crashes account for nearly 30% of all vehicle crashes in the United States in 2015 (NHTSA, 2017). The latent effect logit model (15) is evaluated on single-vehicle crash observations. The set of injury categories are no injury or only complaint of pain, visible injury, severe injury, fatal. Seventeen variables listed in Table 1 are used in the model. All variables are processed into binary vector ’s using dummy coding, in which variable is 0 if it belongs to the reference level in Table 1, and 1 otherwise. We aim to study the computational aspect of (15) and the usage of this model for analyzing individual heterogeneity and quantifying the impact of different accident attributes.

feature ID 0 1 2 3 4 5 6 7
Description 1.5cmvictim
gender 1cmvictim
age 1.5cmseatbelt
used 1.4cmalcohol
used 1.5cmcellphone
used 1.5cmwrong
side of
road 1.5cmimproper
tuning 1cmover
speed
reference level female 25-64 false false false false false false
feature ID 8 9 10 11 12 13 14 15 16
Description 1.3cmweather 1.6cmlight
conditions 1.3cmwet
road 1cmdrug
used 1cmvehicle
age 1cmAM
peak 1cmPM
peak 1.5cmweekend 1cmseason
reference level clear daylight false false 0-10 false false false 1.2cmspring/
summer
Table 1: Accident-related features used in the experiment

6.1 Computation efficiency

Computational efficiency of Algorithm 3 is examined. We implement Algorithm 3 in MATLAB and perform the experiments on a MacOS system with 2.4 GHz Intel Core i5 processor and 4 GB 1600 MHz DDR3 memory. Data from 2012-2013 are sub-sampled to create training sets with different sizes () in order to study the scalability of Algorithm 3. The average running time per iteration are shown in Figure 2. The computation time per iteration in FAPGAR scales linearly with the number of samples in the dataset. The FAPGAR algorithm is compared with the Proximal Gradient method with constant step-sizes. The objective value of (15) after each iteration is shown in Figure 4. Figure 4 clearly demonstrates the advantage of employing the adaptive acceleration techniques. The benefit of using Randomized SVD (Algorithm 1) instead of deterministic SVD is illustrated in Figure 3. The computation time are measured on random input matrices. The left panel shows the time for performing randomized SVD implemented with Matlab and the deterministic partial SVD using Matlab command svds on matrices with varying number of columns. The right panel compares the randomized versus the deterministic approach at increasing number of required singular vectors. In both situations, randomized SVD offers more than a magnitude of speed-up. Note that there are approximation error introduced in the singular values computed via Randomzied SVD due to the random projection step in Algorithm 1. However, the approximation error is relatively small as shown in Table 2.

To start the Local Greedy Continuation search for and , we fist run group- regularized logistic regression over a range of s using software GLMNET (Friedman et al., 2007, 2010a). Note that the result of group- regularized logistic regression is equivalent to our proposed model (15) with set to a large value, which yields a matrix of zeros for the latent heterogeneity . For each on the solution path of group-

regularized logistic regression, we classify the samples on a separate validation set by choosing the class with largest estimated probability as the output category for each validation sample. For each hyperparameter configuration, the F-1 score is computed:

(27)

The value on the group- regularized logistic regression path with highest F-1 score is taken as initialization for the Local Greedy Continuation search. The progress of the greedy search process is visualized in Figure 5. The search process terminated after three iterations due to the occurrence of a cycle. The selected .

Figure 2: Time per iteration (seconds) of FAPGAR implemented in MATLAB.
Figure 3: Computation time comparison, Randomized SVD vs SVD. Left: time against increasing number of columns in the input matrix with 500 rows. Right: time against increasing number of singular vectors required.
number of columns 1000 2500 5000 7500 10000
relative error 0.0385 0.0337 0.0266 0.0230 0.0208
top singular vectors 2 10 50 100 150
relative error 0.1120 0.0166 0.0216 0.0204 0.0157
Table 2: Relative error in singular values computed with Randomized SVD.
Figure 4: Objective value after each iteration, FAPGAR vs Proximal Gradient.
Figure 5: Greedy Local Continuation search on the traffic accident dataset, 2012-2013. A cycle is detected after three iterations.

6.2 Accident factor analysis

Figure 6: Cross-validation direct pseudo-elasticity over a path of . Vertical line marked with ‘CV’ correspond to the with best F-1 score from cross-validation.

The latent effect model (15) is applied for analyzing traffic accident data in this section. The analysis is performed on randomly sampled single-vehicle accidents from 2012-2013 in SWITRS. The average direct pseudo-elasticity is a metric for quantifying the influence of a factor (Shankar and Mannering, 1996; Shankar et al., 1998; Carson and Mannering, 2001; Ulfarsson and Mannering, 2004; Anastasopoulos and Mannering, 2009; Kim et al., 2010; Anastasopoulos and Mannering, 2011; Morgan and Mannering, 2011; Kim et al., 2013; Mannering and Bhat, 2014) (and references therein). Given a binary feature vector , the average direct pseudo-elasticity measures the change in the probability of suffering class injury outcome, when a feature switches from zero to one. The direct pseudo-elasticity for observation on class due to the change of can be computed by

(28)

where denotes the estimated parameters for a model, is a sub-vector of features from excluding the -th variable. In transportation literature, the average of from the training set is calculated. Note that since the parameter is fitted from the training set, is in fact conditioned on the training set. The purpose of computing the average of is to achieve better estimation of the direct pseudo-elasticity in the population, i.e.,

(29)

where denotes the population, and denotes the training data randomly sampled from . Therefore, we propose an extension to the average pseudo-elasticity by incorporating the cross-validation procedure:

  • Shuffle the dataset randomly and separate it into folds. Let denote the subset of data with the -th fold removed, and be the remaining data.

  • estimate a model on , and compute the average pseudo-elasticity on .

  • compute the mean of average pseudo-elasticity estimated from .

We name the estimated direct pseudo-elasticity from this procedure as the cross-validation direct pseudo-elasticity (CV-DPE). We compute CV-DPE of (15) over a path of for the group-sparsity penalty constant, and fix for the nuclear norm penalty constant obtained from the Greedy Local Continuation search. The results for each injury outcome category are displayed in Figure 6. Since the model parameters change when varies, Figure 6 provides a visual inspection of CV-DPE with different degrees of freedom in the model. From the CV-DPE analysis, alcohol increases the probability of suffering severe and fatal injuries. Meanwhile, the probability of experiencing only visible injuries and the chance of free from injuries in an accident is greatly reduce. In addition, alcohol is the first factor selected into the model, when decreases. This is an indication about the importance of alcohol factor in predicting the accident results. On the contrary, wearing seatbelt improves the chance of no injury or suffering only visible injuries from accidents by more than

, reducing the odds of suffer severe injuries or fatality by more than

according to the CV-DPE result from the estimated model with best F-1 score. Consumption of drugs increases the likeliness of fatality by more than according to the model selected by the cross-validation, while reducing the chance of all other injury categories. Factors related to violation of traffic laws, i.e., over-speeding, improper tuning, wrong side of road, all lead to greater chance of fatality, but relatively less impactful compared to the influence of drug and alcohol. In addition, the Local Greedy Continuation found a that leads to a vector of zeros for the homogeneous effect parameter corresponds to cell-phone usage. This can be interpreted as the exclusion of cell-phone influence as a common effect. This observation may be unintuitive at the first sight. After inspecting the dataset, only less than accident samples reported cell-phone as involved. This can be a result of under-reporting — understandably, people would not admit using a cell-phone while driving unless being caught. Hence, it is not surprise to see the model selects cell-phone usage only as a heterogeneous effect variable.

Figure 7: Latent heterogeneity inferred from the 2012-2013 sample. . Each point is represented by the first two principal component scores.

6.3 Visualization of latent individual heterogeneity

The latent effect model (15) also has the advantages of providing insights into the individual heterogeneity. The rank of deceases as increases. Hence, (15) can be used as a dimensional reduction tool. The resulting learned from model (15) has rank two at the selected from the Local Greedy Continuation search process. Hence, each can be visualized by a two dimensional vector, represented by their principal component scores. The first two principal component scores of the latent heterogeneity inferred from each training point are shown in Figure 7. There are four well identified clusters, each cluster is dominated by samples from one injury category. This is an indication of the clustering effect in the individual heterogeneity. Moreover, the cluster with majority of severe injury accidents and the cluster comprises mostly fatality cases are overlapped with each other.

7 Conclusion

We present a latent effect logistic model based on sparse and low-rank decomposition between the homogeneous effect and heterogeneous effect of observations. The formulation has the advantages of preserving model convexity while capturing the latent individual heterogeneity. The optimization problem from the model is solved by a Fast Accelerated Proximal Gradient with Adaptive Restarting algorithm. A Greedy Local Continuation search process is developed to enable efficient exploration of model hyperparameters. We demonstrate that low-rankness is a result of different data-generating processes, and validate through experiments clustering gave rise to the low-rankness of latent heterogeneity in accident observations. The usefulness of the model is demonstrated by analyzing traffic accident factors. From the analysis, drug and alcohol are found to be the factors with the largest impact on the probability of suffering fatality in an accident, whereas the usage of seatbelt greatly improves the chance of avoiding injuries.

References

  • H. Abdi and L. J. Williams (2010) Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2 (4), pp. 433–459. Cited by: §3.1.3.
  • A. Agresti, B. Caffo, and P. Ohman-Strickland (2004) Examples in which misspecification of a random effects distribution reduces efficiency, and possible remedies. Computational Statistics & Data Analysis 47 (3), pp. 639–653. Cited by: §1.
  • P. C. Anastasopoulos and F. L. Mannering (2009) A note on modeling vehicle accident frequencies with random-parameters count models. Accident Analysis & Prevention 41 (1), pp. 153–159. Cited by: §6.2.
  • P. C. Anastasopoulos and F. L. Mannering (2011) An empirical assessment of fixed and random parameter logit models using crash-and non-crash-specific injury data. Accident Analysis & Prevention 43 (3), pp. 1140–1147. Cited by: §6.2.
  • N. S. Aybat, D. Goldfarb, and S. Ma (2014) Efficient algorithms for robust and stable principal component pursuit problems. Computational Optimization and Applications 58 (1), pp. 1–29. Cited by: §3.2.
  • N. S. Aybat, S. Zarmehri, and S. Kumara (2015) An admm algorithm for clustering partially observed networks. In Proceedings of the 2015 SIAM international conference on data mining, pp. 460–468. Cited by: §1.
  • P. Bansal, R. A. Daziano, and E. Guerra (2018) Minorization-maximization (mm) algorithms for semiparametric logit models: bottlenecks, extensions, and comparisons. Transportation Research Part B: Methodological 115, pp. 17–40. Cited by: §1, §1, §1.
  • F. Bastin, C. Cirillo, and P. L. Toint (2004) Estimating mixed logit models with quasi-monte carlo sequences allowing practical error estimation. In the European Transport Conference, Strasbourg, France. Cited by: §1.
  • F. Bastin, C. Cirillo, and P. L. Toint (2006) Application of an adaptive monte carlo algorithm to mixed logit estimation. Transportation Research Part B: Methodological 40 (7), pp. 577–593. Cited by: §1, §4.1.
  • A. Beck and M. Teboulle (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 (1), pp. 183–202. Cited by: §4.1, §4.1, §4.2, Theorem 1.
  • J. Bennett, S. Lanning, et al. (2007) The netflix prize. In Proceedings of KDD cup and workshop, pp. 35. Cited by: §1.
  • C. R. Bhat and P. S. Lavieri (2018) A new mixed mnp model accommodating a variety of dependent non-normal coefficient distributions. Theory and Decision 84 (2), pp. 239–275. Cited by: §1.
  • C. R. Bhat (2001) Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research Part B: Methodological 35 (7), pp. 677–693. Cited by: §1.
  • C. G. Broyden (1970) The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics 6 (1), pp. 76–90. Cited by: §4.1.
  • J. Cai, E. J. Candès, and Z. Shen (2010) A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), pp. 1956–1982. Cited by: §4.1, §4.1.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011) Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §1, §3.2.
  • E. J. Candès and B. Recht (2009) Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1, §3.2.
  • E. J. Candès and T. Tao (2010) The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory 56 (5), pp. 2053–2080. Cited by: §1.
  • J. Carson and F. Mannering (2001) The effect of ice warning signs on ice-accident frequencies and severities. Accident Analysis & Prevention 33 (1), pp. 99–109. Cited by: §6.2.
  • V. Chandrasekaran, P. A. Parrilo, A. S. Willsky, et al. (2012) Latent variable graphical model selection via convex optimization. The Annals of Statistics 40 (4), pp. 1935–1967. Cited by: §3.1.1, §3.2.
  • Y. Chen, Y. Chi, and A. J. Goldsmith (2015) Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory 61 (7), pp. 4034–4059. Cited by: §3.1.1, §3.2.
  • E. Cherchi and J. W. Polak (2005) Assessing user benefits with discrete choice models: implications of specification errors under random taste heterogeneity. Transportation Research Record 1926 (1), pp. 61–69. Cited by: §1.
  • P. De Blasi, L. F. James, J. W. Lau, et al. (2010) Bayesian nonparametric estimation and consistency of mixed multinomial logit choice models. Bernoulli 16 (3), pp. 679–704. Cited by: §1, §1.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1, §2.
  • C. Ding, X. He, and H. D. Simon (2005)

    On the equivalence of nonnegative matrix factorization and spectral clustering

    .
    In Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 606–610. Cited by: §3.1.3.
  • C. Ding, T. Li, and W. Peng (2008) On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis 52 (8), pp. 3913–3927. Cited by: §3.1.3.
  • Econometric Software, Inc. (Technical report) NLOGIT 6 reference guide. Technical report Cited by: §1.
  • M. Fazel, H. Hindi, and S. P. Boyd (2001) A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, 2001. Proceedings of the 2001, Vol. 6, pp. 4734–4739. Cited by: §3.2.
  • R. Fletcher (1970) A new approach to variable metric algorithms. The computer journal 13 (3), pp. 317–322. Cited by: §1, §4.1.
  • M. Fosgerau and S. Hess (2009) A comparison of methods for representing random taste heterogeneity in discrete choice models. European Transport-Trasporti Europei 42, pp. 1–25. Cited by: §1, §1, §2.
  • M. Fosgerau and S. L. Mabit (2013) Easy and flexible mixture distributions. Economics Letters 120 (2), pp. 206–210. Cited by: §1, §2.
  • M. Fosgerau (2006) Investigating the distribution of the value of travel time savings. Transportation Research Part B: Methodological 40 (8), pp. 688–707. Cited by: §1.
  • J. T. Fox, K. i. Kim, S. P. Ryan, and P. Bajari (2011) A simple estimator for the distribution of random coefficients. Quantitative Economics 2 (3), pp. 381–418. Cited by: §1.
  • J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, et al. (2007) Pathwise coordinate optimization. The Annals of Applied Statistics 1 (2), pp. 302–332. Cited by: §5.2, §6.1.
  • J. Friedman, T. Hastie, and R. Tibshirani (2010a) Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), pp. 1. Cited by: §5.2, §6.1.
  • J. Friedman, T. Hastie, and R. Tibshirani (2010b) A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736. Cited by: §3.
  • N. Gillis (2014) The why and how of nonnegative matrix factorization.

    Regularization, Optimization, Kernels, and Support Vector Machines

    12 (257).
    Cited by: §3.1.3.
  • W. H. Greene and D. A. Hensher (2003) A latent class model for discrete choice analysis: contrasts with mixed logit. Transportation Research Part B: Methodological 37 (8), pp. 681–698. Cited by: §1, §1, §3.1.2.
  • B. Haeffele, E. Young, and R. Vidal (2014) Structured low-rank matrix factorization: optimality, algorithm, and applications to image processing. In Proceeding of the 31st International Conference on Machine Learning, pp. 2007–2015. Cited by: §1.
  • V. A. Hajivassiliou and P. A. Ruud (1994) Classical estimation methods for ldv models using simulation. Handbook of econometrics 4, pp. 2383–2441. Cited by: §1, §2.
  • N. Halko, P. Martinsson, and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: §4.2.
  • J. Heckman and B. Singer (1984) A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica: Journal of the Econometric Society, pp. 271–320. Cited by: §1.
  • D. A. Hensher and W. H. Greene (2003) The mixed logit model: the state of practice. Transportation 30 (2), pp. 133–176. Cited by: §1, §1.
  • A. R. Hole and H. I. Yoo (2017) The use of heuristic optimization algorithms to facilitate maximum simulated likelihood estimation of random parameter logit models. Journal of the Royal Statistical Society: Series C (Applied Statistics) 66 (5), pp. 997–1013. Cited by: §1.
  • J. Huang and M. Q. Zhao (2015) Model selection and misspecification in discrete choice welfare analysis. Applied Economics 47 (39), pp. 4153–4167. Cited by: §1.
  • J. Huang, T. Zhang, and D. Metaxas (2011) Learning with structured sparsity. Journal of Machine Learning Research 12 (Nov), pp. 3371–3412. Cited by: §1.
  • S. Jagabathula and P. Rusmevichientong (2016) A nonparametric joint assortment and price choice model. Management Science 63 (9), pp. 3128–3145. Cited by: §1, §3.1.2.
  • R. Jenatton, J. Audibert, and F. Bach (2011a) Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research 12 (Oct), pp. 2777–2824. Cited by: §1.
  • R. Jenatton, J. Mairal, G. Obozinski, and F. Bach (2011b) Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research 12 (Jul), pp. 2297–2334. Cited by: §4.1, §4.1.
  • J. P. Kaipio, V. Kolehmainen, M. Vauhkonen, and E. Somersalo (1999) Inverse problems with structural prior information. Inverse problems 15 (3), pp. 713. Cited by: §1.
  • M. Keane and N. Wasi (2013) Comparing alternative models of heterogeneity in consumer choice behavior. Journal of Applied Econometrics 28 (6), pp. 1018–1045. Cited by: §1, §1.
  • J. Kim, G. F. Ulfarsson, S. Kim, and V. N. Shankar (2013) Driver-injury severity in single-vehicle crashes in california: a mixed logit analysis of heterogeneity due to age and gender. Accident Analysis & Prevention 50, pp. 1073–1081. Cited by: §1, §1, §6.2.
  • J. Kim, G. F. Ulfarsson, V. N. Shankar, and S. Kim (2008) Age and pedestrian injury severity in motor-vehicle crashes: a heteroskedastic logit analysis. Accident Analysis & Prevention 40 (5), pp. 1695–1702. Cited by: §1.
  • J. Kim, G. F. Ulfarsson, V. N. Shankar, and F. L. Mannering (2010) A note on modeling pedestrian-injury severity in motor-vehicle crashes with the mixed logit model. Accident Analysis & Prevention 42 (6), pp. 1751–1758. Cited by: §6.2.
  • D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §3.1.3.
  • D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: §3.1.3.
  • E. Levina and P. J. Bickel (2005) Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pp. 777–784. Cited by: §1.
  • A. Liutkus and K. Yoshii (2017) A diagonal plus low-rank covariance model for computationally efficient source separation. In Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pp. 1–6. Cited by: §3.1.1.
  • S. Ma, D. Goldfarb, and L. Chen (2011) Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming 128 (1-2), pp. 321–353. Cited by: §4.1, §4.1, §5.2.
  • F. L. Mannering and C. R. Bhat (2014) Analytic methods in accident research: methodological frontier and future directions. Analytic methods in accident research 1, pp. 1–22. Cited by: §6.2.
  • V. Mansinghka, C. Kemp, J. Tenenbaum, T. Griffiths, M. Frank, S. Goldwater, T. Griffiths, V. Mansinghka, J. Tenenbaum, D. Roy, et al. (2006) Structured priors for structure learning.. In

    Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006)

    ,
    Cited by: §1.
  • P. McCullagh and J. A. Nelder (1983) Generalized linear models. 2 edition, Chapman and Hall. Cited by: §1.
  • C. E. McCulloch and J. M. Neuhaus (2011) Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical science, pp. 388–402. Cited by: §1.
  • D. McFadden (1981) Econometric models of probabilistic choice. In Structural Analysis of Discrete Data wit Economic Applications, C. Manski and D. McFadden (Eds.), pp. 198–272. Cited by: §1.
  • D. McFadden (1989)

    A method of simulated moments for estimation of discrete response models without numerical integration

    .
    Econometrica: Journal of the Econometric Society, pp. 995–1026. Cited by: §1, §2.
  • D. McFadden (2001) Economic choices. The American economic review 91 (3), pp. 351–378. Cited by: Convex Latent Effect Logit Model via Sparse and Low-rank Decomposition, §1.
  • Z. Meng, B. Eriksson, and A. Hero (2014) Learning latent variable gaussian graphical models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1269–1277. Cited by: §3.1.1, §3.2.
  • A. Morgan and F. L. Mannering (2011) The effects of road-surface conditions, age, and gender on driver-injury severities. Accident Analysis & Prevention 43 (5), pp. 1852–1863. Cited by: §6.2.
  • D. Munger, P. L’Ecuyer, F. Bastin, C. Cirillo, and B. Tuffin (2012) Estimation of the mixed logit likelihood function by randomized quasi-monte carlo. Transportation Research Part B: Methodological 46 (2), pp. 305–320. Cited by: §1.
  • Y. Nesterov (1983) A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27 (a), pp. 372–376. Cited by: §4.2.
  • J. M. Neuhaus and C. E. McCulloch (2011) The effect of misspecification of random effects distributions in clustered data settings with outcome-dependent sampling. Canadian Journal of Statistics 39 (3), pp. 488–497. Cited by: §1.
  • NHTSA (2017) TRAFFIC safety facts 2015. Technical report National Highway Traffic Safety AdministrationCenter for Statistics and Analysis, National Highway Traffic Safety Administra- tion, Washington, DC.. External Links: Link Cited by: §6.
  • B. O’donoghue and E. Candes (2015) Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics 15 (3), pp. 715–732. Cited by: §4.2.
  • S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi (2015) Simultaneously structured models with application to sparse and low-rank matrices. IEEE Transactions on Information Theory 61 (5), pp. 2886–2908. External Links: Document, ISSN 0018-9448 Cited by: §3.1.1, §3.2.
  • D. Pacifico (2012) Fitting nonparametric mixed logit models via expectation-maximization algorithm. The Stata Journal 12 (2), pp. 284–298. Cited by: §1, §3.1.2.
  • B. Recht, M. Fazel, and P. A. Parrilo (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52 (3), pp. 471–501. Cited by: §3.2.
  • B. Recht, W. Xu, and B. Hassibi (2008) Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pp. 3065–3070. Cited by: §3.2.
  • E. Richard, P. Savalle, and N. Vayatis (2012) Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, USA, pp. 51–58. External Links: ISBN 978-1-4503-1285-1 Cited by: §1, §3.1.1, §3.2.
  • J. Saunderson, V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky (2012) Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM Journal on Matrix Analysis and Applications 33 (4), pp. 1395–1416. Cited by: §3.1.1.
  • V. Shankar, R. Albin, J. Milton, and F. Mannering (1998) Evaluating median crossover likelihoods with clustered accident counts: an empirical inquiry using the random effects negative binomial model. Transportation Research Record: Journal of the Transportation Research Board (1635), pp. 44–48. Cited by: §6.2.
  • V. Shankar and F. Mannering (1996) An exploratory multinomial logit analysis of single-vehicle motorcycle accident severity. Journal of safety research 27 (3), pp. 183–194. Cited by: §6.2.
  • D. F. Shanno (1970) Conditioning of quasi-newton methods for function minimization. Mathematics of computation 24 (111), pp. 647–656. Cited by: §1, §4.1.
  • M. Soltanolkotabi, E. Elhamifar, E. J. Candes, et al. (2014) Robust subspace clustering. The Annals of Statistics 42 (2), pp. 669–699. Cited by: §1, §3.2.
  • D. C. Sorensen (1982) Newton’s method with a model trust region modification. SIAM Journal on Numerical Analysis 19 (2), pp. 409–426. Cited by: §4.1.
  • D. I. Swedler, S. M. Bowman, and S. P. Baker (2012) Gender and age differences among teen drivers in fatal crashes. In 56th Annual Scientific Conference of the Association for the Advancement of Automotive Medicine, Cited by: §1.
  • The California Highway Patrol (2014) Statewide integrated traffic records system. External Links: Link Cited by: §1, §6.
  • K. Toh and S. Yun (2010) An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization 6, pp. . Cited by: §4.1, §4.1.
  • K. E. Train et al. (2009) Discrete choice methods with simulation. Cambridge Books. Cited by: §1.
  • K. E. Train (2008) EM algorithms for nonparametric estimation of mixing distributions. Journal of Choice Modelling 1 (1), pp. 40–69. Cited by: §1, §1, §1, §3.1.2.
  • K. E. Train (2009) Discrete choice methods with simulation. 2 edition, Cambridge University Press. Cited by: §1, §2.
  • K. Train (2000) Halton sequences for mixed logit. Technical report University of California Berkeley, Department of Economics, Berkeley, California. Cited by: §1.
  • K. Train (2016) Mixed logit with a flexible mixing distribution. Journal of choice modelling 19, pp. 40–53. Cited by: §1, §1, §2.
  • M. Udell and A. Townsend (2018) Why are big data matrices approximately low rank?.

    SIAM Mathematics of Data Science (SIMODS), to appear

    .
    External Links: 1705.07474 Cited by: §1, §3.2.
  • G. F. Ulfarsson and F. L. Mannering (2004) Differences in male and female injury severities in sport-utility vehicle, minivan, pickup and passenger car accidents. Accident Analysis & Prevention 36 (2), pp. 135–147. Cited by: §6.2.
  • A. Vij and R. Krueger (2017) Random taste heterogeneity in discrete choice models: flexible nonparametric finite mixture distributions. Transportation Research Part B: Methodological 106, pp. 76–101. Cited by: §1, §2, §3.1.2.
  • S. Wright and J. Nocedal (1999) Numerical optimization. Springer Science 35 (67-68), pp. 7. Cited by: §1.
  • M. Yuan and Y. Lin (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), pp. 49–67. Cited by: §3, §4.1.
  • H. Zhan, K. Madduri, and V. N. Shankar (2019) An experimental study on stochastic gradient-based methods for fitting mixed logit models. working paper. Cited by: §1.