The mixed effect logistic regression model (mixed logit) has great success in transportation econometrics applications. Mixed logit is a generalized linear mixed model for multinomial response(McCullagh and Nelder, 1983). In addition, it can also be derived from discrete choice theory as a random utility model (McFadden, 2001). It brings advantages over fixed effect multinomial logistic regression (multinomial logit) due to its flexibility to incorporate individual heterogeneity. In mixed logit, the parameters for each observation are random and their distributions are estimated from data. Therefore, conditioned on the same observed variables, two observations may not share the same outcome due to the differences in individual-level parameters. Under random utility theory, the parameter heterogeneity for each observation models taste variations in the population. In addition, when interpreted as a discrete choice model, mixed logit relaxes the Independence of Irrelevant Alternative (IIA) assumption (McFadden, 1981; Train, 2009). The flexibility to model individual heterogeneity is important in transportation applications. For example in the analysis of traffic accident outcomes, the effect of age is heterogeneous since people of the same age can have diverse health conditions, which affects the degree of resistance to the impact of crash (Kim et al., 2008; Swedler et al., 2012; Kim et al., 2013).
The flexibility of applying mixed logit model comes with the difficulties to specifying and estimating the model. In the past twenty years, many advances have been made on understanding the mixing distributions used (Cherchi and Polak, 2005; Fosgerau, 2006; Neuhaus and McCulloch, 2011; McCulloch and Neuhaus, 2011)
, and relaxing the restrictions on the parametric form of the distribution, which leads to semi-parametric or non-parametric models(Greene and Hensher, 2003; Bansal et al., 2018). There are no consensus on the robustness of mixing distributions. Many authors has discussed about the sensitivity of estimates to the specification of mixing distributions (Heckman and Singer, 1984; Hensher and Greene, 2003; Agresti et al., 2004; Huang and Zhao, 2015). On the applications of mixed logit to welfare analysis, it has been shown that the choice of mixing distributions has profound impact (Cherchi and Polak, 2005; Fosgerau, 2006; Huang and Zhao, 2015). However, there are practical scenarios in which the influence of mixing distribution misspecification is secondary, pointed out in McCulloch and Neuhaus (2011). Nevertheless, many recent efforts have been put on making logit models more robust, from both computational and modeling perspective. A line of work is on the usage of non-parametric estimation for mixing distributions (Train, 2008; De Blasi et al., 2010; Bansal et al., 2018). A comparison between parametric and non-parametric estimation of the mixing distributions are provided in Bhat and Lavieri (2018). An early work along this line is the latent class model proposed by Greene and Hensher (2003), in which the heterogeneous parameters are conditioned on the latent class individuals belonged to. The use of mixture-of-normal for approximating the unknown mixing distributions are studied in (Fosgerau and Hess, 2009; Keane and Wasi, 2013). More recently, the logit-mixture logit (LML) is proposed in Train (2016). LML is a generalization of the latent class logit (Greene and Hensher, 2003) and many other mixture models (Train, 2008; Fox et al., 2011; Fosgerau and Mabit, 2013)
, using functionals such as splines to approximate the kernel of the probability mass function of discrete mixing distribution. Under very general assumptions, any continuous mixing distribution can be approximated by a discrete logit-mixture with sufficiently large support(Train, 2016). Therefore, non-parametric extensions can be considered as improvements on the robustness with respect to misspecification of mixing distributions.
or the Expectation-Maximization (EM) algorithm(Dempster et al., 1977). The expectation operator is nested inside the logarithm in the maximum likelihood estimator. Therefore, simulations are often used to approximate the expectation, leading to the MSL estimator. A large body of work has been done on making the simulation more efficient and numerically more robust approximation of the expectation integrals in practice. Quasi-Monte Carlo (QMC) methods is studied by several authors (Train, 2000; Bhat, 2001; Bastin et al., 2004; Munger et al., 2012)
. It has been shown that QMC reduces the variance of simulation and the number of draws needed in practice. However, it should be noted that the number of draws have to increase with the sample size in theory(Hensher and Greene, 2003; Train and others, 2009), and there still lacks theoretical guidelines on the sufficient number of draws required on finite samples. Quasi-Newton methods (Shanno, 1970; Fletcher, 1970) or Trust-region methods (Wright and Nocedal, 1999; Bastin et al., 2006) are then applied to optimize the simulation-based objective function. The EM algorithm for estimating mixed logit model and its variants are studied in Train (2008); Pacifico (2012); Jagabathula and Rusmevichientong (2016); Vij and Krueger (2017). Zhan et al. (2019) recently studies stochastic gradient method to estimate mixed logit models with Gaussian mixing distribution. Bansal et al. (2018) studies the Minorization-Maximization (MM) algorithm, an extension of EM, for fitting the logit-mixture logit model. Note that the maximum likelihood estimators for mixed logit and the non-parametric variants are non-convex, due to the expectation operator nested inside logarithm. Therefore, both the MSL-based estimation or EM-type algorithms are dependent on the starting point. Hence, the model parameters and the estimated mixing distribution can be unstable because multiple local optima exists in the optimization landscape. The effect of initialization on mixed logit parameter estimation was recently studied by Hole and Yoo (2017).
This paper is motivated to develop a model that allows the analysts to incorporate individual heterogeneity while being convex. Preserving convexity is important for obtaining stable estimation. A naive way to allow heterogeneity in the model is by parameterizing each observation independently. This leads to an over-parametrized model. A remedy is to have heterogeneous parameters for each sub-population or cluster. However, the clusters or grouping effect in the population is unobserved. Existent work marginalized the latent variables which produces non-convex objective functions(Greene and Hensher, 2003; Train, 2008; Fosgerau and Hess, 2009; De Blasi et al., 2010; Keane and Wasi, 2013; Train, 2016; Bansal et al., 2018). Therefore, we would like to avoid marginalization in the estimation. Mixing distributions can be considered as probabilistic priors on the latent heterogeneity in the population. Alternatively, structural prior
is a way to encode general knowledge about the structure of data, frequently used in machine learning and inverse problems(Kaipio et al., 1999; Mansinghka et al., 2006; Huang et al., 2011; Jenatton et al., 2011a; Haeffele et al., 2014). In order to avoid marginalization, we exploit structural priors that can be expressed non-probabilistically. To take advantage of these structural information, our proposed model is estimated using a regularized maximum likelihood approach. In particular, we consider the following structure: the homogeneous effect in the population is sparse and the individual heterogeneity is intrinsically low dimension despite the high dimensionality of ambient space. The reason and theoretical justification for imposing this geometric structure will be made clear later. Our work is related to low-rank and sparse matrix decomposition (Candès and Recht, 2009; Candès et al., 2011). In many applications, researchers discovered that corrupted or unobserved data can be recovered almost exactly if the data matrix is approximately low-rank (Candès and Recht, 2009; Candès and Tao, 2010; Candès et al., 2011; Richard et al., 2012; Soltanolkotabi et al., 2014; Aybat et al., 2015). The Netflix challenge (Bennett et al., 2007) is probably the most well-known application of this technique. Moreover, data from many real problems is found to distributed on intrinsically low dimensional manifolds even if the observations are high dimensional (Levina and Bickel, 2005; Udell and Townsend, 2018). We develop a fast optimization algorithm to fit the proposed convex latent effect logit model and evaluate the model on traffic accidents extracted from The Statewide Integrated Traffic Records System (SWITRS) (The California Highway Patrol, 2014). The computation of one instance of the model on datasets of 10000 samples can be finished in minutes, whereas estimating a mixed logit model on datasets of similar size with the same set of variables takes several hours using commercial software NLOGIT (Kim et al., 2013; Econometric Software, Inc., Technical report) .
The remaining of this paper is organized as follows: section 2 reviews background knowledge about logit models; section 3 introduces our sparse and low-rank decomposition-based convex objective function to incorporate heterogeneity, and explains why low-rankness is a broader geometry description for several different data generating process; section 4
describes the optimization algorithms for learning the parameters in the model. A greedy scheme for hyperparameter tuning is also introduced in section4; section 6 presents case studies using our proposed model for traffic accident analysis, followed by the conclusion in section 7.
We provide a brief review for multinomial logit and mixed logit model in this section, and point out the mathematical challenges. In the following sections, let a pair of observations, where is the observed variables (features) and is a categorical response from the set . For example, in choice modeling, represents the decision made by an agent in observation; in traffic crash modeling, denotes the severity category suffered from a crash. The probability of observation response conditioned on the variables and parameter is modeled by the multinomial distribution
In discrete choice literature, are usually referred as the propensity functions for making decisions . The conditional utility 111better interpreted as “harms” in traffic accidents from decision is
Variants of logit models differs in how the propensity functions are modeled and the methods used for estimating . In multinomial logistic regression (MNL), the propensity functions are parametrized by linear functionals with parameters ,
) implies that two observations have identical probability distribution of outcomes if. This is a restrictive assumption that can be violated in the presence of individual heterogeneity. Mixed logit models relax this restriction by placing probability distributions on the parameters, such that individual has heterogeneous parameters governed by a probability distribution . is usually referred as the mixing distribution and the parameters for
are to be estimated. Normal, log-normal, and truncated normal distributions are among the popular choices of parametric mixing distributions(Train, 2009). On the other hand, can be constructed via a non-parametric approach and via discrete mixture of distributions (Fosgerau and Hess, 2009; Fosgerau and Mabit, 2013; Train, 2016; Vij and Krueger, 2017). Let denote the sample size. The unobserved heterogeneous parameters are usually integrated out, yielding the marginal log-likelihood estimator:
There are two main mathematical challenges in solving the optimization problem in (4). First, is usually non-concave. Second, the expectation taken over the distribution of often has no closed-form expressions and requires further approximation. is usually replaced by the Maximum Simulated Likelihood (MSL) estimator (McFadden, 1989; Hajivassiliou and Ruud, 1994) using (quasi-) Monte Carlo simulation to approximate the expectation. The Expectation Maximization (EM) (Dempster et al., 1977) iterative procedure is another option.
3 Sparse and Low-rank Decomposition
We describe our sparse and low-rank decomposition approach for modeling heterogeneity in this section. Let denote the parameter for the -th individual. For each observation , we decompose the individual parameters by
The individual parameter
is separated into a homogeneous effect vector, and a heterogeneous counter-part for individual . Without any assumptions on , it is impossible to learn the homogeneous-heterogeneous effect decomposition since equation (5
) over-parametrizes the data. In mixed logit, the degrees of freedom inis constrained by the mixing distributions used. However, this leads to an non-convex objective function as discussed earlier. Instead of imposing mixing distirbutions, we will penalize the degree of freedom of the model by imposing geometry constraints. We first describe the notations before introducing the formulation. Let
be a matrix appending the heterogeneous parameters for the individuals as columns. Denote the homogeneous effect for the -th category as , and let
be a matrix collecting the homogeneous effects. Let denote the -th row of . The following constrained optimization problem is considered.
where , and are hyper-parameters to be selected. is the so-called group- norm used by group Lasso (Yuan and Lin, 2006; Friedman et al., 2010b). When is smaller than minimum group- norm of the solutions to the unconstrained problem, the optima in (8) forces into a vector of zeros for some . Therefore, the optimal solution is sparse and there are no homogeneous effect for variable if . When , the heterogeneity effects are constrained in a low-rank vector space. In the next section, we explain why imposing low-rank structural assumptions helps uncovering the separation between homogeneous and heterogeneous effect. Problem (8) is non-convex due to the rank constraint. A convex relaxation to (8) will be presented in section 4.
3.1 Why Low-Rankness?
In this section, we explain situations where low-rank heterogeneous effect arises.
3.1.1 Gaussian mixing with approximately low-rank covariance
The homogeneous-heterogeneous effect decomposition in (5) resembles the linear representation for Gaussian mixing variables. In Gaussian mixed logit model, the random parameters can be decomposed as
Hence the mean of the Gaussian distributions corresponds to the homogeneous effect, and
where . Low-rank matrix of mixing effects occurs when the covariance matrix is (approximately) low-rank. This happens when the covariance matrix is a superposition of a low-rank component and a sparse components (Chandrasekaran et al., 2012; Richard et al., 2012; Meng et al., 2014; Oymak et al., 2015; Chen et al., 2015), or when the covariance matrix has a block diagonal structure where the variables within each block are highly correlated and variables across different blocks are independent (Saunderson et al., 2012; Liutkus and Yoshii, 2017).
3.1.2 Latent clustered heterogeneity
Suppose the population forms clusters, such that the predictive variables have identical or similar influence on individuals belonging to the same cluster. Let be a partition of the data, i.e., if , . If the individual heterogeneities are identical on each cluster, then
Therefore, observations in the same cluster have identical columns in . As a result, . (11) can be relaxed slightly, consider
(12) leads to grouping of the heterogeneous effects, and forces the latent effects in the same group are stays close. can be approximated by a low-rank matrix in this case. Note that latent clustering of heterogeneity is also an implicit assumption in discrete mixture multinomial logits (Greene and Hensher, 2003; Train, 2008; Pacifico, 2012; Jagabathula and Rusmevichientong, 2016; Vij and Krueger, 2017).
3.1.3 Latent matrix factorization
Suppose with . can be equivalently expressed as
where , and both and has full column-rank. The matrix is usually interpreted as latent loadings, and are scores for each of the basis in . This is sometimes referred as a matrix factorization representation (Lee and Seung, 1999, 2001; Ding et al., 2005, 2008; Abdi and Williams, 2010; Gillis, 2014). Under this interpretation, individual heterogeneities are created through different linear combinations of the same latent sources.
3.2 Convex relaxation
Optimization with matrix rank constraints is NP-Hard. Finding the exact solutions to problem (8) is difficult and can be unpractical. Therefore, we use a convex program to approximate the original formulation. Let be the nuclear norm for matrix define as following.
where is the rank of and is the
-th largest singular value of. The nuclear norm is a convex relaxation for the rank function (Fazel et al., 2001; Recht et al., 2008, 2010) and has been successfully applied on many problems involving the rank function as objective or constraints (Candès and Recht, 2009; Candès et al., 2011; Soltanolkotabi et al., 2014; Aybat et al., 2014; Udell and Townsend, 2018; Chandrasekaran et al., 2012; Richard et al., 2012; Meng et al., 2014; Oymak et al., 2015; Chen et al., 2015) (and references therein). Therefore, we solve the following convex program
(15) reformulates (8) by first transforming the original problem into the Lagrangian form, then applying the nuclear norm relaxation to . (15) is the addition of three convex functions and therefore the convexity is preserved. are non-negative hyperparameters correspond to the upper bound and in the hard-constraint version.
4.1 Proximal Gradient algorithm
We now describe an efficient algorithm to solve the convex problem (15). Note that the terms and are non-smooth. Therefore methods requiring second order continuity, for example quasi-Newton methods (Fletcher, 1970; Broyden, 1970; Shanno, 1970) and Trust-region methods (Sorensen, 1982; Bastin et al., 2006), cannot be applied. The main steps in the algorithm are to deal with these non-smooth terms. Problem has the form
where is a smooth convex part corresponds to the multinomial loss, is convex but non-smooth part consists of . Proximal Gradient method (PG) (Beck and Teboulle, 2009; Cai et al., 2010; Toh and Yun, 2010; Ma et al., 2011; Jenatton et al., 2011b) is an efficient first-order algorithm to handle the non-smoothness in . It performs the following simple iteration
where is the step-size and is the proximal operator associated with and defined below.
Suppose is a continuously differentiable function with Lipschitz gradients, i.e.,
for all . The convergence rate of Proximal Gradient algorithm is given by the theorem below.
(Beck and Teboulle, 2009) Suppose is a convex continuous function and possibly non-smooth, is a convex continuously differentiable function satisfying the Lipschitz condition in (19) with Lipschitz constant . Let be a sequence of iterates generated from (17) using constant step-size , and let be an optimal solution of (16). Then
for any .
In particular, Theorem 1 suggests the Proximal Gradient algorithm requires iterations to achieve . When the Lipschitz constant is unknown, Beck and Teboulle (2009) shown a line-search algorithm to select the step-sizes , which requires computing the solution to the proximal operator (18
) for each line-search trial step. Solving the minimization problem in the proximal operation can be costly in some cases. Thus we derive the closed-form solutions for the proximal sub-problems and describe an adaptive step-size heuristic later. The Proximal Gradient method closely resembles the iterations in Gradient Descent, especially when the proximal operator admits close-form solutions. We now apply PG to the convex latent logit model (15). Let denote the iterates after the -th PG iteration (17). Applying (17) on (15) leads to the following update steps using intermediate variables
The sub-problem from the proximal operator (20c) can be decoupled into two separate minimization problems:
Moreover, both (21a) and (21b) has closed-form solutions. The optimal solution to (21a) can be obtained by solving the proximal operator independently over each row . The sub-problems with respect to are essentially the proximal operation arisen from the group-Lasso penalty (Yuan and Lin, 2006; Jenatton et al., 2011b). For each , the analytical solution for the proximal operator associated with step-size and is
where the operator denotes thresholding . The closed-form solution for the proximal operator associated with nuclear norm penalty and step-size can be obtained via a generalization of the soft-thresholding procedure in (22
). Suppose the Singular Value Decomposition (SVD) foris
where is the diagonal matrix of singular values of and . The closed-form solution to (21b) is given by the Singular-Value Thresholding (SVT) operator (Cai et al., 2010; Toh and Yun, 2010; Ma et al., 2011),
4.2 Acceleration, adaptive step-size and randomized SVT
We apply several simple techniques to improve the convergence speed of Proximal Gradient algorithm described in the previous section. Nesterov’s momentum method is a well-known technique to derive optimal convergence rate for first-order optimization methods on smooth convex problems (Nesterov, 1983). The extension to accelerate the convergence on non-smooth objectives via proximal operators are developed by Beck and Teboulle (Beck and Teboulle, 2009), known as the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA). Under constant step-size or line-searched step-sizes, FISTA achieves improved worse-case convergence rate of . The acceleration technique is applied in our implementation. In practice, the Lipschitz constant might be unknown or hard to compute. On the other hand, backtracking line-search requires computing the proximal operator over a range of different step-sizes. Since the intermediate variable changes due to different trial step-size during the line search process, the Singular Value Decomposition of needs to be re-computed in each trial step in order to obtain . Therefore backtracking line-search becomes costly due to multiple computations of SVD in each iteration. Hence we use an adaptive step-size scheme as a surrogate to line-search. After each iteration, the objective value of (15) is measured, the step-size is halved if the objective value increases compared to the previous iteration. Note that Accelerated Proximal Gradient (APG) is not a strictly descent method, as oppose to Gradient Descent, therefore tentative increments of the function value do occur. This phenomenon is referred as the Nesterov ripples in the literature (Nesterov, 1983; O’donoghue and Candes, 2015). Further, we employ a function value-based restarting criterion introduced in (O’donoghue and Candes, 2015). The algorithm is restarted with the current iterates as re-initialization when ripples are detected, and the step-size is halved. The acceleration momentums are reset at each restart. A partial SVD is needed in order to obtain the proximal solution for the nuclear norm of the heterogeneous effect matrix , as shown in eqn. (24). Note that the dimension of is . Therefore, SVD computation becomes a bottleneck during the optimization when or is large. To scale up the estimation problem (15) on large datasets, we deploy the randomized Singular Value Decomposition algorithm Halko et al. (2011) to perform the SVT operation. To obtain an approximate SVD for a matrix , we seek for a matrix with orthonormal columns such that
and approximately spans the range of . In addition, we would like the computation of to be lightweight, by the mean of random projection. Let be standard Gaussian random vectors, where . Then
forms a matrix with linearly-independent columns with high probability. This idea is formalized in the seminal paper by Halko et al. (2011). The randomized SVD algorithm is given in Algorithm 1. Essentially, Algorithm 1 can be viewed as first performing dimensionality reduction via random projection (line 1-4), then running the deterministic SVD for a matrix of smaller dimension (line 5) and lastly recovering the SVD for the original matrix (line 6-7). Algorithm 2 increases the guess for minimum number of singular values required to reach below the threshold and applies randomized SVD until the threshold is found. The Fast Accelerated Proximal Gradient with Adaptive Restart (FAPGAR) algorithm combining these features is listed in Algorithm 3.
5 Greedy Local Continuation for Pathwise Solutions
Algorithm 3 solves one instance of problem (15), given and . Hence, we solve (15) over a range of and using Algorithm 3, then choose the regularization constants by some goodness-of-fit criteria. Suppose the search space in each is discretized into points. Grid search requires calling Algorithm 3 times in order to obtain the solution over the two dimensional grids. This can be prohibitive. In this section, we describe a fast strategy via greedy local search and continuation method to establish the solution over a range of tuning constants.
5.1 Prediction for new observations
Given a new unseen sample , the latent heterogeneous effect needs to be decided. As show in section 3.1, clustering is one of the situations causing low-rank latent effect. Therefore, given a new observation , we query its
-nearest neighbors using distance defined by one minus the cosine similarity. The heterogeneous effect for the new test sample is set to the weighted average of heterogeneous effect of observations belonged to the neighborhood of the new test sample, i.e., .
5.2 Greedy local continuation
For Lasso or other type penalized fixed effect generalized linear models, Friedman et al. (Friedman et al., 2007, 2010a). showed that forms a continuous path of the regularization constant . Therefore, the model parameters stay close when change in the penalty constant is small enough. Based on this observation, Friedman et al. proposed warm start strategy to compute the solution over the entire path of regularization hyperparameters, where the optimal solution from a neighboring is used as initialization (Friedman et al., 2007, 2010a). This idea is extended to the nuclear norm penalized problem in (Ma et al., 2011).
This warm start continuation strategy can also be applied on top of FAPGAR (Algorithm 3), when either or is fixed. Although it is possible to produce the solution path over the entire two dimensional grids generated by the Cartesian product with warm-starting strategy, the computation cost becomes large since calls of Algorithm 3 is required. Here is the cardinality of grids in . Hence, we develop a greedy strategy with continuation method to avoid the computation over the entire search space. We propose a coordinate-wise search strategy. and is optimized alternatively using the continuation scheme while the other is being fixed. After a solution path over is computed for a fixed , we pick the value of yielding the best prediction on a separate validation set. After is selected, the solution path for is computed with the warm-start continuation strategy. The process continuous and search for and alternatively. Let denote the constants selected after coordinate-wise outer iterations. A cycle is a configuration such that for some . The search process stops when a cycle is detected. Figure 1 provides an illustration of the Greedy Local Continuation search.
6 Case Studies
Experimental evaluations on the proposed low-rank and sparse decomposition-based latent effect model (15) and the solution algorithm FAPGAR are described in this section. Traffic accident records from The Statewide Integrated Traffic Records System (SWITRS) (The California Highway Patrol, 2014) in California are used in the experiments. It has been reported that single-vehicle crashes account for nearly 30% of all vehicle crashes in the United States in 2015 (NHTSA, 2017). The latent effect logit model (15) is evaluated on single-vehicle crash observations. The set of injury categories are no injury or only complaint of pain, visible injury, severe injury, fatal. Seventeen variables listed in Table 1 are used in the model. All variables are processed into binary vector ’s using dummy coding, in which variable is 0 if it belongs to the reference level in Table 1, and 1 otherwise. We aim to study the computational aspect of (15) and the usage of this model for analyzing individual heterogeneity and quantifying the impact of different accident attributes.
6.1 Computation efficiency
Computational efficiency of Algorithm 3 is examined. We implement Algorithm 3 in MATLAB and perform the experiments on a MacOS system with 2.4 GHz Intel Core i5 processor and 4 GB 1600 MHz DDR3 memory. Data from 2012-2013 are sub-sampled to create training sets with different sizes () in order to study the scalability of Algorithm 3. The average running time per iteration are shown in Figure 2. The computation time per iteration in FAPGAR scales linearly with the number of samples in the dataset. The FAPGAR algorithm is compared with the Proximal Gradient method with constant step-sizes. The objective value of (15) after each iteration is shown in Figure 4. Figure 4 clearly demonstrates the advantage of employing the adaptive acceleration techniques. The benefit of using Randomized SVD (Algorithm 1) instead of deterministic SVD is illustrated in Figure 3. The computation time are measured on random input matrices. The left panel shows the time for performing randomized SVD implemented with Matlab and the deterministic partial SVD using Matlab command svds on matrices with varying number of columns. The right panel compares the randomized versus the deterministic approach at increasing number of required singular vectors. In both situations, randomized SVD offers more than a magnitude of speed-up. Note that there are approximation error introduced in the singular values computed via Randomzied SVD due to the random projection step in Algorithm 1. However, the approximation error is relatively small as shown in Table 2.
To start the Local Greedy Continuation search for and , we fist run group- regularized logistic regression over a range of s using software GLMNET (Friedman et al., 2007, 2010a). Note that the result of group- regularized logistic regression is equivalent to our proposed model (15) with set to a large value, which yields a matrix of zeros for the latent heterogeneity . For each on the solution path of group-
regularized logistic regression, we classify the samples on a separate validation set by choosing the class with largest estimated probability as the output category for each validation sample. For each hyperparameter configuration, the F-1 score is computed:
The value on the group- regularized logistic regression path with highest F-1 score is taken as initialization for the Local Greedy Continuation search. The progress of the greedy search process is visualized in Figure 5. The search process terminated after three iterations due to the occurrence of a cycle. The selected .
|number of columns||1000||2500||5000||7500||10000|
|top singular vectors||2||10||50||100||150|
6.2 Accident factor analysis
The latent effect model (15) is applied for analyzing traffic accident data in this section. The analysis is performed on randomly sampled single-vehicle accidents from 2012-2013 in SWITRS. The average direct pseudo-elasticity is a metric for quantifying the influence of a factor (Shankar and Mannering, 1996; Shankar et al., 1998; Carson and Mannering, 2001; Ulfarsson and Mannering, 2004; Anastasopoulos and Mannering, 2009; Kim et al., 2010; Anastasopoulos and Mannering, 2011; Morgan and Mannering, 2011; Kim et al., 2013; Mannering and Bhat, 2014) (and references therein). Given a binary feature vector , the average direct pseudo-elasticity measures the change in the probability of suffering class injury outcome, when a feature switches from zero to one. The direct pseudo-elasticity for observation on class due to the change of can be computed by
where denotes the estimated parameters for a model, is a sub-vector of features from excluding the -th variable. In transportation literature, the average of from the training set is calculated. Note that since the parameter is fitted from the training set, is in fact conditioned on the training set. The purpose of computing the average of is to achieve better estimation of the direct pseudo-elasticity in the population, i.e.,
where denotes the population, and denotes the training data randomly sampled from . Therefore, we propose an extension to the average pseudo-elasticity by incorporating the cross-validation procedure:
Shuffle the dataset randomly and separate it into folds. Let denote the subset of data with the -th fold removed, and be the remaining data.
estimate a model on , and compute the average pseudo-elasticity on .
compute the mean of average pseudo-elasticity estimated from .
We name the estimated direct pseudo-elasticity from this procedure as the cross-validation direct pseudo-elasticity (CV-DPE). We compute CV-DPE of (15) over a path of for the group-sparsity penalty constant, and fix for the nuclear norm penalty constant obtained from the Greedy Local Continuation search. The results for each injury outcome category are displayed in Figure 6. Since the model parameters change when varies, Figure 6 provides a visual inspection of CV-DPE with different degrees of freedom in the model. From the CV-DPE analysis, alcohol increases the probability of suffering severe and fatal injuries. Meanwhile, the probability of experiencing only visible injuries and the chance of free from injuries in an accident is greatly reduce. In addition, alcohol is the first factor selected into the model, when decreases. This is an indication about the importance of alcohol factor in predicting the accident results. On the contrary, wearing seatbelt improves the chance of no injury or suffering only visible injuries from accidents by more than
, reducing the odds of suffer severe injuries or fatality by more thanaccording to the CV-DPE result from the estimated model with best F-1 score. Consumption of drugs increases the likeliness of fatality by more than according to the model selected by the cross-validation, while reducing the chance of all other injury categories. Factors related to violation of traffic laws, i.e., over-speeding, improper tuning, wrong side of road, all lead to greater chance of fatality, but relatively less impactful compared to the influence of drug and alcohol. In addition, the Local Greedy Continuation found a that leads to a vector of zeros for the homogeneous effect parameter corresponds to cell-phone usage. This can be interpreted as the exclusion of cell-phone influence as a common effect. This observation may be unintuitive at the first sight. After inspecting the dataset, only less than accident samples reported cell-phone as involved. This can be a result of under-reporting — understandably, people would not admit using a cell-phone while driving unless being caught. Hence, it is not surprise to see the model selects cell-phone usage only as a heterogeneous effect variable.
6.3 Visualization of latent individual heterogeneity
The latent effect model (15) also has the advantages of providing insights into the individual heterogeneity. The rank of deceases as increases. Hence, (15) can be used as a dimensional reduction tool. The resulting learned from model (15) has rank two at the selected from the Local Greedy Continuation search process. Hence, each can be visualized by a two dimensional vector, represented by their principal component scores. The first two principal component scores of the latent heterogeneity inferred from each training point are shown in Figure 7. There are four well identified clusters, each cluster is dominated by samples from one injury category. This is an indication of the clustering effect in the individual heterogeneity. Moreover, the cluster with majority of severe injury accidents and the cluster comprises mostly fatality cases are overlapped with each other.
We present a latent effect logistic model based on sparse and low-rank decomposition between the homogeneous effect and heterogeneous effect of observations. The formulation has the advantages of preserving model convexity while capturing the latent individual heterogeneity. The optimization problem from the model is solved by a Fast Accelerated Proximal Gradient with Adaptive Restarting algorithm. A Greedy Local Continuation search process is developed to enable efficient exploration of model hyperparameters. We demonstrate that low-rankness is a result of different data-generating processes, and validate through experiments clustering gave rise to the low-rankness of latent heterogeneity in accident observations. The usefulness of the model is demonstrated by analyzing traffic accident factors. From the analysis, drug and alcohol are found to be the factors with the largest impact on the probability of suffering fatality in an accident, whereas the usage of seatbelt greatly improves the chance of avoiding injuries.
- Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2 (4), pp. 433–459. Cited by: §3.1.3.
- Examples in which misspecification of a random effects distribution reduces efficiency, and possible remedies. Computational Statistics & Data Analysis 47 (3), pp. 639–653. Cited by: §1.
- A note on modeling vehicle accident frequencies with random-parameters count models. Accident Analysis & Prevention 41 (1), pp. 153–159. Cited by: §6.2.
- An empirical assessment of fixed and random parameter logit models using crash-and non-crash-specific injury data. Accident Analysis & Prevention 43 (3), pp. 1140–1147. Cited by: §6.2.
- Efficient algorithms for robust and stable principal component pursuit problems. Computational Optimization and Applications 58 (1), pp. 1–29. Cited by: §3.2.
- An admm algorithm for clustering partially observed networks. In Proceedings of the 2015 SIAM international conference on data mining, pp. 460–468. Cited by: §1.
- Minorization-maximization (mm) algorithms for semiparametric logit models: bottlenecks, extensions, and comparisons. Transportation Research Part B: Methodological 115, pp. 17–40. Cited by: §1, §1, §1.
- Estimating mixed logit models with quasi-monte carlo sequences allowing practical error estimation. In the European Transport Conference, Strasbourg, France. Cited by: §1.
- Application of an adaptive monte carlo algorithm to mixed logit estimation. Transportation Research Part B: Methodological 40 (7), pp. 577–593. Cited by: §1, §4.1.
- A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 (1), pp. 183–202. Cited by: §4.1, §4.1, §4.2, Theorem 1.
- The netflix prize. In Proceedings of KDD cup and workshop, pp. 35. Cited by: §1.
- A new mixed mnp model accommodating a variety of dependent non-normal coefficient distributions. Theory and Decision 84 (2), pp. 239–275. Cited by: §1.
- Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research Part B: Methodological 35 (7), pp. 677–693. Cited by: §1.
- The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics 6 (1), pp. 76–90. Cited by: §4.1.
- A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), pp. 1956–1982. Cited by: §4.1, §4.1.
- Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §1, §3.2.
- Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1, §3.2.
- The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory 56 (5), pp. 2053–2080. Cited by: §1.
- The effect of ice warning signs on ice-accident frequencies and severities. Accident Analysis & Prevention 33 (1), pp. 99–109. Cited by: §6.2.
- Latent variable graphical model selection via convex optimization. The Annals of Statistics 40 (4), pp. 1935–1967. Cited by: §3.1.1, §3.2.
- Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory 61 (7), pp. 4034–4059. Cited by: §3.1.1, §3.2.
- Assessing user benefits with discrete choice models: implications of specification errors under random taste heterogeneity. Transportation Research Record 1926 (1), pp. 61–69. Cited by: §1.
- Bayesian nonparametric estimation and consistency of mixed multinomial logit choice models. Bernoulli 16 (3), pp. 679–704. Cited by: §1, §1.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1, §2.
On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 606–610. Cited by: §3.1.3.
- On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis 52 (8), pp. 3913–3927. Cited by: §3.1.3.
- NLOGIT 6 reference guide. Technical report Cited by: §1.
- A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, 2001. Proceedings of the 2001, Vol. 6, pp. 4734–4739. Cited by: §3.2.
- A new approach to variable metric algorithms. The computer journal 13 (3), pp. 317–322. Cited by: §1, §4.1.
- A comparison of methods for representing random taste heterogeneity in discrete choice models. European Transport-Trasporti Europei 42, pp. 1–25. Cited by: §1, §1, §2.
- Easy and flexible mixture distributions. Economics Letters 120 (2), pp. 206–210. Cited by: §1, §2.
- Investigating the distribution of the value of travel time savings. Transportation Research Part B: Methodological 40 (8), pp. 688–707. Cited by: §1.
- A simple estimator for the distribution of random coefficients. Quantitative Economics 2 (3), pp. 381–418. Cited by: §1.
- Pathwise coordinate optimization. The Annals of Applied Statistics 1 (2), pp. 302–332. Cited by: §5.2, §6.1.
- Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), pp. 1. Cited by: §5.2, §6.1.
- A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736. Cited by: §3.
The why and how of nonnegative matrix factorization.
Regularization, Optimization, Kernels, and Support Vector Machines12 (257). Cited by: §3.1.3.
- A latent class model for discrete choice analysis: contrasts with mixed logit. Transportation Research Part B: Methodological 37 (8), pp. 681–698. Cited by: §1, §1, §3.1.2.
- Structured low-rank matrix factorization: optimality, algorithm, and applications to image processing. In Proceeding of the 31st International Conference on Machine Learning, pp. 2007–2015. Cited by: §1.
- Classical estimation methods for ldv models using simulation. Handbook of econometrics 4, pp. 2383–2441. Cited by: §1, §2.
- Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: §4.2.
- A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica: Journal of the Econometric Society, pp. 271–320. Cited by: §1.
- The mixed logit model: the state of practice. Transportation 30 (2), pp. 133–176. Cited by: §1, §1.
- The use of heuristic optimization algorithms to facilitate maximum simulated likelihood estimation of random parameter logit models. Journal of the Royal Statistical Society: Series C (Applied Statistics) 66 (5), pp. 997–1013. Cited by: §1.
- Model selection and misspecification in discrete choice welfare analysis. Applied Economics 47 (39), pp. 4153–4167. Cited by: §1.
- Learning with structured sparsity. Journal of Machine Learning Research 12 (Nov), pp. 3371–3412. Cited by: §1.
- A nonparametric joint assortment and price choice model. Management Science 63 (9), pp. 3128–3145. Cited by: §1, §3.1.2.
- Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research 12 (Oct), pp. 2777–2824. Cited by: §1.
- Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research 12 (Jul), pp. 2297–2334. Cited by: §4.1, §4.1.
- Inverse problems with structural prior information. Inverse problems 15 (3), pp. 713. Cited by: §1.
- Comparing alternative models of heterogeneity in consumer choice behavior. Journal of Applied Econometrics 28 (6), pp. 1018–1045. Cited by: §1, §1.
- Driver-injury severity in single-vehicle crashes in california: a mixed logit analysis of heterogeneity due to age and gender. Accident Analysis & Prevention 50, pp. 1073–1081. Cited by: §1, §1, §6.2.
- Age and pedestrian injury severity in motor-vehicle crashes: a heteroskedastic logit analysis. Accident Analysis & Prevention 40 (5), pp. 1695–1702. Cited by: §1.
- A note on modeling pedestrian-injury severity in motor-vehicle crashes with the mixed logit model. Accident Analysis & Prevention 42 (6), pp. 1751–1758. Cited by: §6.2.
- Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §3.1.3.
- Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: §3.1.3.
- Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pp. 777–784. Cited by: §1.
- A diagonal plus low-rank covariance model for computationally efficient source separation. In Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pp. 1–6. Cited by: §3.1.1.
- Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming 128 (1-2), pp. 321–353. Cited by: §4.1, §4.1, §5.2.
- Analytic methods in accident research: methodological frontier and future directions. Analytic methods in accident research 1, pp. 1–22. Cited by: §6.2.
Structured priors for structure learning..
Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI 2006), Cited by: §1.
- Generalized linear models. 2 edition, Chapman and Hall. Cited by: §1.
- Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical science, pp. 388–402. Cited by: §1.
- Econometric models of probabilistic choice. In Structural Analysis of Discrete Data wit Economic Applications, C. Manski and D. McFadden (Eds.), pp. 198–272. Cited by: §1.
A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica: Journal of the Econometric Society, pp. 995–1026. Cited by: §1, §2.
- Economic choices. The American economic review 91 (3), pp. 351–378. Cited by: Convex Latent Effect Logit Model via Sparse and Low-rank Decomposition, §1.
- Learning latent variable gaussian graphical models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1269–1277. Cited by: §3.1.1, §3.2.
- The effects of road-surface conditions, age, and gender on driver-injury severities. Accident Analysis & Prevention 43 (5), pp. 1852–1863. Cited by: §6.2.
- Estimation of the mixed logit likelihood function by randomized quasi-monte carlo. Transportation Research Part B: Methodological 46 (2), pp. 305–320. Cited by: §1.
- A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27 (a), pp. 372–376. Cited by: §4.2.
- The effect of misspecification of random effects distributions in clustered data settings with outcome-dependent sampling. Canadian Journal of Statistics 39 (3), pp. 488–497. Cited by: §1.
- TRAFFIC safety facts 2015. Technical report National Highway Traffic Safety AdministrationCenter for Statistics and Analysis, National Highway Traffic Safety Administra- tion, Washington, DC.. External Links: Cited by: §6.
- Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics 15 (3), pp. 715–732. Cited by: §4.2.
- Simultaneously structured models with application to sparse and low-rank matrices. IEEE Transactions on Information Theory 61 (5), pp. 2886–2908. External Links: Cited by: §3.1.1, §3.2.
- Fitting nonparametric mixed logit models via expectation-maximization algorithm. The Stata Journal 12 (2), pp. 284–298. Cited by: §1, §3.1.2.
- Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52 (3), pp. 471–501. Cited by: §3.2.
- Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pp. 3065–3070. Cited by: §3.2.
- Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, USA, pp. 51–58. External Links: Cited by: §1, §3.1.1, §3.2.
- Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM Journal on Matrix Analysis and Applications 33 (4), pp. 1395–1416. Cited by: §3.1.1.
- Evaluating median crossover likelihoods with clustered accident counts: an empirical inquiry using the random effects negative binomial model. Transportation Research Record: Journal of the Transportation Research Board (1635), pp. 44–48. Cited by: §6.2.
- An exploratory multinomial logit analysis of single-vehicle motorcycle accident severity. Journal of safety research 27 (3), pp. 183–194. Cited by: §6.2.
- Conditioning of quasi-newton methods for function minimization. Mathematics of computation 24 (111), pp. 647–656. Cited by: §1, §4.1.
- Robust subspace clustering. The Annals of Statistics 42 (2), pp. 669–699. Cited by: §1, §3.2.
- Newton’s method with a model trust region modification. SIAM Journal on Numerical Analysis 19 (2), pp. 409–426. Cited by: §4.1.
- Gender and age differences among teen drivers in fatal crashes. In 56th Annual Scientific Conference of the Association for the Advancement of Automotive Medicine, Cited by: §1.
- Statewide integrated traffic records system. External Links: Cited by: §1, §6.
- An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization 6, pp. . Cited by: §4.1, §4.1.
- Discrete choice methods with simulation. Cambridge Books. Cited by: §1.
- EM algorithms for nonparametric estimation of mixing distributions. Journal of Choice Modelling 1 (1), pp. 40–69. Cited by: §1, §1, §1, §3.1.2.
- Discrete choice methods with simulation. 2 edition, Cambridge University Press. Cited by: §1, §2.
- Halton sequences for mixed logit. Technical report University of California Berkeley, Department of Economics, Berkeley, California. Cited by: §1.
- Mixed logit with a flexible mixing distribution. Journal of choice modelling 19, pp. 40–53. Cited by: §1, §1, §2.
Why are big data matrices approximately low rank?.
SIAM Mathematics of Data Science (SIMODS), to appear. External Links: Cited by: §1, §3.2.
- Differences in male and female injury severities in sport-utility vehicle, minivan, pickup and passenger car accidents. Accident Analysis & Prevention 36 (2), pp. 135–147. Cited by: §6.2.
- Random taste heterogeneity in discrete choice models: flexible nonparametric finite mixture distributions. Transportation Research Part B: Methodological 106, pp. 76–101. Cited by: §1, §2, §3.1.2.
- Numerical optimization. Springer Science 35 (67-68), pp. 7. Cited by: §1.
- Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), pp. 49–67. Cited by: §3, §4.1.
- An experimental study on stochastic gradient-based methods for fitting mixed logit models. working paper. Cited by: §1.