1 Introduction
Emerging shared mobility services, such as car sharing, bike sharing, ridesouring, and microtransit, have rapidly gained popularity across cities and are gradually changing how people move around. Predicting individual preferences for these services and the induced changes in travel behavior is critical for transportation planning. Traditionally, travel behavior research has been primarily supported by discrete choice models, most notably the logit family such as the multinomial logit (MNL), the nested logit model and the mixed logit model. In recent years, as machine learning has become pervasive in many fields, there has been a growing interest in its application to mode individual choice behavior. A number of recent publications compared the results of machinelearning methods and logit models in modeling travel mode choices, with a particular emphasis on their respective outofsample predictive accuracy. These studies have often found that machinelearning models such as neural networks (NN) and support vector machines (SVM) perform better than logit models
(e.g. xie2003work; zhang2008travel).However, the existing literature comparing logit models and machine learning for modeling travel mode choice has a number of important gaps. First, the comparisons were usually made between the MNL model, the simplest logit model, and machinelearning algorithms of different complexity. In cases where the assumption of independence of irrelevant alternatives (IIA) is violated, such as when panel data (i.e., data containing multiple mode choices made by the same individuals) are examined, more advanced logit models such as the mixed logit model should be considered. Second, previous modechoice studies rarely applied machinelearning models for behavioral analysis (e.g., examining variable importance, elasticities, and marginal effects) and compared the behavioral findings with those obtained by logit models. In modechoice modeling applications, the behavioral interpretation of the results is as important as the prediction problem, since it offers valuable insights for transit planners and policymakers in order to prioritize the design of service attributes. Third, existing studies rarely discuss the fundamental differences in the application of machinelearning methods and logit models to travelmode choice modeling. The notable differences between the two approaches in the input data structure and data needs, the modeling of alternativespecific attributes, and the form of predicted outputs carry significant implications for model comparison. These differences and their implications, although touched on by some researchers such as omrani2013prediction, have not been thoroughly examined.
This paper tries to bridge these gaps: It provides a comprehensive comparison of logit models and machine learning in modeling travel mode choices and also an empirical evaluation of the two approaches based on statedpreference (SP) survey data on a proposed mobilityondemand (MOD) transit system, i.e., an integrated transit system that runs highfrequency buses along major corridors and operates ondemand shuttles in the surrounding areas (TS2017)
. The paper first discusses the fundamental differences in the practical applications of the two types of methods, with a particular focus on the implications of the predictive performance of each framework and their capabilities to facilitate behavioral interpretations. Then, we compare the performance of two logit models (MNL and mixed logit) and seven machinelearning classifiers, including Naive Bayes (NB), classification and regression trees (CART), boosting trees (BOOST), bagging trees (BAG), random forest (RF), SVM, and NN, in predicting individual choices of four travel modes and their respective market shares. Moreover, we provide behavioral interpretations of the bestperforming models for each approach and contrasts the findings. We find that machine learning can produce higher outofsample prediction accuracy than logit models. Moreover, machine learning is better at revealing nonlinear relationships between trip attributes and the choice output but may produce unreasonable behavioral outputs if the computation of marginal effects and elasticities follows a standard procedure. To tackle this problem we propose to incorporate certain behavioral constraints into the procedure of calculating marginal effects and elasticities for machining learning models, and the results are improved as expected.
The rest of the paper is organized as follows. The next section provides a brief review of the literature in modeling mode choices using logit and machinelearning models. Section 3 explains the fundamentals of the logit and machinelearning models, including model formulation and input data structures, model development and evaluation, and model interpretation and application. Section 4 introduces the data used for empirical evaluation and Section 5 describes the logit and machinelearning models examined and their specifications. Section 6 evaluates these models in terms of predictive capability and interpretability. Lastly, Section 7 concludes by summarizing the findings, identifying the limitations of the paper, and suggesting future research directions. Table 1 presents the list of abbreviations and acronyms used in this paper.
MNL  Multinomial logit 

NB  Naive Bayes 
CART  Classification and regression trees 
RF  Random forest 
BOOST  Boosting trees 
BAG  Bagging trees 
SVM  Support vector machines 
NN  Neural networks 
AIC  Akaike information criterion 
BIC  Bayesian information criterion 
Min  Minimum 
Max  Maximum 
SD  Standard deviation 
SP  Statedpreference 
RP  Revealedpreference 
IIA  Independence of irrelevant alternatives 
PT  Public transit 
2 Literature Review
The logit family is a class of econometric models based on random utility maximization (ben1985discrete). Due to their statistical foundations and their capability to represent individual choice behavior realistically, the MNL model and its extensions have dominated travel behavior research ever since its formulation in the 1970s (mcfadden1973conditional). The MNL model is frequently challenged for its major assumption, the IIA property, and its inability to account for taste variations among different individuals. To address these limitations, modelers have developed important extensions to the MNL model such as the nested logit model and more recently the mixed logit model. The mixed logit model, in particular, has received much interest in recent years: Unlike the MNL model, it does not require the IIA assumption, can accommodate preference heterogeneity, and may significantly improve the MNL behavioral realism in representing consumer choices (hensher2003mixed).
Modechoice modeling can also be viewed as a classification problem, providing an alternative to logit models. A number of recent publications have suggested that machinelearning classifiers such as CART, NN, and SVM are effective in modeling individual travel behavior (xie2003work; zhang2008travel; omrani2013prediction; omrani2015predicting; hagenauer2017comparative; golshani2018modeling; wang2018machine). These studies generally found that machinelearning classifiers outperform traditional logit models in predicting travelmode choices. For example, xie2003work applied CART and NN to model mode choices for commuting trips taken by residents in the San Francisco Bay area. These machinelearning methods exhibited slightly better performance than the MNL model. Based on data collected in the same area, zhang2008travel reported that SVM can predict commuter travel mode choice more accurately than NN and MNL.
It is not surprising that machinelearning classifiers can perform better than logit models in predictive tasks. Unlike logit models that make strong statistical assumptions (i.e. constraining the model structure and assuming a certain distribution in the error term a priori), machine learning allows for more flexible model structures, which can reduce the model’s incompatibility with the empirical data (xie2003work; christopher2016pattern). More fundamentally, the development of machine learning prioritizes predictive power, whereas advances in logit models are mostly driven by refining model assumptions, improving model fit, and enhancing the behavioral interpretation of the model results (brownstone1998forecasting; hensher2003mixed)
. In other words, the development of logit models prioritizes parameter estimation (i.e. obtaining better model parameter estimates that underline the relationship between the input variables and the output variable) and pay less attention to validating the model’s outofsample predictive capability
(mullainathan2017machine). In fact, recent studies have shown that the mixed logit model, despite resulting in substantial improvements in overall model fit, often results in poorer prediction accuracy compared to the simpler and more restrictive MNL model (cherchi2010validation).While recognizing the superior predictive power of machinelearning models, researchers often think that they have weak explanatory power (mullainathan2017machine). In other words, machinelearning models are often regarded as “not interpretable.” Machinelearning studies rarely apply model outputs to facilitate behavioral interpretations, i.e., to test the response of the output variable or to changes in the input variables in order to generate findings on individual travel behavioral and preferences (karlaftis2011statistical)
. The model outputs of many machinelearning models are indeed not directly interpretable as one may need hundreds of parameters to describe a deep NN or hundreds of decision trees to understand a RF model. Nonetheless, many of the behavioral analyses applied in logitmodel studies, such as the evaluation of variable importance, marginal effects, and elasticities, can be similarly implemented in machinelearning models via techniques such as partial dependence plots and sensitivity analysis
(golshani2018modeling). Examining these behavioral outputs from machine learning models could shed light on what factors are driving prediction decisions and also the fundamental question of whether machine learning is appropriate for behavioral analysis.Prediction and behavioral analysis are equally important in travel behavior studies. While the primary goal of some applications is to accurately predict mode choices (and investigators are usually more concerned about the prediction of aggregate market share for each mode than about the prediction of individual choices), other studies may be more interested in quantifying the impact of different trip attributes on travel mode choices. To our knowledge, modechoice applications that focus on behavioral outputs such as elasticity, marginal effect, value of time, and willingnesstopay measures have received even more attention than those that focus on predicting individual mode choice or aggregate market shares in the literature. This paper thus extends current literature by comparing the behavioral findings from logit models and machinelearning methods, beyond the existing studies that primarily focus on their predictive accuracy.
Finally, this paper points out other differences in the practical applications of these two approaches that have bearings on model outputs and performance, including their input data structure and data needs, the treatment of alternativespecific attributes, and the forms of the predicted outputs. Discussions of these differences are largely absent from the current literature that compares the application of logit models and machinelearning algorithms in travel behavior research.
3 Fundamentals of the Logit and MachineLearning Models
This section discusses the fundamentals of the logit and machinelearning models. Table 2 presents the list of symbols and notations used in the paper and Table 3 summarizes the comparison between logit and machinelearning models from various angles. The rest of this section describes this comparison in detail.
Symbols  Description 

Total number of alternatives  
Total number of observations  
Total number of features  
Input data for logit models containing features with observations for alternatives  
Feature for alternative of  
All the features except for alternative of  
A rowvector for the th observation for alternative  
Input data for alternative , where  
The th observation of , where  
The feature of , where  
Input data for machinelearning models containing features and observations  
Feature of  
All the features except of  
th observation of ]  
Utility function for mode  
Parameter vector for alternative of MNL model  
Parameter matrix of MNL model,  
Estimated parameter matrix of MNL model  
Random error for alternative of MNL model  
Observed mode choice data  
Estimated mode choice for observation  
Parameter or hyperparameter vector for machinelearning models 

Estimated parameter or hyperparameter vector  
Trained machinelearning models using and  
Probability of choosing alternative of observation  
Probability prediction for choosing alternative of observation  
Indicator function that equals to 1 if  
Aggregate level prediction for mode based on and for logit models  
Aggregate level prediction for mode based on and for machinelearning models  
Arc elasticity  
Marginal effect  
Constant 
Logit Models  MachineLearning Models  

Model formulation  
Commonly used model type  MNL, Mixed logit, nested MNL, generalized MNL  CART, BAG, BOOST, RF, NB, SVM, NN 
Prediction type  Class probability:  Classification: 
Data input  
Model topology  Layer structure  Layer structure, tree structure, casebased reasoning, etc. 
Optimization method  Maximum likelihood estimation, simulated maximum likelihood  Back propagation, gradient descent, recursive partitioning, structural risk minimization, maximum likelihood, etc. 
Evaluation criteria  (Adjusted) McFadden’s pseudo , AIC, BIC  Resamplingbased measures, e.g., cross validation 
Individuallevel mode prediction  
Aggregatelevel mode share prediction  
Variable importance  Standardized beta coefficients  Variable importance, computed by using Gini index, outofbag error, and many others 
Variable effects  Sign and magnitude of beta coefficients  Partial dependence plots 
Arc elasticity for feature  
Marginal effects for feature 
3.1 Model Development
Logit models and machinelearning models approach the mode choice prediction problem from different perspectives. Logit models view the mode choice problem as individuals selecting a mode from a set of travel options in order to maximize their utility. Under the random utility maximization framework, the model assumes that each mode provides a certain level of (dis)utility to a traveler, and specifies, for each mode, a utility function with two parts: A component to represent the effects of observed variables and a random error term to represent the effects of unobserved factors (ben1985discrete). For example, the utility of choosing mode under the MNL model can be defined as
(1) 
where are the coefficients to be estimated and is the unobserved random error for choosing mode . Different logit models are formed by specifying different types of error terms and different choices of coefficients on the observed variables. For instance, assuming a Gumbel distributed error term and fixed model coefficients (i.e., coefficients that are the same for all individuals) produces the MNL model (ben1985discrete). In the MNL, the probability of choosing alternative for individual is
(2) 
Given the beta coefficient, the MNL can be associated with the likelihood function
(3) 
Maximum likelihood estimation can then be applied to obtain the ”best” utility coefficients . By plugging into Eqn. (2), the choice probabilities for each mode can be obtained. More complex logit models, such as the mixed logit and nested logit, can be derived similarly from different assumptions about the coefficients and error distribution. However, these models are more difficult to fit: They generally do not have closedform solutions for the likelihood function and require the simulation of maximum likelihood for various parameter estimations. Observe also that logit models have a layer structure, which maps the input layer to the output layer, .
Machinelearning models, in contrast, view mode choice prediction as a classification problem: Given a set of input variables, predict which travel mode will be chosen. More precisely, the goal is to learn a target function which maps input variables to the output target () as
(4) 
where
represents the unknown parameter vector for parametric models like NB and the hyperparameter vector for nonparametric models such as SVM, CART, and RF. Unlike logit models that predetermine a (usually) linear model structure and make specific assumptions for parameters and error distributions, many machinelearning models are nonlinear and/or nonparametric, which allows for more flexible model structures to be directly learned from the data. In addition, compared to logit models that maximize likelihood to estimate parameters, machinelearning models often apply different optimization techniques, such as back propagation and gradient descent for NN, recursive partitioning for CART, structural risk minimization for SVM. Moreover, while logit models have a layer structure, machinelearning models have different model topologies for different models. For example, treebased models (CART, BAG, BOOST, and RF) all have a tree structure, whereas NN has a layer structure.
Furthermore, since the outputs of logit models are individual choice probabilities, it is difficult to compare the prediction with the observed mode choices directly. Therefore, when evaluating the predictive accuracy of logit models at the individual level, a common practice in the literature is to assign an outcome probability to the alternative with the largest outcome probability, i.e.,
(5) 
This produces the same type of output (i.e., the travel mode choice) as the machinelearning models. Besides the prediction of individual choices, logit models and machinelearning methods are often evaluated based on their capability to reproduce the aggregate choice distribution for each mode, i.e., the market shares of each mode. For logit models, the predicted market share of mode is
(6) 
and, for machinelearning methods, it is given by
(7) 
The calibration of the logit models is targeted at approximating aggregate market shares, as opposed to giving an absolute prediction on the individual choice (ben1985discrete; hensher2005applied). Thus, the predictive accuracy of the models may differ at the individual level and the aggregate level: Which of them should be prioritized depends on the project at hand.
Another important difference between the two approaches lies in the input data structures. The fitting of a logit model requires the data on all available alternatives. In other words, even if the attributes of nonchosen alternatives are not observed, their values need to be modeled. In contrast, machinelearning algorithms require the observed (chosen) mode only and not necessarily information on the nonchosen alternatives. For example, many previous studies have only considered attribute values of the chosen mode (e.g., travel time of the chosen mode (xie2003work; wang2018machine)) in their machinelearning models. We believe that it is better to also consider attribute values of the nonchosen modes since, if nonchosen mode travel times are not provided, the machinelearning model learns the mode choice based on the chosen mode travel time, which cannot be used to analyze mode changes or plan new transportation projects or services, like the mobilityontransit system studied in the paper.
Figure 1 shows one observation that serves as the input to logit models and machinelearning models respectively. This difference has implications on the flexibility of these two types of models in modeling alternativespecific attributes (e.g., wait time is a transitspecific attribute). Due to this layered structure, logit models can easily accommodate these variables since each alternative has its own utility function and so alternativespecific attributes only enter the utility functions of the corresponding alternatives. While alternativespecific attributes can also be added into machinelearning models, the model does not explicitly capture that these attributes are associated only with certain alternatives.
3.2 Model Evaluation
When evaluating statistical and machinelearning models, the goal is to minimize the overall prediction error, which is a sum of three terms: the bias, the variance, and the irreducible error. The bias is the error due to incorrect assumptions of the model. The variance is the error arising from the model sensitivity to the small fluctuations in the dataset used for fitting the model. The irreducible error results from the noise in the problem itself. The relationship between bias and variance is often referred to as “biasvariance tradeoff,” which measures the tradeoff between the goodnessoffit and model complexity. Goodnessoffit captures how a statistical model can capture the discrepancy between the observed values and the values expected under the model. Better fitting models tend to have more complexity, which may create overfitting issues and decrease the model predictive capabilities. On the other hand, simpler models tend to have a worse fit and a higher bias, causing the model to miss relevant relationships between input variables and outputs, which is also known as underfitting. Therefore, in order to balance the biasvariance tradeoff and obtain a model with low bias and low variance, one needs to consider multiple models at different complexity levels, and use an evaluation criterion to identify the model that minimizes the overall prediction error. The process is known as model selection. The evaluation criteria can be theoretical measures like adjusted
, AIC, , and BIC, and/or resamplingbased measures, such as cross validation and bootstrapping. Resamplingbased measures are generally preferred over theoretical measures.The selection of statistical models is usually based on theoretical measures. For example, when using logit models to predict individual mode choices, researchers usually calibrate the models on the entire dataset, examine the loglikelihood at convergence, and compare the resulting adjusted McFadden’s pseudo (mcfadden1973conditional), AIC, and/or BIC in order to determine a bestfitting model. These three measures penalize the likelihood for including too many “useless” features. The adjusted McFadden’s pseudo is most commonly reported for logit models, and a value between 0.2 to 0.3 is generally considered as indicating satisfactory model fit (mcfadden1973conditional). On the other hand, AIC and BIC are commonly used to compare models with different number of variables.
When applying machinelearning models, cross validation is usually conducted to evaluate a set of different models, with different variable selections, model types, and choices of hyperparameters. The best model is thus identified as the one with the highest outofsample predictive power. A commonlyused cross validation method is the 10fold cross validation, which applies the following procedure: 1) Randomly split the entire dataset into 10 disjoint equalsized subsets, 2) choose one subset for validation, the rest for training; 3) train all the machinelearning models on one training set; 4) test all the trained models on the validation set and compute the corresponding predictive accuracy; 5) repeat Step 2) to 4) for 10 times, with each of the 10 subsets used exactly once as the validation data; and 6) the 10 validation results for each model are averaged to produce a mean estimate. Cross validation allows researchers to compare very different models together with the single goal of assessing their predictive accuracy. This paper compares the logit and machinelearning models using the 10fold cross validation in order to evaluate their predictive capabilities at individual and aggregate levels.
Finally, when applying statistical models such as the logit models, researchers often take into account the underlying theoretical soundness and the behavioral realism of the model outputs to identify a final model (in addition to relying on the adjusted McFadden’s pseudo , AIC and/or BIC). In other words, even though balancing the biasvariance tradeoff is very important, in statistical modeling, a “worse” model may be preferred due to reasons like theoretical soundness and behavioral realism. For example, since worsening the performance of a travel mode should decrease its attractiveness, the utility coefficients of the levelofservice attributes such as wait time for transit should always have a negative sign. Therefore, when a “better” model produces a positive sign for wait time, a “worse” model may be preferred. On the other hand, for machinelearning models, the predictive accuracy is typically the sole criterion for deciding the best model, so machinelearning models may produce results that contradict the theoretical soundness or behavioral realism. This paper however shows that machinelearning models could also be selected based on behavioral realism through behavioral interpretation.
3.3 Model Interpretation and Application
The interpretation of outputs of logit models is intuitive and simple. Like any other statistical model, researchers can quickly learn how and why a logit model works by examining the sign, relative magnitude, and statistical significance of the model coefficients. Researchers may also apply these outputs to conduct further behavioral analysis on individual travel behavior, such as deriving marginal effect and elasticity estimates, comparing the utility differences in various types of travel times, calculating traveler willingnesstopay for trip time and other service attributes. All of these applications can be validated by explicit mathematical formulations and derivations, which allows modelers to clearly explain what happens “behind the scene.”
In contrast, machinelearning models are often criticized for being “black box” and lacking interpretability (klaiber2011random). Some complex machinelearning models such as NN and RF may contain hundreds or even thousands of parameters and no human language can describe how exactly they work. In practice, it is often the case that more complex models have higher prediction accuracy, but increasing complexity inevitably reduces interpretability. Accordingly, machinelearning practitioners rarely attempt to directly interpret the model parameters. Instead they apply modelagnostic interpretability methods such as variable importance and partial dependence plots to extract explanations of their outputs (molnar2018interpretable)
. On the one hand, variable importance measures in machine learning shows which variables have the most impact when predicting the response variable. Different machinelearning models have different ways to compute variable importance. For example, for treebased models (such as CART and RF), the mean decrease in node impurity (measured by the Gini index) is commonly used to measure the variable importance. On the other hand, partial dependence plots measure the influence of a variable
on the logodds or probability of choosing a mode
after accounting for the average effects of the other variables (friedman2001elements). In recent years, as machine learning became increasingly popular in the study of societal systems, there has been a surge of research interest in the development of these modelagnostic methods to make machinelearning models and their decisions understandable (vellido2012MLinterp).Arguably, the behavioral insights that one can obtain from the logit models (with parameter ratios, marginal effects, and elasticities) may also be generated by machinelearning models through the application of modelagnostic interpretability methods and sensitivity analysis. For example, for machinelearning models, the arc elasticity for feature can be obtained by
(8) 
and the marginal effect for feature can be computed as
(9) 
In essence, all of these techniques, despite their obvious differences, measure how the output variable responds to changes in the input features. In the context of travel mode choices, they help researchers gain a better understanding of how individual choices of travel modes is impacted by a variety of different factors such as the socioeconomic and demographic characteristics of travelers and the respective trip attributes for each travel mode. In the current literature, however, the behavioral findings gained from machinelearning models are rarely compared with those obtained from logit models. Since the goals of mode choice studies are often in extracting knowledge to shed light on individual travel preferences and travel behavior instead of merely predicting their mode choice, these comparisons are necessary to have a more thorough evaluation of the adequacy of machine learning. Machinelearning models that have excellent predictive power but generate unrealistic behavioral results are not useful in travel behavior studies.
4 The Data for Empirical Evaluation
The data used for empirical evaluation came from a statedpreference (SP) survey completed by the faculty, staff, and students at the University of Michigan in the Ann Arbor campus. In the survey, participants were first asked to estimate the trip attributes (e.g., travel time, cost, and wait time) for their hometowork travel for each of the following modes: Walking, biking, driving, and taking the bus. Then, the survey asked respondents to envision a change in the transit system, i.e., the situation where a new public transit (PT) system, named RITMO Transit (RitmoTransit), that fully integrates highfrequency fixedroute bus services and microtransit services has replaced the existing bus system (see Figure 2). Text descriptions were coupled with graphical illustrations to facilitate the understanding of the new system. Each survey participant was then asked to make their commutemode choice among Car, Walk, Bike, and PT in seven statechoice experiments, where the trip attributes for Walk, Bike, and Car were the same as their selfreported values and the trip attributes for PT were pivoted from those of driving and taking the bus. A more detailed descriptions of the survey can be found in YAN2018.
A total of 8,141 observations collected from 1,163 individuals were kept for analysis after a datacleaning process. The variables that enter into the analysis include the trip attributes for each travel mode, several sociodemographic variables, transportationrelated residential preference variables, and current/revealed travel mode choices. The travel attributes include travel time for all modes, wait time for PT, daily parking cost for driving, number of additional pickups for PT, and number of transfers for PT. The socioeconomic and demographic variables include car access (car ownership for students and car per capita in the household for faculty and staff), economic status (living expenses for students and household income for faculty and staff), gender, and identity status (i.e., faculty vs staff vs student). The transportationrelated residential preference variables are the importance of walkability/bikeability and transit availability when deciding where to live. Finally, current travel mode choices are also included as statedependence effects (i.e., the tendency for individuals to abandon or stick with their current travel mode) are verified as important predictors of mode choice by many empirical studies. Table 4
summarizes the descriptive statistics on these variables, including a general description of each variable, category percentages for categorical variables, and min, max, mean, and standard deviation for continuous variables.
Variable  Description  Category  %  Min  Max  Mean  SD 

Dependent Variable  
Mode Choice  Car  14.888  
Walk  28.965  
Bike  20.870  
PT  35.278  
Independent Variables  
TT_Drive  Travel time of driving (min)  2.000  40.000  15.210  6.616  
TT_Walk  Travel time of walking (min)  3.000  120.000  32.300  23.083  
TT_Bike  Travel time of biking (min)  1.000  55.000  15.340  10.447  
TT_PT  Travel time of using PT (min)  6.200  34.000  18.680  4.754  
Parking_Cost  Parking cost ($)  0.000  5.000  0.9837  1.678  
Wait_Time  Wait time for PT (min)  3.000  8.000  5.000  2.070  
Transfer  Number of transfers  0.000  2.000  0.328  0.646  
Rideshare  Number of additional pickups  0.000  2.000  1.105  0.816  
Income  Income level  1.000  6.000  1.929  1.342  
Bike_Walkability  Importance of bike and walkability  1.000  4.000  3.224  0.954  
PT_Access  Importance of PT access  1.000  4.000  3.093  1.023  
CarPerCap  Car per capita  0.000  3.000  0.529  0.476  
Female  Female or Male  Female  56.320  
Male  43.680  
Student  Students or faculty/staff  Student  73.517  
Faculty or staff  26.483  
Current_Mode_Car  Current travel mode is Car or not  Car  16.681  
Not Car  83.319  
Current_Mode_Walk  Current travel mode is Walk or not  Walk  40.413  
Not Walk  59.587  
Current_Mode_Bike  Current travel mode is Bike or not  Bike  8.254  
Not Bike  91.746  
Current_Mode_PT  Current travel mode is PT or not  PT  34.652  
Not PT  65.348 
After extracting the data from the SP survey, we preprocessed the data and verified that all the independent variables have little multicollinearity (farrar1967multicollinearity). The existence of multicollinearity can inflate the variance and negatively impact the predictive power of the models. This study chose the variance inflation factor to determine which variables are highly correlated with other variables and found out that all variables had a variance inflation factor value of less than five, indicating that multicollinearity was not a concern.
5 Models Examined and Their Specifications
This section briefly introduces the logit and machinelearning models examined in this study. Since our dataset has a panel structure, usually a mixed logit model should be applied. However, we also fitted an MNL model as the benchmark for comparison, as previous studies generally compared machinelearning models with the MNL model only. Seven machinelearning models are examined, including simple ones like NB and CART, and more complex ones such as RF, BOOST, BAG, SVM, and NN. Most previous mode choice studies only examined a subset of these models (xie2003work; omrani2013prediction; omrani2015predicting; wang2018machine; chen2017understanding).
5.1 Logit Models
We have already introduced the MNL model formulation in detail in Subsection 3.1, and so only the mixed logit model is presented here.
The mixed logit model is an extension of the MNL model, which addresses some of the MNL limitations (such as relaxing the IIA property assumption) and is more suitable for modeling panel choice datasets in which the observations are correlated (i.e., each individual is making multiple choices) (mcfadden2000mixed). A mixed logit model specification usually treats the coefficients in the utility function as varying across individuals but being constant over choice situations for each person (train2009discrete). The utility function from alternative in choice occasion by individual is
(10) 
where is the independent and identically distributed random error across people, alternatives, and time. Hence, conditioned on , the probability of an individual making a sequence of choices (i.e., ) is
(11) 
Because the ’s are independent over the choice sequence, the corresponding unconditional probability is
(12) 
where
is the probability density function of
. This integral does not have an analytical solution, so it can only be estimated using simulated maximum likelihood (e.g. train2009discrete).In this study, the MNL models can be summarized as follows: 1) The utility function of Car includes modespecific parameters for TT_Drive, Parking_Cost, Income, CarPerCap, and Current_Mode_Car; 2) the utility function of Walk includes modespecific parameters for TT_Walk, Student (sharing the same parameter for Bike), Female (sharing the same parameter with Bike), Bike_Walkability (sharing the same parameter with Bike), and Current_Mode_Walk; 3) the utility function of Bike includes modespecific parameters for TT_Bike, Student (sharing the same parameter with Walk), Female (sharing the same parameter with Walk), Bike_Walkability (sharing the same parameter with Walk), and Current_Mode_Bike; and 4) the utility function of PT includes modespecific parameters for TT_PT, Wait_Time, Rideshare, Transfer, Student, PT_Access, and Current_Mode_PT. We also specify three alternative specific constants for Walk, Bike, and PT
respectively. The mixed logit model has the same model specifications except that travel time is generic across all modes. In addition, in order to accommodate individual preference heterogeneity (i.e. taste variations among different individuals), we also specify coefficients on all the levelofservice variables (i.e., travel time, Wait_Time, Parking_Cost, Transfer, and Rideshare) as random parameters. These random parameters are all assessed with a normal distribution. We use 1,000 halton draws to perform the integration. Both the MNL and mixed logit models are estimated using the NLOGIT software.
5.2 MachineLearning Models
5.2.1 Naive Bayes
The NB model is a simple machinelearning classifier. The model is constructed using Bayes’ Theorem with the naive assumption that all the features are independent
(mccallum1998comparison). NB models are useful because they are faster and easier to construct as compared to other complicated models. As a result, NB models work well as a baseline classifier for large datasets. In some cases, NB even outperforms more complicated models (zhang2004optimality). A limitation of the NB model is that, in real world situations, it is very unlikely for all the predictors to be completely independent from each other. Thus, the NB model is very sensitive when there are highly correlated predictors in the model. In this study, the NB model is constructed through the R package e1071 (e1071).5.2.2 Treebased Models
The CART model builds classification or regression trees to predict either a classification or a continuous dependent variable. In this paper, the CART model creates classification trees where each internal node of the tree recursively partitions the data based on the value of a single predictor. Leaf nodes represent the category (i.e., Car, Bike, PT, and Walk) predicted for that individual (breiman2017classification). The decision tree is sensitive to noise and susceptible to overfit (last2002improving; quinlan2014c4). To control its complexity, it can be pruned. This study prunes the tree until the number of terminal nodes is 6. The CART model is obtained through the R package tree (tree).
To address the overfitting issues of CART models, treebased ensemble techniques were proposed to form more robust, stable, and accurate models than single decision trees (breiman1996bagging; friedman2001elements). One of these ensemble methods is BOOST. For a
class problem, BOOST creates a sequence of decision trees, where each successive tree seeks to improve the incorrect classifications of the previous trees. Predictions in BOOST are based on a weighted voting among all the boosting trees. Although BOOST usually has a higher predictive accuracy than CART, it is more difficult to interpret. Another drawback is that BOOST is prone to overfitting when too many trees are used. This study applies the gradient boosting machine technique to create the BOOST model
(friedman2001greedy). 500 trees are used, with shrinkage parameter set to 0.05 and the interaction depth to 10. The minimum number of observations in the trees terminal nodes is 10. The BOOST model is created with the R package gbm (gbm).Another wellknown ensemble method is BAG, which trains multiple trees in parallel by bootstrapping data (i.e., sampling with replacement) (breiman1996bagging). The BAG model uses all independent variables to train the trees. For a class problem, after all the trees are trained, the BAG model makes the mode choice prediction by determining the majority votes among all the decision trees. By using bootstrapping, the BAG model is able to reduce the variance and overfitting problems of a single decision tree model. One potential drawback with the BAG model is that it assumes that all the features are independent. If the features are correlated, the variance would not be reduced with BAG. In this study, 600 classification trees are bagged, with each tree grown without pruning. The BAG model is produced with the R package ipred (ipred).
The RF model is also an ensemble method. Like BAG, RF trains multiple trees using bootstrapping (ho1998random). However, RF only uses a random subset of all the independent variables to train the classification trees. More precisely, the trees in RF use all the independent variables, but every node in each tree only uses a random subset of them (breiman2001random). By doing so, RF reduces variance between correlated trees and negates the drawback that BAG models may have with correlated variables. Similar to BAG, RF makes mode choice predictions by determining the majority votes among all the classification trees. Like other nonparametric models, RF is difficult to interpret. In this study, 600 trees are used and 10 randomly selected variables are considered for each split at the trees’ nodes. The R package used for producing the RF model is randomForest (RF).
5.2.3 Support Vector Machine
The SVM model is a binary classifier which, given labeled training data, finds the hyperplane maximizing the margin between two classes. This hyperplane is a linear or nonlinear (depending on the kernel) decision boundary that separates the two classes. Since a mode choice model typically involves multiclass classification, the oneagainstone approach is used
(hsu2002comparison). Specifically, for a class problem, binary classifiers are trained to differentiate all possible pairs of classes. The class receiving the most votes among all the binary classifiers is selected for prediction. SVM usually performs well with both nonlinear and linear boundaries depending on the specified kernel. However, the SVM model can be very sensitive to overfitting especially for nonlinear kernels (cawley2010over). In this study, a SVM with a radial basis kernel is used. The cost constraint violation is set to 1.25, and the gamma parameter for the kernel is set to 0.4. The SVM model is produced with the R package e1071 (e1071).5.2.4 Neural Network
A basic NN model has three layers of units/nodes where each node can either be turned active (on) or inactive (off), and each node connection between layers has a weight. The data is fed into the model at the input layer, goes through the weighted connections to the hidden layer, and lastly ends up at a node in the output layer which contains units for an class problem. The hidden layer allows the NN to model nonlinear relationships between variables. Although NN have shown promising results in modeling travel mode choice in some studies (omrani2015predicting), NN models tend to be overfitting, and are difficult to interpret. In this paper, a NN with a single hidden layer of 10 units is used. The connection weights are trained by back propagation with a weight decay constant of 0.1. The R package nnet (stats) is used to create our NN model.
6 Comparison of Empirical Results
This section presents the empirical results of this study. Specifically, it compares the predictive accuracy of the logit models with that of the machinelearning algorithms. In addition, it compares the behavioral findings of the bestperforming model (in terms of predictive accuracy) from each approach.
6.1 Predictive Accuracy
This study applied the 10fold cross validation approach. As discussed above, cross validation requires subsetting the sample data into training sample sets and validation sample sets. One open issue is how to partition the sample dataset when it is a panel dataset (i.e., individuals with multiple observations). One approach is to treat all observations as independent choices and randomly divide these observations. The other is to subset by individuals, each with their full set of observations. This study follows the first approach, which is commonly applied by previous studies (xie2003work; hagenauer2017comparative; wang2018machine).
As discussed in Subsection 3.1, the predictive power of the models may differ at the individual level (predicting the mode choice of a particular choice) and at the aggregate level (predicting the market shares for each travel mode). The calibration of logit models focuses on reproducing market shares whereas the development of machinelearning classifiers focuses on predicting individual choices. This study compares both the mean individuallevel predictive accuracy and the mean aggregatelevel predictive accuracy.
6.1.1 IndividualLevel Predictive Accuracy
The cross validation results for individuallevel predictive accuracy is shown in Table 5. Note that, while the machinelearning methods predict a particular travel mode, logit models return probabilities for all available modes. The results assume that the travel mode with the highest predicted probability is selected as the predicted mode for the logit models. The best two models are RF and BAG, with a mean predictive accuracy equal to 0.856 and 0.843 respectively. However, the accuracy of the MNL and mixed logit model is only 0.640 and 0.592 respectively, much lower than the two bestperforming machinelearning models.
The predictive accuracy of each model by travel mode is presented in Table 5. All models predict Walk most accurately. All machinelearning models have a mean predictive accuracy value between 0.795 and 0.929, whereas the MNL model has an accuracy of 0.860 and the mixed logit model 0.652. Both logit models and the two bestperforming machinelearning models predict modes PT and Bike relatively better than mode Car. One possible explanation is that Car, with a market share of 14.888%, has fewer observations compared to other modes. Furthermore, many car owners always select mode Car regardless of changes in the PT profiles; such nonswitching behavior creates a challenge to accurately predict mode Car (hess2010non).
Finally, it is somewhat surprising that the mixed logit model, a model that accounts for individual preference heterogeneity and has significantly better model fit (adjusted McFadden’s pseudo is 0.58) than the MNL model (adjusted McFadden’s pseudo is 0.36), underperformed the MNL model in terms of the outofsample predictive power. This finding is nonetheless consistent with the findings of cherchi2010validation. It suggests that the mixed logit model may have overfitted the data with the introduction of random parameters, and such overfitting resulted in greater outofsample prediction error.
Overall  Car  Walk  Bike  PT  
Mean  SD  Mean  SD  Mean  SD  Mean  SD  Mean  SD  
MNL  0.640  0.016  0.393  0.028  0.860  0.017  0.414  0.032  0.698  0.027 
Mixed logit  0.592  0.013  0.481  0.026  0.652  0.033  0.573  0.032  0.601  0.034 
NB  0.602  0.013  0.612  0.046  0.852  0.018  0.368  0.038  0.529  0.030 
CART  0.593  0.014  0.428  0.032  0.795  0.022  0.329  0.038  0.653  0.026 
BOOST  0.836  0.010  0.775  0.028  0.916  0.014  0.798  0.031  0.817  0.028 
BAG  0.843  0.016  0.774  0.018  0.916  0.020  0.834  0.023  0.817  0.029 
RF  0.856  0.015  0.796  0.017  0.929  0.016  0.859  0.028  0.819  0.030 
SVM  0.731  0.016  0.589  0.021  0.858  0.027  0.575  0.030  0.778  0.026 
NN  0.643  0.014  0.445  0.046  0.863  0.026  0.420  0.040  0.678  0.028 
It is also useful to compare the four models (two logit models and two bestperforming machinelearning models) using a
test to check if the mean accuracy of these models is significantly different from each other. The null hypothesis of the
test is that the mean predictive accuracy of the four models is the same, while the alternative hypothesis is that the mean predictive accuracy of the four models is different. When the value is lower than the common thresholds, e.g., 0.1 or 0.05, the null hypothesis may be rejected. Since multiple comparisons for the four models must be conducted, the values obtained from these comparisons have to be adjusted otherwise, a null hypothesis could be incorrectly rejected by pure chance (dunnett1955multiple). Hence the values are adjusted by applying the Bonferroni correction (rice2006mathematical), which requires values , where represents the significance level (in this case, ) and is the number of individual significance tests (i.e., ). The results of the adjusted tests are given in Table 6, where the numbers below the diagonal are values, and the numbers above the diagonal are testing conclusions (i.e., whether the mean difference is significant or not).MNL  Mixed Logit  BAG  RF  

MNL  Significant  Significant  Significant  
Mixed Logit  5.805e5  Significant  Significant  
BAG  1.273e9  1.160e10  Significant  
RF  8.525e11  1.656e11  2.295e3 
As a result, with all the adjusted values in the fourth row smaller than 0.05, the mean accuracy of RF (the best model) is significantly different from other models, which provides us with strong statistical evidence that RF has the best predictive performance at the individual level. With 85.6% predictive accuracy, it is advisable to apply machine learning to predict individuallevel mode choices for the activitybased or agentbased transportation models. The results conclude with high confidence that the logit models are weaker than the bestperforming machinelearning model in terms of prediction.
6.1.2 AggregateLevel Predictive Accuracy
We now turn to aggregatelevel predictive accuracy. To quantify the sum of the absolute differences between the market share predictions and the real market shares from the validation data, we use the L1norm, also known as the least absolute deviations. Taking machinelearning models as an example, let and represent the true (observed) and predicted shares for mode . The L1norm thus is defined as
(13) 
The predictive accuracy results of the logit and machinelearning models at the aggregate level are depicted in Table 7. The results show that RF outperforms all the other models, with a prediction error of 0.043 and a standard deviation of 0.014. Notably, even though logit models are expected to have good performance for market share predictions, RF has lower error compared to MNL (0.048) and mixed logit (0.076). Again, the MNL model resulted in a higher aggregatelevel predictive accuracy than the mixed logit model.
L1Norm  

Mean  SD  
MNL  0.048  0.020 
Mixed logit  0.076  0.026 
NB  0.297  0.043 
CART  0.293  0.038 
BOOST  0.067  0.022 
BAG  0.048  0.019 
RF  0.043  0.014 
SVM  0.174  0.030 
NN  0.221  0.037 
In summary, the results show that RF is the best model (among those evaluated) for forecasting travel choice for a new transit system featuring very different parameters.
6.2 Model Interpretation
Recent advances in machine learning make models interpretable through techniques such as variable importance and partial dependence plots. Machinelearning results can be readily applied to compute behavioral outputs such as marginal effects and arc elasticities. However, other behavioral outputs such as the value of time, willingnesstopay, and consumer welfare measures are hard to obtain from machinelearning models, because they are grounded on the random utility modeling framework and an assumption that individual utility can be kept constant when attributes of a product substitutes each other (e.g., paying a certain amount of money to reduce a unit of time). Machinelearning models lacks the behavioral foundation required to obtain these measures.
This section examines the behavioral findings from the bestperforming logit model (MNL) and machinelearning model (RF). For the MNL model, we interpret the model results and calculate some behavioral outputs including marginal effects and elasticities. We then conduct comparable behavioral analysis on the RF model by applying variable importance and partial dependence plots and by performing sensitivity analysis. Finally, we compare and contrast the behavioral findings generated by the two models.
It should be noted that, while the mixed logit model is found to be inferior in terms of its predictive capacity (and hence its results are not discussed here), it can generate additional behavioral insights on individual travel that neither the MNL nor any machinelearning model can produce. Notably, the mixed logit model is very flexible in modeling (both observed and unobserved) preference heterogeneity, i.e., variations in traveler tastes for different attributes of the choice alternatives, among the study population. Since the MNL model does not recognize the panel data structure (i.e. repeated observations from the same individual), it has limited capacity to accommodate preference heterogeneity. For example, one can only model observed taste variations using the market segmentation approach (train2009discrete).
6.2.1 Variable Importance and Effects
The outputs for the MNL model are presented in Table 8. The adjusted McFadden’s pseudo for this model is 0.36, which indicates satisfactory model fit. All of the coefficient estimates are consistent with theoretical predictions. All levelofservice variables carry an intuitive negative sign, and all of them are statistically significant except for Parking_Cost. Individual sociodemographic characteristics are associated with their travel mode choices. Unsurprisingly, higherincome travelers with better car access are more likely to drive than using alternative modes. Females are less likely to choose Walk and Bike than males, but there is no significant difference between the mode choice of students and faculty/staff. The model also shows that individual residential preferences and current travel mode choices are associated with their travel mode choices of Car, Walk, and Bike. However, people tend to have weak attachment to PT as shown by the small and insignificant beta coefficient. Individuals who value walking, biking, and transit access when choosing where to live are more likely to use these modes. Also, the model shows that travelers tend to stick to their current mode even when a new travel option is offered.
The last column of the table shows the standardized beta coefficients for the MNL model, which allow researchers to assess the relative importance of the independent variables, i.e., a coefficient of larger magnitude indicate a greater impact of the corresponding independent variable on the choice outcome (menard2004six). These results show that the most important variable in predicting mode choice is TT_Bike, followed by the travel time variables for the other three modes, several revealedpreference (RP) variables (i.e., current travel modes), and some levelofservice attributes including Transfer, Rideshare, and Wait_Time. These results are reasonable and generally consistent with findings in the existing literature.
Variable  Alternatives  Unstandardized  standardized  

coefficients  coefficients  
S.E.  Std  
Constants  
Walk  Walk  2.882**  0.418  / 
Bike  Bike  1.767**  0.412  / 
PT  PT  3.260**  0.426  / 
Levelofservice variables  
TT_Drive  Car  0.075**  0.005  1.138** 
TT_Walk  Walk  0.146**  0.004  2.210** 
TT_Bike  Bike  0.163**  0.006  2.461** 
TT_PT  PT  0.102**  0.009  1.543** 
Wait_Time  PT  0.158**  0.018  0.327** 
Parking_Cost  Car  0.134  0.096  0.226 
Rideshare  PT  0.438**  0.042  0.358** 
Transfer  PT  0.583**  0.050  0.376** 
Sociodemographic variables  
Income  Car  0.076*  0.030  0.103* 
CarPerCap  Car  0.561**  0.083  0.267** 
Student  Walk, Bike  0.093  0.368  0.041 
PT  0.026  0.369  0.011  
Female  Walk, Bike  0.174**  0.061  0.086** 
Residential preference variables  
Bike_Walkability  Walk, Bike  0.072*  0.033  0.068* 
PT_Access  PT  0.112**  0.031  0.115** 
Current travel mode  
Current_Mode_Car  Car  1.369**  0.094  0.510** 
Current_Mode_Walk  Walk  1.291**  0.077  0.634** 
Current_Mode_Bike  Bike  2.891**  0.120  0.796** 
Current_Mode_PT  PT  0.090  0.073  0.043 
Sample size  1163  
Log likelihood at constant  11285.82  
Log likelihood at convergence  7160.62  
Adjusted McFadden’s pseudo  0.36 
* significant at the 5% level ** significant at the 1% level.
There is growing research interest in developing techniques to interpret machine learning in order to to help explain the decisions behind the complex models (miller2017explanation). This study applied widely used tools including variable importance measures and partial dependence plots to interpret the RF model and compare the behavioral findings obtained from the RF with those from the MNL models. Like ()standardized beta coefficients in a logit model, a variable importance measure can be used to indicate the impact of an input variable on predicting the response variable for machinelearning models. Unlike ()standardized beta coefficients that can show the direction of association between the input variable and the outcome variable with a positive or negative sign, however, variable importance measures provide no such information.
This study uses the Gini index to measure variable importance for RF. Figure 3 shows the relative variable importance of each input feature for the RF and MNL models relative to the maximum value. Note that the reason why Student has two values in the MNL model is that logit models have the flexibility of specifying alternative specific coefficients to account for different effects of a single feature on different alternatives. The ranking of the input features with respect to their relative importance in RF is generally consistent with that of the MNL model. The travel times of walking, driving, biking, and transit, and the revealed/current mode choice of biking have very high influence on their stated mode choice. Moreover, “Student” is neither important for the MNL model nor for the RF model. On the other hand, slight differences do exist. For example, PT_Access and Bike_Walkability have more importance for the RF model than they do for the MNL model. To conclude, the two model outputs on variable importance are very similar, which implies that both models relied on similar information (variability of selected input features) contained in the sample data to predict the choice outcome.
Partial dependence plots are another important tool that helps interpret machinelearning models. Figure 4 presents how the probability of choosing PT changes as the value of the selected variable changes for RF and MNL. The shape of the curves sheds light on the direction and magnitude of the changes, which is similar to the beta coefficients estimated from the MNL model. However, the beta coefficients in logit models affect the utility of mode (see Eqn. (1)) rather than the probability of choosing mode (see Eqn. (2)). Accordingly, we translate utility estimates into probability estimates for the MNL model in order to compare it with RF directly.
As shown in Figure 4(a), RF and MNL share a very similar trend for TT_PT, with a very similar slope between 10 and 25 minites. In addition, Figures 4(b)4(d) show that RF and MNL have similar patterns for Wait_Time, Rideshare, and Transfer, with RF having smaller slopes (the absolute values) compared to MNL. While MNL shows a nearly linear relationship between these features and the probability of choosing PT, RF presents some interesting findings for its nonlinear relationships: 1) For TT_PT, RF has relative flat tails before 10 minutes and after 25 minutes, showing people tend to become insensitive to very short or very long transit times; 2) travelers are more sensitive to wait times less than 5 minutes; and 3) the choice probability of PT decreases more significantly from 0 to 1 ride share compared to from 1 to 2 ride shares. Therefore, unlike logit models that usually assume a linear relationship between the input variables and the utility functions, the partial dependence plots of machinelearning models can readily reveal the nonlinearities of mode choice responses to levelofservice attributes. As opposite the timeconsuming handcurating procedure required in logit models (often by introducing interactions terms) to reveal nonlinear relationships, machine learning algorithms search for nonlinearities automatically and thus can generate richer behavioral insights much more efficiently. Therefore, we believe that machinelearning models can serve as an exploratory analysis tool for identifying better specifications for the logit models in order to enhance the predictive power and explanatory capabilities of logit models.
The existence of nonlinearities, on the other hand, may prevent researchers from conveniently obtaining willingnesstopay measures and valuation of time parameters like one can readily implement with a logit model. However, there is a caveat to the behavioral findings from the RF model. The partial dependence plots show that MNL agrees with RF well regarding the attribute TT_PT but the impact of Wait_Time, Rideshare, and Transfer on the choice probability of PT are smaller in the RF model than the MNL model. This is discussed at length in the next section.
6.2.2 Arc Elasticity and Marginal Effects
Logit models are often applied to generate behavioral outputs such as marginal effects and elasticities to gain insights on individual travel behavior. Marginal effects (and elasticities) measure the changes of the choice probability of an alternative in response to a one unit (percent) change in an independent variable. This study calculates marginal effects and elasticities for the levelofservice variables associated with the proposed mobilityondemand transit system, including TT_PT, Wait_Time, Rideshare, and Transfer. To facilitate the comparison of these behavioral outputs obtained from the MNL model with those generated by the RF model, arc elasticities were computed using Eqn. (8), since the RF model is not able to generate point elasticity estimates. Note that the data may present very nonlinear behavior and may not be sensitive to small changes, such as 1% and/or 1 unit change in a feature. Therefore, Table 9 presents the arc elasticity computed by applying 1%, 10%, 50%, and 100% increases for the selected feature. Similarly, the marginal effects are presented by applying 1, 2, and 5 units of increase for TT_PT and Wait_Time, as shown in Table 10.
The arcelasticity and marginaleffect estimates are consistent with results shown in the partial dependence plots. The two models generate somewhat similar results regarding TT_PT, but their outputs regarding Wait_Time, Transfer, and Rideshare are drastically different. In general, the behavioral outputs of the MNL are quite reasonable. On the other hand, the RF model suggests that travelers are very insensitive to changes in Wait_Time, Transfer, and Rideshare, which are inconsistent with findings in the existing literature (see abrantes2011meta for a metaanalysis of these behavioral outputs). For example, RF suggests that the impact of an additional transfer on the choice probability of public transit is smaller than that of an additional minute of travel time, which is very unlikely.
These results are puzzling for modelers: Why are the results on prediction and behavioral outputs inconsistent? Intuitively, a model is expected to capture traveler preferences (i.e., responses to different tripattributes) reasonably in order to make accurate predictions of the choice outcome. However, our results show that RF has a significantly higher predictive quality even though it does not result in reasonable behavioral outputs for many trip attributes. Here we offer two potential reasons for this empirical contradiction. First, as discussed above, logit models define different utility functions for different modes and assume that alternativespecific attributes only affect the utility of their corresponding alternatives, i.e., Transfer, Wait_Time, and Rideshare only affecting the utility of PT. In contrast, RF does not include such constraints and needs to “learn” them from the data by itself. When the data is not perfect, however, RF may stumble at representing travelers’ behavior realistically. For example, the results show that the RF model predicts that increasing the number of transfer associated with transit leads to a lower choice probability of choosing both PT and Walk, which implies a (unrealistic) negative crossmarginaleffect estimate of choosing Walk with respect to Transfer.
1% increase  10% increase  50% increase  100% increase  

TT_PT  MNL  0.86  0.85  0.77  0.66 
RF  0.00  0.99  0.51  0.30  
Wait_Time  MNL  0.34  0.34  0.33  0.30 
RF  0.00  0.00  0.08  0.05 
Unit  1unit increase  2unit increase  5unit increase  

Transfer  MNL  Number  10.93%  /  / 
RF  Number  0.66%  /  /  
Rideshare  MNL  Number  8.23%  /  / 
RF  Number  2.27%  /  /  
TT_PT  MNL  Minute  1.92%  1.92%  1.92% 
RF  Minute  1.11%  1.95%  1.56%  
Wait_Time  MNL  Minute  2.96%  2.96%  2.96% 
RF  Minute  0.00%  0.66%  0.41% 
Second, there are specific limitations associated with using the RF model in estimating marginal effects and elasticities. To be specific, the prediction decisions of RF are based on splitting of input feature values at different nodes. These nodes can only take on discrete value thresholds and thus the prediction decision becomes insensitive to values between and/or outside the thresholds. In the case study data, Wait_Time, Transfer, and Rideshare only take three different values, since the data resulted from a SP survey where only three attribute levels were used to construct the statechoice experiments (see YAN2018 for more detail). For example, Wait_Time has three values, i.e., 3 min, 5 min, and 8 min, and the decision trees inside RF split at 4 min, 5.5 min, and 6.5 min. As a result, RF has no ability to predict a different outcome for two observations with a Wait_Time of 4.1 min and 5.4 min respectively or a Wait_Time of 6.6 min and 10 min. In contrast, because the MNL model assumes a linear relationship between the utility functions and the independent variables, it is capable of distinguishing “betweenthresholds” and “outofbound” observations. If Wait_Time were to be observed for a larger variety of values like TT_PT, it would have more reasonable behavioral outputs.
6.2.3 Revisiting Direct Arc Elasticities and Marginal Effects for MachineLearning Models
It is possible to address these limitations through an alternative approach for calculating direct marginal effects and elasticities. Since the nature of the RF model structure makes it unsuitable to estimate these behavioral outputs in a standard fashion, the computations of elasticities and marginal effects need to include some behavioral constraints. In other words, while the logit models improve behavioral realism by constraining the model structure, machinelearning models may achieve similar results by constraining how the model results are applied to generate behavioral outputs. When degrading the value of an attribute associated with a given alternative, a behaviorally realistic response from a machinelearning model should be only changing the classification of those individuals currently using that alternative. For example, when a transfer is added to the PT alternative, it is expected that some individuals who are currently choosing PT may choose a different mode, whereas the classification for individuals who are currently choosing nonPT modes should not change. We thus propose to incorporate this behavioral constraint into the calculation of direct arc elasticity and marginal effect.
We illustrate the proposed approach by discussing the calculation of direct marginal effect of choosing PT with respect to Transfer. Since point estimates of marginal effects will be different for individuals with different attribute values due to nonlinear responses, marginal effects should be calculated at each data point first and then be projected back to the entire market based on their respective market shares. Denote the individuals choosing PT by , the marginal effect of choosing PT with respect to Transfer as by , the marginal effect of choosing PT with respect to Transfer for those individuals with transfers by , and the market share of individuals currently choosing PT with i transfers by
. To allow RF to interpolate when the attribute values become outofbound after a change (e.g., Transfer becomes three, which is unobserved in the data, when one additional transfer is added to an individual who currently has two transfers), we assume that the marginal effects for outofbound observations are the same as those at the boundary (e.g.,
). Accordingly, the direct marginal effect of choosing PT with respect to Transfer can be expressed as:(14) 
A similar approach can be applied to compute elasticity for continuous variables.
Values  0  1  2  Aggregate  

Transfer  Marginal effects  4.33%  19.65%  19.65%  2.97% 
Market share  25.83%  6.38%  3.07%  
Rideshare  Marginal effects  11.43%  11.77%  11.77%  4.11% 
Market share  13.22%  10.12%  11.94% 
Values  <Max  Max  Aggregate  

TT_PT  Marginal effects  3.25%  3.25%  1.15% 
Elasticity  1.08  1.08  1.08  
Market share  35.18%  0.10%  /  
Wait_Time  Marginal effects  3.10%  3.10%  1.09% 
Elasticity  0.18  0.18  0.18  
Market share  26.94%  8.34%  / 
Table 11 presents the results of this approach for computing marginal effects of choosing PT with respect to Transfer and Rideshare, and Table 12 presents marginal effects and arc elasticities of choosing PT with respect to TT_PT and Wait_Time. Compared to results in Table 10, the results regarding Transfer, Rideshare, and Wait_Time appear to improve significantly. In terms of marginal effects, these new results suggest that the impact of a transfer is approximately equal to three minutes of transit time and the impact of an additional pickup is approximately equal to four minutes of transit time. In addition, the RF model estimates that individuals value travel time slightly more than wait time. By contrast, the MNL results indicate that the effect of a transfer is roughly equal to five minutes of travel time and an additional pickup is roughly equivalent to four minutes and that individuals consider that wait time is 1.5 times as important as travel time by transit. Regarding the elasticity estimates, RF estimates a higher value for TT_PT but a lower value for Wait_Time compared to the MNL model. Overall, the behavioral results of the RF model and the MNL model are different but some comparable.
In the absence of ground truth, one cannot determine which model represents individual travel behavior more accurately than the other. The arguments can go either way. Some may argue that results obtained from the RF model should be more accurate given its superior predictive quality. Others, on the hand, may suggest that the behavioral outputs from any machinelearning model are unreliable because machine learning is not inherently built for the estimation task and lacks model selection consistency (mullainathan2017machine). We believe that this is largely an unresolved (untouched) issue in the literature and further theoretical and empirical studies are needed to shed light on the appropriateness of applying machine learning for parameter estimation and inference tasks. Our goal here is to start the conversation on these issues and to explore sound approaches to compute behavioral outputs from machinelearning algorithms in order to understand consumer choice behavior.
7 Discussion and Conclusion
The increasingly popularity of machine learning in transportation research raises questions regarding its advantages and disadvantages compared to conventional logitfamily models used for travel behavioral analysis. The development of logit models typically focuses on parameter estimation and pays little attention to prediction (i.e., lack of a procedure to validate outofsample prediction accuracy). On the other hand, machinelearning models are built for prediction but are often considered as difficult to interpret and are rarely used to extract behavioral findings from the model outputs.
This paper aims at improving the understanding of the relative strengths and weaknesses of logit models and machine learning for modeling travel mode choices. It compared logit and machinelearning models side by side using cross validation to discover their predictive and interpretability capabilities. The results showed that the bestperforming machinelearning model, the random forest model, significantly outperforms the logit models both at individual and aggregate levels. Somewhat surprsingly, the mixed logit model underperformed the multinomial logit model in terms of the outofsample predictive quality, which may result from overfitting. Moreover, to interpret the RF model, we applied three techniques, including variable importance, partial dependence plots, and sensitivity analysis, to extract behavioral insights from the model outputs. Some of the results (e.g., on travel time by transit) were illuminating, revealing additional behavioral information compared to the MNL model due to the RF model’s ability to better capture the nonlinear effects of an independent variable on the choice output. This indicates that machine learning can, at minimum, serve as an exploratory analysis tool to reveal nonlinearities; researchers can then apply such information to specify logit models that can better represent behavioral preferences and have better predictive capabilities, which should be much more efficient than a handcurating procedure typically done with statistical models.
However, a direct application of standard approaches to compute behavioral outputs from the RF model such as marginal effects and elasticities leads to unrealistic behavioral findings. This is because the machinelearning models studied here lack the behavioral assumptions applied in logit models (i.e., constraining alternativespecific attributes to only affect the utility of their corresponding alternatives). Moreover, the RF algorithm, a treebased model, is not capable of distinguishing “between nodes” and “outofboundary” observations. To address these limitations, the paper proposed an alternative approach for estimating arc elasticities and marginal effects. This approach imposes behavioral constraints to the process of generating behavioral outputs from machinelearning model results, which leads to findings that are more realistic and somewhat comparable with the MNL’s outputs.
Overall, these results are encouraging and identify many new research directions in applying machine learning to model travel behavior and forecast travel demand. Prediction and interpretation are two major topics in modeling individual choice behavior. Traditionally, each approach has focused on one aspect and ignored the other. We have demonstrated that both approaches can be applied to make predictions and infer behavior. Nonetheless, there are several major topics in travelbehavior research that we have not examined in depth. First is regarding preference heterogeneity. The development of the mixed logit model has mostly been driven by its capability to capture both observed and unobserved preference heterogeneity among individuals. Machine learning models have limited power in accommodating observed preference heterogeneity through a market segmentation approach but they cannot account for unobserved heterogeneity because they do not recognize a panel data structure. Second is on mechanisms to correct the reporting bias associated with statedpreference data. The statedpreference data are generally considered as containing reporting bias due to their hypothetical nature. Joint revealedpreference and statedpreference models have been proposed to correct for this bias (train2009discrete), but to our knowledge no machine learning algorithms allow such a joint estimation process. Finally, the differences in the output formats of logit models and machine learning. As discussed in Section 6.1.1, the logit model outputs a choice probability for each alternative whereas machine learning outputs a class (i.e. predicted mode). To facilitate the comparison with machine learning, the common practice is to alter the outcome of logit models (o.e. assigning the alternative with the highest choice probabilities as the predictive class). It is not clear if this practice has major implications on the predictive accuracy results.
There is great potential in merging important ideas from machine learning and logit models to develop more refined models for travelbehavior research. Besides addressing the limitations mentioned above, other possible research directions include: 1) examining which machine learning models are more suitable than others for behavorial analysis; 2) incorporating the behavioral assumption that assume alternativespecific , which are enabled by the “layered” data structure of logit models, into machine learning algorithms.
Acknowledgements
This research was partly funded by the Michigan Institute of Data Science (MIDAS) and by Grant 7F30154 from the Department of Energy.
Comments
There are no comments yet.