Imputing Missing Boarding Stations With Machine Learning Methods

03/10/2020 ∙ by Nadav Shalit, et al. ∙ 0

With the increase in population densities and environmental awareness, public transport has become an important aspect of urban life. Consequently, large quantities of transportation data are generated, and mining data from smart card use has become a standardized method to understand the travel habits of passengers. Public transport datasets, however, often may lack data integrity; boarding stop information may be missing due to either imperfect acquirement processes or inadequate reporting. As a result, large quantities of observations and even complete sections of cities might be absent from the smart card database. We have developed a machine (supervised) learning method to impute missing boarding stops based on ordinal classification. In addition, we present a new metric, Pareto Accuracy, to evaluate algorithms where classes have an ordinal nature. Results are based on a case study in the Israeli city of Beer Sheva for one month of data. We show that our proposed method significantly notably outperforms current imputation methods and can improve the accuracy and usefulness of large-scale transportation data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Public transport is an integral part of everyday life. Over the past century, there has been a gradual shift of the global population to urban areas, increasing greatly the dependency on public transport [petrovic2016appraisal]. With the growing number of cars on urban roads, public transport improvement is an important step to mitigate traffic congestion. Understanding the patterns of public transport use is crucial to optimizing public transport; however, this remains a very challenging task. Over the years, many studies have been conducted to examine the behavior of public transport travelers [li2018smart]. Travel habits are of great interest to transportation planners, and their analysis can help improve demand predictions and understand necessary changes in public transport supply [briand2017analyzing]. To this end, urban planners and social scientists typically use travel surveys [stopher2007household]. While such surveys reflect human behavior, they are expensive, time consuming, and inadequate in generating sufficient amounts of data relative to the size of the population [stopher2007household].

Data collection using smart cards can generate millions of records in contrast to surveys that gather data from a typical range of 2,500 to 10,000 households [stopher2007household, maeda2019detecting]. Wonjae Jang [jang2010travel] stated that “automatic fare collection systems using smart card technology have become popular because they provide an efficient and cost-saving alternative to the manual fare collection method” [jang2010travel]. Nowadays most of the monetary transactions in public transport are based solely on smart cards [chen2018extracting]

. In addition, smart cards generate geocoded timestamps recording boardings, line transfers, and sometimes alightings from transit vehicles (bus, tram, train, or metro). Recent developments in the field of data science, such as the increase in data volume, new data mining tools

[li2018smart], and cloud computing [li2015towards], have created new opportunities to analyze and determine travel behavior at the individual level over long periods and large areas [ma2017understanding].

Large-scale transportation datasets have a vast potential to influence transportation research, and with the help of big data and data mining algorithms, this task has become more feasible [ma2013mining]. However, similar to datasets in other domains, large-scale transportation datasets may be affected by issues of data integrity, such as incorrect or missing values. For example, operators may have only partial data on travelers’ boarding stops. A common solution for such problems is to replace the missing or erroneous data by utilizing publicly accessible data such as the General Transit Feed Specification (GTFS), which is defined as a common format for public transportation schedules and associated geographic information [ma2012transit]. Another solution is to discard the aforementioned data, i.e., simply remove missing records or those that do not align with a prescribed hypothesis [tao2014examining]. However, these two solutions are far from optimal. Moreover, data discarding might be a reasonable solution when the share of missing data is insignificant. When the missing portion is large, however, the whole dataset is usually discarded, causing some geographic areas to be completely blind with respect to smart card data.

In this study, we establish a new machine learning based methodology for improving the quality and integrity of transportation datasets by predicting missing or corrupted travelers’ records. Namely, we propose a novel algorithm for predicting passengers’ boarding stops. Our proposed algorithm is based on features extracted by fusing multiple big data sources, such as planned GTFS schedule data, smart card data, and other geospatial (GIS) data. Then, we apply a machine learning model on these features to predict boarding stops (see Section

3).

To test and evaluate our algorithm’s performance, we utilized a real-world smart card dataset, which consists of over a million trips taken by more than 85,000 people. Using this smart card dataset, we show that our constructed boarding stop prediction model performs significantly better compared to a naïve prediction model based on GTFS data (i.e., schedule-based). Specifically, our supervised model’s accuracy was almost two times better than the schedule-based model constructed using GTFS data (see Section 4). Importantly, we show that our constructed prediction model is generic and can be used to predict boarding stops across different cities (see Section 3.4). Furthermore, as presented in Figure 1, our proposed model predicted considerably more boarding stops with over 50% accuracy than the schedule-based method.

(a) Schedule-based predictions
(b) Proposed model predictions
Figure 1: Boarding stations with predicted accuracy of over 50%

The problem we address is ordinal classification because after embedding the boarding stops (see Section 3.4), they are ordered. Ordinal classification is a classification task where the classes have an inherent order between them. Existing methods for evaluating this problem are not suitable in our case, which also requires a high level of interpretability of the results. For this purpose, we propose a new method of evaluation which shows the percentage of each error dimension. Missing data imputation is the substitution of a null or incorrect value with a replacement value. In this particular imputation, researchers need to understand the dimension of error between the imputed boarding stops and the actual ones. For example, to compare two algorithms on a single imputation where both did not predict the correct stop, we must compare them based on how ‘large’ their mistake is. Therefore, a new method of evaluation is required; we call it Pareto Accuracy.

1.1 Contributions

Our study’s overall focus is to improve the integrity of public transportation data. Specifically, our study offers the following two main contributions:

  1. We present a novel algorithm for completing missing boarding stops using supervised learning.

  2. We develop a new method of evaluating public transport metrics that is more interpretable and allows better comparison between imputation models.

In addition to the above contributions, we show which features are most important in predicting the imputed boarding stops using state-of-the-art SHAP values [lundberg2017unified] for feature importance.

1.2 Organization

The rest of the paper is organized as follows: In Section 2, we give a brief overview of related works. Section 3 describes the experimental framework and methods used to develop the model and the extraction of the features. In Section 4, we present the results of this study. In Section 5, we discuss the implications of the findings and, lastly, in Section 6 we present our conclusions and future research directions.

2 Related Work

We provide an overview of relevant studies by first presenting smart card research in general, followed by studies that have utilized smart card data with machine learning to perform predictive analytics. Then, we provide an overview of the field of missing data imputation. Lastly, we present studies in the field of ordinal classification.

2.1 Smart Card

The smart card system was introduced as a smart and efficient automatic fare collection (AFC) system in the early 2000s [chien2002efficient] and has become an increasingly popular payment method [trepanier2007individual]. In recent years, smart cards have not only become an effective payment tool, but also an increasingly popular big data source for research [jang2010travel, agard2007mining]. Agard et al. [agard2007mining] claimed that data mining techniques on public transport must be done with domain knowledge. They showed how combining data mining with transportation planning produces travel behavior indicators such as daily patterns. Initially, research focused on the use of rather classic statistical methods and descriptive analytics. Le Kieu et al. [bhaskar2014passenger] used density-based spatial clustering application with noise (DBSCAN) to cluster passengers for mining travel patterns. Devillaine et al. [devillaine2012detection] inferred location, time, duration, and designation of public transport users’ activities using rules derived from observed sociological behaviors and from smart card data. Kusakabe and Asakura [kusakabe2014behavioural] used a Naïve Bayesian model for data imputation and analysis of public transport to observe continuous long-term changes in the attributes of trips. A comprehensive review of smart card usage literature is provided by Pelletier et al. [pelletier2011smart].

2.2 Analyzing Smart Card with Machine Learning

Researchers have now come to realize that the traditional analysis methods are subpar when used in combination with big data. In recent years, there has been a shift towards harvesting the prognostic nature of machine learning to yield predictive analytics. Welch and Widita [welch2019big] presented a number of machine learning algorithms in the field of public transportation. Moreover, they highlighted the growing emphasis in using this data for analytical purposes. This underscores the shift from the simple analysis done in the past to the more complex analysis done today.

The current belief is that the proper use of smart card data will yield insights that were not previously available from traditional methods [welch2019big]. A literature survey on this topic can be found in the paper by Li et al. [li2018smart]

. Recently, deep learning algorithms have also been used to address public transportation issues using smart cards. For example, implementations include inferencing passenger employment status

[zhang2019deep], forecasting passenger destinations [jung2017deep, toque2016forecasting], inferencing demographics [zhang2019deep2], improving passenger segmentation [chen2018traveler], and predicting multimodal transport passenger flows [toque2017short].

2.3 Missing Data Imputation

One of the well-known problems in data mining is that data is often incomplete, and a significant amount of data could be missing or incorrect, as evident in the study by Lakshminarayan et al. [lakshminarayan1996imputation]. Missing data imputation should be carefully handled or bias might be introduced [batista2003analysis], and as shown in their paper, common methods are not always optimal. In recent years, techniques to optimize missing data imputation have been explored further [bertsimas2017predictive], even using state-of-the art deep learning methods to impute [camino2019improving], showing the importance of this area. This field of study, however, has not been optimized in the field of public transport, and therefore it is compelling to further examine.

2.4 Ordinal classification

Classification is a form of supervised machine learning which tries to generalize a hypothesis from a given set of records. It learns to create where has a finite number of classes [kotsiantis2007supervised]. The basic metrics for classification are sensitivity, specificity, and accuracy [jiao2016performance]

. Ordinal classification is a form of multi-class classification where there is a natural ordinal ordering between the classes (such as cold, warm, and hot) but not necessarily numerical traits for each class. For this case, a classifier will not necessarily be chosen based on traditional metrics, such as accuracy, but rather on the severity of its errors

[gaudette2009evaluation]. In addition, classic modeling techniques will sometimes perform suboptimally, since machine learning models assume there is no order between classes. On such tasks, different models are sometimes preferred [frank2001simple]. Additional metrics are proposed to calculate such tasks differently [cardoso2011measuring].

3 Methods and Experiments

Figure 2: Methodology Overview

Our study’s main goal is to improve public transportation data integrity using machine learning algorithms. Namely, we developed a supervised learning-based model to impute missing boarding stops given a smart card dataset. Moreover, we created a generic model that is fully transferable to other datasets and can be used to impute missing data in similar transportation datasets without any adjustments.

In the process, two major constraints were identified:

  1. In constructing our model, we could only utilize properties that can be applied to any city. For example, a bus line specific to a certain city must be embedded, i.e., receive a numerical representation.

  2. Classification classes must be the same across datasets. For example, Bus Stop 14 in a specific city has no meaning in other cities; therefore, we required a different numerical representation (see Section 3.4).

To complete the missing boarding stop values, we used the following methodology (see Figure 2). First, given a smart card dataset, we pre-processed and cleaned the dataset (see Section 3.2). Then, we extracted a variety of features from the geospatial and GTFS datasets (see Section 3.3). In addition, we converted boarding stops from their original identifiers to a numerical representation by utilizing GTFS data (see Section 3.4). Next, we applied machine learning algorithms to generate a prediction model that can predict boarding stops (see Section 3.5). Lastly, we evaluated our model compared to the schedule-based method with common metrics. Moreover, we evaluated our supervised algorithm performance using a novel Pareto Accuracy metric (see Section 3.6). In the following subsections, we will describe in detail each step in our methodology.

3.1 Datasets

To apply our methodology for predicting missing boarding stops, we needed to fuse three types of datasets:

  1. [topsep=1pt, partopsep=1pt,noitemsep]

  2. Smart card dataset - contains data of smart card unique IDs; traveler types, such as student or senior travelers; boarding stops; timestamps of boarding; and unique trip identifiers for the line at that time.

  3. GTFS dataset - aligns with the smart card dataset and consists of a detailed timetable of every trip made by public transport. The GTFS dataset also was used both to enrich the features space as well as transform boarding stops into a numerical value (see Section 3.4).

  4. Geospatial dataset- contains a variety of geospatial attributes, such as number of traffic lights, population density, and more, derived from a GIS database.

By combining these three datasets, we could get a detailed overview of the bus lines and travelers. While using only the smart card and GTFS datasets would be sufficient for this methodology, fusing the geospatial dataset as well resulted in higher performance.

3.2 Data Preprocessing

To make the dataset suitable for constructing prediction models, we needed to remove any record that did not have a boarding stop or a trip ID (a unique identifier of a trip provided by a specific and unique public transport operator) from the smart card dataset. Next, we joined the smart card dataset with the GTFS dataset by matching the trip ID attribute. Furthermore, we joined the geospatial dataset with the smart card dataset using the GTFS dataset, which contains all the geographic coordinates of each route.

3.3 Features Extraction

While the smart card dataset contains valuable data regarding a passenger’s trip, this data provided only a partial view of the real world. Moreover, machine learning performance is highly correlated to the quality of the feature space; hence, more features can result in increased model performance [gudivada2017data]

. While the smart card data showed that a specific passenger boarded a transit line at a given time, it did not provide how long it had been since the line left the origin depot, how much time remained to the final destination, the total number of stops, and other trip attributes. In addition, there were physical characteristics that could improve the prediction model. These characteristics included such attributes as the number of traffic lights along the line, which increases the probability of traffic congestion forming and consequently additional delays.

111

In initial experiments, we tested 41 features, and using a stepwise feature selection, we reduced the feature space to 15; see Table

1. For example, we tested if the total length of a route could help improve our classifier’s performance.
In the feature extraction process, we extracted three features by utilizing the smart card dataset, eight features by utilizing the GTFS dataset, and three features by utilizing the geospatial dataset. Overall, utilizing these three datasets, we extracted 41 features. Table 1 gives the descriptions of the 15 selected features that were used for the most accurate models.

Dataset Feature Explanation
City
geospatial records
Addresses_average
Calculate the amount of addresses listed
along the route
Street_light_average
Calculate the amount of street lights along
the route
Light_traffics_average
Calculate the amount of trafic lights along the
route
GTFS Number_of_points
Calculate the number of points in shape file in
GTFS per route
Average_distance_per_stop
Calculate total length of route divided by
number of points
Average_time_per_stop
Calculate total expected travel time of route divided
by number of points
Average_points_to_stops
Calculate the number of points in shape file in
GTFS per route divided by number of points
Time_diff_of_trip Total travel time
GTFS &
smart card
Time_from_boarding_to_last_stop
Time from boarding time to expected last stop
of the route
Time_from_departue_to_boarding Time from route departue time to boarding time
Predicted_sequence GTFS prediction sequence of the most likely stop
Hourly_expected_lateness Average lateness per hour (based on training data)
Smart card Boardingtime_Seconds_from_midnight
Time stamp of boarding to numerical value-
second from midnight
Boardingtime_weekday Day of the week in which the boarding accord
Is_weekend Is it a weekend
Table 1: Extracted Features

3.4 Embedding Boarding Stops

To construct the prediction model, we used the GTFS dataset to create a schedule-based prediction. This prediction is the transit vehicle’s position along a line according to the GTFS schedule. Namely, let be the sequence number of the boarding stop based on the GTFS schedule and let be the actual boarding stop sequence number. Then, we define as . Our prediction model goal was to predict by utilizing the variety of features presented in the previous section. For instance, consider a passenger who boarded a line at the third stop, i.e., , at the time the transit vehicle was scheduled to arrive at the second stop. The schedule-based prediction would be 2, i.e., , the stop where it was supposed to be at that time. Then, the difference is , and this is the class the algorithm will predict.

3.5 Constructing Prediction Model

Figure 3: Modeling methodology

To construct a model which can predict boarding stops, we performed the following steps: First, we selected several well-known classification algorithms. Namely, we used Random Forest

[singh2016review]

, Logistic Regression

[singh2016review]

, and XGBoost

[chen2016xgboost]. Second, we split our dataset into training datasets, which consisted of the first three weeks of data, and a test dataset, which consisted of the last week of data. The algorithm classifies the difference between the schedule-based prediction and the actual sequence. Third, for both the training and test datasets, we extracted all 14 features (see Section 3.3). Fourth, we constructed the prediction models using each one of the selected algorithms. Lastly, we compared the generated models and selected the one that presented the best performance according to the Pareto Accuracy metric (see Section 4).

3.6 Model Evaluation

We evaluated each model and compared it to the schedule-based method on the test dataset using common metrics: accuracy, recall, precision, F1 (see appendix), and Pareto Accuracy.

We used the following variables for our novel Pareto Accuracy metric: Let be the predicted sequence of , be the actual sequence, and be the absolute difference between them. Let be the limit of acceptable difference for imputation, i.e., if an error of one stop is tolerated, such as for neighborhood segmentation, then .
Let indicator
We defined Pareto Accuracy as follows:

The PA metric is a generalization of the accuracy metric. Namely,

is the well-known accuracy metric. The primary advantage of using the PA metric is to evaluate the true dimension of error while being extremely robust to outliers (by setting parameter

), unlike other ordinal classification methods. Moreover, this metric is highly informative since its outcome value can be interpreted easily; for example, 0.6 means that 60% of predictions had at most difference from true labels.

For example, let us consider a set of eight observations of embedded boarding stops {-2,0,3,20,-3,4,3,2}, where each observation is a simulated boarding by a passenger where each number () in the set represents the difference between expected () and actual boarding stops (). The fourth observation, with a value of 20, is an outlier, which might occur due to some fault in the decoder device of the public transport operator. We do not want to predict it, as it is naturally unpredictable. We seek a metric that will both be resilient to outliers, as they are unpredictable, and still account for the true dimension of the errors (see Section 2.4). Let us compare two classifiers, A and B. Classifier A predicted the following boarding stops {-2,0,4,3,-2,3,2,2}, while Classifier B predicted {3,0,3,7,1,1,3,2}. Classifier A

is a more useful classifier since, in general, its predicted values are closer to the actual values; i.e., its variance is very small, which makes it more reliable. However, when using the classical accuracy and RMSE metrics, Classifier B has a higher accuracy and RMSE values than Classifier

A, with accuracy values of 50% vs. 37.5%, and RMSE values of 5.2 vs. 6. By using the Pareto Accuracy (), we obtain a more accurate picture in which Classifier A clearly outperforms Classifier B (87.5% vs. 50%). Here, we see a case were metrics used for both classical classification (accuracy) and ordinal classification (RMSE) do not reflect the actual performance of each classifier.

To evaluate the performance of our method, we also performed spatial analysis on our proposed model and compared it to the schedule-based method. Analysis was done by comparing boarding stops that were predicted well, i.e., with accuracy of 50% or higher. Lastly, since we wanted to enrich our understanding on the nature and patterns of public transport, we produced and analyzed feature importance (see Figure 4) by utilizing the state-of-the-art method SHAP values [lundberg2017unified].

3.7 Experiment Settings

To evaluate the above methodology, we chose to focus on the city of Beer Sheva, Israel. We utilized a smart card dataset, consisting of over a million records from over 85,000 distinct travelers for one month during November-December 2018. We chose this city since the boarding stop information was complete and road traffic in the city is not prone to congestion.

We also used a GTFS dataset, containing over 27,000 stops in Israel with over 200,000 trips, and including all the agencies (operators) in Israel.222https://transitfeeds.com/p/ministry-of-transport-and-road-safety/820 The dataset also includes a detailed timetable for every trip taken by public transport. We utilized the GTFS dataset both to enrich our feature space as well as to transform our boarding stops with the detailed timetable. Lastly, we used a geospatial dataset, which contains a variety of geographical attributes of the city of Beer Sheva, such as location of traffic lights, built-area density, and more.

We removed observations that were unusable (see Section 3.2), leaving us with about 92% of the smart card dataset. Afterwards, we extracted the 14 features from the above datasets as mentioned in Section 3.3. Next, we embedded the boarding stops from their Beer Sheva identifiers to a numerical value as described in Section 3.4. Lastly, we developed a machine learning algorithm to classify the above boarding stops (see Section 3.5) and evaluated the classifier’s performance (see Section 3.6).

4 Results

Using the aforementioned datasets, we trained several classifiers and evaluated their performance. Among all trained classifiers, the XGBoost classifier presented the highest performance (see Table 2). Comparison was done with common metrics as stated in Section 3.6. In addition, we evaluated our new proposed metric, Pareto Accuracy. We chose to evaluate on error sizes of 1,2 i.e. , . This is due to the fact that a larger gap would be usually unacceptable and they are highly correlated to for .

Additional spatial analysis produced the two heatmaps shown in Figure 1, and feature calculation importance using SHAP values to understand the effect of each feature is presented in Figure 4.

In Figure 5 we can see that results do not change for higher values of in the Pareto Accuracy metric. Proposed method still outperform current scheduling-based method.

Algorithm Accuracy Recall Precision F1 AUC
Schedule based 0.209 0.209 0.212 0.209 0.590 0.470 0.643
Logistic Regression 0.205 0.205 0.097 0.102 0.562 0.474 0.654
Random Forest 0.368 0.368 0.348 0.353 0.650 0.672 0.818
XGB 0.410 0.410 0.393 0.394 0.675 0.712 0.843
Table 2: Classifiers Performances
Figure 4: Feature importance using shap values
Figure 5: Pareto chart

We can see that by all the metrics ensemble methods outperformed the current method and logistic regression. In particular, XGBoost outperform all other algorithms and is almost twice as good as Schedule based method in terms of accuracy, Recall, Precision and F1.

5 Discussion

From the above results we can conclude the following:

First, our methodology for embedding boarding stops demonstrates several advantages: (a) the proposed algorithm generates a generic model that can be used with different smart card datasets since the labels (numeric representations) are always aligned in all datasets; (b) by encoding the boarding stops, our proposed method ensures that the number of distinct labels will be very small and can reduce the computation time significantly for generating a prediction model; and (c) boarding stops inherently have an extremely imbalanced nature, since some stops are very commonly used while others are rarely used, and our proposed methodology considerably reduces the imbalanced nature of accurately classifying many classes with such imbalances.

Second, according to Table 2, the XGBoost algorithm produced the best results in all metrics, having 41% accuracy and 71% , while the current baseline schedule-based method achieved only 21% accuracy and 47% . Moreover, as can be observed from Figure 1

, the schedule-based method predicts well only in a few main locations as opposed to our model that predicts well in many locations. Additionally, we conclude that the machine learning algorithms in this paper significantly outperformed current schedule-based imputation methods by every metric. This confirms our hypothesis that schedule-based imputation approaches can be improved by using machine learning methods. Furthermore, we can see that complex methods, such as ensemble , resulted in much better performance than simple algorithms, such as logistic regression. In future research, we intend to test the performance of additional prediction algorithms, such as Deep Neural Networks.

Third, from the SHAP values presented in Figure 4, the following can be noted: (a) by far the most important feature is the prediction created by the schedule-based method, i.e., Predicted sequence, which shows it is highly correlated to actual patterns and is very useful for classifying; (b) other than the first two features, the next four features are temporal, which is logical in that different time periods have an impact on traffic (such as morning commute) and when the route is further from boarding it accumulates stochastic events and variance increases; (c) while geospatial features do not have the highest values of importance, they are not insignificant, and thus we can conclude that physical attributes affect the nature of our problem, such as denser areas can lead to more congestion; and (d) the two least significant features pertain to day of the week, and we can conclude that daily public transport routines remained quite stable in our case study.

Fourth, our methodology is transferable across cities (see Section 3.4). Initial results on another city (Kiryat Gat, Israel) looks promising compared to the current schedue-based method, with an accuracy of 31% vs. 20% and of 56% vs. 47%.

Lastly, we introduced a new metric in this paper, a generalized accuracy metric we named Pareto Accuracy, which helps compare classifiers for ordinal classification. This metric is more robust to outliers, more interpretable, and accounts for the true dimension of errors. In addition, the metric is easy to implement. In the future, we hope to understand how Pareto Accuracy can improve additional ordinal classification use cases.

6 Conclusions and Future Work

Missing data imputation is a very difficult and complex task. On one hand, one wants as much data as possible for analysis, but on the other hand, data integrity is of utter importance and demands available imputation methods that will work well. We assert that the commonly used schedule-based method suffers from subpar performance in terms of accuracy and other key metrics, as well as being highly dependent on the centrality of boarding stops. This dependency is clearly visible in a hotspot spatial analysis of the stops that are well predicted. These attributes reduce the imputed data’s integrity and make it less suitable for usage. In contrast, we showed that our proposed model outperformed the schedule-based method in all metrics, as well as being more robust to the centrality of the imputed stops. This makes it a much more suitable method for imputation as it improves data integrity.

In addition, our method is based on generic classification and thus can be used in a wide variety of transportation use cases, such as predicting alighting at stops, imputing other attributes of interest, etc. In the future, we would like to test our model in other cities to verify its generalization. In addition, we would like to test the influence on transfer learning on new datasets.

There are a few limitations to the study worth noting. One is that our method requires several constraints to succeed, such as timestamps, trip IDs, and existing trip timetables. These constraints possibly reduce the number of applicable datasets and the number of observations that could be imputed. However, these constraints also prevent using the schedule-based method; hence, in practice using this method does not reduce researchers’ ability to impute missing data. In addition, the generality of our method can increase bias as it ignores features that cannot be transferred from one dataset to another. These features, such as having each line as a categorical feature, can reduce bias when imputing a specific dataset.

In conclusion, by mining smart card data we were able to construct a passenger boarding stop prediction model that clearly surpasses the traditional schedule-based method. Our study revealed that applying machine learning techniques improves the integrity of public transport data, which can significantly benefit the field of transportation planning and operations.

Acknowledgement

This research was supported by a grant from the Ministry of Science & Technology, Israel and The Ministry of Science & Technology of the People’s Republic of China (No. 3-15741). Special thanks to Prof. Itzhak Benenson of Tel Aviv University for his helpful advice, and to Carol Teegarden for editing and proofreading.

Appendix

Metrics presented in this paper:

  • Accuracy - percent of observations that were correctly classified

  • Recall - for each class, Recall is the number of observations that were correctly classified divided by the total number of distinct observations from this class. Final Recall is the weighted average of the above on all classes.

  • Precision - for each class, Precision is the number of observations that were correctly classified divided by the total number of observations that were predicted within this class. Final Precision is the weighted average of the above on all classes.

  • AUC - area under curve (AUC), refers to the area under the ROC curve. This curve, for each class, is the true positive rate as a function of the false positive rate. Then a weighted average of the areas under the curves of all classes is calculated as the AUC metric.

  • RMSE - root mean square error (RMSE) is a method for ordinal classification and regression. It sums the square difference from prediction to actual label, then returns the root of the above average.

References