Estimating Train Delays in a Large Rail Network Using a Zero Shot Markov Model

by   Ramashish Gaurav, et al.

India runs the fourth largest railway transport network size carrying over 8 billion passengers per year. However, the travel experience of passengers is frequently marked by delays, i.e., late arrival of trains at stations, causing inconvenience. In a first, we study the systemic delays in train arrivals using n-order Markov frameworks and experiment with two regression based models. Using train running-status data collected for two years, we report on an efficient algorithm for estimating delays at railway stations with near accurate results. This work can help railways to manage their resources, while also helping passengers and businesses served by them to efficiently plan their activities.


page 1

page 2

page 3

page 4


Train performance analysis using heterogeneous statistical models

This study investigated the effect of harsh winter climate on the perfor...

A Train Status Assistant for Indian Railways

Trains are part-and-parcel of every day lives in countries with large, d...

Transformers à Grande Vitesse

Robust travel time predictions are of prime importance in managing any t...

Temporal Connectivity: Coping with Foreseen and Unforeseen Delays

Consider planning a trip in a train network. In contrast to, say, a road...

Statistical learning for train delays and influence of winter climate and atmospheric icing

This study investigated the climate effect under consecutive winters on ...

Congestion in near capacity metro operations: optimum boardings and alightings at bottleneck stations

During peak hours, metro systems often operate at high service frequenci...

Delay-Robust Routes in Temporal Graphs

Most transportation networks are inherently temporal: Connections (e.g. ...

I Introduction

Trains have been a prominent mode of long-distance travel for decades, especially in the countries with a significant land area and large population. India, with a population of billion people in 2016, has a railway system of network route length of kilometers, with locomotives, stations, that served billion ridership in [7]. The Indian railway system is fourth largest in the world in terms of network size. However its trains are plagued with endemic delays that can be credited to (a) obsolete technology, e.g., dated rail engines, (b) size, e.g., large network structure and high railway traffic, (c) weather, e.g., fog in winter months in north India and rains during summer monsoons countrywide.

In this paper, we take the initial steps in understanding and predicting train delays. Specifically, we focus on the delays of trains, totaling 135, which pass through the busy Mughalsarai station (Station Code: MGS), over a two year period. We build an -Order Markov Late Minutes Prediction Framework (-OMLMPF) which, as we show, predicts near accurate late minutes at the stations the trains travel to. To the best of our knowledge, this is the first effort to predict train delays for Indian rail network. The closest prior work is by Ghosh et al. [4] [5] who study the structure and evolution of Indian Railway network, however, they do not estimate delays. Our analysis is complementary and agrees with the characteristics of the busiest train stations that they find. We now define the problem, outline contributions, and present our approach.

Problem Statement: Given a train and its route information, predict the delay in minutes at an in-line station during its journey on a valid date.

I-a Contributions

Our main contributions are that we:

  • as a first, present the dataset of 135 Indian trains’ running status information (which captures delays along stations), collected for two years. We plan to make it public.

  • build a scalable, train-agnostic, and Zero-Shot competent framework for predicting train arrival delays, learning from a fixed set of trains and transferring the knowledge to an unknown set of trains.

  • study delays using -order Markov Process Regression models and do Akaike Information Criterion (AIC) and Schwartz Bayesian Information Criterion (BIC) analysis to find the correct order of the Markov Process. Most of the 135 trains follow 1-order Markovian Process.

  • discuss how the train-agnostic framework can leverage different types of trained models and be deployed in real time to predict the late minutes at an in-line station.

The rest of paper is arranged as follows. We first discuss the data about train operation and its analysis in Section II and then present the proposed model in Section III. Next, in section IV

, we outline the experiments conducted with two different regression models: Random Forest Regression and Ridge Regression and give an exhaustive analysis of our results. Finally, we conclude with pointers for future research.

Ii Data Preprocessing and Analysis

This section gives details of train information we collected for a span of two years from site[10]. Table I gives the statistics.

Total number of trains considered 135
Total number of unique stations covered 819
Maximum number of journeys made by a train 334
Average number of journeys made by a train 48
Maximum number of stations in a train’s route 129
Average number of stations in a train’s route 30
TABLE I: Data Statistics for 135 Trains Complete Data

Ii-a Data Collection and Segregation

We considered 135 trains that pass through Mughalsarai Station (MGS), one of top busiest stations in India. For them, we collected train running status information (Train Data) over the period of March 2016 to February 2018. A train’s Train Data consists of multiple instances of journeys, where each journey has the same set of in-line stations that the train plies through. Table II has important fields of interest in Train Data.

Due to the infrequent running of trains, the amount of data collected for each of the trains greatly varied. Using the file size as criterion, we selected Train Data of 52 frequent trains (henceforth mentioned as Known Trains), out of 135, as training data. The data of remaining 83 trains (henceforth mentioned as Unknown Trains) were used for testing and evaluating the transfer of knowledge through trained models. Figure 1 pictorially illustrates the actual segregation of collected Train Data

from March 2016 to February 2018 for 135 trains. One may recall that in traditional machine learning, the training and test data are drawn from the same set (or class). In contrast, we train our models on a seen set of

Known Trains and test it on an unseen set of Unknown Trains, thus employing zero data of Unknown Trains for training, hence the term Zero-Shot. This problem setting is similar to Zero Shot Learning [8] where training and test set classes’ data are disjoint. Figure 2 shows a train journey and related notations used in this paper.

Field Name Description
Actual arrival date of train at a station e.g.
Station code name (acronym) for a station e.g.
Late minutes (arrival delay) at station e.g.
Distance of a station from the source in kilometers e.g.
Jan, Feb, Mar… extracted from
Mon, Tue, Wed… extracted from
TABLE II: Description of Train Data collected for each train

width=3.5in, totalheight=2.1in

135 Trains Complete Data: March 2016 to February 2018

52 Known Trains:March 2016 toFebruary 2018

83 Unknown Trains:March 2016 toFebruary 2018

52 TrainsTraining &Cross-validationDataMarch 2016 toJune 2017(52TrnsTrCv)

52 TrainsTest DataJuly 2017 toFebruary 2018(52TrnsTe)

83 TrainsTest DataMarch 2016 toFebruary 2018(83TrnsTe)

Fig. 1: Segregation of Complete Data of 135 Trains for Experimentation.
The complete data is divided into two sets: 52 Known Trains and 83 Unknown Trains. Known Trains data is further subdivided into 52 Trains Training & Cross-validation Data (52TrnsTrCv) and 52 Trains Test Data (52TrnsTe) with different time periods. The Unknown Trains data (83TrnsTe) is kept intact to assess knowledge transfer from Known Trains to Unknown Trains.

Ii-B Data Preparation

We define a data-frame as a collection of multiple rows with fixed number of columns. For our experiments we prepared two types of data-frames, with one type being a data-frame Table III for each station (henceforth mentioned as Known Stations, totaling 621 out of 819) falling in the journey route of Known Trains by extracting required information from Train Data Table II of respective trains (in whose route the station fell) to train the models. Another type consisted of only one data-frame Table IV capturing certain information of all 819 stations; irrespective of whether they are in-line to Known Trains or Unknown Trains. We divided the journey data in 52TrnsTrCv Data in ratio 4 to 1 to train and cross-validate the models and prepared data-frame (Table III) for the chosen 80% journey data. However we did not prepare any data-frames (Table III) for rest 20% of 52TrnsTrCv Data, 52TrnsTe Data and 83TrnsTe Data, thereby leaving them in their native format of Train Data Table II.












CurrentStation ()

PreviousStation ()

PreviousStation ()

PreviousStation ()

PreviousStation ()

Fig. 2: Train Route of Train 12439. The above figure shows the route of train 12439 which starts at the station and ends at the station . For current station , 4 previous stations are considered; whose information we can use for preparing a -- data-frame (Table III). notation for previous station is used throughout this paper.

Ii-C Data Analysis

Here we analyze the most important factors which drive our learning and prediction algorithm. As observed in Figures 3, 4, and 5, the spikes in each month signify that mean late minutes at a station varies monthly (the colored dots are the individual late minutes during the month). This premise was verified with similar graphs obtained for other trains and their in-line stations. In Figures 6, 7, and 8, the dots represent the mean of late minutes at each in-line station during a train’s journey in a particular month. In Figure 6 we can see that the mean late minutes increase during journey up-till station and later it decreases. We observed similar graphs for other trains and found that partial sequences of consecutive in-line stations characterize the delays during a train’s journey.

Fig. 3: Monthly variation of late minutes at station for Train 12307
Fig. 4: Monthly variation of late minutes at station for Train 12802
Fig. 5: Monthly variation of late minutes at station for Train 12816
Fig. 6: Mean late minutes during Train 12282’s journey in June 2017
Fig. 7: Mean late minutes during Train 12395’s journey in December 2017
Fig. 8: Mean late minutes during Train 12444’s journey in April 2017
train_type zone is_superfast month weekday
Is it Special, Express or Other?
What zone does the
train belong to?
Is it super fast? Month in which the journey is made
Weekday on which
the journey is made
Obtained from [2] through train number (e.g. 13050 for Train 13050) Obtained from actarr_date (Table II)
Stn1_code Stnn_code late_mins_Stn1 late_mins_Stnn db_Stn0_Stn1 db_Stnn-1_Stnn
Station Code
Station Code
Late Minutes
Late Minutes
Distance between
Distance between
Obtained from station_code (Table II) Obtained from latemin (Table II) Obtained from distance (Table II)
Stn1_dfs Stnn_dfs tfc_of_Stn1 tfc_of_Stnn deg_of_Stn1 deg_of_Stnn
from source station
from source station
Traffic Strength
Traffic Strength
Degree Strength
Degree Strength
Obtained from distance (Table II) Obtained from Open Government Data (OGD) [4]
Stn0_dfs Stn0_tfc Stn0_deg Stn0_late_minutes
distance from source station traffic strength degree strength Current Station’s target late minutes to be predicted
Obtained from distance (Table II) Obtained from OGD [4] Obtained from latemin (Table II)

The bold font texts are the columns in our prepared data-frame for each Known Station. We assert that Stn0_late_minutes depends on the values mentioned in other columns. tfc_of_Stni and deg_of_Stni are the total number of trains passing through and total number of direct connections of to other stations respectively. Such a data-frame is called -- data-frame of a target station () for which it is prepared, where depends on the number of previous stations (a partial sequence of consecutive stations) considered.

TABLE III: Description of Training Data prepared from Train Data Table II
station latitude longitude stn_tfc stn_deg
Latitude Longitude
Traffic Strength
of station
of station
Obtained from
Obtained from
Google Maps APIs
Obtained from OGD [3]

The bold font texts are the columns in our prepared data-frame for collectively all 819 stations of Known Trains and Unknown Trains. station is used as a key to obtain rest features on which k-NN is run. This data- frame helps to determine the semantically nearest station to a given station.

TABLE IV: Description of Station Features

Iii Proposed Model

In this section, we explain our proposed regression-based -OMLMPF algorithm and its components. Regression is the task of analyzing the effects of independent variables (in a multi-variate data) on a dependent continuous variable and predicting it. In our setting, the independent variables are the ones mentioned in Table III and the dependent continuous variable to be predicted is the target late minutes (Stn0_late_minutes

). Our regression experiments with low RMSE and significant accuracy under 95% Confidence Interval back our hypothesis to cast it as a Regression based problem. We used Random Forest Regressors (RFRs) and Ridge Regressors (RRs) as two types of individual regression models in

-OMLMPF to learn, predict, evaluate, and compare results.

For real-time deployment and scalability, we avoided building train-specific models. Hence we looked for entities which would help us to frame a train-agnostic algorithm as well as enable knowledge transfer from Known Trains to Unknown Trains. A train’s route is composed and characterized by the Stations in-line in its journey. Significant delays along a route which has more number of busy stations can be expected compared to the ones having lesser number of busy stations.

Through the analysis of multiple figures similar to the ones mentioned in subsection II-C we observed the following details about the delay at in-line stations during a journey:

  • It highly depends on the months during which the journey is made. One can observe the variations during summer ( in Fig.3) and winter months ( in Fig.3).

  • Partial routes of consecutive Stations can be identified during journey which either increase or decrease the delay at next stations ( in Fig.6).

  • Stations with a high traffic and degree strength tend to be the bottleneck in a journey, thus increasing the overall lateness (-a busy station in Fig.6, Fig.7, and Fig.8).

Above points suggest that multiple deciding factors (e.g. the month of travel, the sequence of stations during a journey etc.) determine the late minutes at a station considered. Since we sought to use Stations to frame a train-agnostic late minutes prediction algorithm and for knowledge transfer, we prepare a data-frame Table III for each of the Known Stations capturing the details mentioned. Later, we train -Order Markov Process Regression models for each Known Station; described next.

Iii-a -Order Markov Process Regression (-OMPR) Models

The Markov Process asserts that the outcome at a current state depends only on the outcome of the immediately previous state. However if the current state’s outcome depends on previous states, we call it an -Order Markov Process. Here we assert that the late minutes at a current target station depends on the details of its -previous stations (henceforth mentioned as --). This notion is effectively captured in data-frame Table III where we capture general features of a train, day and month of a journey and the characteristics of the -- along with that of the current target station. The idea is to learn -OMPR models (Random Forest Regressors and Ridge Regressors) for each of the Known Stations using Algorithm 1 and later use those trained models to frame a train-agnostic late minutes prediction algorithm (-OMLMPF Algorithm 2). Regression models are trained on each of the Known Stations’ corresponding -- data-frame Table III with the values of depending on the number of stations previous to it, subject to its positions during the journeys of multiple trains. This design will be clarified in section III-C. We used python sklearn.ensemble library [9] and sklearn.linear_model library [9] for learning Random Forest Regressor and Ridge Regressor models respectively.

Iii-B -Nearest Neighbor (-NN) Search

Unknown Stations (USs) are the ones which, along with the Known Stations (KSs), build the journey route of Unknown Trains. Since we made Unknown Trains’ data Zero Shot, data-frame Table III is not prepared for USs, thus we do not have -OMPR models for them. Hence, we look for a which is best similar to the current target with respect to features stated in Table IV; whose model could be used to approximate the predicted late minutes at the .We employ -NN search algorithm (Algorithm 3) to fulfill this objective. A two-step -NN search is applied since latitude and longitude data are semantically different from traffic and degree strength data. We used python sklearn.neighbors library [9] with default options.

Input: List Of Known Stations ():
Output: -OMPR Models for Known Stations
for  do
        Initialize empty list (stores stations having -OMPR models)
end for
for  Known Stations do
        for  do
               Get ’s -- data-frame (Table III)
               if  is not empty then
                      Train RFR & RR Models on
                      + Include in list
               end if
        end for
end for
for  do
end for
Algorithm 1 Training n-OMPR Models
Input: Train number , in-line stations list (), journey route information (Table II),
Output: A list () of predicted late minutes at each station during the journey
Initialize late minutes list with entry ( minutes late at source)
for  do
        = .At() Station at position
        if  is at position i = 1 then
               Prepare ’s -- row data-frame (Table III) using Table II with late_mins_Stn1 set as .At(0)
               if   then
                      Get nearest Known Station in using Algorithm 3
               end if
              .At() Predict late minutes at for using model
        else if  is at position i = 2 then
               Prepare ’s -- row data-frame (Table III) using Table II with late_mins_Stn1 set as .At(1) and late_mins_Stn2 set as .At(0)
               if   then
                      Get nearest Known Station in using Algorithm 3
               end if
              .At() Predict late minutes at for using model
               is at position i 3 during the journey
               Prepare ’s -- row data-frame (Table III) using Table II with late_mins_Stn1 set as .At(-), late_mins_Stn2 set as .At(-) and late_mins_Stn3 set as .At(-)
               if   then
                      Get nearest Known Station in using Algorithm 3
               end if
              .At() Predict late minutes at for using model
        end if
end for
Algorithm 2 -OMLMPF for Known Trains and Unknown Trains (here the value of is set as limit the models up to -OMPR models)

Iii-C Example

In our example, let there be five Known Trains () routes and two Unknown Trains () routes with dummy stations and to explain our proposed framework, where (a..q) and (r..w) are Known Stations and Unknown Stations, respectively. Figure 9 shows the train route map where source stations are colored green.

Fig. 9: Visual view of example trains routes and . Starting stations are highlighted.
  • Journey:

  • Journey:

  • Journey:

  • Journey:

  • Journey:

  • Journey:

  • Journey:

Input: A Station , Valid of Known Stations
Output: A nearest Known Station
Get -NN Known Stations to among stations in on the basis of Latitude and Longitude
Get -NN Known Stations to among stations in on the basis of Degree and Traffic
Return the first station among
Algorithm 3 -NN search framework to get a Known Station best similar to any type of Station ( set to 10)

Iii-C1 Data Preparation and Training

We collect Train Data Table II for each of the seven trains and divide them into two categories: Known Trains ( ) and Unknown Trains ( ) based on the amount of data collected for each train. After the actual segregation of collected data as showed in Fig.1, we prepare -- data-frame Table III for each using ’s Table II data.

  • Preparation of -- data-frames Table III for :
    We prepare a -- data-frame for owing to Train only since it has as one station previous to it. It is navigated by also, but it is the source station there, thus has zero stations previous to it.

  • Preparation of -- data-frames Table III for :
    We prepare a -- data-frame for owing to trains , , and since it has a valid set of one station previous to it and a -- data-frame owing to train , as it has two stations previous to it.

  • Preparation of -- data-frames Table III for :
    We prepare a -- data-frame for it owing to Train , and as they have a valid one station previous to during the journey. Another -- data-frame is prepared for it owing to Train and , and a -- data-frame owing to Train .

Similarly, for each of the Known Stations, we prepare valid -- data-frames Table III, depending on the number of stations previous to them during the journey of Known Trains. Later we use those -- data-frames to train -OMPR models (RFR and RR) for each Known Station as explained in Algorithm 1. While training the models, we also maintain a list of stations which stores the names of stations (station codes) which have -OMPR models. For example, in context of all five Known Trains here, the stations in are (, , , , , , , , , , , , , ) since they have one valid station previous to them during the journey of various ; … has stations (, , , , , , , ) since each of them has a valid set of stations previous to them.

Iii-C2 Prediction of Late Minutes for Train Journeys

We explain -OMLMPF algorithm (Algorithm 2) here with the help of above train examples. We employ a feed-forward method for late minutes prediction at each of the in-line stations where the late minutes predicted for the previous stations and their other details are incorporated in current target station’s -- row data-frame. (A row data-frame consists of only one row of Table III).

Known Trains Late Minutes Prediction

Stations in-line during the journeys of cross-validation set and the test set of Known Trains consist of only Known Stations for which we have trained models saved from Algorithm 1. The column entries in -- row data-frame (Table III) for the current station at which late minutes are to be predicted are filled accordingly as explained in the table, except Stn0_late_minutes since we aim to predict it here. Say for train ’s cross-validation or test data, we predict late minutes at each station. As per the execution steps of Algorithm 2 the late minutes at:

  • is assumed to be since it is a source station thus list is .

  • is predicted through since we have this -OMPR model trained over the -- training data-frame for . We fill the -- row data-frame for with set as and late minutes at set as the first entry in i.e. . Say the predicted late minutes at is 5, hence extends to .

  • is predicted through as we have this -OMPR model trained for it. The first and second entry in list, ( and ) are used as late minutes at station and respectively in the -- row data-frame for station to predict the late minutes at it; say minutes. So the list becomes .

  • In a similar fashion, we keep feed-forwarding the predicted late minutes at previous stations to predict the late minutes at , , and through -OMPR models , , and respectively.

Unknown Trains Late Minutes Prediction

We choose train for explaining Algorithm 2 to predict late minutes for Unknown Trains’ in-line stations. The late minutes at:

  • is assumed to be since it is the source station. Thus the late minutes list is initialized with .

  • is predicted as follows. We do not have a trained -OMPR model (neither RFR nor RR) for since it is an Unknown Station, thus not in . Hence, via Algorithm 3 we find a Known Station nearest to it among the ones in which have a -OMPR model (RFR and RR), say station is found. Next, the -- row data-frame prepared for with set as is fed to the model to predict late minutes at , say minutes. Thus list extends to .

  • is predicted through model with , and late minutes at , late minutes at set as , and , respectively; say minutes is predicted, thus the list becomes .

  • is predicted as follows. It can be noticed from above set of Known Trains journey that we do not have a valid trained model in spite of the current target station being a Known Station since no -- data-frame for station could be prepared from any of the Known Trains. So we choose a station among which is best similar to through Algorithm 3 (say station is chosen). Thus is used to predict the late minutes (say minutes) on the row data-frame for with , , and being , , and respectively with corresponding late minutes as , and . Thus the list becomes .

  • is predicted through a -OMPR model; say where is obtained through Algorithm 3 for . The -- row data-frame for it has , , set as , , and respectively.

  • is predicted through model on its -- row data-frame with , , and set as , , and respectively.

Iv Experiments and Result Analysis

The -OMLMPF Algorithm 2 was executed on three sets of data, namely Cross-validation Data of Known Trains, Test Data of Known Trains and Test Data of Unknown Trains as mentioned in Figure 1 for different values of (in -OMLMPF). We enumerate four detailed experiments below, which were conducted with both RFR and RR models individually:

  1. Exp 1: We ignored tfc_of_Stni, deg_of_Stni and Stni_dfs columns from data-frame Table III since these features are implicitly captured in Stni_code. Experiment was conducted on dataset 52TrnsTrCv.

  2. Exp 2: We ignored the Stni_code columns from data-frame Table III as tfc_of_Stni, deg_of_Stni and Stni_dfs numerically capture the property of station codes. This was done for Unknown Trains case because we did not have partial consecutive in-line station path of s and s (hence no Stni_codes) due to the test data being Zero-Shot. The experiment was conducted on 83TrnsTe data after learning the prediction models from 52TrnsTrCv data to assess the transfer of knowledge from Known Trains to Unknown Trains.

  3. Exp 3: We conducted Exp 2 again on 52TrnsTrCv data, where results similar to that obtained in Exp 1 for cross-validation data endorses our notion of vice-versa representation of stations, as done in Exp 1 and Exp 2.

  4. Exp 4: We conducted Exp 2 on 52TrnsTe data with prediction models learned from 52TrnsTrCv data.

After conducting the experiments we analyzed the results to evaluate the performance of trained models and to determine the optimum value of (in -OMLMPF). For brevity, we do not present the detailed results for all 135 trains, but we do justice by presenting -OMLMPF output on test data of few trains in Tables V, VI, VII (negative numbers in tables suggest that the train arrived early by those many minutes).

Actual Late
0 2 8 -1 13 25 19 18 2 9 -21 -5 6 15
Predicted Late
0 2.75 6.83 0.01 17.44 16.52 11.22 17.65 1.94 16.01 -8.77 -0.25 12.26 23.10
TABLE V: Predicted Late Minutes for Known Train 22811 Test Data (obtained from -OMLMPF with RFR models)
Actual Late
0 3 4 -11 0 -6 15 55 30 10 18 10 11 0 7 3 5
Predicted Late
0 9.38 7.87 -2.43 3.61 0.50 26.13 36.14 29.42 32.14 20.38 3.296 6.87 -3.80 17.55 14.30 13.91
TABLE VI: Predicted Late Minutes for Known Train 12326 Test Data (obtained from -OMLMPF with RFR models)
Actual Late
0 8 3 0 -5 -15 -10 -1 30 41 51 57 74 111 75 123 130 120 120
Predicted Late
0 10.19 10.74 10.17 11.60 11.97 27.24 34.63 28.45 40.15 41.29 42.94 60.71 72.51 75.25 70.50 74.45 67.95 71.80
TABLE VII: Predicted Late Minutes for Unknown Train 12356 Test Data with 3 Unknown Stations (obtained from -OMLMPF with RFR models)
Random Forest Regressor (RFR) Models Ridge Regressor (RR) Models
Exp 1 (Avg %age) Exp 2 (Avg %age) Exp 3 (Avg %age) Exp 4 (Avg %age) Exp 2 (Avg %age) Exp 4 (Avg %age)
CI68 CI95 CI99 CI68 CI95 CI99 CI68 CI95 CI99 CI68 CI95 CI99 CI68 CI95 CI99 CI68 CI95 CI99
1-OMLMPF 34.65 61.37 70.47 5.90 14.73 18.51 33.67 61.05 70.21 27.60 55.41 65.57 4.97 12.87 17.29 22.34 44.30 55.71
2-OMLMPF 35.28 61.36 70.85 5.72 14.17 18.41 33.72 61.03 70.65 27.51 56.32 66.87 5.34 12.65 16.80 22.81 43.67 56.59
3-OMLMPF 33.86 62.31 71.42 6.00 14.79 18.81 33.80 62.13 71.58 27.81 55.89 66.98 4.89 12.46 16.76 22.21 44.05 55.67
4-OMLMPF 34.39 62.53 71.74 5.66 14.96 18.97 33.67 61.57 71.49 27.82 55.80 66.82 4.66 12.35 16.35 21.85 43.89 55.83
5-OMLMPF 34.77 62.70 72.10 5.51 14.52 18.75 33.45 62.03 71.96 27.93 56.20 67.07 4.61 12.43 16.16 21.85 43.87 55.18

CI68, CI95, and CI99 respectively stand for 68% CI, 95% CI, and 99% CI. Avg %age stands for Average Percentage.

TABLE VIII: Confidence Interval (CI) observations for different experiments
Known Trains Unknown Trains
Trains 12305 12361 12815 12307 13131 13151 22811 22409 18612 13119 15635 03210 04401 04821 12141 12295 22308 12439 18311
Number of Journeys 16 14 39 84 19 83 28 14 47 25 13 2 1 6 3 4 28 2 3
Mean RMSE 87.12 89.38 96.61 88.26 62.84 82.34 53.71 44.72 29.42 80.66 80.22 57.37 23.86 31.97 53.38 68.49 44.83 11.75 36.20

Trains row consists of unique Train Numbers. Number of Journeys row denotes the number of journeys undertaken by the corresponding train in its Test Data. Mean RMSE row presents the average of the RMSEs of all journeys. For example, Train 12305 covered 16 journeys with a mean RMSE of 87.12.

TABLE IX: Mean RMSE values for few Known Trains and Unknown Trains Test Data (Obtained from 4-OMLMPF with RFR Models)

Iv-a Performance Evaluation of Models

We begin by noting again that a train’s Train Data consists of multiple instances of journeys, where each journey has the same set of stations that the train plies through. For each in-line station during a train’s journey, we calculated monthly 68%, 95%, and 99% Confidence Intervals (CI) around the mean of late minutes in a month, considering the train’s complete Train Data

with outlier late minutes removed by Tukey’s Rule

[6]. For each train’s cross-validation/test Train Data, the percentage of the number of times the predicted late minutes for an in-line station fell under each matching CI was calculated. Then we averaged out all the percentages (calculated for each train) in different experiments enumerated above. Table VIII shows the corresponding figures. In Table IX we present the mean Root Mean Square Error (RMSE) values for few Known Trains and Unknown Trains obtained from their Test Data, where RMSE for a journey was calculated between the predicted late minutes and the actual late minutes. It is to be noted that reported results in Table VIII and IX are inclusive of journeys where the train actually got late at the source station, but these details could not be captured by our models due to their scarce occurrences.

Preliminary analysis of CI and mean RMSE observations showed that RFR models outperformed RR models. However, for sake of completion, we present CI observations of RR models for some selected experiments in Table VIII. The scattering of individual late minutes at a station during a month; as observed in Figures 3, 4, and 5 suggests to consider CI95 (or higher) since the late minutes are not closely centered around mean but cover a wider distribution around it. Under RFR Models column in Table VIII, the figures in CI95 columns for Exp 1 and Exp 3 suggest that at an average we were able to predict late minutes at in-line stations during cross-validation journey data of Known Trains for approximately 62% times within 95% CI (say accuracy is 62%). Figures in Exp 2 under both RFR and RR Models columns in Table VIII for Unknown Trains’ test data do not seem promising, but since these results are for Zero-Shot trains for which significant amount of data is not available, the observations are appreciable. One should also note here the low mean RMSE values for Unknown Trains in Table IX. The higher accuracies (around 56% and 66% for CI95 and CI99) for Known Trains

’ test data in Exp 4 column under RFR Models column compared to that under RR Models column signify a very important conclusion. Random Forest Regressors (which are an ensemble of multiple decision trees) very well model the deciding factors (in Table

III) compared to Ridge Regressors, thus the results state that the prediction of late minutes is effectively a decision-based regression task.

Iv-B Determination of Optimum value of in -Omlmpf

We executed Algorithm 2 with values of (..), but which one truly captures the Markov Process property of delays along a train’s journey? To answer this we employ two common model selection criterion [1]: Akaike Information Criterion (AIC) and Schwartz Bayesian Information Criterion (BIC) to choose the statistically best regression model.


where stands for the number of observations used to train a model, is the Squared Sum of Errors (between predicted late minutes and the actual late minutes) and is the number of parameters in the model (number of columns in formatted data-frame Table III). Lower the score, better the model. The count of the number of times a run of -OMLMPF (for a particular value of ) yielded the least AIC and BIC scores among all five runs for each train in all four experiments is noted in Table X. In Table X we see that delays along journey undertaken by 40.38% to 67.30% of Known Trains under related experiments follow a -Order Markov Process since 1-OMLMPF scores minimum AIC and BIC score among other frameworks. Similarly 71.08% to 81.93% of Unknown Trains follow a -Order Markov Process. Rest of the trains follow a higher order Markov Process with diminishing indications. However lower cumulative RMSE scores (summed over all trains) obtained for - and -OMLMPF under different experimental settings suggest to use them for real-time deployment.

Random Forest Regressor Models
BIC Analysis AIC Analysis
Exp 1 Exp 2 Exp 3 Exp 4 Exp 1 Exp 2 Exp 3 Exp 4
1-OMLMPF 32 68 35 29 21 59 31 23
2-OMLMPF 7 7 9 14 9 12 9 10
3-OMLMPF 9 5 6 5 12 7 7 11
4-OMLMPF 4 3 1 4 8 2 3 6
5-OMLMPF 0 0 1 0 2 3 2 2

The figures in each cell denote the number of times an -OMLMPF scored minimum score among other runs, e.g. in BIC Analysis column for Exp 1, 1-OMLMPF scored minimum BIC score for 32 trains among other runs.

TABLE X: BIC and AIC analysis of -OMLMPF with RFR Models

V Conclusion and Future Work

Our objective was to predict the late minutes at an in-line station given the route information of a train and a valid date. The significant accuracy results in Table VIII for Known Trains’ and Unknown Trains’ data demonstrates the efficacy of our proposed algorithm for a highly dynamic problem. We also determine experimentally and statistically that the delays along journey for most of the trains follow a -Order Markovian Process, while other few trains follow a higher order Markovian Process. Reasonably low RMSE results obtained for Unknown Trains in Table IX also show that we were able to transfer knowledge from Known Trains to Unknown Trains. The -OMLMPF algorithm is so designed that it can leverage different types of prediction models and predict delay at stations for any train, thus it is train-agnostic. With just % of total trains in India, our approach was able to cover more than

% of stations, thereby illustrating scalability . There are many avenues for future work: (a) one can expand the data collection and extend the analysis to trains India-wide, (b) one can also explore other approaches like time series prediction and neural networks. In particular, Recurrent Neural Networks (RNN) have the property of memorizing past details and predicting the next state. The prediction of delays along stations is inherently dynamic which implicitly calls for an online learning algorithm to continuously learn the changing behavior of railway network and delays. Thus one can attempt to develop an Online RNN algorithm for it. One can also consider predicting delay of trains in other countries.

Vi Acknowledgment

We would like to thank Debarun Bhattacharjya for his help in statistically discovering the order of Markovian delays through mathematical equations. We also thank Nutanix Technologies India Pvt Ltd for the computational resources.


  • [1]

    D. Beal, “Information criteria methods in sas for multiple linear regression models,”

    SESUG Proceedings. Paper SA05, 2007.
  • [2] I. R. F. Club, “Faqs about indian railway numbers,”, 2016.
  • [3] I. O. Data, “Indian railway time table,”, 2016.
  • [4] S. Ghosh, A. Banerjee, N. Sharma, S. Agarwal, N. Ganguly, S. Bhattacharya, and A. Mukherjee, “Statistical analysis of the indian railway network: a complex network approach,” Acta Physica Polonica B Proceedings Supplement, vol. 4, no. 2, pp. 123–138, 2011.
  • [5] S. Ghosh, A. Banerjee, N. Sharma, S. Agarwal, A. Mukherjee, and N. Ganguly, “Structure and evolution of the indian railway network,” in Summer Solstice International Conference on Discrete Models of Complex Systems, 2010.
  • [6] D. C. Hoaglin, B. Iglewicz, and J. W. Tukey, “Performance of some resistant rules for outlier labeling,” Journal of the American Statistical Association, 1986.
  • [7] R. M. India, “Indian railways yearbook 2015-2016,” in Ministry of Railways (Railway Board), 2015.
  • [8] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 453–465, Mar. 2014. [Online]. Available:
  • [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [10] RailApi, “Indian railway apis,”, 2016.