Flight delay has a significant negative economic impact, as well as being detrimental to the climate and international communications. In the United States alone, more than US$26.6 billion are wasted due to the flight delays in 2017 based on the estimation of the Federal Aviation Administration (FAA) GAO19. Flight delay and cancellation also takes responsibility for nearly a third of complaints from air travel passengers for selected airlines GAO19. Additionally, around a quarter of all commercial flights have been delayed or cancelled GAO11 in recent years.
Departure delay, the most common and intolerable flight delay, caused more than 1600 overstayed flights in the airport for more than three hours during the summertime between 2004 to 2010 in the United States GAO11. However, its root causes do not attract sufficient attention from researchers. Most existing works focus on other types of delays since departure delay is influenced by the air travel both and on-ground tarmac situations.
Figure 1 illustrates the concept of departure delay by showing the scheduled against the real flight travel flow. The flight travel flow usually comprises the following five stages: the airborne stage, taxi-in stage, turn around stage, taxi-out stage and airborne again. Departure delay refers to the period between the gate departure time and gate arrival time. The departure delay causes passenger anxiety and the uncertainty of scheduled economic and social activities. Therefore, an accurate departure delay prediction model is needed to alleviate these problems GAO19.
Most aerospace experts explored the correlation between environmental issues (weather, wind, etc.), attributes of flight data (day of a week, season, month, etc.) and flight delay time Kim2016Sternberg2017. There are also recent works in relevant Computer Science fields that have explored probabilistic models ahmadbeygi2010decreasing; tu2008estimating, network representation abdelghany2004model; tu2008estimating
, and machine learning modelsrebollo2012characterization; ahmadbeygi2008analysis. However, the problem of departure delay is under-studied and more complex, as it is influenced by the environmental factors, air traffic and also on-ground airport situations.
Aerospace expert researchers have proposed a concept called air traffic complexity (ATC) to describe airport traffic mogford1995complexity. However, most works rely on established equations based on empirical observations without offering any quantification of the concept. In this context, Internet of Thing (IoT) -related techniques such as sensor networks and big data analytics have the potential to provide a scalable solution. Increasing works are exploring sensor networks in the air traffic area trub2018monitoring; prabowo2019coltrane; shao2019onlineairtrajclus. They either use the sensor data to estimate the traffic at the airport or monitor meteorological parameters.
In this work, we address whether it is possible to predict the flight departure delay using big data from sensor networks, harnessing machine learning and aerospace domain knowledge. Firstly, we represent the ATC using airport situational awareness maps (the trajectory of flights and vehicles at the airport). This work bridges the gap between the ATC – a concept proposed by aerospace experts, sensor data fusion, and machine learning techniques. We apply two ways to represent the ATC: 1) we propose a group of features which are extracted from sensor and radar data of aircraft at the airport, and 2) regard the whole airport as a grid map and take a snapshot of each time period. The locations, speed and other contextual information are represented as a sequence of images. Secondly, we applied the factor analysis to both environmental data and ATC data to explore the correlation between flight departure delay and situational awareness map (ATC, weather, etc.). Thirdly, both of traditional machine learning methods and a deep learning framework are used to predict the flight departure delay with a real-world dataset of the Los Angeles International Airport (LAX).
The experimental results show that both the features we proposed to represent the ATC and the sequence of images we used to represent the airport situational awareness map can be used to predict the flight departure delay, and they are more effective than weather conditions and schedules. Additionally, our proposed end-to-end deep learning framework also can predict the flight departure delay with the comparable performance with machine learning-based methods. We show that both the hand-crafted features based methods and the end-to-end deep learning approaches can achieve similar results.
The main contributions of this paper are the following:
We propose a generic framework to predict the departure delay using situational awareness maps, original flight schedules and weather conditions.
We borrow a concept from aerospace area called Air Traffic Complexity (ATC), and use both aircraft trajectories and a sequence of snapshot of airport aircraft to represent this concept. We also analyse the importance of those features and compare them against previous delay-causing factors such as weather conditions and schedules.
We propose an approach to project the situational awareness map to a sequence of images and propose an end-to-end deep learning framework to predict the departure delay. The experimental results demonstrate the effectiveness of the approach to achieve a good result (less than 20 minutes prediction error).
2 Related Work
There exists a large number of studies in the field of air traffic complexity (ATC). However, the literature review shows that the majority of them are considering using the complexity factors to indicate the currently existing or expected impact on air traffic controller’s workload. Additionally, various studies on aircraft trajectory sensor data have shown to be effective in mobility forecasting and map creation vouros2018big; andrienko2018creating.
A polynomial equation was proposed by mogford1995complexity which described ATC as the combination of static sector characteristics and dynamic traffic patterns. Another metric approach called Dynamic Density (DD) was proposed as a measure of ATC by NASA laudeman1998dynamic and quickly became a well-known metric. The DD metric is defined as a linear combination of multiple weighted traffic complexity factors (TC); traffic density (TD) and air traffic controller intent (CI).
A study proposed by NASA in 2007 kopardekar2007airspace shows that the performance of the complexity metrics varies at different facilities and conduct a further validation on the 52 complexity variables for the Cleveland Centre. The paper by djokic2010air lists of all complexity factors proposed by the predecessors is given and an attempt to reduce the size of that factor set is presented. The focus of the research by delahaye2003air is to find a non-linear dynamical model which can differentiate the level of the air traffic complexity based on vectorised aircraft locations.
Since the researches mentioned above are mainly focused on the relationship between ATC and air traffic controller workload, which normally exists in the en-route environment. The research about airfield and airport traffic complexity has received very little attention (koros2003complexity, simic2015airport), while the relationship between the air traffic complexity and the gate-hold delay has had even fewer attention tu2008estimating. The study by koros2003complexity has a primary focus on the ATC around the tower area and its influence on the air traffic controllers’ workload. It also shows that the relative effectiveness of complexity metrics is site-specific due to the different layouts and configurations of every airport. simic2015airport proposed a new metric called Dynamic Complexity which contains the features about the layout of a certain airport and the traffic interactions in the airport itself and its vicinity.
When considering predicting the departure delay of flights, the air traffic complexity can be seen as one of the most important parts, since it connects and captures the characteristics of several problems: the arrival delay propagation problem, the runway sequencing problem. In rebollo2014characterization
, a snapshot of the current departure delay state is used to characterise the current network state. This snapshot is a 584-dimensional vector which comprises the current departure delay state of each link in a simplified US airport network and the delay predictions at timeare made based on the snapshot at time . The main focus of xu2008multifactor
is to classify the major factors that could cause or absorb the flight delay.tu2008estimating discussed the seasonal weather trend the arrival flight delay propagation effect but neglected factors that may describe the air traffic complexity. In tu2008estimating, departure delay is measured by the discrepancy between the scheduled departure time and the actual departure time from the gate.
In what follows, we define a single flight, , as a pair of vectors. The vectors, and
, represent the feature and label vectors respectively. Both vectors contain ordered pairs of attributes and values associated with a single flight. All flights,, are contained in the set, . Furthermore, all flight features, , and labels, , belong to their respective sets, and . The attributes of a feature vector are listed in Table 1 and Table 2. This is also referred as the reference data source.
|Actual arrival time||
|Wheel on Time||
|Delay carrier (arrival)||
|Delay weather (arrival)||
|Delay security (arrival)||
Meanwhile, the attributes of a label vector are:
Delay Carrier (departure)
Delay Weather (departure)
Delay National Aviation System (departure)
Delay Security (departure)
Delay Late Aircraft Departure (departure)
The Departure Delay is the sum of the total of all the other five types of delays. Therefore, the performance metric is only based on the Departure Delay attribute. The remaining attributes are used for labels to help with training.
3.2 ATC Features
We define a GPS point, , as a vector of ordered pairs of attributes and values. The attributes are:
There are also other attributes such as altitude and aircraft type, but we removed these attributes during the data pre-processing step 5.1. These GPS points are the basis for engineered features to capture Air Traffic Complexity (ATC) as done in shao2019flight. In this paper, we propose a novel set of ATC features called TrajCNN Features based on these GPS points, described in Section 5.3.
3.3 Problem definition
Most airlines only have real-time details about their flights, but not flights from other airlines. For this reason, to predict the Departure Delay, we will only use features from the flight in question, , and not from any other flights. However, to capture the spatio-temporal information on the airport tarmac, airports could produce real-time aggregated and de-identified information with posing significant privacy or security risk. We call these kinds of information as ATC, defined in the previous subsection.
Two factors affect the decision of when to make the prediction. On the one hand, the prediction should be made as late as possible in order to capture the most recent and relevant information. On the other hand, the earlier the prediction, the more useful it will be for the airlines and the passengers. For example, earlier delay predictions would allow passengers to adjust their schedules, while late delay predictions might not allow for passengers to make any adjustment, and thus, rendering the prediction obsolete. We decided that a realistic predicting gap is four hours as passengers are expected to arrive at the airport 3 hours before departure, giving appropriate time for passengers to adjust their schedule before they arrive at the airports. We call this duration predicting gap.
Our model will capture all the tarmac spatio-temporal information within an observation window by creating a subset of , called . In turn, the ATC features are constructed from . Figure 1 illustrates predicting the time, observation window and the definition of departure delay, with is the difference between the scheduled and actual departure (gate-out) time.
Formally, for every flight, , we predict the Departure Delay, , using only features about that particular flight, , and ATC constructed from a subset of GPS points, .
4 Dataset Analysis
4.1 Flight Delay
Figure 2(a) shows the detailed statistics of the flight delay in our dataset. Note that departure delay exist for all flights, not just the ones delayed. Also, departure delay can take negative value, meaning that the aircraft depart from the gate early. Using very strict definition of on-time: departure delay
, only a little bit less than half of the flights are on time (49.4%) while the remaining are delayed by at least one minutes. This is consistent with the median of the delay being 1 minutes. However, this data is very skewed, as flights are unlikely to leave very early, yet very long delay is more likely, thus the mean of the delay (even including early departures) is 16 minutes, far higher than the median. The problem of flight delay prediction is made difficult not by the average magnitude of the delay, but by the high variability, measured by the standard deviation of 44.7 minutes.
It might be the case that these delay might easily be explained by a few factors. However, in the remaining of this section, we will show that flight delay prediction is a complex problem as it is not easily explained by any of the features alone. Simple combinations of the features are explored in subsection 6.3.
The explainability of a feature is calculated through RMSE. The RMSE is calculated by taking the average of each class and using it as the prediction. Note that if we consider the entire dataset as one single class, then the RMSE is equal to the standard deviation by definition. For this reason we use RMSE as comparison of delay factor between different features.
4.2 Day of Week and Time of Day
The first features we are going to analyse are the temporal features: Day of Week (Figure 2(b)) and Time of Day (Figure 2(c)). Although the mean and median delay fluctuates throughout the week, the standard deviations remained high. Taking the Day of Week effects into account, RMSE only goes down 0.05 from the standard deviation, showing that the Day of Week effect is small.
There is a stronger Time of Day effect compared to Day of Week effect. In the morning before 10 AM, when the airport is more quiet, there are much less delays. Nevertheless, the standard deviation within each hour is still very high. Although the RMSE from Time of Day is smaller than Day of Week, it is still only 1.1 minutes lower than the standard deviation.
4.3 Airlines and Airports
Next, we look at the effects of airlines, origin, and destination airports as shown in Figure 3
. The common theme is that although some airlines and some airports has more delays than others, the variance within each group is still big, with the lowest being destination airport with RMSE of 43.4.
As there are 77 origin and destination airports, we only pick the top 12 as a part of our visualisation, and group the remaining into others. We pick the number 12 as the combinations of these 12 airports represent more than 50% of all the flights.
Finally, we analyse the relationship between many component of weather with flight delay in Figure 4. For numerical variables, we draw the line of best fit and calculate the RMSE from those. The weather condition of most flight is fair, as expected from Californian cities. However, there are still sufficient variability in weather conditions, temperature, humidity, and wind speed. Contrary to expectation, fair weather has the lowest on-time ratio. This can be explained by the fact that the standard deviation of delay during fair weather is higher than the dataset (45.2 vs 44.7). This again suggested fair weather is not a significant factor regarding delay. As also expected from most places, the geography of a city is a huge determinant of wind direction, and thus we see that the wind direction as clustered around Westerly winds.
In this section, we conclude that temporal, airlines, airports, and weather factors alone cannot satisfactory explain the huge variability in airport delay. This suggests that we might need more information, such as spatial information from within the airport, or that there are complex interplay of features that necessitate the use of more powerful learners for flight delay prediction.
The architecture of the flight departure delay time prediction system will be described in this section. As shown in Figure 5, multiple datasets are gathered including: the GPS sensor data of aircraft and vehicles in the tarmac area; the historical flight data from major airline companies; and the associated weather data. Besides the noise found, some datasets had redundant portions, and some were incomplete. Therefore, a proper pre-processing procedure is required. To this end, we apply the methods mentioned in shao2019onlineairtrajclus to make the datasets ready for the next stage. More specific details will be discussed in the following section 5.1.
In the feature extraction stage, various methods were used to extract features from the cleaned datasets. First, a partition of the whole ground surface of LAX is used to separate different areas apart. Trajectories and GPS dots are counted to represent the traffic density of those areas to form the baseline ATC, whereas a more granular 2D histogram method was used to construct TrajCNN Features. A principal component analysis (PCA) is used to reduce the dimensionality of the weather data for the non-deep learning methods. The deep learning methods could extract latent features directly from the raw data. We also extracted multiple features from the combination of the historical flight scheduling data and trajectory sensor data.
5.1 Data Pre-processing
As mentioned in the previous section, some datasets contain irrelevant and redundant data. For example, the raw trajectory sensor data comprises the GPS records for ground vehicles, aircraft, and the historical flight scheduling data contains ICAO (International Civil Aviation Organization) number. Besides filtering out the irrelevant data points, it is also required to restore a continuous trajectory of each flight based on the noisy data, which includes some random jitter that falls within some impossible area. Therefore, we developed a data pipeline that can ignore these random jitters. We also calculated the average speed of the aircraft using the distance delta between two GPS records and eliminated the outliers. After that, the pipeline could form a trajectory of each flight based on its call sign and date since the same call sign is shared by all the flights travelling on the same route. At the end of this stage, each trajectory will be labelled into three different types based on which tarmac area the trajectory encountered. An illustration of the three tarmac types: parking area, apron area, and runway area are shown in Figure6. Here the parking area indicates that most aircraft move slowly in this area. Indeed, we checked the map of the airport and found this area is for cargo purpose.
5.2 Feature Extraction
The features we used in this study can be classified into three categories: Weather Features, ATC features and TrajCNN Features. The first two are defined in Section 3; more details about the extraction of TrajCNN Feature are described in the following section.
We matched every departure flight in LAX airport with their corresponding arrival flight record in the historical scheduling table data. That will help to introduce the possible delay propagation effect into consideration since the delay of the current flight might not just come from the current situation, but also has some relationship with the status of its previous arrival flight.
For the ATC data, we extracted features for each flight during its observation window. As shown in Figure 1, the observation window is a duration period before the prediction time. We extracted features from the ATC data listed in Table 3. These features comprise a representation of the traffic density across all three aforementioned tarmac area areas. The number of potential landing/take-off aircraft is also calculated to help reveal the complexity.
For the non-deep learning methods, we performed a principal component analysis (PCA) on the weather data to reduce the dimensionality. The PCA is a statistical procedure which can reduce the dimension of a feature set while still retaining most of the knowledge in it by transferring them into another set of variables that are linearly uncorrelated li2014principal. After this process, the weather data in Table 2 manages to get 18 principal components while still containing its information. Furthermore, we use the most recent weather data within or near the observation window for each flight.
Graph embedding is widely used in deep learning models especially for trajectory data chen2019trip2vec; wang2018efficient. However, for our model, we use the raw data because we compared the results between using raw data and the graph embedding results and found the raw data is a better option. It is because it is difficult for extract accurate graph network from airport.
We converted the Wind Direction attribute, which was initially a categorical attribute (e.g. N/S/W/E/NW, etc.), to bearings in radians. We have applied multiple machine learning approach to estimate the departure delay with different data. We found that weather data, in most experiments, shows no contributions to the prediction accuracy. Therefore, we did not consider it in our deep learning model.
5.3 TrajCNN Features
After we preprocess the GPS dataset, we then used these to construct TrajCNN features to capture the spatio-temporal information from the airport tarmac for our deep learning model. For every flight, , we first selected a subset of GPS points that is within the flight’s observation window, . Then, we divided the airport into a
grid and perform three 2D histograms. We chose the grid size of 28 as it is one of the common resolutions for CNN, such as MNIST. The first channel is the counting channel while the subsequent two channels are the of sum x- and y- velocity component on each grid, respectively. Then we use a global scalar scaler to scale the pixel values between 0 and 1. We stack these three channels into one image. This image is the TrajCNN features associated with flight.
Support Vector Machine (SVM) is a technique that was developed by Vladimir Vapnik in 1995 smola2004tutorial
. It constructs a hyperplane in a high-dimensional feature space. It was proposed to get a better result for classification problems. Later, Drucker et al.drucker1997support made some modifications to make it suitable for regression tasks. This newer version is known as Support Vector Machine. The most crucial step in this method is the production of the hyperplane. Exploiting kernel methods to project the features into higher dimensional space could make the classes linearly separable and improve results. There are two major factors in this model: the parameter, , affects how close the fitting can be, and the parameter, , controls model’s patience on the error. Changes in causes the degradation of the model’s generalisation ability.
The Multilayer Perceptron (MLP) is a feed-forward neural network, and its structure is very self-explanatory. Each layer contains several units, and there are one or more hidden layers that sit between the input and the output layerhaykin1994neural
. An activation function can be applied to each node in this model, except the ones on the input layer. By introducing the activation function, the performance on fitting non-linear relationships can be improved. In addition, increasing the number of hidden layers can also help on the same goal, but might also cause over-fitting. In our experiments, the MLP model has two hidden layers.
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm in recent years, and it is an effective way of building a predictive model. The basic idea is to generate multiple weak learners and use an additive model on those weak learners to minimise the objective function, which is typically formed as a loss function. In this paper, we use a newer variant of it, LightGBM, which was recently developed by Microsoft ke2017lightgbm
as the GBDT framework for the experiment. Although there are other implementations such as XGBoost and pGBRT, the reason why we choose LightGBM is due to its several improved features:
The usage of histogram-based algorithms suppress the need of going through all discrete values. That ends up in faster training speed and lower memory consumption.
The implementation some advanced network communication algorithms to utilise the multi-core processors for parallel learning.
As shown in Figure 7, the primary process in the LightGBM training is to repeatedly create estimator, which is a weaker decision tree and test the remaining loss based on all the weaker trees created so far.
Since departure delay prediction is a regression task, we use the Root Mean Squared Error (RMSE) as the loss function.
The Convolutional Neural Network (CNN) is one of the first architecture in the now ubiquitous deep learning paradigmlecun1989backpropagation le1989handwritten. It has been shown to work well with image and image-like data simonyan2014very he2016deep krizhevsky2012imagenet. As a result, many have attempted to engineer image-like features out of various problems tas2018cnn. This includes spatio-temporal predictions such as traffic predictions yao2019revisiting wang2016traffic ma2017learning.
Airport tarmac contains spatio-temporal information that would aid in solving the flight delay prediction problem shao2019flight. Unlike the previous work that divides the airport into tailored areas, we construct a novel TrajCNN feature that would capture more information at a higher granularity as we make less amounts of aggregation. The deep learning architecture would pick up latent spatial-temporal features that might be lost during the coarse level aggregation of the previous work.
The CNN architecture is as follows. The TrajCNN features are firstly fed into two blocks of convolutional blocks. Each block consists of a convolutional layer, a ReLu activation layer, and a 2
2 max-pooling layer, and the convolutional layer has only one 33 filter. After flattening, it outputs the ATC latent features that capture the delay related complexities of the airport. We fuse this with the remaining flight features,
and use it as an input to a fully connected multilayer perceptron (MLP) with the delaysas the output. The detail of our architecture can be shown in Figure 8.
6 Experiment and Results
In this section, we describe all data we used for our prediction in details.
6.1.1 Schedule Table Data
The first dataset consists of reference data operated by the Bureau of Transportation Statistics of the United State Department of Transportation historicaldata. This dataset is an open dataset, and we extracted the relevant parts from it which comprise all flight records from the 1st of July, to the 31st of August. Both arrival and departure flight information occurred in LAX are included since we can extract features about the delay propagation effect from the former while the latter provides the ground truth of the departure delay that we tried to predict. Except for the overall flight delay (both arrival and departure), the delay is partitioned and categorised into five different types for various causes. Additionally, it also contains the Call-sign/Tail-number, which helps us to match it with our GPS trajectories data.
6.1.2 Weather Data
The second dataset in this experiment is reference data collected from website weathersource
. This weather data mainly covers the Los Angeles Airport (LAX) and some surrounding regions. It is gathered hourly, which provides a fine-grade tracking of the weather condition. The essential numerical-based features are temperature and humidity level. Meanwhile, the categorical features, Condition and Wind Direction, are critical for later stages. We apply one-hot encoding on the categorical features. More details are shown in Table1.
6.1.3 Airport GPS Trajectories Sensor Data
The third reference dataset is a private dataset which consists of GPS observation of all vehicles and aircraft in the LAX (Los Angeles Airport). It is collected by the United States’ Federal Aviation Administration’s (FAA’s) System Wide Information Management (SWIM) program. Trajectories of all vehicles can be restored based on this dataset. In the spatial domain, it covers all the tarmac area of LAX while the vertical altitude goes up to 15000 metres. In the temporal domain, the data contains seven weeks of data from the 1st of July, 2016, to the 18th of August, 2016. There are around 11 million GPS points fall within this range, which include both ground vehicles and aircraft locations. We managed to restore 43,503 trajectories from it, which belongs 6,518 vehicles.
6.2 Experimental Setup
We withhold the last (28.6%, 2 weeks) of the dataset for testing, while we use the remaining (71.4%, 5 weeks) for training and validation. We performed cross-validation by dividing the dataset into 5 folds. The reason we chose this method of cross-validation, instead of using the random split method, is because each data point is temporally correlated with the other. We do this to ensure no temporal correlation among each fold.
The hyper-parameter search space for MLP consist of:
Number of nodes sampled from the following probability distribution:is distributed randomly over where is a normalisation constant
Number of layers sampled from the following probability distribution:is distributed randomly over where is a normalisation constant
Adam learning rate is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant
Where is a nearest integer rounding function.
The hyper-parameters for LightGBM in this experiment are shown in Table 4, which also summarise the hyper-parameters for other models as well. The three most important parameters are: learning rate, number of estimators, and number of leaves. We also applied the early stop mechanism to avoid the overfitting. All other parameters were left at their default settings. All hyper-parameters are held unchanged through the whole testing procedure, which includes a 5-fold cross-validation experiment and its following application on the testing data-set.
Since the training process for deep learning models is time-consuming, we did not perform 5-fold validation. Instead, we randomly split the data with the ratio of 7-1-2 for training-validation-testing, respectively. Our hyper-parameter search space is as follows:
The number of fully connected layers is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
The number of nodes in the fully connected layer is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
The number of convolutional layers is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
The number of filter in convolutional layer is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
The batch size is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
The Learning rate is sampled from the following probability distribution: is distributed randomly over where is a normalisation constant.
We performed around 100 experiments and chose the hyper-parameter set with the best validation RMSE. The best hyper-parameters can be found in table 4. We then use this hyper-parameter set to train the final model on both the training and validation dataset. Finally, we evaluate the final model on the test dataset.
|The error term||C||10000|
Num of epoch
|Early stopping round||n_patience||50|
|Num of layers||n_layer||2|
|Num of node||n_node||1553|
|Adam learning rate||adam_learning_rate||0.00976563|
|Number of estimators||n_estimators||16000|
|Number of leaves||num_leaves||39|
|Num of fully connected layer||n_fc_layer||1|
|Num of node in fully connected layer||n_fc||429|
|Num of convolutional layer||n_conv_layer||2|
|Num of filter in convolutional layer||n_conv||1|
|Num of epoch||n_epoch||200|
|Early stop patience||n_early_stop_patience||20|
6.3 Experimental Results
In this section, we conducted three sets of experiments. In this first experiment, we evaluated the prediction results using different combinations of data sources and machine learning models. In the second experiment, we tested the temporal sensitivity of the learning model and validated the robustness of the model in flight delay time prediction. In the last set of experiments, we compared the importance of different features that we extracted and created from the historical, ATC, and weather datasets. We conducted these three sets of experiments to evaluate the performance of our proposed flight delay prediction framework and the importance of our proposed ATC features.
In the first set of experiments, we apply four conventional machine learning regressors: Linear Regression (LR), Support Vector Regressor (SVR), Multilayer Perceptron (MLP), LightGBM, and our proposed deep learning architecture, TrajCNN, to predict the flight delay using different combination of data sources (Historical, weather, and GPS data). In this experiment, we use the traditional regression evaluation metrics: Root Mean Square Error (RMSE) to measure the performance of flight delay time prediction using different models and combinations of data sources. A Lower RMSE shows less error between prediction value and ground truth.
As shown in Table 5, LightGBM has the best overall performance. We can found out that there is a clear performance boost after adding spatiotemporal information, either in the form of ATC features, or TrajCNN features. In fact, without the spatiotemporal information, the RMSE of most algorithms is only around the standard deviation of the label.
The best results come from the experiments that combines flight reference data, ATC features, and weather data. These results indicate that the ATC features play a significant role in predicting flight departure delay.
Linear Regression performs poorly no on linear features such as ATC features, which other algorithms able to exploit well. Moreover, since we linearly decorrelate the data using PCA, linear regression is received a minor boost in improvement.
In the second set of experiments, we choose two temporal parameters to validate the robustness of the prediction result. We use only LightGBM in this experiment because it is one of the best performers with the shortest training time. The first parameter is the length of the observation window. As illustrated in Figure 1, the observation window is between the time point we predict the flight delay and the time point we start to collect ATC and weather data. The other parameter is the delay-predicting gap which starts from prediction time and gate-out time (flight close the gate). From intuition, Longer delay-predicting gap is likely to lead to worse results due to higher uncertainty. Meanwhile, shorter observation window means less training data is used in the model, which is likely to cause worse performance.
In summary, prediction accuracy is stable with even less training data. Furthermore, even the prediction time is four hours ahead of the flight delay event, the accuracy (around 17 minutes) is acceptable.
In the third experiment, we compare the importance of features we used in ATC dataset, weather condition dataset and historical dataset. We use the parameter recorded by the LightGBM regressor, which captures the numbers of times a specific feature is used to ‘split’ a tree featureimportance. The higher times this value is, the more information this feature provides, which helps the model to differentiate the situation.
Figure 10 shows the feature importance after the training process. Firstly, the historical data is still the most important feature (Schedule Departure Time) in predicting departure delay. Secondly, we can find that most of our selected ATC features have significantly higher importance than the weather features, which also supports the result in the first experiment that the ATC features does contribute more in this task.
7 Discussion and Future Work
Our proposed features and framework achieve an acceptable prediction performance in flight departure delay problem. However, there remain many limitations and potential work for the future. Firstly, this work mainly focuses on predicting the departure delay using airport traffic data. We did not consider many other traditional factors such as air traffic or airport traffic control system. Secondly, due to the difficulty to collect data from airports, we only chose the data from one airport, which limits the generalisation of our proposed methods. Although our proposed solutions do not rely on any specific factor of this airport, various data are still needed. Additionally, the seasonal pattern also plays an important role in flight delay as well. However, due to the range of the data provided by FAA, we only evaluate our system using a few months data. We plan to apply our proposed framework to different airports in a longer time in future work. Thirdly, the proposed deep learning networks can still be improved with more state-of-the-art techniques such as attention mechanism and assign the temporal prediction ability to this network using recurrent components. Fourthly, due to the purpose of this paper is to estimate the departure delay using only on-ground data at the airport, many factors such as conditions of air route, destination, and airline (Hub or Non-Hub) are not taken into account, which can also significantly affect the departure time. We plan to incorporate these factors with on-ground GPS data to achieve higher accuracy in the future. Additionally, many STOA CNN-based models have not been applied to this data since the rapid development of the deep learning area. Lastly, the existing work predicts the flight delay 4 hours before the scheduled departure time based on the observation window. However, sometimes, people or airport traffic controller needs a long horizontal prediction. We plan to apply the existing model to various needs and long-term prediction in the future.
We have proposed to use airport traffic complexity (ATC) data combined with traditional environmental factors to predict the flight departure delay from heterogeneous sensor data. We have designed, implemented, and evaluated an end-to-end deep learning-based framework to infer the delay time from a sequence of the snapshot of aircraft trajectories at the airport. We show in our evaluation that our approach allows for estimating flight departure delay with average deviations to the ground truth of fewer than 20 minutes. http://www.cs.sjtu.edu.cn/ yaobin/ccf.htm Also, we have demonstrated that the airport situational awareness map plays a more critical role in departure delay time prediction than weather conditions, and analyse the correlation between flight departure delay and ATC components. We also demonstrated that LightGBM outperforms other classic regressors using ATC feature, and a sequence of the snapshot of aircraft can be used to represent the aircraft trajectories and predict the flight delay.
This research is funded by Northrop Grumman Corporations USA for ”Spatio-temporal Analytics for Situation Awareness in Airport Operations” project, RMIT University. We would like to also acknowledge the support of the CSIRO Data61 Scholarship for Arian Prabowo and RMIT Research Stipend Scholarship (RRSS) for Sichen Zhao. We also acknowledge the support of ARC Discovery Project DP190101485.