. However, as a dynamic system with uncertainties, civil aviation faces multiple unexpected events frequently that can result in flight delays or cancellations. In the case of the flight delay propagation between initially delayed flight and flights downstream, the impact can grow exponentially if the air traffic control cannot adopt an appropriate resource re-allocation strategy to optimize mobility in time. Hence, estimating delay as accurately as possible within a controllable time window is critical.
Traditional methods of the aviation delay prediction are relying on modeling and simulation techniques, but it is hard to select suitable modeling assumptions for the quality of the analysis . With the large-scale deployments and applications of Internet of Things (IoT) devices, fast collection of ubiquitous information including spatial and temporal data motivates the utility of big data analytics and statistical machine learning in many fields[7, 8, 9, 10, 11, 12, 13, 14]. A few studies on aviation delay prediction based on supervised machine learning have been conducted in recent years. For instance, Manna et al. 
employ Gradient Boosted Decision Tree to predict average arrival and departure delays of a day. However, the outliers in the dataset are removed directly before the training process so that the data completeness and model generalization is not guaranteed. Gopalakrishnan et al.
compare the performance of Markov Jump Linear System, Classification and Regression Trees, and Artificial Neural Networks to predict delays in air traffic networks, and reveal a trade-off between model simplicity and prediction accuracy. Although promising results are obtained by using statistical machine learning methods, the crucial temporal elements, in particular delay propagation effect, cannot be learned by such one-shot prediction models.
Compared with previous work based on RNN, this paper makes three main contributions:
Unlike other systems that predict delay before departure  or make an implicit assumption that the journey of each aircraft does not vary significantly , we fully consider the indeterminacy of air transportation system to figure out the aviation delay prediction problem pointedly over a time horizon of one hour.
We present a complete workflow of data manipulation in this paper. Some work [20, 18] provide comparative analysis for delay prediction using several models including RNNs but they focus more on non-time-series methods especially boosting and bagging methods, whereas the input format for RNNs is a far away from the input format of other algorithms. Moreover, though  describes a data processing pipeline in detail, it creates day-to-day sequences rather than flight-to-flight or minute-to-minute sequences like us, which does not make sense in real-time air traffic control.
Multi-source data are integrated and extended to generate a spatio-temporal dataset with richer information in spatial and temporal domains.
The remainder of this paper is structured as follows. Section II formulates the aviation delay prediction problem and reviews the fundamental elements of the LSTM network. Section III introduces the data sets we use and the corresponding feature engineering. Then, our proposed LSTM-based architecture is introduced. The experimental setup and results are demonstrated in Section IV that show the accuracy of our solution, while Section V concludes this paper and contemplates some future work.
Ii Problem Statement and LSTM Network
Aviation delay is affected by multiple factors. Some of them are unpredictable such as military training activities, equipment failures at the airport, extreme weather. However, there are still valuable temporal contexts that can be retrieved for delay prediction. For example, the late arrival or departure of a previous flight will affect the on-time departure and arrival of succeeding flights. Such a pattern motivates us to model the aviation delay as a multivariate time series problem.
Ii-a Problem Formulation
We select an airport and flights flying to to formulate the arrival delay estimation at an airport.
Considering the prediction function , with the input for , where (i) denotes the number of time stamps, (ii) denotes the number of variables or feature dimension, and (iii) is the ground truth which is denoted as where , and
is the self-defined time interval. Therefore, our multivariate sequential data are transformed into a supervised learning problem. To be specific, if we use a sequence starting at, the information in a certain period ( and ) would be used to predict the delay at time stamp as shown in Fig. 1.
Vanilla RNNs are elaborate for sequential prediction problem but it has difficulty in overcoming exploding and vanishing error flow to learn long-term dependence . Therefore, a variant of RNNs named LSTM is well designed to address these problems and has made significant advancements in applications [25, 8]. In this paper, we feed our data to the LSTM model whose basic structure of a cell is shown in Fig. 2. The gate mechanism is a key component in the LSTM structure. Gates are a track to optionally let information (hidden state and cell state ) through. There are three gates in an LSTM cell, Forget Gate, Input Gate and Output Gate, and their functions are represented by the following equations:
is the sigmoid activation function which observes states at the previous step and outputs a numberbetween 0 and 1 to control the degree of remaining information flow. is the input gate that is combined with the candidate value, after tanh layer to update the state, then the old cell state representing long-term memory can be replaced by the new one as shown in Equation 4. is the final output calculated by the multiplication of and tanh() which could be the input for next LSTM cell.
In this section, data sets used for aviation delay prediction are introduced, followed by feature engineering that constructs a proper set representing the underlying problem of the predictive task for improved model generalization. Besides, the design of our proposed LSTM-based architecture is explained.
Iii-a Data Description
In this paper, we use data of the first day of each month for the period of July 2016 through February 2017 in Coordinated Universal Time (UTC). Note that, to guarantee the spatio-temporal continuity of data, only records whose Arrival Airport
is Hartsfield-Jackson Atlanta International Airport (ATL) are selected. The outline descriptions of features extracted from each domain are illustrated in Fig.3.
Iii-A1 Flight data
The flight records including delay information are provided by United States Department of Transportation. The data contains some most important information, for example, Departure/Arrival Airport, Scheduled Departure/Arrival Time, Actual Departure/Arrival Time and Airlines.
Iii-A2 Trajectory data
In the United States, the aircraft’s trajectories are collected continuously by automatic dependent surveillance-broadcast (ADS-B) , a surveillance technology used by Federal Aviation Administration (FAA) for air traffic control (ATC). The data from ADS-B Exchange  contains all flights installed with ADS-B equipment which covers most commercial aviation. The fields of the ADS-B data includes Aircraft Identifications, position information (Longitude, Latitude, Altitude), flight status (such as Aircraft Speed and Track Angle) and aircraft attributes (such as Aircraft Model and Manufacturer’s Name).
Iii-A3 ATC data
Iii-A4 Weather data
The weather condition in the air route is demonstrated as a significant factor for the delay prediction task. Therefore, we gather Local Climatological Data (LCD) from the National Oceanic and Atmospheric Administration (NOAA). The flight-related data involves Temperature, Precipitation, Humidity, Sky Conditions, Wind Speed, Wind Direction and so on.
Iii-B Feature Engineering
As a preprocessing step, feature engineering refers to the process of using knowledge and statistical approaches to select and create representative features. We aim to construct a cleansed data set with relatively high-dimensional feature space after feature engineering, so it enables the trained model to have better performance on unseen data.
Iii-B1 Feature selection
Features with the percentage of missing values that exceed the threshold (80%) are removed. Then, we calculate the Pearson correlations of each pairwise variable and remove redundant features beyond the threshold (80%). Among the remaining features, the presumed useful ones are selected according to our domain knowledge and task needs.
Iii-B2 Congestion index
The congestion indices of ground and airspace are calculated using flight data and trajectory data, respectively. The ground congestion indices are computed by counting departures and arrivals within 10-minute bins. In a similar way, congestion indices of airport airspace (low-altitude) can be obtained but we use ADS-B data because it contains both commercial airplanes and private jets that makes the results more accurate. To be specific, the flight records whose aircraft-to-ATL distance is less than 200 KM and altitude ranges from 1200 to 10,000 feet above the mean sea level (MSL) are counted. The distance is obtained by Haversine formula given the longitude and latitude of ATL and an airplane:
where is the great-circle distance between two points on a sphere with longitude and latitude (, ) and is the radius of the Earth. Besides, we split the airspace begins at 18,000 feet above MSL into equal-sized sectors and the number of aircraft in each sector per 10 minutes is defined as the congestion index of en-route airspace.
Iii-B3 Encode discrete and categorical features
To ensure the consistency of the sequence length, we need to summarize the information on succeeding flights because the number of flights are uncertain at every point in time. However, it is inappropriate to take the average of these features because our data set contains some discrete and categorical variables. Considering the interpretation and diversity of these features, a hybrid encoding strategy is adopted in this paper. More narrowly, the discrete and categorical variables are split up into two types – high-cardinality and low-cardinality based on the threshold of 50. Then, we encode the features using frequency encoding and one-hot encoding, respectively. One-hot encoding enables the representation of the discrete feature to be mapped into Euclidean space, and a certain value of the discrete feature corresponds to a point in the Euclidean space so that it is more reasonable to calculate the distance between them. However, it would cause the curse of dimension if we encode high-cardinality features using one-hot encoding. Hence, we use frequency encoding, which counts and sorts the occurrence of values, to address this issue.
Iii-B4 Data fusion
Before we combine various cleansed data sets, we have to convert the timestamps into a unique time zone, UTC. Then, we merge weather data with trajectory data and flight data separately to create two data sets named and , respectively. Recall Section II, here we set minutes which means if we assume the timestamp of the last instance in the temporal sequence is , the aviation delay situation at time is chosen as the ground truth. However, there may be no aircraft arrives at ATL at the time of . Thus we take the average delay of aircraft whose arrival times lie between and , which could be positive or negative, as the target of our prediction task. It is obvious that airplanes whose estimated arrival times fall within the interval between and are still flying. Therefore, spatio-temporal information of these airplanes such as how close they are to ATL airport, weather situations and their speeds are retrieved from and introduced to based on aircraft registration identification and time flag. Finally, we fuse ATC data and together when taking geolocation as the key value.
Iii-C LSTM Based Architecture
The success of deep neural networks on a wide range of challenging prediction problems is commonly attributed to the hierarchy and depth [30, 31]. Inspired by this, Graves et al.  demonstrate that RNNs can also benefit from depth in space by stacking multiple recurrent hidden layers on top of each other. Therefore, we adopt stacked LSTM architecture for our aviation delay prediction. Fig. 5 shows the structure of our proposed framework. There is a sequence with length
, and each output of the second LSTM layer is jointed with a fully connected (FC) layer. We expect the model looks backward historical states and attempts to capture potential temporal characteristics since some important hidden representations may be lost in the last LSTM cell. Besides, FC layer with more neurons has better expressivity for a complex function. In the process of our experiments, we found that the loss converges too quickly and tend to 0 which is proven to be the overfitting problem, thus we add a Dropout regularization between two LSTM layers.
In this section, we present the construction of the time series. Then we evaluate the performance and effectiveness of our LSTM-based architecture compared with other commonly-used machine learning algorithms for aviation delay prediction.
Iv-a Data Preparation
The inputs for LSTM are 3-dimensional sequential data. Fig. 6 shows the progress of constructing multivariate time series. It should be noted that our data only contains records of the first day of each month from July 2016 to February 2017 in UTC, so we have to split the original data into 8 days by time matching firstly and then slice them separately to create a series of time blocks with a length of N. Next, the training set and test set are generated by stacking these arrays in sequences vertically, which keeps the time continuity within a time block. The order of inputs will affect the training results since the former samples will get larger gradients in general, while the neighboring sequences in the time series have the similar distribution. Hence, we shuffle the order in which sequences are fed to the LSTM to guarantee the generalization of the trained model. Note that we do not shuffle the ordering of elements within individual sequences.
Two types of data sets are used for training and evaluation. The first data set with 66 features contains only flight and weather information in the airport, while the other data set with 203 features contains not only the first data set has but also spatio-temporal information of succeeding flights. We let the sequence length
to be 30, 60, 90, and 120, respectively to generate data sets with different time steps. Mean Square Error (MSE) is chosen as the loss function in the training.
Our experiment is divided into three steps:
(1) Because most of the researches utilize ensemble methods to predict aviation delay, the first step is to compare the performance of our model and other powerful ensemble algorithms: Random Forest Regression (RF), a bagging method; and Gradient Boosting Regression Tree (GBRT), a boosting method. RF and GBRT are not designed for sequential prediction so we just feed all data into the model instead of slicing data according to
like LSTM. Besides, other commonly-used baseline models such as Linear Regression (LR), Support Vector Regression (SVR), Regression Tree (RT) and Multilayer Perceptron (MLP) are also used for comparison.
(2) The second step is to use two data sets separately to train our stacked LSTM model based on different sequence length, . Therefore, we can analyze the impact of sequence length and en-route spatio-temporal information for our model.
(3) We apply the model for different airports to verify the validity. The airports include large hub airports such as Los Angeles International Airport (LAX), O’Hare International Airport (ORD), a mid-size airport, Orlando International Airport (MCO), and a small airport, Daytona Beach International Airport (DAB).
The MSE (
) is used as the evaluation metric in this paper. TableI compares performance of different model settings based on data. represents the data set without information on subsequent flights, on the contrary, represents the entire data set. The best score is highlighted. In a quick glance, it seems clear that our LSTM model outperforms other models. The second best schema is the ensemble method, in which GBRT works best. These conclusions can also be drawn from Table II. LR perform poorly on both and that illustrates there is no apparent linear relation between features and target. Generally speaking, the performance of models with richer spatio-temporal features is better than models that take only temporal correlations into account. Besides, we observe that there is a trend of decline for MSE for test set as the length of the sliding window increases. We think the reason for this phenomenon is that the LSTM learns more hidden patterns of prior knowledge from longer sequences. Although longer sequences may introduce noise to the model during training, the gate mechanism prevents the abuse of long-term dependencies.
Fig. 7 depicts the gap between the predicted value and the ground truth over time in a more intuitive way. It can be clearly seen that there is a very slight difference in amplitude, and their trends are consistent basically which also verifies that our proposed model possesses the ability to predict aviation delay accurately.
Table III shows the MSE values of test sets for different airports. The distributions of data for different airports vary dramatically no matter from spatio-temporal perspective (e.g. weather varies in different cities) or from flight’s perspective (e.g. an airline has different flight arrangements for different airports). Therefore, we use the data at various airports to re-generate time series with spatio-temporal features and re-trained the LSTM-120 model. It demonstrates the effectiveness of our proposed LSTM model and the corresponding feature engineering for large hub airports. However, this model does not make sense for mid- and small-size airports due to the relatively few flights, which are hard to generate representative sequences.
V Conclusion and Future Work
In this work, we present a novel aviation delay prediction framework based on stacked LSTM for commercial flights. We provide a complete data processing pipeline and generate a data set with richer spatio-temporal features while regarding delay propagation and uncertainty in the flights. Our experiments verify that our framework can predict a commercial flight arrival delay in the US within an acceptable error.
There is some significant work needed to do in the future. Firstly, we should collect more data to improve the robustness of our proposed model. Moreover, due to the difference in the number and frequency of flights in airports, in particular small airports which only hold a small volume of data that is not enough to train a high-performance but data-greedy deep learning model, this is a strong motivation to implement transfer learning to improve model performance for such airports with the help of prior knowledge.
This research was supported by the Center for Advanced Transportation Mobility (CATM), USDOT Grant #69A3551747125.
-  H. Song, R. Srinivasan, T. Sookoor, and S. Jeschke, Smart cities: foundations, principles, and applications. John Wiley & Sons, 2017.
-  H. Song, G. Fink, and S. Jeschke, Security and Privacy in Cyber-Physical Systems. Wiley Online Library.
-  Y. Sun, H. Song, A. J. Jara, and R. Bie, “Internet of things and big data analytics for smart and connected communities,” IEEE Access, vol. 4, pp. 766–773, 2016.
-  H. Song, D. B. Rawat, S. Jeschke, and C. Brecher, Cyber-physical systems: foundations, principles and applications. Morgan Kaufmann, 2016.
-  Y. Liu, X. Weng, J. Wan, X. Yue, H. Song, and A. V. Vasilakos, “Exploring data validity in transportation systems for smart cities,” IEEE Communications Magazine, vol. 55, no. 5, pp. 26–33, 2017.
-  Y. J. Kim, S. Choi, S. Briceno, and D. Mavris, “A deep learning approach to flight delay prediction,” in 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC). IEEE, 2016, pp. 1–6.
-  Z. Lv, H. Song, P. Basanta-Val, A. Steed, and M. Jo, “Next-generation big data analytics: State of the art, challenges, and future research topics,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1891–1899, 2017.
-  G. Dartmann, H. Song, and A. Schmeink, Big data analytics for cyber-physical systems: machine learning for the internet of things. Elsevier, 2019.
-  Y. Liang, Z. Cai, J. Yu, Q. Han, and Y. Li, “Deep learning based inference of private information using embedded sensors in smart devices,” IEEE Network, vol. 32, no. 4, pp. 8–14, 2018.
-  Z. Cai and Z. He, “Trading private range counting over big iot data,” in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2019, pp. 144–153.
-  X. Zheng and Z. Cai, “Privacy-preserved data sharing towards multiple parties in industrial iots,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 5, pp. 968–979, 2020.
-  G. Han, S. Shen, H. Song, T. Yang, and W. Zhang, “A stratification-based data collection scheme in underwater acoustic sensor networks,” IEEE Transactions on Vehicular Technology, vol. 67, no. 11, pp. 10 671–10 682, 2018.
-  J. Tan, W. Liu, M. Xie, H. Song, A. Liu, M. Zhao, and G. Zhang, “A low redundancy data collection scheme to maximize lifetime using matrix completion technique,” EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, pp. 1–29, 2019.
-  L. A. Tawalbeh, W. Bakheder, and H. Song, “A mobile cloud computing model using the cloudlet scheme for big data applications,” in 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), 2016, pp. 73–77.
S. Manna, S. Biswas, R. Kundu, S. Rakshit, P. Gupta, and S. Barman, “A
statistical approach to predict flight delay using gradient boosted decision
2017 International Conference on Computational Intelligence in Data Science (ICCIDS). IEEE, 2017, pp. 1–5.
-  K. Gopalakrishnan and H. Balakrishnan, “A comparative analysis of models for predicting delays in air traffic networks.” ATM Seminar, 2017.
-  W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and rnn for natural language processing,” arXiv preprint arXiv:1702.01923, 2017.
-  G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, “Flight delay prediction based on aviation big data and machine learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 140–150, 2019.
N. McCarthy, M. Karzand, and F. Lecue, “Amsterdam to dublin eventually
delayed? lstm and transfer learning for predicting delays of low cost
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9541–9546.
-  S. Ayhan, P. Costas, and H. Samet, “Predicting estimated time of arrival for commercial flights,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 33–42.
-  F. Kong, J. Li, B. Jiang, T. Zhang, and H. Song, “Big data‐driven machine learning‐enabled traffic flow prediction,” Trans. Emerg. Telecommun. Technol., vol. 30, no. 9, Sep. 2019. [Online]. Available: https://doi.org/10.1002/ett.3482
F. Kong, J. Li, B. Jiang, and H. Song, “Short-term traffic flow prediction in smart multimedia system for internet of vehicles based on deep belief network,”Future Generation Computer Systems, vol. 93, pp. 460 – 472, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X18320326
-  W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Y. Liu, L. L. Njilla, J. Wang, and H. Song, “An lstm enabled dynamic stackelberg game theoretic method for resource allocation in the cloud,” in 2019 International Conference on Computing, Networking and Communications (ICNC), 2019, pp. 797–801.
-  AVweb-flight safety. [Online]. Available: https://www.avweb.com/flight-safety/faa-regs/air-route-traffic-control/
-  C. L. Scovel III and I. General, “FAA’s progress and challenges in advancing the next generation air transportation system,” Statement of the Honorable Calvin L Scovel III, Inspector General, US Department of Transportation before the Committee on Transportation and Infrastructure Subcommittee on Aviation United States House of Representatives, Washington DC, vol. 17, 2013.
-  ADS-B Exchange-World’s largest co-op of unfiltered filght data. [Online]. Available: https://www.adsbexchange.com/
-  123ATC. [Online]. Available: https://123atc.com/
-  R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
-  M. Hermans and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in Advances in neural information processing systems, 2013, pp. 190–198.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.