Rail accident reporting in the U.S. has remained relatively unchanged for more than 40 years. The report form has 52 relevant accident fields and many of these fields have sub-fields. Some fields require entry of the value of an accident result or condition, e.g., “Casualties to train passengers” and “Speed”. Other fields have restricted entries to values from a designated set of choices, e.g., “Type of equipment” and “Weather”. “Primary cause” is an example of a restricted entry field in the report where the value must be one of 389 coded values. Choosing one of these categories while filling in reports is sometimes challenging and subject to errors due to the wide range of accidents. On the other hand, this field has significant importance for transportation administrations analysis in order to provide better safety regulations.
Field 52 on the report is different from the other fields because it allows the reporter to enter a narrative description of the accident. These accident narratives provide a way for the accident reporter to describe the circumstances and outcomes of the accident in their own words. Among other things, this means providing details not entered in any of the other fields of the report. For example, while the report may show an accident cause of H401 - “Failure to stop a train in the clear,” the narrative could provide the reasons and circumstances for this failure. These additional details can be important in improving rail safety by helping in selection of a more accurate cause for the event. As a result, a method that correlates the detailed narratives with causes would be beneficial for both accident reporters and railroad administrators.
Despite the advantages of the narrative field, most safety changes result from fixed field entries since accident descriptions are difficult to automatically process. The advance of methods in text mining and machine learning has given us new capabilities to process and automatically classify textual content. This paper describes the results of a study using the latest methods in deep learning to process accident narratives, classify accident cause, and compare this classification to the causal entry on the form. The results of this study give us insights into the use of machine learning and, more specifically, deep learning to process accident narratives and to find inconsistencies in accident reporting.
This paper investigates how the narrative fields of FRA accident reports could be efficiently used to extract the cause of accidents and establish a relationship between the narrative and the possible cause. Such relationships could assist the reporters to freely enter the narratives and getting candidate choices for causal field of reports. Our approach uses state-of-the-art deep learning technologies to classify texts based on their causes. The rest of this paper is organized as follows: in Section II, related work in both accident analysis and text classification with deep learning have been presented. Section III describes in detail the approach that has been used along with evaluation criteria. Section IV provides details of our implementations and section V reports the results. Finally, Section VI presents the conclusion.
Ii Related Work
This paper utilizes text mining and a new generation of natural language processing techniques, i.e. deep learning[1, 2, 3] in an effort to discover relationships between accident reports’ narratives and their causes. In this section, we describe related work in both railroad accident analysis and text mining with deep leaning. Train accident reports have been the subject of considerable research and different approaches have been used to derive meaningful information from these reports to help improve safety. As an example, the relationship between the length of train and accident rate has been investigated in . This paper also emphasizes the importance of proper causal understanding. Other authors [5, 6] have used FRA data to investigate accidents caused by derailments. Recent work has used statistical analysis on FRA data to discover other patterns to investigate freight train derailment rate as an important factor . All of these previous works used only the fixed field entries in the accident reports for their analysis and did not use information in the accident narratives. Some investigators have begun to apply text mining for accident report analysis in an attempt to improve safety. Nayak, et al.  provided such an approach on crash report data between 2004 and 2005 in Queensland Australia. They used the Leximancer text mining tool to produce cluster maps and most frequent terms and clusters. Other research  introduced concept of chain queries that utilize text retrieval methods in combination with link-analysis techniques. Recent work by Brown  provided a text analysis of narratives in accident reports to the FRA. He specifically used topic modeling of narratives to characterize contributors to the accidents. In this paper, we present a new approach to the analysis of these accident narratives using deep learning techniques. We specifically applied three main deep learning architectures, Convolutional Neural Nets (CNN), Recurrent Neural Nets (RNN), and Deep Neural Nets (DNN), to discover accident causes from the narrative field in FRA reports.
Another study  presented an overview of how these methods improved the state-of-the-art machine learning results in many fields such as object detection, speech recognition, drug discovery and many other applications. CNN were first introduced as a solution for problems involving data with multiple array structure such as 2D images. However, the researchers in  proposed using a 1D structure to enable CNN applications for text classification. This work was extended by Zhang, et al., who developed character-level CNN for text classification 
. Other work has provided additional extensions to include use of dynamic k-max pooling for the architecture in modeling sentences. In RNN, the output from a layer of nodes can reenter as input to that layer. This architecture makes these deep learning models particularly suited for applications with sequential data including, text mining. Irsoy et al. 
showed an implementation of deep RNN structure for sentiment analysis of sentences. The authors of this paper compared their approach to the state-of-the-art conditional random fields baselines and showed that their method outperforms such techniques. Other researchers used different combinations of RNN models with some modifications and showed better performance in document classifications as in, . Also, some recent researchers combined CNN and RNN in a hierarchical fashion and showed their overall improved performance for text classification as in . Another hierarchical model for text classification is presented in  where they employ stacks of deep learning architectures to provide improved document classification at each level of the document hierarchy. In our study, we have combined text mining methods and deep learning techniques to investigate the relationship of narrative field with accident cause which has not been explored before using such methods.
For this analysis, each report is considered as a single short document which consists of a sequence of words or unigrams. These sequences are considered input in our models and the accident cause (general category or specific coded cause) is the target for the deep learning model. We convert the word sequences into vector sequences to provide input to the deep learning models. Different solutions such as“Word Embedding” and tf-idf representation are available to accomplish this goal. This section also provides details on deep learning architectures and evaluation methods used in this study .
Iii-a Word Embedding and Representation
Different word representations have been proposed to translate words or unigrams into understandable numeric input for machine learning algorithms. One of the basic methods is term-frequency (TF) where each word is mapped on to a number corresponding to the number of occurrences of that word in the whole corpora. Other term frequency functions present word frequency as a Boolean or a logarithmically scaled number. As a result, each document is translated to a vector containing the frequency of the words in that document. Therefore, this vector will be of the same length as the document itself. Although such an approach is intuitive, it suffers from the fact that common words tend to dominate the representation.
Iii-A1 Term Frequency-Inverse Document Frequency
K. Sparck Jones  proposed inverse document frequency (IDF) that can be used in conjunction with term frequency to lessen the effect of common words in the corpus. Therefore, a higher weight will be assigned to the words with both high frequency in a document and low frequency in the whole corpus. The mathematical representation of weight of a term in a document by tf-idf is given in 1 .
Where is the number of documents and is the number of documents containing the term in the corpus. The first part in Equation 1 would improve recall and the latter would improve the precision of the word embedding . Although tf-idf tries to overcome the problem of common terms in a document, it still suffers from some other descriptive limitations. Namely, tf-idf cannot account for similarity between words in the document since each word is presented as an index. In recent years, with development of more complex models such as neural networks, new methods have been presented that can incorporate concepts such as similarity of words and part of speech tagging. GloVe is one such word embedding technique that has been used in this work. Another successful word embedding method used in this work is Word2Vec which is described in the next part.
Mikolov, et al. developed the “word to vector” representation as a better word embedding approach . Word2vec uses two neural networks to create a high dimensional vector for each word: Continuous Bag of Words (CBOW) and continuous skip-gram (CSG). CBOW represents the word in context with previous words while CSG represents the word by proximity in the vector space. Overall the word2vec method provides a very powerful relationship discovery approach.
Iii-A3 Global Vectors for Word Representation (GloVe)
Another powerful word embedding technique is Global Vectors (GloVe) presented in . The approach is very similar to the word2vec method where each word is represented by a high dimension vector, and trained based on the surrounding words over a huge corpus. The pre-trained embeddings for words used in this work are based on 400,000 vocabularies trained over Wikipedia 2014 and Gigaword 5 with 50 dimensions for word representation. GloVe also provides other pre-trained word vectorizations with 100, 200, 300 dimensions which are trained over even bigger corpi as well as over Twitter. Figure 2 shows an example of how these embeddings can be used to transfer words to a better representation. As one can see, words such as “Engineer”, “Conductor”, and “Foreman” are considered close based on these embeddings. Similarly, words such as “inspection” and “investigation” are considered very similar.
Iii-B Text Classification with Deep Learning
Three deep learning architectures used in this paper to analyze accident narratives, are Convelutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Deep Neural Networks (DNN)[1, 2, 3]. The building blocks of these classifiers are described in greater detail in this section.
Iii-B1 Deep Neural Networks (DNN)
. The input is a vectorized representation of documents, which connects to the first layer. The output layer is number of classes for multi-class classification and only one output for binary classification. The implementation of Deep Neural Networks (DNN) is discriminative trained model that uses standard back-propagation algorithm. Different activation functions for nodes exist such as sigmoid or tanh but we noticed ReLU (Equation 2) provides better results. The output layer for multi-class classification, should use as shown in Equation 3.
Given a set of example pairs , the goal is to learn from these input and target spaces using hidden layers. In our text classification, the input is a string which is generated by vectorization of text using tf-idf word weighting.
Iii-B2 Convolutional Neural Nets
to recognize handwritten digits in images. The proposed design, though powerful, did not catch the attention of the computer-vision and machine learning communities until almost a decade later when higher computation technologies such as Graphics Processing Units (GPU) became available. Although CNNs have been designed with the intention of being used in the image processing domain, they have also been used in text classification using word embedding [25, 2, 3].
In CNN, a convolutional layer contains connections to only a subset of the input. These subsets of neurons arereceptive fields and the distance between receptive fields is called stride. The value at any neuron in the receptive field is given by the output from an activation function applied to the weighted sum of all inputs to the receptive field. Common choices for activation functions are sigmoid, hyperbolic tangent, and rectified linear. As with most CNN architectures, in this study we stack multiple convolutional layers on top of each other.
The next structure in the CNN architecture is a pooling layer. The neurons in this layer again sample a small set of inputs to produce their output value. However, in this case they simply return the minimum, average or maximum of the input values. Pooling reduces computation complexity, and memory use. Additionally, it can improve performance on translated and rotated inputs . Pooling can be repeated multiple times depending on the size of input and the complexity of the model.
The final layer is traditional fully connected layers taking a flattened output from the last pooling layer as its input. The output from this fully connected network is run through a softmax function for multinomial (i.e., multiple labels) problems, such as classifying cause from accident narratives.
Figure 1 shows the structure of an example CNN with one convolutional and max pooling layer for text analysis.
Iii-B3 Recurrent Neural Networks (RNN)
RNN are a more recent category of deep learning architectures where outputs are fed backward as inputs. Such a structure allows the model to keep a memory of the relationship between words in nodes. As such, it provides a good approach for text analysis by keeping sequences in memory .
The general RNN structure is formulated as in Equation 4 where denotes the state at time and refers to the input at step .
In this equation is the recurrent matrix weight, are the input weights, is the bias, and is an element-wise function.
The general RNN architecture has problems with vanishing and, less frequently, exploding gradients. This happens when the gradient goes through the recursions and gets progressively smaller or larger in vanishing or exploding states respectively. 
. To deal with these problems, long short-term memory (LSTM), a special type of RNN that preserves long-term dependencies was introduced which shows to be particularly effective at mitigating the vanishing gradient problem.
Figure 3 shows the basic cell of an LSTM model. Although LSTM has a chain-like structure similar to RNN, LSTM uses multiple gates to regulate the amount of information allowed into each node state [1, 2, 3].
Gated Recurrent Unit (GRU)
The Gated Recurrent Unit (GRU) is a more recent and simpler gating mechanism than LTSM. GRU contains two gates, does not possess internal memory (the in Figure 3), and unlike LSTM, a second non-linearity is not applied (tanh in Figure 3). We used GRU as our main RNN building block. A more detailed explanation of a GRU cell is given in following:
Where refers to update gate vector of , stands for input vector, , and are parameter matrices and vector, is the activation function, which could be sigmoid or ReLU.
Where stands for the reset gate vector of .
Where is output vector of , stands for reset gate vector of , is update gate vector of , indicates the hyperbolic tangent function.
Figure 4 shows the RNN architectures used in this study by employing either LSTM or GRU nodes.
In order to understand how well our model performs, we need to use appropriate evaluation methods to overcome problems such as unbalanced classes. This section describes our evaluation approach.
Iii-C1 F1 measurement
With unbalanced classes, as with accident reports, simply reporting the overall accuracy would not reflect the reality of a model’s performance. For instance, because some of these classes have considerably more observations than others, a classifier that chooses these labels over all others will obtain high accuracy, while misclassifying the smaller classes. Hence, the analysis in this paper requires a more comprehensive metric. One such metric is F1- score and its two main implementations: Macro-averaging and Micro-averaging. The macro averaging formulation is given in Equations 12, using the definition of precision () and recall () in Equation 9,10.
Here , , represent true positive, false positive and true negative, respectively, for class and classes.
Our analysis uses macro averaging which tends to be biased toward less populated classes 
. As a result, we provide a more conservative evaluation since deep learning methods tend to perform worse with smaller data sets. Another performance measure used in this study, is confusion matrix. A confusion matrix compares true values with predicted values and therefore, provides information on which classes are mostly misclassified to what other classes .
In this section, we describe the embeddings that are used for our analysis as well as the structure of each deep learning model and the hardware that has been used to perform this work. To create word2vec presentation, we used gensim library to construct a 100 dimension vector for each word using a window of size 5. Similarly, we used a 100 dimension representation of Glove trained over 400K vocabulary corpus. The input documents have been padded to be of the same size of 500 words for all narratives. Our experiments showed that higher dimensions would not have a significant effect on the results.
Our DNN implementation consists of five hidden layers, where in each hidden layer, we have units with ReLu activation function followed by a dropout layer.
Our CNN implementation consists of three 1D convolutional layers, each of them followed by both a max pool and dropout layer. Kernel size for convolution and max pooling layers was both 5. At the final layer our fully connected layer has been made from 32 nodes and used a dropout layer as well.
RNN implementation is made of two GRU layers with 64 units in each followed by dropout after them. Final layer is a fully connected layer with 64-128 nodes at the end. This layer also includes a dropout similar to previous layers. The dropout rate is between 0.1 to 0.5 depending on the task and model which helps to reduce the chance of overfitting for our models.
The processing was done on a with cores and
memory. We used Keras package
with Tensorflow as its backend for our implementation.
This work has been performed using Federal Railroad Administration (FRA) reports collected during 17 consecutive years (2001-2017) 
. FRA provides a narrative for each accident with the corresponding cause reported on that accident. The results are in two sections. In the first section, we show the performance in labeling the general cause for each accident based on its narrative and in the second section, we focus on the specific accident cause, on most common type of accidents according to reported detailed cause. In both of these analyses, we also compare our performance with some of traditional machine learning algorithms such as Support Vector Machines (SVM), Naive Bayes Classifier (NBC) and Random Forest as our baselines. Finally, we look at our misclassified results using confusion matrix and analyze errors made by our models.
V-a General cause analysis
The general accident cause is in the reported cause field of accident reports. This analysis considers reports with five labels as general causes. Table II shows the five causal labels and their distribution.
To classify the reports, both RNN and CNN along with two word embeddings, Word2Vec and Glove, and DNN with tf-idf are used. Table III shows the performance of our techniques and compare it with our baselines. Generally, Word2Vec embedding produces better F1 scores over the test set. Also, the differences between RNN and CNN results are not significant.
Figure 6 shows the confusion matrix for the best classifier. This confusion matrix shows that deep learning models in conjunction with vector representations of words can provide good accuracy especially on categories with more data points.
V-B Specific cause analysis
Our analysis also considers more specific accident causes in FRA reports (one of 389 code categories). An obvious issue with more detailed causal labels is that there are some cause categories with very few reports. Therefore, over the period studied, the top ten most common causes (combined into 8 categories since H307 and H306 have the same description and the description of T220 and T207 is very similar) have been selected for analysis. Table I shows the distribution of reports on these categories. Figure 5 shows the confusion matrix for the best classifier for the top 8 categories of causes.
We also investigate classifier performance using ROC curves as in Figure 7 for both general and specific causes.
Table IV shows the results for specific causes along with a comparison with our baselines’ performances. Similar to our previous results, models using Word2Vec embedding perform better than the ones using GloVe both in CNN and RNN architecture.
V-C Error analysis
To better understand model performance, we investigated the errors made by our classifiers. The confusion matrices, clearly show that the number of instances in the classes plays a major role in classification performance.
As an example, reports labeled with Signal as the main cause are the smallest group and not surprisingly, the model does poorly on these reports due to the small number of training data points.
There is, however, another factor at work in model performance which comes from rare cases where the description seems uncorrelated to the cause. As an example of such cases, our model predicted the following narrative ”DURING HUMPING OPERATIONS THE HOKX112078 DERAILED IN THE MASTER DUE TO EXCESSIVE RETARDER FORCES.” in mechanical category while the original category reported is cause by Signal. This seems not consistent with the report narrative.
Identifying such inconsistencies in reports’ narratives is important because both policy changes and changes to operation result from aggregate analysis of accident reports.
Vi Conclusion and Future Work
This paper presents deep learning methods that use the narrative fields of FRA reports to discover the cause of each accident. These textual fields are written using specific terminologies, which makes the interpretation of the event cumbersome for non-expert readers. However, our analysis shows that when using proper deep learning models and word embeddings such as GloVe and especially Word2Vec, the relationship between these texts and the cause of the accident could be extracted with acceptable accuracy. The results of testing for the five major accident categories and top 10 specific causes (according to FRA database coding) show the deep learning methods we applied were able to correctly classify the cause of a reported accident with overall 75 % accuracy. Also, the results indicate that applying recent deep learning methods for text analysis can help exploit accident narratives for information useful to safety engineers. This can be done by providing an automated assistant that could help identify the most probable cause of an accident based on the event narrative. Also, these results suggest that in some rare cases, narrative description seems inconsistent with the suggested cause in the report. Hence, these methods may have promise for identifying inconsistencies in the accident reporting and thus could potentially impact safety regulations. Moreover, the classification accuracy is higher in more frequent accident categories. This suggests that as the number of reports increases, the accuracy of deep learning models improves and these models become more helpful in interpreting such domain specific texts.
-  K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, , M. S. Gerber, and L. E. Barnes, “Hdltex: Hierarchical deep learning for text classification,” in Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on. IEEE, 2017, pp. 364–371.
-  K. Kowsari, M. Heidarysafa, D. E. Brown, K. J. Meimandi, and L. E. Barnes, “Rmdl: Random multimodel deep learning for classification,” in Proceedings of the 2nd International Conference on Information System and Data Mining. ACM, 2018, pp. 19–28.
-  M. Heidarysafa, K. Kowsari, D. E. Brown, K. Jafari Meimandi, and L. E. Barnes, “An improvement of data classification using random multimodel deep learning (rmdl),” vol. 8, no. 4, pp. 298–310, 2018.
-  D. Schafer and C. Barkan, “Relationship between train length and accident causes and rates,” Transportation Research Record: Journal of the Transportation Research Board, no. 2043, pp. 73–82, 2008.
-  X. Liu, C. Barkan, and M. Saat, “Analysis of derailments by accident cause: evaluating railroad track upgrades to reduce transportation risk,” Transportation Research Record: Journal of the Transportation Research Board, no. 2261, pp. 178–185, 2011.
-  X. Liu, M. Saat, and C. Barkan, “Analysis of causes of major train derailment and their effect on accident rates,” Transportation Research Record: Journal of the Transportation Research Board, no. 2289, pp. 154–163, 2012.
-  X. Liu, “Statistical temporal analysis of freight train derailment rates in the united states: 2000 to 2012,” Transportation Research Record: Journal of the Transportation Research Board, no. 2476, pp. 119–125, 2015.
-  R. Nayak, N. Piyatrapoomi, and J. Weligamage, “Application of text mining in analysing road crashes for road asset management,” in Engineering Asset Lifecycle Management. Springer, 2010, pp. 49–58.
-  W. Jin, R. K. Srihari, H. H. Ho, and X. Wu, “Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques,” in Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 2007, pp. 193–202.
-  D. E. Brown, “Text mining the contributors to rail accidents,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 2, pp. 346–355, 2016.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  R. Johnson and T. Zhang, “Effective use of word order for text categorization with convolutional neural networks,” arXiv preprint arXiv:1412.1058, 2014.
-  X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Advances in neural information processing systems, 2015, pp. 649–657.
-  P. Blunsom, E. Grefenstette, and N. Kalchbrenner, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
-  O. Irsoy and C. Cardie, “Opinion mining with deep recurrent neural networks.” in EMNLP, 2014, pp. 720–728.
-  D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification.” in EMNLP, 2015, pp. 1422–1432.
-  S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification.” in AAAI, vol. 333, 2015, pp. 2267–2273.
-  Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hierarchical attention networks for document classification.” in HLT-NAACL, 2016, pp. 1480–1489.
-  K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972.
-  T. Tokunaga and I. Makoto, “Text categorization based on weighted inverse document frequency,” in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ. Citeseer, 1994.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
-  D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for object recognition,” Artificial Neural Networks–ICANN 2010, pp. 92–101, 2010.
-  A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” Andrej Karpathy blog, 2015.
-  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
-  R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks.” ICML (3), vol. 28, pp. 1310–1318, 2013.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  A. Özgür, L. Özgür, and T. Güngör, “Text categorization with class-based and corpus-based keyword selection,” Computer and Information Sciences-ISCIS 2005, pp. 606–615, 2005.
F. Chollet et al.
, “Keras: Deep learning library for theano and tensorflow.(2015),” 2015.
-  “Federal railroads administration reports,” http://safetydata.fra.dot.gov/OfficeofSafety/default.aspx.