Identifying the set of components and attributes that result in (or contribute to) code failure is an important topic in any application that relies on telemetry data for system health monitoring. For example, in Microsoft Office products, test engineers are interested in finding a generic pattern in the data that causes a code failure. Usually this is not an easy problem to solve since a combination of many factors (such as user’s activities, hardware architecture, operating system, other programs running in the background, add-ins, etc) can potentially contribute in a code failure. Also part of this information may not be fully captured by telemetry signal. Moreover, a code failure may not be necessarily tied to the very last activity of the user, but triggered by a sequence of activities with specific order. Also depending on the architecture design, we might capture or lose the very last batch of telemetry data if a major failure such as crash happens.
In this paper, we are interested in finding the root-cause of code failures as well as building a model to predict the code failure. We use Long Short Term Memory (LSTM) which is a type of recurrent neural networks for code failure prediction and pattern extraction.
The novelty of this work can be summarized as follows:
For code failure prediction, we propose an LSTM network whose hyper-parameters and type of LSTM cells are determined by Baysian optimization technique . It is usually not a trivial task to pick the LSTM architecture that achieves the best performance . Bayesian optimization technique enables us to systematically find the best LSTM architecture for our application. In this work, we consider two types of LSTM networks: (a) standard, and (b) Bidirectional network. These topics are covered in Section III
For code failure pattern extraction, we first introduce the Contributors and Blockers concepts. In this paper, contributors are the set of actions or events that individually or together result(s) in a code failure, while blockers are the set of actions or events that individually or together prevent(s) a code failure from happening. We then formulate the problem of finding contributors and blockers as two optimization problems and propose an algorithm that utilizes a trained LSTM-based prediction model to extract contributors and blockers in sequential data. Details of the proposed method are discussed in Section IV.
In Section V, we provide experimental results that show the proposed method outperforms the existing algorithms in achieving better pattern extraction and prediction performance.
Ii Related Works
For sequential data analysis and prediction, conventional machine learning algorithms such as SVM, Logistic Regression and Feed-Forward Neural Networks assume independence between features and exhibit poor performance when the order of components in a sequence is important. In the Bag of Words, we lose the order of components in the sequences and although we can capture the order indirectly by creating
-grams and considering a sliding window, this type of solutions are not quite useful in time series analysis. Hidden Markov Models (HMM) and high order Markov chains can be another option for representing time series and an alternative for sequential modeling[3, 4, 5, 6]. However, the state space grows exponentially with the size of the sequence, window size and number of states, rendering markov models computationally impractical for modeling long-term dependencies . Classic sequential pattern   and rule mining methods   can help us gain insight about events that contribute to code failures when the size of data is not very large and the overall length of sequences or the window size is relatively small. However, these methods fail to handle long sequences and large datasets. The main drawback of these methods is the number of patterns and rules they need to keep track of, similar to Markov models growing exponentially with the length of sequences. Also sequential rule mining solutions perform well on sequences with no duplicate events, but this assumption (having no duplicate) is not necessarily valid for code failures we see during sessions conducted by Microsoft Office users and we cannot eliminate the possibility of having duplicates in the sequences of telemetry events.
The main advantage of Recurrent Neural Networks (RNN) and in particular LSTM networks is that they are end-to-end differentiable with respect to each of the parameters in the model. Therefore, unlike the other models with combinatorial nature, RNNs and LSTM networks can be trained using gradient-based algorithms. In fact, the training complexity of LSTM grows linearly with the number of weights in the network , which makes LSTM networks very efficient for this type of problems. Moreover, an RNN model can avoid over-fitting using standard techniques such as drop-out and weight decay.
Code failure prediction in sequential telemetry data is one of the areas that LSTM networks can offer an edge over classical machine learning and data mining methods. Zhang et al. recently used standard LSTM networks for system failure prediction , however, the focus of their work is on failure prediction and not identifying the root-cause of failures. We are, on the other hand, more interested in identifying the root-cause of code failures and extracting patterns that lead to code failures, in addition to code failure prediction.
Interpretability is critical for some applications such as medical diagnostic tools and self driving cars, where the reliance of the model on the correct features needs to be guaranteed [19, 20]. Recently there have been some efforts by researchers to better understand the decision making process in recurrent neural networks with applications in NLP     and Genomic Sequencing . Karpathy et al. visualized the neural generation models from an error-analysis point of view, by analyzing predictions and errors of recurrent neural networks . The approach shows the intriguing dynamics of hidden cells in LSTM networks, but is limited to a few manually-inspected cases such as brace opening and closing. Li et al. used the first-order derivative to examine the saliency of input features , but they relied on the overly strong assumption that the decision score is a linear combination of input features. In another effort, Li et al. trained a separate generator that extracts a subset of text which leads to a similar decision as the one with the original input, and used it to form an interpretable summary . Lei et al. proposed a learning process that generates rationales for a given text . Rationales are subsets of words from the input text that are sufficient for prediction, and can be used as a substitute of the original text.
Our work is different from  in the sense that it extracts a set of words that may not be coherent, and can consist of words that are not necessarily in immediate sequences. The emphasis of our work is on the importance of each individual word in the sequence. Here the idea is to calculate the relative change of score for a text when a word is removed from the sequence. We utilize this approach to find the sequences of actions that lead to a specific code failure.
Iii Prediction Method
Iii-a Network Architecture
Suppose a session consists of a sequence of events that may or may not result in a code failure. Using a dataset of such sequence data, we train an LSTM-based model to predict the outcome of sequences (code failure vs no code failure). Fig. 1 shows the high level architecture of our proposed LSTM model. First, the sequence of events is fed to the embedding layer where each event is encoded to an
-dimensional real-valued vector. Here is a parameter which is determined based on the number of events and the size of the input sequence. The output of the embedding layer is passed through the LSTM layer followed by a dropout layer to avoid over-fitting. In the end, we have a fully connected (Dense) layer that generates a single output value. In this setting, output indicates “no code failure” while indicates “code failure”.
For training, we use the Cross-Entropy loss function. Given the ground-truth vector of outcomes and the vector of outcomes predicted by our model for sequences of data, the cross-entropy loss function is calculated as
Iii-B Long Short Term Memory (LSTM)
LSTM is a special type of RNNs, capable of learning both long-term and short-term dependencies in data. An LSTM network consists of multiple LSTM cells. Each LSTM cell has three main components responsible for forgetting, remembering and updating data . These components are depicted in Figure 2. For a time step and at cell , we have the following input-output relationships:
where is the vector representing input event at time and
stands for the sigmoid function. In this notation,denotes the element-wise product. and are the output values of the hidden layers at time and , respectively. can also be viewed as the filtered version of cell state . While adjusts how much activation is added to the internal state (forget gate), controls the effect of the internal state on the next cell (output gate). is responsible for remembering and updating the input value . The coefficients and bias factors are optimization parameters shared among all cells. In total, there are parameters to optimize for a standard LSTM network. In this paper, we also consider Bi-directional LSTM networks. Bi-directional LSTM networks can be thought of as two attached standard LSTMs with forward and backward directions. While the forward direction effectively makes use of the past features, the backward direction utilizes the forward features. For such networks, the hidden layer at time is given by . Subsequently, a bi-directional LSTM network has twice as many variables as a standard LSTM. These parameters are determined during the training process. In addition to LSTM, we also tried other types of networks such as GRU and simple RNN for this application, but LSTM had better performance than GRU and RNN.
Feature engineering can be very challenging in some applications. An important advantage of using LSTM networks is that they do not require any feature engineering. Moreover, it needs no prior knowledge of the events that form a sequence or session, and all important features are identified by the algorithm itself.
Iii-C Bayesian Optimization For Hyper-parameter Tunning
Any LSTM model has some hyper-parameters that need to be set before the training process starts. These hyper-parameters include the size of embedding layer (), size of hidden layer (), learning rate and the type of LSTM network (standard vs. bi-directional). We use Bayesian optimization technique  to find the optimal values of hyper-parameters for our model. It uses a probabilistic model for the objective function, which is the performance of the learning algorithm in this application, and based on the probabilistic model determines the next point to evaluate the function. The idea is to use all the available information to decide the next point, as opposed to only relying on the last point in conventional gradient methods. It is particularly appealing when the objective function is hard or expensive to calculate, with a good example being a deep neural network. This technique finds a nice trade-off between exploration and exploitation, and picks the hyper-parameters of the next iteration based on minimizing an accusation function. In our application, we used expected improvement (EI)  as the accusation function. The F1-score in -fold cross-validation was set as the objective function of the optimization algorithm. Our model’s hyper-parameters include learning rate, embedding size, and LSTM type (standard vs bi-directional cells). Using Bayesian optimization technique, we found the following optimal values of parameters for our synthetic dataset: embedding size (, learning rate , LSTM size and LSTM type = bi-directional. We observed that most of the off-the-shelf embedding layers with embedding size of and more would lead to over-fitting problem.
Iv Pattern Extraction
LSTM-based models are very good in predicting the outcome of a given sequence, however, they are often hard to interpret. In code failure prediction, it is important for the test engineers to find the root-cause of failures. More specifically, it is important to extract and identify patterns and combination of events that either result in code failure (contributors) or prevent a code failure from happening (blockers) during a session. For a given sequence of events that leads to a code failure, the contributors are formally derived from the following optimization problem
where refers to removing event from . Here, we assume that a set of events collectively contribute to a code failure and if one of these events is removed we will not see the code failure.
Similarly, blockers are identified by solving the optimization problem below
where refers to removing blocker from . This notation is based on the assumption that each blocker by itself can prevent the code failure from happening. Here, it is assumed that sequence ends with no code failure (executed normally).
A naive approach to solve these two optimization problems and find blockers and contributors is to use an exhaustive search where all combinations of actions are examined to find the one with the minimum length that satisfies the constraints. However, the search space exponentially grows in this case, rendering the exhaustive search infeasible. In what follows, we propose a greedy approach that works well in identifying the contributors and blockers. For a given sequence , we start from left to right and remove each event from the sequence. After each removal, we evaluate the output of the prediction model. If the prediction changes from to , we keep the event in the contributors list, and if the prediction changes from to we add it to the blockers list. Removing events can be done in two ways. In the first approach which is called zero-inserting, we replace the event under inspection with a default “don’t care” event, for example . The second approach is called void-inserting where we completely remove the event under inspection and make the sequence shorter. As it was not obvious which approach is better, we empirically examined both approaches. Based on the results which will be discussed in detail in the next section, we observed that void-inserting approach outperforms zero-inserting method.
The approaches described above fail to detect the correct contributors and blockers in a sequence with duplicates. For example, consider and assume that we know actions are the contributors. If we remove any of the two events, the predicted label still remains the same due to the other event in the sequence. To address this problem, we need to remove the first and second events together, unless the first one has already been detected as a contributor or blocker. With this modification, the whole algorithm is described in 1.
Another approach to solve the code failure prediction problem is to use sequential rule mining and sequential pattern mining techniques. Sequential rule mining algorithms    discover rules in the form of in a sequence of database such that and are sequential patterns. Each rule is given by its support, which is the frequency of sequences that contain the rule, and confidence, which is the likelihood of sequence appearing after . The input of these algorithms are often minsup (minimum support) or minconf (minimum confidence) such that only rules whose support or confidence are higher than these threshold values are returned. This dependence on minsup and minconf thresholds could be a drawback of sequential rule mining algorithms for applications such as code failure prediction, since the threshold values are not known a priori. Moreover, as we see in the experimental section, while sequential rule mining methods can extract the contributors in our application, they are not able to detect blockers. Finally, the search space grows exponentially with the size of the database and the number of distinct actions in sequential rule mining algorithms. Based on the above, sequential rule mining algorithms are not capable of solving the code failure precition problem efficiently.
V Experimental Work
V-a Code failure Prediction using Synthetic Data
In order to better understand how accurately and efficeintly our proposed LSTM model can solve the code failure prediction problem, we first applied our LSTM model on synthetic (simulated) data. The synthetic data was generated by assuming that we have distinct events and each session can have any combination of these events with a fixed length of events per session. The order of events are important (for example, the sequence is different from the sequence ). We assumed that any session containing the sequence will lead to code failure unless event is seen in the sequence. For example, a session with sequence of events will result in code failure but another session with will not result in any code failure. We randomly generated sequences using the above rule and split them for training and test sets. In addition to the proposed LSTM model, we used Decision Tree and Random Forest algorithms and compared the results from these three methods to see which one can better predict a code failure based on the simple rules we defined above.
V-B Code Failure Detection using Real Data
In the next step, we trained our proposed LSTM model using real data from Excel users, which consists of Microsoft Office customers who paid to use our services. We removed all the identifiers from data to proetct users privacy. To make this dataset, we extracted all telemetry events recorded from our Excel users during the sessions they used Excel. We balanced the dataset to have one quarter with label 0 (code failure) and the rest with lable 1 (no code failure) data. In the next step, we trained an LSTM model that can detect and predict a specific type of code failure using the following parameters: epochs: 1000, drop-out: 0.4, maximum sequence length: 45, embedding size: 2, training-test ratio: 50/50, number of sessions: 90,000 number of distinct telemetry events: 1067.
V-C Results and Performance Comparison
shows the prediction performance of our bi-directional LSTM model in terms of accuracy, precision and recall over the number of epochs. In this model, we have the following settings: embedding size:, LSTM cell: bi-directional, batch-size: , learning rate: and hidden layer size: . Our proposed LSTM model achieves the highest performance after epochs for both training and validation sets. Table I shows the performance comparison between the proposed LSTM model at epoch , Random Forest algorithm with maximum depth of 4 and 10 trees with bag of words (BoW), and Random Forest algorithm with maximum depth of and trees and with no feature engineering. As can be seen in Table I, our proposed LSTM model outperforms the other two in predicting code failures in our synthetic data. The main reason behind the superior performance of the LSTM model is that it can remember and learn both short-term and long-term characteristics of a sequence, while the other two methods cannot learn the short-term and long-term dependencies. We also tried sequential rule mining method to predict code failures in our synthetic data, but we observed the sequential rule mining fails to fully learn the logic we used to generate the synthetic data and cannot accurately predict the sequences with code failure. This happens because the class of sequential rule mining methods are capable of extracting positive rules that result in an outcome (contributors in this context), but they cannot learn and deduct negative rules that prevents a specific outcome from happening (blockers in this context). As a result, sequential rule mining methods generally have poor performance in applications that we have a rule which can override and negate the general rule.
|RandomForest (Depth=4, Trees =10), No FE||0.65||0.64||0.64|
|RandomForest (Depth=4, Trees =10), BoW||0.78||1.0||0.87|
|Bi-Directional LSTM (e=3, lstmSize=6)||1.0||1.0||1.0|
In Table II, we provide some examples to show how we can find contributors and blockers using algorithm 1. As can be seen in the table, our algorithm is capable of extracting both contributors and extractors correctly for almost all sequences. The only exception is case where contributors are detected incorrectly.
|id||Sequence||Prediction||True Label||Confidence Score||Comment|
|0||a f b cef||0||0||1.0||Correct Extraction|
|1||c a f h f c e c k b f a b j e||0||0||1.0||Correct Extraction|
|2||af b ca||1||1||.99||Correct Extraction|
|3||g b g a c af b c kb c fc||1||1||1.0||Correct Extraction|
|4||g b d f gf i g b c||1||1||1.0||Correct Extraction|
|5||f h a d b d h f c g b j d||1||1||1.0||Correct Extraction|
|6||k f b c j b h f c f c b f c||1||1||1.0||Correct Extraction|
|7||h b j c a k c d c f b c i d||1||1||0.878||Wrong Extraction|
|8||f c d b l g l c i c b f a b||1||1||1.0||Correct Extraction|
Sequential rule mining is another set of solutions that can be used for the code failure pattern extraction. We tried a modified version of the algorithm featured in  to extract the top rules for our synthetic data. The top rules are listed in Table III. Based on the results, this sequential rule mining technique is able to identify the contributors as . However, it fails to detect the blocker event . Moreover, this sequential rule mining technique fails to work for our real Excel data where the size of data is much larger and the number of telemetry events is far more than the synthetic data. As another limitation, many recent sequential rule mining algorithms are in fact partially sequential in the sense that in a rule , only is required to occur after and the events in are unordered  .
|f,b,c code failure||21.55||45.90||2.1304|
Table IV shows the performance of our proposed algorithm for zero inserting and void inserting approaches. Based on the results, the void inserting approach outperforms the zero inserting approach. The model is not able to make as accurate predictions using the zero inserting method because there is no sequence in our original dataset that contains in the middle of the sequence, which means the LSTM network does not have the chance to learn this condition properly during the training process.
|Extraction Accuracy vs. data size|
|Bi-Dir. LSTM + Zero Inserting||
|Bi-Dir. LSTM + Void Inserting||
Fig. 4 shows the performance of our LSTM model in terms of accuracy, precision and recall during the training and test process using the real Excel data. As can be inferred from the plot, our trained LSTM model is capable of predicting code failure with over % accuracy and more than % recall. In order to find the root-cause of code failures, we need to focus on sequences that our LSTM model predicts code failure for them with high confidence. Therefore, we only consider the sequences whose confidence score is above %. Confidence score is the output of the Dense layer in Fig 2. As can be seen in Fig. 5, our proposed LSTM model has a precision of and above for almost % of the sequences.
Vi Practical Remarks: Code Failure Signature Removal
We need to apply a number of preprocessing steps on the real data before using the proposed LSTM model to predict code failures and extract contributors and blockers. First, we need to remove any event that indicates a signature of the code failure or normal execution of the code. These events are highly correlated with the outcome but do not hold any useful information about the code failure. The easiest way to detect such events is to first train our model without any filtering, and extract the contributors and blockers. Then, by showing the results to the subject experts, we can decide whether a specific event is just a signature or it is a real contributor or blocker. For example, in our case we initially identified an event as a blocker which was only called when Excel was properly closed with no code failure. This event was initially picked up by the proposed LSTM model as a blocker, while in reality it was a signature for proper termination of the application.
We also noticed that there are many identical sessions in our real Excel data. In other words, we have many sessions with the same sequence of events. We have two options to address this issue. We can either remove the redundant sessions and train the LSTM model with the reduced dataset, or, keep the original dataset with all redundant data. It is clear that models developed from these two approaches would be quite different. To solve this delima, it is recommended to make sure that the training data is a true representation of the real data and mimic the real-world conditions as closely as possible. Therefore, removing the redundant data points is not a good option in our code failure prediction example.
In this paper, we applied LSTM recurrent neural networks to find sessions that are prone to code failure and to extract telemetry patterns that lead to a specific code failure. Our method is designed to process a large set of data and automatically handle edge cases in code failure prediction. We took advantage of Bayesian optimization technique to find the optimal hyper parameters. To extract the failue code patterns, we first introduced the Contributors and Blockers concepts and we used a greedy approach to find them. We used both synthetic and real data to develop and test our proposed LSTM model. Our trained LSTM model demonstrated over accuracy for detecting code failures in the synthetic data. Using the proposed greedy method, we detected the contributors and blockers in the synthetic data in more than of the cases, with a performance better than sequential rule and pattern mining algorithms.
The author would like to thank Dr. Sandi Ganguli and Wayne Roseberry for constructive discussion and providing experimental data.
-  J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, 2017.
-  J. S. Kinnebrew and G. Biswas, “Comparative action sequence analysis with hidden markov models and sequence mining,” Knowledge Discovery and Data (KDD), 2011.
-  M. Scholz, “R package clickstream: Analalyzing clickstream data with markov chains,” Journal of Statistical Software, vol. 74, no. 4, pp. 1–17, 2016.
-  V. Melnykov, “ClickClust: An R package for model-based clustering of categorical sequences,” Journal of Statistical Software, vol. 74, no. 9, pp. 1–34, 2016.
-  D. Jurafsky and J. H. Martin, “Hidden Markov Models,” in Speech and Language Processing. 2nd Edition, 2018.
-  Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.
-  M. J. Zaki, “SPADE: An efficient algorithm for mining frequent sequences,” Machine learning, vol. 42, no. 1-2, pp. 31–60, 2001.
-  P. Fournier-Viger, A. Gomariz, M. Campos, and R. Thomas, “Fast vertical mining of sequential patterns using co-occurrence information,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2014, pp. 40–52.
-  D. Lo, S.-C. Khoo, and L. Wong, “Non-redundant sequential rules—theory and algorithm,” Information Systems, vol. 34, no. 4-5, pp. 438–453, 2009.
-  P. Fournier-Viger, C.-W. Wu, V. S. Tseng, L. Cao, and R. Nkambou, “Mining partially-ordered sequential rules common to multiple sequences,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 8, pp. 2203–2216, 2015.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long short term memory networks for anomaly detection in time series,” in Proceedings. Presses universitaires de Louvain, 2015, p. 89.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
-  M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,” in International Workshop on Human Behavior Understanding. Springer, 2011, pp. 29–39.
-  A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
R. Socher, “Recursive deep learning for natural language processing and computer vision,” Ph.D. dissertation, Citeseer, 2014.
-  K. Zhang, J. Xu, M. Renqiang Min, G. Jiang, K. Pelechrinis, and H. Zhang, “Automated it system failure prediction: A deep learning approach,” pp. 1291–1300, 12 2016.
-  G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, 2018.
-  R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” knowledge discovery and data mining, 2015.
-  J. Li, W. Monroe, and D. Jurafsky, “Understanding neural networks through representation erasure,” arXiv preprint arXiv:1612.08220, 2016.
-  J. Li, X. Chen, E. Hovy, and D. Jurafsky, “Visualizing and understanding neural models in nlp,” arXiv preprint arXiv:1506.01066, 2015.
-  A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding recurrent networks,” arXiv preprint arXiv:1506.02078, 2015.
-  T. Lei, R. Barzilay, and T. Jaakkola, “Rationalizing neural predictions,” arXiv preprint arXiv:1606.04155, 2016.
-  J. Lanchantin, R. Singh, B. Wang, and Y. Qi, “Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks,” in PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017. World Scientific, 2017, pp. 254–265.
-  P. M. Kevin, “Machine learning: a probabilistic perspective,” 2012.
-  C. Olah, “Understanding LSTM networks,” 2015. [Online]. Available: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
-  J. Mockus, V. Tiesis, and A. Zilinskas, “The application of bayesian methods for seeking the extremum,” pp. 117–129, 09 2014.
-  P. Fournier-Viger, R. Nkambou, and V. S.-M. Tseng, “Rulegrowth: mining sequential rules common to several sequences by pattern-growth,” in Proceedings of the 2011 ACM symposium on applied computing. ACM, 2011, pp. 956–961.