Cloud Failure Prediction with Hierarchical Temporal Memory: An Empirical Assessment

10/06/2021
by   Oliviero Riganelli, et al.
0

Hierarchical Temporal Memory (HTM) is an unsupervised learning algorithm inspired by the features of the neocortex that can be used to continuously process stream data and detect anomalies, without requiring a large amount of data for training nor requiring labeled data. HTM is also able to continuously learn from samples, providing a model that is always up-to-date with respect to observations. These characteristics make HTM particularly suitable for supporting online failure prediction in cloud systems, which are systems with a dynamically changing behavior that must be monitored to anticipate problems. This paper presents the first systematic study that assesses HTM in the context of failure prediction. The results that we obtained considering 72 configurations of HTM applied to 12 different types of faults introduced in the Clearwater cloud system show that HTM can help to predict failures with sufficient effectiveness (F-measure = 0.76), representing an interesting practical alternative to (semi-)supervised algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

11/21/2019

Predicting Failures in Multi-Tier Distributed Systems

Many applications are implemented as multi-tier software systems, and ar...
10/13/2020

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, ...
06/12/2014

Event and Anomaly Detection Using Tucker3 Decomposition

Failure detection in telecommunication networks is a vital task. So far,...
12/20/2019

Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments

To provide proactive fault tolerance for modern cloud data centers, exte...
02/13/2019

Statistical Failure Mechanism Analysis of Earthquakes Revealing Time Relationships

If we assume that earthquakes are chaotic, and influenced locally then c...
09/28/2015

Encoding Reality: Prediction-Assisted Cortical Learning Algorithm in Hierarchical Temporal Memory

In the decade since Jeff Hawkins proposed Hierarchical Temporal Memory (...
04/24/2021

Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning

In the data center, unexpected downtime caused by memory failures can le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cloud systems are distributed systems that rely on virtualization technologies to flexibly scale services depending on the environmental conditions, such as the workload and the available resources. Their capability to adapt to the conditions that emerge in the field, while they are operating, is extremely useful to save resources and finally deliver the expected quality of service despite changing conditions [5].

Unfortunately, the adaptability and flexibility of the cloud sophisticate the environment, and consequently it increases the chance to observe misbehaviors and failures. In fact, in contrast with the strong availability requirements of modern Web applications and services, empirical data show that failures are still extremely frequent in cloud systems [6, 17, 7]. For instance, Microsoft reported a failure rate that drastically impacts on the target availability of 99.999% [17], and Vishwanath and Nagappan reported frequent sever failures in data centers [31]. Further, Cotroneo et al. show that many of the failures experienced in the cloud are not timely detected and notified [7].

To handle failures properly, cloud systems are equipped with monitoring techniques that collect behavioral data about both the individual services and the infrastructure [2, 24, 11], to detect anomalies, raise alarms, and anticipate failures [33, 19, 12, 17, 16].

In this paper, we focus on the challenge of predicting failure occurrences exploiting runtime data. So far, this challenge has been addressed with machine learning algorithms, such as Support Vector Machine 

[29]

, Long Short Term Memory 

[17]

or Gradient Boosting Tree 

[33], that require a supervised training phase that must be frequently repeated to adapt the models to the changing behavior of the monitored cloud system.

Hierarchical Temporal Memory (HTM) is a brain-inspired unsupervised machine learning algorithm originally proposed by Jeff Hawkins [14]. HTM mimics the structural and functional characteristics of cells in the cerebral cortex in order to learn and make predictions, while effectively processing spatial and temporal information extracted from input data stream. HTM continually updates models as new data to analyze become available. While successfully applied in many other contexts [10, 27, 15, 32, 20, 18, 3], it has received little attention in the context of cloud failure prediction, with only a preliminary study by Mobilio et al. considering its usage [21].

This paper presents the first study that systematically evaluates the effectiveness of 72 configurations of HTM in supporting cloud failure prediction. The study considers a realistic telecom cloud-based system that provides IP-based voice, video and message services. We evaluated the capability to predict failures using anomalies reported by HTM for 12 types of faults while analyzing Key Performance Indicators (KPIs), such as CPU, network and memory consumption, collected from the cloud resources available in the system. Results show that anomalies reported by HTM can be used to predict failure occurrences with good effectiveness (F-measure ).

The paper is organized as follows. Section II introduces HTM. Section III presents the online failure prediction strategy used to evaluate HTM. Section IV explains the evaluation methodology. Section V describes results. Section VI discusses related work. Section VII provides final remarks.

Ii Hierarchical Temporal Memory

Hierarchical Temporal Memory (HTM) is an unsupervised learning algorithm inspired by the structural and algorithmic features of the neocortex, which consists of many structurally-identical regions responsible for different tasks, despite their similar cellular structure. Similar to the neocortex, HTM is composed of regions interconnected in a hierarchy. Each region is able to learn patterns detected from streamed input values. The regions are connected in a hierarchy, so that information can flow across regions. Such a structure allows high-level representations to be formed from low-level sensing data, mimicking the behavior of the brain that does not store every single object of a class to recognize it, but it can simply memorize the properties that define the objects of that class.

Sparse Distributed Representations (SDRs) are the means of memorizing and transferring information in HTM. An SDR is a large vector of bits of which only a small percentage is active to capture the semantic meaning of the encoded information. This implies that if the set of active bits of one SDR overlaps strongly with the set of active bits of another SDR, then those SDRs are semantically similar.

HTM is composed of two main components, the Spatial Pooler (SP) and the Sequence Memory (SM), running in a pipeline. The SP has the responsibility of enconding the input data into a SDR. The SP learns the spatial characteristics of each input and finds a stable representation of the spatial patterns in the form of a sparse vector . For example, the time series data (i.e., sequence of numerical values associated with a timestamp) produced by each KPI can be encoded using standard HTM date-time and scalar encoders [25].

The SDR produced by the SP is then sent to the SM, which is responsible of learning and processing sequence patterns that are used for prediction, taking into account the data processed in the past. The output of the SM is a sparse vector that represents a prediction of the next input.

In contrast with other statistical and machine learning algorithms, such as neural networks, HTM better matches the characteristics of cloud environments. In fact, other algorithms are often trained in batch mode, requiring the storage of a large set of sequences to optimize performance on that specific data 

[8, 9]. HTM is more flexible, providing full online and real-time learning, jointly with quick adaptation to new patterns, without requiring to store large sets of sequences.

A detailed presentation of HTM is available in [13, 8].

Iii Online Failure Prediction in the Cloud

We study the effectiveness of HTM in supporting cloud failure prediction by using a KPI-driven online failure prediction approach, which analyzes the KPIs to detect the patterns of anomalous behaviors that are likely to result in failures at a later time. KPI-driven online failure prediction is implemented by many techniques [17, 19, 29] and can be considered as the most used approach to failure prediction in the Cloud.

Fig. 1: Online Failure Prediction

The failure prediction system consists of three main types of components, as shown in Figure 1: the Anomaly Detector, which is responsible of identifying anomalous KPI values; the Local Failure Predictor, which is responsible of predicting failures based on the anomalies reported for the KPIs collected from a same cloud resource; and the Global Failure Predictor, which is responsible of ultimately predicting system failures by analyzing the resource-level failure predictions. The architecture includes one anomaly detector per KPI that is analyzed, one local failure predictor per cloud resource that is analyzed, and a single global failure predictor. Faulty executions of the system are not required to train the models.

Iii-a Anomaly Detector

The Anomaly Detector is the key component that can discriminate legal and anomalous values by analyzing each KPI separately, according to a learned predictive model. In this study, the anomaly detection algorithm employs Hierarchical Temporal Memory (HTM) as proposed by Ahmad et al.

[1]. Note that HTM neural networks continuously learn and model the spatiotemporal characteristics of the inputs to predict the values of the inputs at the next time interval, and do not generate anomalies. However, they can be simply adapted to report anomalies, as proposed by Ahmad et al. [1].

Given a real-time KPI stream, where is a vector representing the state of the KPI at time , HTM learns and predicts the temporal sequences of the values in such stream. Recall that the input is first semantically encoded into a form of sparse distributed representation (SDR), then normalized into a bit vector of fixed size and sparsity. The resulting vector is used by HTM to produce a prediction in the form of another normalized sparse vector , which represents the prediction for the input at the next time interval .

HTM can be used to determine if is anomalous by comparing actual values to predictions made at the previous time interval  [1]. Since and are bits vectors, a prediction error between 0 and 1 can be calculated depending on the “similarity” between the actual and predicted bits, that is, the prediction error is inversely proportional to the number of common bits between the actual and predicted binary vectors. More rigorously:

where indicates the multiplication operation between the bit vectors and denotes the number of elements in the vector.

The prediction error represents an instantaneous measure of the predictability of the current KPI stream. However, a threshold-based anomaly detection for this measurement could lead to a high number of false positives, especially in noisy systems. For this reason, an anomaly likelihood , which is a probabilistic metric that defines how anomalous the current state is based on the prediction history of the HTM model, is also computed [1]. The history consists of a window of the last

prediction errors. Assuming that the errors of the predictions have a rolling normal distribution, then the mean

and the covariance are continuously updated as follows:

A threshold to the Gaussian tail probability (Q-function) is applied to decide if an input is anomalous:

where is a short term moving mean value computed as but on a smaller window. The threshold for is based on a user-defined parameter : an anomaly is reported if .

Iii-B Local Failure Predictor

The Local Failure Predictor analyzes anomalies in the KPIs collected from a same cloud resource to determine if they are symptoms of a failure or are anomalous but legal behaviors. The Local Failure Predictor is obtained by training a model with the outputs of the anomaly detector obtained during normal executions (note that several anomalous behaviors are usually generated also during normal executions). Deviations from normal behaviors, that is sets of anomalies different from the ones reported during normal executions, are reported as symptoms of a possible failure in the cloud resource. The possibility to train models only using normal executions is of fundamental importance for cloud failure prediction techniques, since there are many different failure situations that cannot be anticipated, and thus a fairly complete set of samples of failures is rarely available for a system.

In this experiment, we train a one-class Support Vector Machine (SVM) as Local Failure Predictor since it can be trained with data that is assumed to belong to a same class, which, in this case, can be easily obtained from the normal execution of the software. The SVM need to be updated only if the list of the KPIs monitored in a resource changes. To consider stable predictions only, the Local Failure Predictor is instructed to report a failure prediction only after a local prediction is confirmed for consecutive instants of time, where is a user-defined threshold.

Iii-C Global Failure Predictor

While the Local Failure Predictor performs a per-resource analysis, the goal of the Global Failure Predictor is to perform an ensemble analysis and produce a single failure prediction for the entire system. In fact, there might be anomalous behaviors limited to one resource that are handled by the resource, not propagating to other cloud resources and then to the rest of the system. The role of the global failure predictor is to determine when failures predicted locally may result in a system failure.

At each time interval , the Global Failure Predictor accesses the predicted state of each resource and uses this information to confirm whether a local failure prediction is going to affect the whole system. Similarly to Sauvanaud et al. [29], we investigate two strategies: (i) single-resource global prediction and (ii) vote-based global prediction.

In the single resource prediction, a system failure is predicted after consecutive local predictions of failure for the same resource, where is a user-configurable threshold. This case corresponds to a policy that assumes that local failures are likely to propagate at the system level. In the vote-based prediction, instead, a failure is confirmed by the Global Failure Predictor when there are consecutive failure predictions on at least half of the resources, where is user-defined parameter. The idea here is that the system can tolerate local failures and only when multiple cloud resources are likely to fail, a system-level failure is likely experienced.

Iv Methodology

We empirically assess HTM as online anomaly detector supporting prediction tools for cloud systems failures. In this section, we introduce the research questions that we studied, the telco cloud-based system we used as testbed for the assessment, and the fault seeding strategy that we adopted to collect data about failures caused by different types of faults.

Iv-a Research Questions

To assess HTM in the context of failure prediction, we investigate the following research questions:

RQ1: Can an HTM-based anomaly detector support a failure prediction system in accurately predicting failures? We executed the prediction system with failure-free executions and with different failure types and failure patterns, and measured its ability to predict failures.

RQ2: How early can failures be predicted? We studied how early failures are predicted, for different types of failures.

These research questions are addressed by analyzing parameter configurations of the prediction system. The studied configurations are obtained by combining the following parameters with the specified values:

  • [leftmargin=*]

  • is the anomaly likelihood threshold used for KPI anomaly detection. We considered , , and .

  • is the local prediction threshold of a possible failure on a single resource. We considered , and .

  • and are thresholds used in single-resource and vote-based global prediction, respectively. We considered between and , and between and .

To answer RQ1, we evaluate the quality of the models using the standard metrics precision, recall and F-measure [28], so as to assess effectiveness in predicting failures. We address the research question RQ2 by calculating the time between a global failure prediction and its occurrence.

Iv-B Testbed

We perform the evaluation on a cloud environment running Clearwater, which is an open source implementation of an IP Multimedia Subsystem (IMS) that provides IP-based voice, video and message services

[23]. Clearwater is a meaningful subject for this study since it represents the case of a ever-running cloud system that must operate reliably to guarantee IP-based communication. Failures must be indeed predicted and handled before they cause any service interruption.

We used the standard installation of Clearwater, which consists of six components, each one running on a different virtual machine (VM) configured with 2 vCPUs, 2GB of RAM, 20GB of hard disk space, and Ubuntu 12.04 LTS. In this evaluation, we detect anomalies from a total of 150 KPIs collected from all the components of Clearwater. The monitored KPIs include CPU and memory consumption, network usage, and many more. The detailed list of monitored KPIs is available in our online appendix to this paper [22].

Iv-C Experimental setup

We evaluate the quality of HTM by studying how it can support automatic failure prediction. Specifically, we first run Clearwater without producing any failure to train both the HTM models and the local classifier. In particular, data collected from a first week of failure-free executions is exploited to generate the HTM models, and data from a second week of failure-free executions is processed to train the local predictor. Note that failure-free executions are not free of anomalies, since services may operate in an unpredictable way, despite not generating any failure. To generate the traffic for the running system, we generated a number of users and calls according to weekly and daily patterns (e.g., more users on workdays, fewer users at night, pick time at 9am and 7pm), including a certain degree of randomness, as done in similar studies that used Clearwater as subject system 

[19].

To evaluate failure prediction, we inject the following four types of faults, one at time: CPU hogs, memory leaks, packet loss faults and excessive workload conditions as defined in [4, 30]. We activate the injected faults according to three activation patterns: (i) the fault is triggered with a same frequency over time; (ii) the fault is activated with a frequency that increases exponentially, resulting in a shorter time to failure; (iii) the fault is activated randomly over time. Overall, we experience 12 faults: 4 fault types times 3 activation patterns. We also assess failure prediction with 12 normal executions, not included in the training set, where the system was running under its normal operating conditions without faults or abnormal workloads, to assess the capability to not generate spurious predictions.

V Results

In this section we discuss the results obtained for the two research questions introduced in the previous section.

RQ1: Prediction Effectiveness

We evaluated the capability of HTM to support failure prediction by studying the effectiveness of the vote-based and single-resource prediction strategies. In the experiments, failures correspond to either system crashes or a call success rate lower than . We report the detailed results obtained for each configuration in the online appendix [22]. Here we discuss the most relevant results.

Figure 2 (a) shows the precision, recall, and F-measure values of vote-based global prediction for the various configurations, aggregated according to the value of . As expected precision increases while recall decreases for increasing values of . The best performing configuration in terms of F-measure is the one with , and , achieving F-measure , precision , and recall . This is also the configuration providing the highest recall. In some cases, engineers may want to optimize precision (e.g., to avoid false alarms). To this end, the configuration that provides the highest precision (i.e., precision ) with a recall greater than (i.e., recall ) is , and .

Figure 2 (b) shows the prediction, recall, and F-measure values of the single-resource global prediction strategy for the various configurations, aggregated according to the value of . The trends of the precision, recall and F-measure resemble the ones of the vote-based global prediction, although differences between configurations are smaller. The configurations that provide the highest F-measure are the one with , , and , and the one with , , and , which both obtain F-measure , precision , and a recall . These are also the configurations that provide the highest recall. The configurations that provides the highest precision (i.e., precision ) with recall greater than (i.e., recall ) are the one with , and and the one with , and .

Overall, these results show that HTM can be feasibly employed in supporting failure prediction, especially if high recall values are needed (i.e., it is important to predict most of the failures at the cost of experiencing some false alarms). In this case, the vote-based strategy might be preferred to the single-resource strategy, since it achieves higher f-measure, with more stable results across configurations.

Although it might be hard to achieve nearly perfect precision, HTM can still be employed when high precision is the priority. In such a case, the single-resource global prediction strategy should be preferred, since it reached a precision of , without dropping recall below .

Interesting, a value of generates the best results for both strategies, suggesting that local predictions should be fired immediately, with the global predictor taking decisions considered the stability and spreading of failure predictions. In particular, the vote-based strategy performs best when firing a global failure prediction as soon as the majority of the cloud resources are likely to fail ( in the best configurations). While, the single-resource strategy performs best when waiting for enough consistent local predictions before issuing a global failure prediction ( or in the best configurations).

Table I shows recall per type of fault. The most difficult type of failure to predict is the ones caused by excessive workload, since an overloading number of requests generates behaviors quite similar to failure-free executions. The other failure types have similar recall, with cpu hogs, due to their impact on running services, being simpler to detect than other faults.

Fault Recall Median Prediction Time
CPU Hog 0.74 134 mins
Memory Leak 0.6 54 mins
Network Loss 0.65 51 mins
Excessive Workload 0.52 15 mins
TABLE I: Fault Analysis

Compared to studies where failures are used in the training phase [29, 19], HTM achieved slightly lower effectiveness. Although results obtained in different experiments are not directly comparable, a gap was expected. However, the use of failure-free executions only and the continuous training of HTM-based anomaly detectors represent important features that prevent engineers from spending a significant, and sometime impractical, effort in simulating hundreds of failures to train prediction models.

Fig. 2: Performance of global predictors

RQ2: Prediction timeliness

RQ2 studies the capability to early predict failures. To evaluate the timeliness of the predictions, we measure the time difference between the prediction time and the failure time. This measure approximatively captures the time available to react to a failure prediction before the failure happens. Figure 3 shows the results aggregated according to the value of .

The vote-based global failure prediction was able to produce a failure prediction between 4 and 159 minutes before the failure occurred, with a median time of 54 minutes. The configuration with the best f-measure predicted the failure with a median time of 64 minutes.

The single-resource global failure prediction was able to produce a failure prediction between 2 and 159 minutes before the failure occurred, with a median time of 53 minutes. The configurations with the best f-measure predicted the failure with a median time of 57 minutes.

These time figures indeed allow the activation of automatic workaround, such as cloning or migrating services, and in the vast majority of the cases they are also early enough to allow for manual intervention.

Table I reports the median failure prediction time per type of fault. Failures caused by excessive workloads have the smallest prediction time, since these failures are usually recognized only once the system is already congested. The failures caused by CPU hogs are on the contrary easy to anticipate, due to their quickly recognizable impact on the running services.

Fig. 3: Prediction timeliness

Vi Related Work

The work presented in this paper relates to research in two main areas: research about applications of HTM and research about prediction of failures in cloud systems.

Vi-a Applications of HTM

HTM networks have been used to address various learning tasks in different domains. In particular, they are involved in providing real-time predictions in contexts where continuous and unpredictable changes affect the input data. For example, HTM has been employed for patient health monitoring and human-machine gesture interactions prediction [10, 27], network intrusion detection [15, 32]

, phone network intention estimation 

[20], short-term prediction of traffic flows [18] and detection of anomalies in crowd movements [3]. The results reported in these works motivated the investigation of HTM applied to cloud failures prediction.

Ahmad et al. [1] recently described how to use HTM to perform unsupervised real-time anomaly detection on streaming data. Their algorithm is capable of detecting spatial and temporal anomalies in predictable and noisy domains without look ahead and supervision. Although the original paper does not suggest any specific application scenario, the suggested anomaly detection schema well fits the domain of cloud failures prediction, and thus we used it in our experiments.

One of the closest applications of HTM is the one by Rodriguez et al. [26], who propose to use HTM to detect anomalous resource consumption caused by the execution of scientific workflows. Differently for our study, anomalies are used to trigger resource scheduling and not to predict failures.

Vi-B Failure Prediction in Cloud Environments

Several studies propose to predict failures using supervised machine learning algorithms that generate system-level models, that is, a model that captures the behavior of the entire system. For example, Lin et al. [17] used an ensemble of supervised machine learning models to predict failures in cloud systems. Mariani et al. [19]

combined anomaly-based and signature-based techniques for predicting failures in multi-tier distributed systems. Supervised learning is often impractical, since it needs generating many sample failures, while changes to the system require retraining the models. On the contrary, HTM offers an unsupervised continuously learning schema for streamed data that can represent a better choice for cloud failure prediction, as discussed in this paper.

The failure prediction architecture considered in this paper has been exploited also by others. In particular, Sauvanaud et al. [29] proposed an approach for anomaly detection running both per-VM and system-wide analyses. Although they studied a different combination of techniques, it is interesting that both in their work and in the study presented in this paper system-wide (vote-based) analysis performed slightly better than the single-resource analysis, suggesting that a degree of global reasoning on the behavior of the system is often necessary to generate effective predictions.

Related to our work, Mobilio et al. [21] studied how to dynamically deploy and undeploy lightweight anomaly detectors in cloud systems. In the context of their evaluation, they considered HTM among the set of the anomaly detectors that can be deployed. Their approach only delivers preliminary findings about the possibility to use HTM as an anomaly detector for the cloud. This study strengthen this evidence, providing quantitative evidence on a larger scale.

Finally, a different line of research studies how to predict incidents using texts, issue reports, and statistical data [33, 12]. Differently, in this paper we studied the challenge of anticipating failures based on the KPIs collected online.

Vii Conclusions

Predicting failures in cloud systems is a challenging problem that requires practical approaches to be solved effectively. Supervised learning schema that require retraining the models and exploit samples collected both during normal executions and during failures do not adapt well to this context.

HTM is an interesting alternative offering an unsupervised continuously learning schema designed to work with streamed data, as the one collected when monitoring cloud systems. In this context, this paper offers a first systematic assessment of HTM, generating initial evidence that HTM can be a practical and effective option for supporting cloud failures prediction.

In this paper, we reported on our experience with a cloud native IMS. Future work concerns with validating cloud failure-prediction based on HTM in additional real-world systems.

References

  • [1] S. Ahmad, A. Lavina, S. Purdy, and Z. Aghaab (2017) Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262. Cited by: §III-A, §III-A, §III-A, §VI-A.
  • [2] (Website) Note: https://aws.amazon.com/cloudwatch/Online on 11/07/2021 Cited by: §I.
  • [3] A. Bamaqa, M. Sedky, T. Bosakowski, and B. B. Bastaki (2020) Anomaly detection using hierarchical temporal memory (htm) in crowd management. In ICCBDC, Cited by: §I, §VI-A.
  • [4] (Website) Note: https://github.com/Netflix/chaosmonkeyOnline on 11/07/2021 Cited by: §IV-C.
  • [5] T. Chen, R. Bahsoon, and X. Yao (2018) A survey and taxonomy of self-aware and self-adaptive cloud autoscaling systems. CSUR 51 (3). Cited by: §I.
  • [6] X. Chen, C.-D. Lu, and K. Pattabiraman (2014) Failure analysis of jobs in compute clouds: a google cluster case study. In ISSRE, Cited by: §I.
  • [7] D. Cotroneo, L. D. Simone, P. Liguori, R. Natella, and N. Bidokhti (2019) How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform. In ESEC/FSE, Cited by: §I.
  • [8] Y. Cui, S. Ahmad, and J. Hawkins (2016) Continuous online sequence learning with an unsupervised neural network model. Neural Comput. 28 (11). Cited by: §II, §II.
  • [9] Y. Cui, C. Surpur, S. Ahmad, and J. Hawkins (2016) A comparative study of htm and other neural network models for online sequence learning with streaming data. In IJCNN, Cited by: §II.
  • [10] N. O. El-Ganainy, I. Balasingham, P. S. Halvorsen, and L. Arne Rosseland (2019) On the performance of hierarchical temporal memory predictions of medical streams in real time. In ISMICT, Cited by: §I, §VI-A.
  • [11] (Website) Note: https://www.elastic.co/elastic-stackOnline on 11/07/2021 Cited by: §I.
  • [12] J. Gu, C. Luo, S. Qin, B. Qiao, Q. Lin, H. Zhang, Z. Li, Y. Dang, S. Cai, W. Wu, Y. Zhou, M. Chintalapati, and D. Zhang (2020)

    Efficient incident identification from multi-dimensional issue reports via meta-heuristic search

    .
    In ESEC/FSE, Cited by: §I, §VI-B.
  • [13] J. Hawkins and S. Ahmad (2016)

    Why neurons have thousands of synapses, a theory of sequence memory in neocortex

    .
    Front. Neural Circuits 10. Cited by: §II.
  • [14] J. Hawkins and S. Blakeslee (2004) On intelligence. Times Books, USA. Cited by: §I.
  • [15] G. Khangamwa (2010) Detecting network intrusions using hierarchical temporal memory. In AFRICOMM, Cited by: §I, §VI-A.
  • [16] D. Lin, R. Raghu, V. Ramamurthy, J. Yu, R. Radhakrishnan, and J. Fernandez (2014) Unveiling clusters of events for alert and incident management in large-scale enterprise it. In KDD, Cited by: §I.
  • [17] Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J.-G. Lou, C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang (2018) Predicting node failure in cloud service systems. In ESEC/FSE, Cited by: §I, §I, §I, §III, §VI-B.
  • [18] J. Mackenzie, J. F. Roddick, and R. Zito (2019) An evaluation of htm and lstm for short-term arterial traffic flow prediction. T-ITS 20 (5). Cited by: §I, §VI-A.
  • [19] L. Mariani, M. Pezzé, O. Riganelli, and R. Xin (2020) Predicting failures in multi-tier distributed systems. JSS 161. Cited by: §I, §III, §IV-C, §V, §VI-B.
  • [20] W. J. C. Melis, S. Chizuwa, and M. Kameyama (2009) Evaluation of hierarchical temporal memory for a real world application. In ICICIC, Cited by: §I, §VI-A.
  • [21] M. Mobilio, M. Orrù, O. Riganelli, A. Tundo, and L. Mariani (2019) Anomaly detection as-a-service. In ISSREW, Cited by: §I, §VI-B.
  • [22] (Website) Note: https://github.com/lta-unimib/ICMLA2021Online on 12/07/2021 Cited by: §IV-B, §V.
  • [23] (Website) Note: https://www.projectclearwater.org/Online on 11/07/2021 Cited by: §IV-B.
  • [24] (Website) Note: https://prometheus.io/Online on 11/07/2021 Cited by: §I.
  • [25] S. Purdy (2016) Encoding data for htm systems. ArXiv abs/1602.05925. Cited by: §II.
  • [26] M. A. Rodriguez, R. Kotagiri, and R. Buyya (2018) Detecting performance anomalies in scientific workflows using hierarchical temporal memory. FGCS 88. Cited by: §VI-A.
  • [27] D. Rozado, F. B. Rodriguez, and P. Varona (2011) Gaze gesture recognition with hierarchical temporal memory networks. In IWANN, Cited by: §I, §VI-A.
  • [28] F. Salfner, M. Lenk, and M. Malek (2010) A survey of online failure prediction methods. CSUR 42 (3). Cited by: §IV-A.
  • [29] C. Sauvanaud, K. Lazri, M. Kaâniche, and K. Kanoun (2016) Anomaly detection and root cause localization in virtual network functions. In ISSRE, Cited by: §I, §III-C, §III, §V, §VI-B.
  • [30] B. Sharma, P. Jayachandran, A. Verma, and C. R. Das (2013) CloudPD: problem determination and diagnosis in shared dynamic clouds. In DSN, Cited by: §IV-C.
  • [31] K. V. Vishwanath and N. Nagappan (2010) Characterizing cloud computing hardware reliability. In SoCC, Cited by: §I.
  • [32] K. Zhang, F. Zhao, S. Luo, Y. Xin, H. Zhu, and Y. Chen (2020) Online intrusion scenario discovery and prediction based on hierarchical temporal memory (htm). Appl. Sci. 10 (7). Cited by: §I, §VI-A.
  • [33] N. Zhao, J. Chen, Z. Wang, X. Peng, G. Wang, Y. Wu, F. Zhou, Z. Feng, X. Nie, W. Zhang, K. Sui, and D. Pei (2020) Real-time incident prediction for online service systems. In ESEC/FSE, Cited by: §I, §I, §VI-B.