Recent years have seen rapid development in cellular networks, both in increasing network scale and complexity coupled with increasing network performance demands. This growth has made the quality of network management an even greater challenge and puts limits on the analysis methods that can be applied. In cellular networks, anomalies are commonly identified through alarms. A large-scale network can generate millions of alarms during a single day. Due to the interrelated network structure, a single fault can trigger a flood of alarms from multiple devices. Traditionally, to recover after a failure, an operator will analyze all relevant alarms and network information. This can be a slow and time-consuming process. However, not all alarms are relevant. There exists a subset of alarms that are the most significant for fault localization. We denote these as root cause alarms, and our main goal is to intelligently identify these alarms.
There exist abundant prior research in the areas of Root Cause Analysis (RCA) and fault localization. However, most proposed methods are highly specialized and take advantage of specific properties of the deployed network, either by using integrated domain knowledge or through particular design decisions [2, 3, 4]. A more general approach is to infer everything from the data itself.
In our proposed alarm RCA system, we create an influence graph to model alarm relations. Causal inference is used to infer an initial causal graph, and then we apply a novel Causal Propagation-Based Embedding (CPBE) algorithm to supplement the graph with meaningful edge weights. To identify the root cause alarms, we build upon ideas in how influence propagates in social networks and view the problem as an influence maximization problem , i.e., we want to discover the alarms with the largest influence. When a failure transpires, our system can automatically perform RCA based on the sub-graph containing the involved alarms and output the top-
most probable root cause alarms.
In summary, our main contributions are as follows:
We design a novel unsupervised approach for root cause alarm localization that integrates casual inference and influence maximization analysis, making the framework robust to causal analysis uncertainty without requiring labels.
We propose HPCI, a Hawkes Process-based Conditional Independence test procedure for causal inference.
We further propose CPBE, a Causal Propagation-Based Embedding algorithm based on network embedding techniques and vector similarity to infer edge weights in causality graphs.
Extensive experiments on a synthetic and a real-world citywide dataset show the advantages and usefulness of our proposed methods.
2 Related work
Root cause alarms.
There are various ways to discover alarm correlations and root cause alarms. Rules and experience of previous incidents are frequently used. In more data-driven approaches, pattern mining techniques that compress alarm data can assist in locating and diagnosing faults . Abele et al. 
propose to find root cause alarms by combining knowledge modeling and Bayesian networks. To use an alarm clustering algorithm that considers the network topology and then mine association rules to find root cause alarms was proposed in.
Graph-based root cause analysis.
Some previous works depend on system dependency graphs, e.g., Sherlock . A disadvantage is the requirement of exact conditional probabilities, which is impractical to obtain in large networks. Other systems are based on causality graphs. G-RCA  is a diagnosis system, but its causality graph is configured by hand, which is unfeasible in large scale, dynamic environments. The PC algorithm  is used by both CauseInfer  and 
to estimate DAGs, which are then used to infer root causes. However, such graphs can be very unreliable. Co-occurrence and Bayesian decision theory are used in
to estimate causal relations, but it is mainly based on log event heuristics and is hard to generalize. Nie et al. use FP-Growth and lag correlation to build a causality graph with edge weights added with expert feedback.
This is a popular method to model continuous-time event sequences where past events can excite the probability of future events. The keystone of Hawkes process is the conditional intensity function, which indicates the occurrence rate of future events conditioned on past events, denoted by , where is an event type. Formally, given an infinitely small time window , the probability of a type- event occurring in this window is . For -dimensional Hawkes process with event type set , each dimension has a specific form of conditional intensity function defined as
where is the background intensity for type- events and is a kernel function indicating the influence from past events. An exponential kernel is most frequently used, i.e., , where captures the degree of influence of type- events to type- events and controls the decay rate. The parameters are commonly learned by optimizing a log-likelihood function. Let be the background intensities, and the influence matrix reflecting the certain causality between event types. For a set of event sequences , where each event sequence is observed during a time period of , and each pair represents an event of type that occurred at time . The log-likelihood of a Hawkes process model with parameters can then be expressed as
The influence matrix is generally sparse or low-rank in practice, hence, adding penalties into is common. For instance, Zhou  used a mix of Lasso and nuclear norms to constrain to be both low-rank and sparse by using
where is the -norm, and is the nuclear norm. The parameters and controls their weights. A number of algorithms can be applied to solve the above learning problem, more details can be found in .
This algorithm is frequently used for learning directed acyclic graphs (DAGs) due to its strong causal structure discovery ability . Conditional Independence (CI) tests play a central role in the inference. A significance level is used as a threshold to determine if an edge should be removed or retained. Formally, given a variable set , if is independent of , denoted as , the edge between and will be removed, otherwise it will be kept in the causal graph. A rigorous description can be found in .
The G-square test and Fisher-Z test are two common realizations for conditional independence testing in causal inference 
. The G-square test is used for testing independence of categorical variables using conditional cross-entropy while the Fisher-Z test evaluates conditional independence based on Pearson’s correlation. CI tests assume that the input is independent and identically distributed. In our alarm RCA scenario, the size of the time window depends on the network characteristics and needs to be selected to ensure that causal alarms exist in one window and the data between different windows are independent.
4 System Overview
Our proposed framework consists of two main procedures: influence graph creation and alarm ranking. A system overview can be found in Figure 1. The alarm preprocessing module is shared and handles alarm filtering and aggregation with consideration to the network topology.
The influence graph is constructed using historical alarm transactions and is periodically recreated. It is comprised of alarm types as nodes and their inferred relations as edges. To create the graph, we first exploit causal inference methodology to infer an initial alarm causality graph structure by applying HPCI, a hybrid method that merges Hawkes process and conditional independence tests. We further apply a network embedding technique, CPBE, to infer the edge weights. The alarm stream is monitored in real-time. When a failure transpires, the system attempts to discover the underlying root cause alarms. The related alarms are aggregated with the created influence graph and are ranked by their influence to determine the top- most probable root cause alarm candidates. The alarm candidates are then given to the network operators to assist in handling the network issue.
This section introduces the key components in our system; alarm preprocessing, the influence graph construction, and how the influence ranking is done. We start by presenting the data and its required preprocessing and aggregation steps.
5.1 Data and Alarm Preprocessing
This is the topological structure of the connections between network devices. Connected network devices will interact with each other. If a failure occurs on one device, then any connected devices can be affected, triggering alarms on multiple network nodes.
Network alarms are used to identify faults in the network. Each alarm record contains information about occurrence time, network device where the alarm originated, alarm type, etc. In practice, any alarms with missing key information are useless and removed. Furthermore, alarm types that are either systematic or highly periodical are also removed. These types of alarms are irrelevant for root cause analysis since they will be triggered regardless if a fault occurred or not.
We partition the raw alarm data into alarm sequences and alarm transactions in three steps as follows.
Devices in connected sub-graphs of the network can interact, i.e., alarms from these devices can potentially be related to the same fault. Consequently, we first aggregate alarms from the same sub-graph together.
Alarms related to the same failure will generally occur together within a short time interval. We thus further partition the alarms based on their occurrence times. Alarms that occurred within the same time window are grouped and sorted by time. The window size can be adjusted depending on network characteristics. We define each group as an alarm sequence, denoted as , where is the window, is the alarm type, is occurrence time, and the number of alarms.
Each alarm sequence is transformed into an alarm transaction denoted by , where indicates the alarm type, the earliest occurrence time and the number of occurrences, respectively. Different from , contains a single element for each alarm type in window .
5.2 Alarm Influence Graph Construction
In this section, we elaborate on the construction of the alarm influence graph. The graph has the alarm types as nodes and their relation as the edges. First, an initial causal structure DAG is inferred by a hybrid causal structure learning method (HPCI). Subsequently, edge weights are inferred using a novel network embedding method (CPBE).
A multi-dimensional Hawkes process can capture certain causalities behind event types, i.e., the transpose of the influence matrix can be seen as the adjacency matrix of the causal graph for event types. However, redundant or indirect edges tend to be discovered since the conditional intensity function can not perfectly model real-world data and due to the difficulty in capturing the instantaneous causality.
To reduce this weakness, we propose a hybrid algorithm HPCI that is based on Hawkes process and the PC algorithm. HPCI is used to discover the causal structure for the alarm types in our alarm RCA scenario. The main procedure can be expressed in three steps. (1) Use multi-dimensional Hawkes process without penalty to capture the influence intensities among the alarm types. We use the alarm sequences as input and obtain an initial weighted graph. The weights on an edge is the influence intensity , reflecting the expectation of how long it takes for a type- event to occur after an type- event. All edges with positive weights are retained. (2) Any redundant and indirect causal edges are removed using CI tests. We use the alarm transactions as input and for each alarm the sequence of alarm occurrences is extracted. Note that can be if an alarm type is not present in a window . For each pair of alarm types , the CI test of their respective occurrence sequences is used to test for independence and remove edges. The output is a graph with unwanted edges removed. (3) Finally, we iteratively remove the edge with the smallest intensity until the graph is a DAG. Our final causal graph is denoted as .
We select CI tests to enforce sparsity in the causal graph in the second step. Compared to adding penalty terms such as -norm, the learning procedure is more interpretable, and our experiments show more robust results.
Edge Weights Inference.
The causal graph learned by HPCI is a weighted graph, however, the weights do not account for global effects on the causal intensities. Hence, to encode more knowledge into the graph, we propose a novel network embedding-based weight inference method, Causal Propagation-Based Embedding (CPBE). CPBE consists mainly of two steps; (1) For each node , we obtain a vector representation using a novel network embedding technique. (2) Use vector similarity to compute edge weights between nodes.
The full CPBE algorithm is shown in Algorithm 1. CPBE uses a new procedure to generate a context for the skip-gram model  (lines 1-9). This procedure is also illustrated in Figure 2. In essence, for each historical alarm transaction , we use the learned causality graph and extract a causal propagation graph , where only the nodes corresponding to alarm types in are retained. Starting from each node in , we traverse the graph to generate a node-specific causal context. During the traversal for a node , only nodes that have a causal relation with are considered. There are various possible traversing strategies, e.g., depth-first search (DFS) and RandomWalk . The skip-gram model is applied to the generated contexts to obtain an embedding vector for each node
. Finally, the edge weight between two nodes is set to be the cosine similarity of their associated vectors. We denote the final weighted graph as the alarm influence graph.
5.3 Root Cause Alarm Influence Ranking
This section describes how the alarm influence graph is applied to an alarm transaction to identify the root cause alarms. For each alarm transaction , an alarm propagation graph is created with the relevant nodes and applicable edges . Any nodes corresponding to alarms not present in are removed. The process is equivalent to how is created from the causal graph . The alarms in each propagation sub-graph are then ranked independently. The process is illustrated in Figure 3.
We consider the problem of finding the root cause alarm as an influence maximization problem . We want to discover a small set of seed nodes that maximizes the influence spread under an influence diffusion model. A suitable model is the independent cascade model, which is widely used in social network analysis. Following this model, each node is activated by each of its neighbors independently based on an influence probability on each edge . These probabilities directly correspond to the learned edge weights. Given a seed set to start with at , at step , tries to activate its outgoing inactivated neighbors with probability . Activated nodes are added to and the process terminates when , i.e., when no nodes further nodes are activated. The influence of the seed set is then the expected number of activated nodes when applying the above stochastic activation procedure.
There are numerous algorithms available to solve the influence maximization problem . In our scenario, each graph is relatively small and the actual algorithm is thus less important. We directly select the Influence Ranking Influence Estimation algorithm (IRIE)  for this task. IRIE estimates the influence for each node by deriving a system of linear equations with variables. The influence of a node comprises of its own influence, , and the sum of the influences it propagates to its neighbors.
In this section, we present the experimental setup and evaluation results. We perform two main experiments, one to verify the correctness of our causal graph and a second experiment to evaluate the root cause identification accuracy. The first experiment is performed on both synthetic and real-world data, while the second is completed on the real-world dataset. The datasets and code are available at https://github.com/shaido987/alarm-rca.
Synthetic Data Generation.
The synthetic event sequences are generated in four steps. (1) We randomly generate a DAG with an average out-degree with event types. We set to to emulate the sparsity property of our real-world dataset. (2) For each edge , a weight is assigned by uniform random sampling from a range . (3) For each event type , we assign a background intensity by uniform random sampling from . (4) Following Ogata , we use and as parameters of a Multi-dimensional Hawkes process and simulate event sequences. We generate event sequences of length days while ensuring that the total number of events is greater than .
The dataset was collected from a major cellular carrier in a moderate-sized city in China between Aug 4th, 2018 and Oct 24th, 2018. After preprocessing, it consists of alarm records from devices with different alarm types. Due to the difficulty of labeling causal relations, we only have the ground-truth causal relations for a subset of alarm types, directed edges in the graph. Furthermore, we have also obtained the ground-truth root cause alarms in a random sample of alarm transactions. These are used to evaluate the root cause localization accuracy.
6.1 Causal Graph Structure Correctness
We evaluate our proposed HPCI method and the accuracy of the discovered causal graphs. We use four frequently used causal inference methods for sequential data as baselines.
PC-GS: PC algorithm with G-square CI test.
PC-FZ: PC algorithm with Fisher-Z CI test.
PCTS: Improved PC algorithm for causal discovery in time series .
HPADM4: Multi-dimensional Hawkes process with exponential parameterization of the kernels and a mix of and nuclear-norm .
The significance level in the conditional independence tests included in the methods are all set to . The size of time window for aggregating event sequences is set to seconds, the maximum lag in PCTS, and the penalization level in HPADM4 is set to the default . Furthermore, the decay parameter in Hawkes process is set to , and we select Fisher-Z as the CI test in our HPCI algorithm. For evaluation, we define three metrics as follows.
where is the set of all directed edges in the learned causal graph and is the set of ground-truth edges.
The F1-scores using synthetic data with are shown in Table 1. As shown, HPCI outperforms the baselines for nearly all settings of and . However, HPADM4 obtains the best result for and low , this is due to the distribution of event occurrence intervals being sparse which makes the causal dependency straightforward to capture using a Hawkes process. However, for higher or the events will be denser. Thus, Hawkes process has trouble distinguishing instantaneous causal relations, especially when events co-occur. The use of CI tests in HPCI helps to distinguish these instantaneous causal relations by taking another perspective in which causality is discovered based on distribution changes in the aggregated data without considering the time-lagged information among events. HPCI thus achieves better results. The use of time aggregation is disadvantageous for PCTS due to its focus on time series, which can partly explain its comparatively worse results.
The results on the real-world data are shown in Table 2
. HPCI performs significantly better than all baselines in precision and F1-score, while PTCS obtains the highest recall. PTCS also has significantly lower precision, indicating more false positives. PCTS is designed for time series, however, those may be periodic, which can give higher lagged-correlation values leading to more redundant edges. HPCI instead finds a good balance between precision and recall. The competitive result indicates that the causality behind the real alarm data conforms to the assumptions of HPCI to a certain extent.
6.2 Root Cause Alarm Identification
We evaluate the effectiveness of CPBE and the root cause alarm accuracy on the real-world dataset. We use the causal graph structure created by HPCI as the base and augment it with the known causal ground-truths. The causal graph is thus as accurate as possible. CPBE is compared with four baseline methods, all used for determining edge weights.
IT, directly use the weighted causal graph discovered by HPCI with the learned influence intensities as edge weights.
Pearson, uses the aligned Pearson correlation of each alarm pair .
CP, the weights of an edge is set to where is the number of times and co-occur in a window, and is the total number of alarms.
ST, a static model with maximization likelihood estimator . It is similar to CP, but represents the number of times occurs before .
For each method, IRIE is used to find the top- most likely root cause alarms in each of the labeled alarm transactions. For IRIE, we use the default parameters. We attempt to use RandomWalk, BFS, and DFS for traversal in CPBE, as well as different Skip-gram configurations with and vector length . However, there is no significant difference in the outcome, indicating that CPBE is insensitive to these parameter choices on our data. The results for different when using RandomWalk are shown in Table 3. As shown, CPBE outperforms the baselines for all . For , CPBE achieves an accuracy of 61.8% which, considering that no expert knowledge is integrated into the system, is an excellent outcome. Moreover, the running time of CPBE is around seconds and IRIE takes seconds for all alarm transactions. This is clearly fast enough for system deployment.
We present a framework to identify root cause alarms of network faults in large telecom networks without relying on any expert knowledge. We output a clear ranking of the most crucial alarms to assist in locating network faults. To this end, we propose a causal inference method (HPCI) and a novel network embedding-based algorithm (CPBE) for inferring network weights. Combining the two methods, we construct an alarm influence graph from historical alarm data. The learned graph is then applied to identify root cause alarms through a flexible ranking method based on influence maximization. We verify the correctness of the learned graph using known causal relation and show a significant improvement over the best baseline on both synthetic and real-world data. Moreover, we demonstrate that our proposed framework beat the baselines in identifying root cause alarms.
Combining knowledge modeling and machine learning for alarm root cause analysis. IFAC Proceedings Volumes 46 (9), pp. 1843–1848. Cited by: §2.
-  (2007) Towards highly reliable enterprise network services via inference of multi-level dependencies. In ACM SIGCOMM Computer Communication Review, Vol. 37, pp. 13–24. Cited by: §1, §2.
-  (2014) CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM, 2014 Proceedings IEEE, pp. 1887–1895. Cited by: §1, §2.
-  (2010) GRCA: a generic root cause analysis platform for service quality management in large isp networks. In ACM ACM Conference on Emerging Networking Experiments and Technologies, Cited by: §1, §2.
-  (2010) Learning influence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining, pp. 241–250. Cited by: 4th item.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §5.2.
-  (1971) Spectra of some self-exciting and mutually exciting point processes. Biometrika 58 (1), pp. 83–90. Cited by: §3.
-  (2012) Irie: scalable and robust influence maximization in social networks. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 918–923. Cited by: §5.3.
-  (2007) Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research 8, pp. 613–636. Cited by: §3.
-  (2003) Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137–146. Cited by: §1, §5.3.
-  (2018) Mining causality of network events in log data. IEEE Transactions on Network and Service Management 15 (1), pp. 53–67. Cited by: §2.
-  (2018) Influence maximization on social graphs: a survey. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.3.
-  (2010) Mining dependency in distributed systems through unstructured logs analysis. SIGOPS Operating Systems Review 44 (1), pp. 91–96. Cited by: §2.
-  (2020) Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), pp. 1–10. Cited by: 3rd item.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §5.2.
-  (2016) Mining causality graph for automatic web-based service diagnosis. In Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International, pp. 1–8. Cited by: §2, 2nd item.
-  (1981) On lewis’ simulation method for point processes. IEEE transactions on information theory 27 (1), pp. 23–31. Cited by: §6.
-  (2014) Causal discovery with continuous additive noise models. The Journal of Machine Learning Research 15 (1), pp. 2009–2053. Cited by: §3.
-  (2000) Causation, prediction, and search. MIT press. Cited by: §3.
-  (1991) An algorithm for fast recovery of sparse causal graphs. Social science computer review 9 (1), pp. 62–72. Cited by: §2.
-  (2017) Association mining analysis of alarm root-causes in power system with topological constraints. In Proceedings of the 2017 International Conference on Information Technology, pp. 461–468. Cited by: §2.
-  (2008) Estimation of space–time branching process models in seismology using an em–type algorithm. Journal of the American Statistical Association 103 (482), pp. 614–624. Cited by: §3.
-  (2018) Cloudranger: root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 492–502. Cited by: §3.
-  (2018) Network alarm flood pattern mining algorithm based on multi-dimensional association. In Proceedings of the 21st ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp. 71–78. Cited by: §2.
-  (2013) Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial Intelligence and Statistics, pp. 641–649. Cited by: §3, 4th item.