# Anomaly Detection in Partially Observed Traffic Networks

This paper addresses the problem of detecting anomalous activity in traffic networks where the network is not directly observed. Given knowledge of what the node-to-node traffic in a network should be, any activity that differs significantly from this baseline would be considered anomalous. We propose a Bayesian hierarchical model for estimating the traffic rates and detecting anomalous changes in the network. The probabilistic nature of the model allows us to perform statistical goodness-of-fit tests to detect significant deviations from a baseline network. We show that due to the more defined structure of the hierarchical Bayesian model, such tests perform well even when the empirical models estimated by the EM algorithm are misspecified. We apply our model to both simulated and real datasets to demonstrate its superior performance over existing alternatives.

Comments

There are no comments yet.

## Authors

• 4 publications
• 16 publications
• 28 publications
• ### Latent Laplacian Maximum Entropy Discrimination for Detection of High-Utility Anomalies

Data-driven anomaly detection methods suffer from the drawback of detect...
02/16/2017 ∙ by Elizabeth Hou, et al. ∙ 0

read it

• ### Degree-based Outlier Detection within IP Traffic Modelled as a Link Stream

This paper aims at precisely detecting and identifying anomalous events ...
06/06/2019 ∙ by Audrey Wilmet, et al. ∙ 0

read it

• ### A Probabilistic Framework to Node-level Anomaly Detection in Communication Networks

In this paper we consider the task of detecting abnormal communication v...
02/12/2019 ∙ by Batiste Le Bars, et al. ∙ 0

read it

• ### Online network monitoring

The application of network analysis has found great success in a wide va...
10/19/2020 ∙ by Anna Malinovskaya, et al. ∙ 0

read it

• ### Detecting Anomalous Event Sequences with Temporal Point Processes

Automatically detecting anomalies in event data can provide substantial ...
06/08/2021 ∙ by Oleksandr Shchur, et al. ∙ 0

read it

• ### Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems

In this work we consider the problem of anomaly detection in heterogeneo...
02/21/2016 ∙ by Igor Melnyk, et al. ∙ 0

read it

• ### ChoiceRank: Identifying Preferences from Node Traffic in Networks

Understanding how users navigate in a network is of high interest in man...
10/20/2016 ∙ by Lucas Maystre, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In today’s connected world, communication is increasingly voluminous, diverse, and essential. Phone calls, delivery services, and the Internet are all modern amenities that send massive amounts of traffic over immense networks. Thus network security, such as the ability to detect network intrusions or illegal network activity, plays a vital role in defending these network infrastructures. For example, (i) computer networks can protect themselves from malware such as botnets by identifying unusual network flow patterns; (ii) supply chains can prevent cargo theft by monitoring the schedule of shipments or out-of-route journeys between warehouses; (iii) law enforcement agencies can uncover smuggling operations by detecting alternative modes of transporting goods.

Identifying unusual network activity requires a good estimator of the true network traffic, including the anomalous activity, in order to distinguish it from a baseline of what the network should look like. However, often it is not possible to observe the network directly due to constraints such as cost, protocols, or legal restrictions. This makes the problem of estimating the rate of traffic between nodes in a network difficult because the edges between nodes are latent unobserved variables. Network tomography approaches have been previously proposed for estimating network topology or reconstructing link traffic from incomplete measurements and limited knowledge about network connectivity. However for network anomography, the detection of anomalous deviations of traffic in the network, highly accurate estimation of all network traffic may not be necessary. It often suffices to detect perturbations within the network at an aggregate or global scale. This paper addresses the problem of network anomography rather than that of network tomography or traffic estimation.

### I-a Related Work

Broadly defined, the network tomography problem is to reconstruct complete network properties, e.g., source-destination (SD) traffic or network topology, based on incomplete data. The term “network tomography” was introduced in [1] where the objective is to estimate unknown source destination traffic intensities given observations of link traffic and known network topology. Since the publication of [1], the scope of the term network tomography has been used in a much broader sense (see the review papers [2, 3, 4], and [5]). For example, a variety of passive or active packet probing strategies have been used for topology reconstruction of the Internet, including unicast, multicast, or multi-multicast [6, 7], and [8]; or using different statistical measures including packet loss, packet delay, or correlation [9, 10, 11], and [12].

In the formulation of [1], the network tomography objective is to determine the total amount of traffic between SD pairs given knowledge of the physical network topology and the total amount of traffic flowing over links, called the link data. This leads to the linear model for the observations where is the known routing matrix defining the routing paths, and at each time point ,

is a vector of the observed total traffic on the links and

is a vector of the unobserved message traffic between SD pairs. Using the model that the elements of

are independent and Poisson distributed, an expectation-maximization (EM) maximum likelihood estimator (MLE) and a method of moments estimator are proposed in

[1] for the Poisson rate parameters . The authors of [13]

propose a Bayesian conditionally Poisson model, which uses a Markov chain Monte Carlo (MCMC) method to iteratively draw samples from the joint posterior of

and . The authors of [14] and [15]

assume the message traffic is instead from a Normal distribution, obtaining a computationally simpler estimator of the SD traffic rates. The authors of

[16] relax the assumption that the traffic is an independent and identically Poisson distributed sequence and instead consider the network as a directly observable Markov chain. Under this weaker assumption, they derive a threshold estimator for the Hoeffding test in order to detect if the network contains anomalous activity.

In [17] the authors propose an EM approach for Poisson maximum likelihood estimation when the network topology is unknown; however, their solution is only computationally feasible for very small networks and it does not account for observations of traffic through interior nodes. This has led to simpler and more scalable solutions in the form of gravity models where the rate of traffic between each SD pair is modeled by where and are the total traffic out of the source node and into the destination node respectively and is the total traffic in the network. Standard gravity models do not account for the interior nodes, thus in [18] and [19] tomogravity and entropy regularized tomogravity models were proposed, which incorporate the interior node information in the second stage of their algorithm. The authors of [20] generalize the tomogravity model from a rank one (time periods are independent) to a low rank approximation (time periods are correlated) and allow additional observations on individual SD pairs. Similarly, the authors of [21] and [22] use a low rank model with network traffic maps to incorporate a sparse anomaly matrix, and they solve their multiple convex objectives with the alternating direction method of multipliers (ADMM) algorithm.

Dimensionality reduction has also been used directly for anomaly detection in the SD traffic flows in networks. Under the assumption that traffic links have low rank structure, the authors in [23] and [24] use Principle Component Analysis (PCA) to separate the anomalous traffic from the nominal traffic. This low rank framework is generalized to applying PCA in networks that are temporally low rank or have dynamic routing matrices, in [25]. The authors of [25] also coin the term “network anomography” to reflect the influence of network topology reconstruction, which is a necessary component to detecting anomalies in a network with unknown structure. However, later work in [26] discusses the limitations of PCA for detecting anomalous network traffic, e.g., it is sensitive to (i) the choice of subspace size; (ii) the way traffic measurements are aggregated; (iii) large anomalies. The low rank plus sparse framework is extended to online setting with a subspace tracking algorithm in [27].

Specifically for Internet Protocol (IP) networks, some works prefer to perform anomaly detection on the flows from the IP packets instead of the SD flows. The authors of [28] use PCA to separate the anomalous and nominal flows from sketches (random aggregations of IP flows) while the authors of [29] model the sketches as time series and detect change points with forecasting. The works of [30] and [31] also perform change point detection using windowed hypothesis testing with generalized likelihood ratio or relative entropy respectively.

Because our approach in this paper is based on traffic networks or SD models, these types of approaches were the focus of our related works subsection. However, networks can also be represented as graph models or as features of the network characteristics. This subsection would be incomplete if it did not mention anomaly detection approaches to other types of network models. So, we refer to some survey papers that cover many of the recent techniques in graph based approaches: [32] and [33]. In particular, similar to the low rank approaches for SD networks, there are low rank approaches to graph models such as [34] who assume the inverse covariance matrix of their wireless sensor network data has a graph structure and solve a low rank penalized Gaussian graphical model problem and [35] who impose graph smoothness by a low rank assumption on graph Laplacian of the features of the network. [36] also uses a low rank approach on their KDD intrusion data set, but they directly apply the low rank assumption to the network characteristics of their data.

### I-B Our Contribution

In this paper, we consider networks where an exterior node (a node in an SD pair) only transmits and receives messages from a few other nodes, but because we cannot observe the network directly, we do not know which SD pairs have traffic and which do not. Thus, we develop a novel framework to detect anomalous traffic in sparse networks with unknown sparsity pattern. Our contributions are the following. 1) In order to estimate the network traffic, we propose a parametric hierarchical model that alternates between estimating the unobserved network traffic and optimizing for the best fit rates of traffic using the EM algorithm. 2) We warm-start the algorithm with the solution to non-parametric minimum relative entropy model that directly projects the rates of traffic onto the nearest attainable sparse network. 3) Since we do not make assumptions of fixed edge structure in our model, it allows us to accommodate the possibility of anomalous edges in the actual network structure because anomalies will never be known in advance. 4) Using our probabilistic model’s estimator of actual traffic rates, we test for anomalous network activity by comparing it to a baseline to determine which deviations are anomalies and which are estimation noise. We develop specific statistical tests, based on the generalized likelihood ratio framework, to control for the false positive rate of our probabilistic model, and show that even when our models are misspecified, our tests can accurately detect anomalous activity in the network.

The rest of the paper is organized in the following way. Section II proposes a problem formulation of the network we are interested in and our assumptions about it. Section III describes our proposed hierarchical Bayesian model, which is solved with a generalized EM algorithm and warm-starting the EM with a solution that satisfies the minimum relative entropy principle. Section IV describes our anomaly detection scheme through statistical goodness of fit tests and Section V describes the computational complexity of our method. Section VI contains simulation results of the performance of our proposed estimators and applications to the CTU-13 dataset of botnet traffic and a dataset of NYC taxicab traffic. Finally, Section VII concludes the paper.

## Ii Proposed Formulation

We give a simple diagram of a notional network in Fig. LABEL:sub@fig:oracle. An exterior node, , sends messages, , at a rate, , to another exterior node, , at each time point, . Messages can flow through interior nodes, such as , but the interior nodes do not absorb or create messages. Because the magnitude of flow is just the total number of messages that have been sent from one node to another, network traffic between nodes is a counting process. For tractability, it is common to assume the messages are independent and identically distributed (i.i.d.) and the total number of messages in a time period is from some parametric distribution. The Poisson distribution is the most natural choice because it models events occurring independently with a constant rate, and it is used by [1], [17], [13], [14], and [15]

although the latter two works use a Normal approximation to the Poisson for additional tractability. Under these Poisson process assumptions, the uniformly minimum variance unbiased estimator is simply the maximum likelihood estimator (MLE).

However, this is a very strong and unrealistic assumption because it would require being able to track every single message being passed in the network. Thus, we are interested in the much weaker assumption that we can only monitor the nodes themselves. Fig. LABEL:sub@fig:actual shows what we can actually observe from the network under this weaker assumption. While we also observe the total amount of traffic, unlike in [1], we do not know the network topology.

Since we can only monitor the nodes, we can only observe the total ingress and egress of the exterior nodes. Thus we know an exterior node, , transmits messages and receives messages, but we do not know which of the other nodes it is interacting with. We can also observe the flow through interior nodes, but we cannot distinguish where the messages come from or are going to. For instance, in Fig. LABEL:sub@fig:oracle, an interior node, such as , will observe all messages, , that flow through it, but it will not be able to distinguish the number of messages from each SD pair or whether all the SD pairs actually send messages.

A network with exterior nodes can naturally be mathematically formulated as a matrix, which is observed times. Let be the unobserved traffic matrix at time instance and let the elements of the matrix, , be the amount of traffic between nodes and . The row and column sums of the traffic are denoted by and respectively, and are the observed flows through interior nodes, which are indexed by . The traffic at each time instance is generated from a distribution with mean , the true intensity/rate parameter of the matrix, and is the baseline parameter of a network without any anomalies. This mathematical formulation is shown below.

 Nt=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0Nt12Nt13⋯Nt1PNt210Nt23⋯Nt2PNt31Nt320⋯Nt3P⋮⋮⋮⋱⋮NtP1NtP2NtP3⋯0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Observations

We assume a priori that the distribution of the rate matrix is centered around some baseline rate matrix , which are the assumed rates when there is no anomalous activity. We then update this prior distribution using the observations in order to get a distribution of the rates , which does account for potential anomalous activity.

## Iii Hierarchical Poisson Model with EM

We propose a generative model that assumes a series of statistical distributions govern the generation of the network. We assume that the messages passed through the network are Poisson distributed with rates . However, because we cannot observe the traffic network directly, we do not have the complete Poisson likelihood and use the EM algorithm. In the following subsections, we will show a series of generative models with increasing complexity that attain successively higher accuracy. Then we will discuss warm-starting the EM algorithm at a robust initial solution to compensate for its sensitivity to initialization.

### Iii-a Proposed Hierarchical Bayesian Model

#### Iii-A1 Maximum Likelihood by EM

The simplest hierarchical model assumes all priors are uniform, thus the only distributional assumption is that likelihood is . The maximum likelihood estimator for the Poisson rates can be approximated by lower bounds of the observed likelihood using the maximum likelihood expectation maximization (MLEM) algorithm. The MLEM alternates between computing a lower bound on the likelihood function , the E-step, and maximizing the lower bound, the M-step. A general expression for the E-step bound can be expressed as:

 (1)

where is an arbitrarily chosen distribution of , denotes statistical expectation with respect to the reference distribution , and is the Shannon entropy of . The choice of that makes the bound (1) the tightest, and results in the fastest convergence of the MLEM algorithm, is (see Section 11.4.7 of [37]); however, this is not a tractable distribution. When the observations consist of the row and column sums of the matrix

, this distribution is the multivariate Fisher’s noncentral hypergeometric distribution, and when the flows are also observed the distribution is unknown. Unfortunately, use of this optimal distribution leads to an intractable E-step in the MLEM algorithm due to the coupling (dependence) between the row and column sums of

. As an alternative we can weaken the bound on the likelihood function (1) by using a different distribution q that leads to an easier E-step. To this aim, we propose to use a distribution q that decouples the row sum from the column sum; equivalent to assuming that each sum is independent, e.g., as if each were computed with different realizations of .

###### Proposition 1.

Assume and are different time points so that observations at these time points are independent

 \emphP(D|Λ)=T∏t1=1\emphP(Rt1|Λ)T∏t2=1\emphP(Ct2|Λ)T∏t3=1\emphP(Ft3|Λ).

Then the tightest lower bound of the observed data log likelihood is

 log\emphP(D|Λ)≥3∑τ=1T∑tτ=1\emphH(qtτ)+\emphEqtτ(log\emphP(Ntτ|Λ))

where , , and are multinomial distributions.

In the EM algorithm, the expectation in the E-step is taken with respect to the distribution estimated using the previous iteration’s estimate of the parameter , and the M-step does not depend on the entropy terms in the lower bound in Proposition 1, which are constant with respect to . Since the likelihoods are all Poisson, the E-step reduces to computing the means of multinomial distributions and the M-step for any pair is given by the Poisson MLE with the unknown terms replaced by their mean values. Explicitly the M-step objective is

 ^Λk+1ij=argmaxΛij−Λij+log(Λij)Ntotalij (2)

where and the expectations are with respect to the multinomial distributions of Proposition 1 . Thus the Poisson MLE equals .

#### Iii-A2 Maximum a Posteriori by EM

Because there are unobserved variables and only observed variables, the expected log likelihoods have a lot of local maxima. In order to make the EM objective better defined and incorporate the baseline Poisson rate information

, a prior can be added to the likelihood model of the previous subsection. The EM objective of this new model is now the expected log posterior and the estimator in the M-step is the maximum a posteriori (MAP) estimator. It is natural to choose a conjugate prior of the form

where each (shape, rate) as this choice yields a closed form expression for the posterior distribution. These priors have modes at the baseline rates

. The hyperparameters

can be thought of as the belief we have in the correctness of the baseline so as , the prior variance goes to infinity, and the prior becomes non-informative because we have no confidence in the baseline, while as , the prior variance goes to zero, and the prior degenerates into the point because we are certain the baseline is correct.

Given a matrix of hyperparameters , the complete data posterior distribution is where each posterior is of the form of . Because we can only observe the network indirectly , we again must estimate the mode of this posterior using the EM algorithm, which is very similar to the algorithm for the likelihood model. The only difference is the M-step in which an additional term of the form is added to (2). Thus at every EM iteration, the entries of the MAP estimator matrix are

 ^Λk+1ij=ϵijΛ0ij+Ntotalijϵij+3T (3)

where is the same as in (2).

#### Iii-A3 Bayesian Hierarchical Model

Choosing the hyperparameters can be difficult because it is not always possible to quantify our belief in the correctness of the baseline rates. We can rectify this by allowing the

to be random with hyperpriors

. We choose uninformative hyperpriors for . A notional diagram for the proposed hierarchical model is shown in Fig. 4.

With these uninformative priors the posterior takes the form

 P(Λ|N1,…,NT)=∫P(N1,…,NT|Λ)P(Λ|ϵ)P(ϵ)P(N1,…,NT)dϵ =∫P(N1,…,NT|Λ)P(Λ|ϵ)P(N1,…,NT|ϵ)P(N1,…,NT|ϵ)P(ϵ)P(N1,…,NT)dϵ =∫P(Λ|ϵ,N1,…,NT)P(ϵ|N1,…,NT)dϵ (4)

where . The observed (incomplete data) log posterior has lower bound proportional to

 log(∫exp{Eq(logP(Λ|ϵ,N1,…,NT))} exp{Eq(log∫%P(Λ,ϵ|N1,…,NT)dΛ)}dϵ)

which is tight when , as shown in (VII) in the Appendix.

However, marginalizing the joint posterior is often not feasible, so instead it is popular to use empirical Bayes to approximate it with a point-estimate

We propose an empirical Bayes approach to maximizing the log posterior as an alternative to maximization of (III-A3) . This empirical Bayes approximation can be embedded in the EM algorithm so that once we have an estimate for , an estimator for is obtained by maximizing the expected log conditional posterior .

###### Theorem 1.

Using the time independence in Proposition 1 and the empirical Bayes approximation, the E-step of the EM algorithm for the hierarchal model is

 ^Nt1ij=\emphE(Nt1ij|Rt1,^Λk)=^Λkij∑Pj=1^ΛkijRt1i ^Nt2ij=\emphE(Nt2ij|Ct2,^Λk)=^Λkij∑Pi=1^ΛkijCt2j, ^Nt3ij=\emphE(Nt3ij|Ft3,^Λk)=^Λkij∑ij^ΛkijFt3h for any pair ij,

and the M-step is

 ^ϵk+1ij=argmaxϵij3∑τ=1T∑tτ=1logΓ(^Ntτij+ϵijΛ0ij+1)Γ(ϵijΛ0ij+1) and ^Λk+1ij=argmaxΛij(^ϵijΛ0ij)log(Λij)−^ϵijΛij−3TΛij +log(Λij)⎛⎝T∑t1=1^Nt1ij+T∑t2=1^Nt2ij+T∑t3=1^Nt3ij⎞⎠.

Since the function that lower bounds the observed log likelihood changes after every iteration of the EM algorithm, the prior should also change after every iteration. Intuitively, the earlier iterations of the EM algorithm will have expected log likelihoods that are more misspecified than the later iterations. This suggests spreading the prior distribution in the earlier iterations. The empirical Bayes approximation of Theorem 1 effectively does this by allowing the variance of the prior to be chosen using the data instead of fixing it as a constant. In this manner, the empirical Bayes approximation can be thought of as a Bayesian analog to the regularized EM algorithm of [38].

### Iii-B Warm Starting with Minimum Relative Entropy

The EM algorithm is well known to be sensitive to initialization, especially if the objective has a lot of local maxima. Thus if instead of a random initialization, the EM algorithm is warm-started, it is more likely to converge to a good maximum and also potentially converge faster. A good choice for an initialization point is a more robust estimator of the rate matrix such as the solution to a model with fewer distributional assumptions. Thus instead of modeling an explicit generative model, we can instead adopt the minimum relative entropy (MRE) principle [39, 40, 41], and [42]. Geometrically, this reduces to an information projection of the prior distribution, as shown in Fig. 5.

The constrained minimum relative entropy distribution is the density that is closest to a given prior distribution and lies in a feasible set, . This feasible set is formed from constraints that require their expected values, with respect to the minimum relative entropy distribution, to match properties of the observations,

(the total ingress, egress, and flows). And because relative entropy is the Kullback-Leibler (KL) divergence between probability distributions, this is used as the metric for closeness. This closeness criterion is well suited to the anomaly detection problem of interest to us because anomalous activity is rare, so the distribution of the actual rates,

, should be similar to the prior distribution , which is parameterized by the baselines rates .

The MRE objective is

 minP(Λ|R,C,F)%KL(P(Λ|R,C,F)||P0(Λ)) \makebox[260.172pt]subject to ∫P(Λ|R,C,F)(Λ1−¯R)dΛ=0 ∫P(Λ|R,C,F)(1′Λ−¯C)dΛ=0 ∫P(Λ|R,C,F)(AΛB−¯F)dΛ=0

where and are vectors of zeros and ones respectively, and are the average rates of observed total traffic into and out of each node, and and are 0-1 matrices summing the rates that flow through each of the interior nodes with average observations . Using the Legendre transform of the Lagrangian to get the Hamiltonian, the optimal density has the form

 P(Λ|R,C,F)=P0(Λ)Z(ρ,γ,ϕ)exp{ρ′(Λ1−¯R) (5) +γ′(1′Λ−¯C)+ϕ′(AΛB−¯F)}

where are Lagrange multipliers that maximize the negative log partition function .

###### Proposition 2.

Let be independent Laplace distributions with mean parameter and scale parameter , then the constrained mode of the MRE distribution is the solution to

 argmaxΛ∈R+−||Λ−Λ0||1+^ρ′(Λ1−¯R) +^γ′(1′Λ−¯C)′+^ϕ′(AΛB−¯F)

where .

Maximizing the above expression over

(constrained to only positive real numbers) can be seen as a slight relaxation of the more direct objective of minimizing the loss function

 argminΛ∈R+||Λ−Λ0||1 (6) subject toΛ1=¯R,1′Λ=¯C,AΛB=¯F

where is the element wise norm. The loss function in (6) has the advantage that it can be easily implemented in any constrained convex solver such as CVX [43].

The objective in (6) is an easily interpretable formulation for estimating the rate matrix, which does not depend on the unobserved traffic . And, because it does not put distributional assumptions on the “likelihood”, it is more robust to model mismatch, at the cost of accuracy. The generality of the solution to (6), while not precise enough on its own, makes it a good candidate to be further refined by the EM algorithm in the Hierarchical Poisson model.

## Iv Testing For Anomalies

Since the estimators in the previous section are maximizers of probabilistic models, a natural way to test for anomalies in the rate matrix

is to compare goodness of fit of the fitted model using hypothesis testing. By testing the null hypothesis

against the alternative hypothesis , we can control the false positive rate (FPR) (Type 1 error), of incorrectly declaring anomalous activity in the rate matrix, using a level- test. In this section we will represent a statistical model with the notation , as the results apply for both log likelihood and log posterior models.

Depending on if the statistical models are likelihoods or posteriors, the statistic

 ψ=−2T∑t=1(log(Mt(Λ0))−log(Mt(^Λ))) (7)

would be either a log likelihood ratio (LR) statistic or a log posterior density ratio (PDR) statistic [44] respectively, where . Thus testing against a threshold can be seen as a generalized log likelihood ratio test or generalized log posterior ratio test with a composite alternative hypothesis.

###### Proposition 3.

Under the standard regularity conditions for the log LR statistic or under the sufficient conditions of the Bernstein-von Mises theorem for the log PDR statistic,

will be asymptotically distributed under the null hypothesis.

Next we show that the statistic in (7) is a good estimator of the KL divergence between the true model at its maximum and the true model at the baseline. And even if the models are misspecified, the statistic

 ^ψ=−2T∑t=1(log(^Mkt(Λ0))−log(^Mkt(^Λ)))

can still be a good estimator for goodness-of-fit, where the in and indicates the iteration of the EM algorithm.

###### Proposition 4.

The statistic is a consistent estimator for

 Ψ=2\emphKL(M(Λ∗)||M(Λ0)),

the KL divergence between the true model and the true model under the null hypothesis. The statistic is a consistent estimator for

 2\emphKL(M(Λ∗)||M(Λ0)) (8) −2(\emphKL(M(Λ∗)||^Mk(^Λ∗))−\emphKL(M(Λ0)||^Mk(Λ0)))

where is the closest population local maximum at iteration .

The second term in (8) can be seen as the difference between the true model misspecification error and the model misspecification error of the null hypothesis. So if conditions are satisfied so that the EM algorithm converges to the global maximum as the number of iterations or if the model is equally as misspecified under the truth as under the null hypothesis such that the differences in the second term in (8) cancel to 0, then the statistic is also a consistent estimator of . The justification for using misspecified models can also be geometrically interpreted as follows. Because the models estimated from the EM algorithm are from the correct parametric family of distributions, the misspecified models still lie on the same Riemannian manifold as the correct models. Below, we provide an algorithm for performing hypothesis testing on the statistic .

Algorithm 1 calculates the statistic as a log ratio of the modes of the model under the null and alternative hypothesis. It then tests against a critical value , which is related to the false positive level.

Under the null hypothesis, the statistic can be decomposed as sampling error plus model error . Thus for the level- test

, a Type-I error can occur due to either sampling error or model error or a combination of both. Since typically the finite sample distribution of the statistic

is unknown, the asymptotic distribution described in Proposition 3 can be used to choose the critical value of . Assuming the model error is small, or small relative to the sampling error, we can also use Proposition 3 to choose the critical value of a test with a misspecified statistic . In the following section, we will show in simulations that the asymptotic distribution of the correct statistic is adequate for choosing the critical value of a test using the misspecified statistic .

## V Computational Complexity

In Algorithm 2, we present our hierarchical Poisson EM model warm started at the MRE estimator and analyze its computational complexity.

Warm starting the EM algorithm at the MRE solution (6) requires using interior-point methods, which have polynomial complexity in the number of variables. Since the MRE objective has linear variables and second order cone problem variables, the computational cost is of order where is the polynomial degree (often 3) and is the number of iterations of the interior point algorithm.

The E-Step consists of calculating the multinomial means using the observed data. Assume that the number of flows in the interior nodes are roughly , so that each of the row sums, column sums, and interior node flows are the summation of values. Then for each independent time instance , there are summations of values in denominator and a multiplication and division operation on each of the entries in the numerator. The total computational cost of the E-step is of order where is the number of different time points in Proposition 1 (2 + number of interior nodes).

In the M-step, the estimator

can only be solved numerically because the score function of the negative binomial distribution is a non-linear equation. Because we can derive the gradient of the score function, we can use a trust-region method with a Newton conjugate gradient subproblem (each subproblem has linear complexity in time points). Given

, the estimator can be solved in closed form (3) with scalar operations, making its complexity linear in time points. Thus the total computational cost of the M-step is of order where is the number of conjugate gradient iterations.

Given the final iterations EM estimators, evaluating the models at each entry only involves scalar operations, and getting the log ratio statistic requires summing over all entries and the

time points; so the total complexity of the anomaly test statistics is of order

. Thus, overall Algorithm 2 has computational complexity of order . Note that our choice in algorithms for the numerical optimizations were based more on convenience (using popular standard packages e.g. CVX, Matlab’s fsolve) than optimal performance, so the computational complexities listed in this section are certainly not the best case scenarios. Nonetheless, even using non-optimal numerical algorithms, we show, in the following section, that our method can run in a reasonable amount of time in both simulations and large real world problems.

## Vi Simulation and Data Examples

In this section, we model network traffic in both simulated and real datasets as hierarchical Poisson posteriors to get estimators of the true network traffic rates. These estimators, from the hierarchical Poisson posteriors where the EM algorithm is initialized randomly or at the MRE estimator (Rand-HP or MRE-HP), are tested against baseline rates to detect anomalous activity in the network, as shown in Algorithm 1. We compare the performance of our proposed models to the maximum likelihood EM (MLEM) model of [17] (with the same time independence assumptions of Proposition 1 for feasibility), the Traffic and Anomaly Map (TA-Map) method of [22], and an “Oracle” that unrealistically observes the network directly. The “Oracle” estimator is the uniformly minimum variance unbiased estimator and achieves the Cramer-Rao lower bound [45].

The Traffic and Anomaly Map method is the state-of-the-art for estimating the rates in networks with traffic anomalies. Specifically for the TA-Map method we use the objective of (P1) in [22], but with the low rank decomposition of (P4) in [22] where and is a vector of ones because the rates do not change over time. Since the anomalies also do not change over time, they can be expanded as where is a vector of rates of anomalous activity. We use to form the routing matrix for the vector of nominal rates and a full routing matrix for the vector of anomalous rates since we do not know any structural knowledge about them. Additionally, converting the notation of [22] to the notation of this paper, , are defined as the edges that are observed, and , where and are solved using CVX on (P1) in [22]. We empirically choose the penalty parameters and .

### Vi-a Simulation Results

We simulate networks where the baseline rate matrix has 10 exterior nodes and 2 interior nodes. The probability of an edge between any two nodes in the baseline network is , the baseline rates are drawn from distributions, and each interior node observes the total flow of a random 7 edges. We consider scenarios where anomalous activity can take place in either the edges or the nodes. In the first scenario, the anomalous activity can cause increases in the rates of some of the edges, new edges to appear or disappear, or both. So, the rates of anomalous activity are drawn from distributions where the probability of anomalous activity between any two nodes is . In the second scenario, there is a hidden node that is interacting with the other nodes, thus affecting the observed total flows of the known nodes. So the entries of the true rate matrix are drawn from distributions, but the true rate matrix has 11 exterior nodes and the baseline rate matrix is the submatrix of known nodes. Like in the first scenario, the probability of an edge between the hidden node and another node is . All simulations contain 200 trials, with anomalous activity in approximately half of them.

In Fig. 6 we explore the accuracy of correctly identifying anomalous activity as a function of the percentage of observed edges, where we observe time points (samples). We measure accuracy as where the number of true positives (TP) and true negatives (TN) are the number of times a method correctly detects that there is anomalous activity or no anomalous activity respectively. For the probabilistic models (MLEM, Rand-HP, MRE-HP) , we use the likelihood or posterior density ratio tests described in Section IV

where the critical value is calculated using the inverse cumulative distribution function of the

distribution at . The Traffic and Anomaly Map method uses a threshold on the maximum (absolute) value of the anomaly matrix where the threshold is chosen so that it has Type-I error. While the accuracy of all the probabilistic models increases as the percentage of observed edges increases, the MLEM has low accuracy unless over 80% of the network is observed whereas the two Hierarchical Poisson models have high accuracy even when no part of the network is directly observed. The TA-Map method also has poor performance at all percentages of the network observed. This may due to issues the TA-Map method has at separating and into the correct separate matrices even when the total estimator is accurate.

While the Rand-HP and MRE-HP models have approximately the same accuracy at detecting anomalies (MRE-HP does slightly better when only a few of the edges are observed), initializing the EM algorithm of the Hierarchical Poisson model at the MRE solution has additional benefits. Fig. 7 shows that the EM algorithm in the Hierarchical Poisson model with random initialization takes longer to converge than if it is initialized at the MRE solution. This is because, if the EM algorithm is initialized in a place where likelihood is very noisy, it may have difficultly deciding on the best of the nearby local maxima, but the MRE solution is often already close to a good local maximum.

Fig. 8 shows the mean squared error (MSE) of the estimated rate matrices . The MRE-HP model gains some of the advantages of the MRE estimator making its MSE much lower than that of the Rand-HP model. As the percentage of observed edges in the network increases, all estimators’ errors decrease to the Oracle estimator’s error, which is the lowest possible MSE among all unbiased estimators. However, both the TA-Map method and the MLEM model do not have good performance except when almost all of the network is observed, at which point every estimator performs well. Note that estimating the traffic is not the end goal in the considered anomaly detection problem. We demonstrate this by comparing Fig. 8 to Fig. 6, where we can see that estimating the traffic well (having low MSE) does not guarantee the method high accuracy. Low MSE implies that a method’s estimates do not have a large difference with the true rates, however depending on where the differences occur, it can be enough to cause the method to incorrectly detect anomalous activity.

Fig. 9 shows the ROC curves of the anomaly detection performance of the MRE-HP, MLEM, and TA-Map methods for both the anomalous rates and the hidden node scenarios, where only 20% of the edges are observed. The accuracy of the MRE-HP model increases with the total observation time , and it can detect anomalous activity almost perfectly with only 100 time points, as evidenced by its area under the curve (AUC) being very close to 1. The stars over the lines are the FPR vs TPR when using the critical values found by calculating the inverse cumulative distribution function of the distribution at 0.05. The ROC curve for testing a misspecified LR test statistic using the MLEM is just the point at because the Poisson MLE model is so misspecified, it always rejects the null hypothesis. The TA-Map method, while it does not always rejects the null hypothesis like the MLEM model, performs about as bad as random guessing (a diagonal line from to ). These results are consistent with the accuracy results shown in Fig. 6.

In Table I, we show the corresponding CPU timings of each method in the two scenarios used in Fig. 9. The algorithms were run on an Intel Xeon E5-2630 processor at 2.30GHz without any explicit parallelization; however some of the built-in Matlab functions are by default multi-threaded (such as ones that call BLAS or LAPACK libraries). While the MRE-HP is slower than the competing methods, its computation time is still very fast and on average less than half a minute. Also, note the significant performance improvement provided by MRE-HP in the considered anomaly detection problem (see Fig. 6 and Fig. 9).

### Vi-B CTU-13 Dataset

The proposed model was applied to botnet traffic networks from the CTU-13 dataset, which come from 13 different scenarios of botnets executing malware attacks captured by CTU University, Czech Republic, in 2011 [46]. The dataset contains real botnet traffic mixed with normal traffic and background traffic and the authors of [46] processed the captured traffic into bidirectional NetFlows and manually labeled them. Because the objective is to detect if there is botnet traffic among the regular users, we will only use the sub-network of nodes that are being used for normal traffic, but the traffic on this sub-network can be of any type: normal, background, or botnet. Thus, baseline traffic on the network is either normal or background traffic and the anomalous traffic is from botnets. And because the botnet traffic originates and also potentially ceases from nodes that are not the regular users, the anomalous activity is due to unobserved hidden nodes.

The observations consist of the total ingress and egress of each node along with the total flows of 10 interior nodes, where each interior node receives flow from other nodes, in addition to observing 20% of the edges in the network. An observation or sample is all the traffic that occurs in a one-hour time period. For each of the scenarios, we test the probabilistic models at an alpha level of 0.05 under both regimes where the null hypothesis is true (no botnet traffic) and not true (botnet traffic). For the TA-Map method of [22], we use the ROC curves from the simulations to choose the threshold that yields a Type-I error equal to . Table II summarizes the characteristics of each of the 13 difference scenarios.

Table III shows that the Hierarchical Poisson model initialized at the MRE solution always correctly rejects the null hypothesis when it is not true. However, the model incorrectly rejects the null hypothesis in Scenario 3. This scenario has far more nodes than any of the other scenarios, and as the number of nodes increase, the number of entries that must be estimated, , vastly outweigh the number of observations, . This gives rise to a large model misspecification error in this scenario, which would negatively impact the accuracy of Algorithm I. Like in the simulations, the Poisson MLE model always rejects the null hypothesis due to its massive model misspecification error and the TA-Map method also has poor performance in the scenarios that are computationally feasible for the method (the ones marked NA are too computationally expensive). Overall MRE-HP has good performance detecting anomalous activity, especially compared to the other methods.

In Table IV, we show the CPU timings of the algorithms for the 13 scenarios in the CTU-13 dataset under both hypothesis, where the algorithms are run on the same processor described in the simulations. Even for scenario 8, the computational times of MRE-HP are feasible despite running on a rather out-of-date processor with a low clock speed. Again we mark NA for the scenarios that are computationally infeasible for the TA-Map method (the memory requirements are above 32GB even for scenario 6). The MRE-HP method despite being slower than the TA-Map on smaller networks (see Table I), scales much more efficiently to larger networks.

### Vi-C Taxi Dataset

The proposed model was applied to a dataset consisting of yellow and green taxicabs rides from the New York City Taxi and Limousine Commission (NYC TLC) [47] and [48]. For every NYC taxicab ride, the dataset contains the pickup and drop-off locations as geographic coordinates (latitude and longitude). Green taxicabs are not allowed to pickup passengers below West 110th Street and East 96th Street in Manhattan, but occasionally they risk the chance of getting punished and ignore the regulations. In an article on June 10th 2014, the New York Post explains how the city began hiring more TLC inspectors to catch illegal pickups and enforce the location rules [49]. Thus we are interested in identifying if there are green taxicabs operating in lower Manhattan when we only know the yellow taxicab network. We treat the 18 Neighborhood Tabulation Areas (NTA) in lower Manhattan as nodes and associate any pickups or drop-offs within an NTA’s boundaries as traffic entering or leaving the node. We form edges from only frequently occurring routes of traffic, which we define as having activity at least an average of every 20 minutes for yellow taxicabs and twice a month for green taxicabs. For samples, we use the yellow and green taxicab rides from between January and May of 2014 and aggregate them into daily totals.

Like in the previous example, we indirectly observe samples of the total ingress and egress of each node, and the total flows of 10 interior nodes that each observe the flows of nodes. This creates a total traffic network with nodes and non-zero edges ( sparsity) where the baseline network (yellow taxicab rides) has of the edges. There is anomalous activity (green taxicab rides) on of the edges, where of these edges are also in the baseline network and are not. We observe the network for a total of days. Fig. 10 shows the baseline network formed from yellow taxicab rides and the unknown anomalous activity due to illegal pickups from green taxicabs.

Table V shows, for different percentages of edges observed, whether the correct decision (reject or not) is made when the null hypothesis is true (no green taxi traffic) and when it is not true (green taxi traffic). The Hierarchical Poisson model initialized at the MRE solution always makes the correct decision while the Poisson MLE model, except for when the network can be directly observed, always rejects the null hypothesis. These two models are tested at an alpha level of . The Traffic and Anomaly Map method, which has a Type-I error threshold chosen from the ROC curves of the simulations, also has poor performance.

From the results of Table V, we know the Hierarchical Poisson model initialized at the MRE solution is always able to detect changes in the network at a global scale, but we are also interested in the recovery of the individual green taxicab routes. When 70% of the network is observed, the model is able to detect 52 of the 56 edges that contain anomalous activity with only a 2% false positive rate. The 4 missed edges and 5 false alarms are shown in Fig. 11.

Out of the 4 misses, 3 of them are from green taxicab pickups from MN31, which contains the Lenox Hill and Roosevelt Island areas. Green taxicabs are allowed to pick up passengers from Roosevelt Island, but not from Lenox Hill, so some of the traffic on these 3 routes could be legal and not anomalous activity. The other miss, from MN19 to MN40, only had 11 rides in 150 days, making it harder to distinguish from just perturbation noise in the samples.

## Vii Conclusion

We have developed a framework and a probabilistic model for detecting anomalous activity in the traffic rates of sparse networks. Our framework is realistic and robust in that, at minimum, it only requires observing the total egress and ingress of the nodes. Because it imposes no fixed assumptions of edge structure, our framework allows the estimator to handle noisy observations and anomalous activity. Our simulation results show the advantages of our model over competing methods in detecting anomalous activity. Through application of our models to the CTU-13 botnet datasets, we show that the model is scalable and robust to various scenarios, and with the NYC taxi dataset, we show an application of our model and framework to an already identified real-world problem.

###### Proof of Proposition 1.

By Jensen’s inequality,

 =log⎛⎝T∏t1=1P(Rt1|Λ)T∏t2=1P(Ct2|Λ)T∏t3=1P(Ft3|Λ)⎞⎠ ≥T∑t1=1Eqt1(log% P(Rt1,Nt1|Λ))−Eqt1(logq(Nt1)) +T∑t2=1Eqt2(logP(Ct2,Nt2|Λ))−E%qt2(logq(Nt2)) +T∑t3=1Eqt3(logP(Ft3,Nt3|Λ))−E%qt3(logq(Nt3)) =T∑t1=1Eqt1(logP(Rt1|Nt1,Λ))+T∑t2=1Eqt2(logP(Ct2|Nt2,Λ)) +T∑t3=1Eqt3(logP(Ft3|Nt3,Λ))

and . The inequality is tight (by KL divergence) when , , and are multinomial distributions. ∎

###### Proof of Theorem 1.

Define as the set of all network traffic at different time points for the entire sample window . So is the intersection of the set and is its joint probability. By Jensen’s inequality,

 =log∫P(Λ,ϵ|D)dϵ=log∫P(D|Λ,ϵ)% P(Λ|ϵ)P(ϵ)P(D)dϵ =log∫(∫⋯∫P(D,∩N|Λ,ϵ)P(Λ|ϵ)P(ϵ)P