Anomaly Detection in Large Scale Networks with Latent Space Models

11/13/2019 ∙ by Wesley Lee, et al. ∙ University of Washington 0

We develop a real-time anomaly detection algorithm for directed activity on large, sparse networks. We model the propensity for future activity using a dynamic logistic model with interaction terms for sender- and receiver-specific latent factors in addition to sender- and receiver-specific popularity scores; deviations from this underlying model constitute potential anomalies. Latent nodal attributes are estimated via a variational Bayesian approach and may change over time, representing natural shifts in network activity. Estimation is augmented with a case-control approximation to take advantage of the sparsity of the network and reduces computational complexity from O(N^2) to O(E), where N is the number of nodes and E is the number of observed edges. We run our algorithm on network event records collected from an enterprise network of over 25,000 computers and are able to identify a red team attack with half the detection rate required of the model without latent interaction terms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The near ubiquity of reliable and cost-effective telecommunications technology means that even small or medium size organizations maintain enterprise networks with thousands of interconnected devices. The “Internet of Things (IoT),” bringing with it smart homes, streets, cars, and offices, has further increased the number and type of networked devices. Modern networks now consist of a mix between devices where electronic communication is a primary function (e.g. computers, servers, or smart speakers) and devices that are intended for another purpose but maintain network connectivity for convenience or remote access (e.g. home appliances or cars).

These expanding enterprise networks provide unparalleled access and convenience for device users, but also increase potential points of vulnerability for cyberattacks. In this paper, we present a scaleable statistical approach for identifying anomalous behavior in enterprise networks. We use latent space models (Hoff et al., 2002) to parsimoniously represent the, likely complex, structure of the connectivity graph and represent higher order dependence between devices. This approach provides a richer baseline description of network behavior than current approaches that, for computational reasons, focus on either single device behavior or clusters of devices. We use a combination of variational and case-control approximations to ensure that our approach can both represent rich network structure and be implemented on the large-scale networks.

Our approach simultaneously addresses three key challenges which are critical for anomaly detection in cybersecurity settings. Our approach (i) uses a baseline model for behavior on the network that incorporates higher order dependence between networked devices, (ii) includes a scaleable computational algorithm that leverages the sparsity in enterprise networks and (iii) captures dynamics through an efficient online updating scheme.

First, a common approach for anomaly detection in this setting is to build a model that represents normal behavior and use deviations from this model to flag potential anomalies (Neil et al., 2013a). Ahmed et al. (2016) provides a survey of alternate anomaly detection techniques in network settings. When using a probabilistic baseline model, each pair of devices, or dyad, has a propensity to interact in a given time period. If observed device interactions are very unlikely under the baseline model, then an anomaly is registered for further investigation.

We propose a baseline that incorporates higher order network properties using a latent position model (Hoff et al., 2002). These models represent the likelihood of two devices interacting based on distance in an unobserved geometric space. The closer the estimated positions of the two devices, the more likely they are to interact. The latent distance model captures higher order features of the graph, such as the propensity to form triangles, through the geometry of the latent space. Our approach is in contrast to existing baseline models which, for computational reasons, avoid complex interaction terms, opting for simple sender- and receiver-specific popularity terms (Neil et al., 2013b) or clustering nodes and modeling interactions at a cluster level (Metelli and Heard, 2016). While an approach that relies on individual device effects will be effective for capturing some types of attacks (e.g. if an attacker uses an infected device in a way that dramatically increases the activity of that device), it will fail to detect more sophisticated attacks that change the pattern of device behavior rather than simply the volume.

Second, we develop a strategy for online computation that leverages the sparsity in enterprise networks. As alluded to above, computation in network models depends not on the scale of devices, but on the scale of device pairs, or dyads. If is the number of devices, then there are approxameltely

possible dyads. A system with 25,000 devices, for example, will have over 600 million dyads, making any algorithm that requires evaluating the full likelihood via summation of over device pairs completely infeasible. This scale also precludes Bayesian computation using Markov chain Monte Carlo (MCMC), as is typically done for network models. We leverage the variational message passing algorithm described in

Minka (2005) with a case control approximation inspired by Raftery et al. (2012) in order to reduce the computation cost from to , where is the number of observed edges in the network. Since the computational complexity scales with the number of realized connections, rather than the number of possible ones, the algorithm will be particularly efficient for sparse graphs. Sparsity is a common feature in enterprise networks. For example, most end-user machines do not talk to each other, rather they communicate with servers, vastly reducing the number of dyads. In the data we use in our empirical evaluation, for example, an average of 0.02% possible dyads communicate in any four hour period.

Our method has similarities to existing dynamic network models using latent projections and to previous work on variational inference for network models. Sewell and Chen (2015) and Durante and Dunson (2016) both propose models with temporal dynamics, but are difficult to scale to graphs of the size we encounter in enterprise networks. Salter-Townshend and Murphy (2013) also propose a variational algorithm in the static latent position model (Hoff et al., 2002). Their approach is based on minimizing KL divergence, finding substantial computational gains over a comparable MCMC even before using the case-control approximation from Raftery et al. (2012). The primary expectation required in the algorithm is inherently intractable, and they proceed via a series of Taylor series expansions in order to reach an tractable expression. In contrast, we choose to adopt a variational approach minimizing a different divergence metric but resulting in a tractable, analytic set of updating equations.  Sewell et al. (2017) also propose a scaleable computational algorithm for dynamic networks, but focus on detecting community structure rather than identifying anomalous behavior.

Third, we propose an efficient dynamic updating procedure to capture variations in behavior over time. Taking advantage of the parametric form of the variational approximation to the posterior, we allow our parameters to update with discrete time dynamics via Gaussian random walks, adopting the autotuning procedure from McCormick et al. (2012) in order to flexibly adjust the amount of additional variation introduced at each time step. In addition to their autotuning procedure, McCormick et al. (2012)

propose a general purpose algorithm for estimating dynamic logistic regression models. However, their approach jointly updates the logistic parameters via Newton’s method and requires inversion of the corresponding Hessian. When the number of parameters in the model is large, such as when each node has a specific popularity term, this process becomes infeasible.

We evaluate our method using the Netflow activity data collected by the Los Alamos National Laboratory (LANL) on their enterprise network for a period of 89 days (Turcotte et al., 2018). Each record consists of directed communication between network devices, and network activity is logged between over 25,000 devices over these 89 days. In cybersecurity, a major objective is to flag potential intrusions on the network. The detection of these invaders is time sensitive, reflecting a desire to prevent further intrusion when possible. The LANL data also contain a so-called “red team” attack, which is a simulated cyberattack that mimics tactics used by actual attackers. The “red team” attack provides a ground truth event to use for benchmarking discovery. The data are available at https://csr.lanl.gov/data/2017.html.

The remainder of the paper is organized as follows. In Section 2.1 we describe the static bilinear effects model. In Section 2.2 we present the variational message passing algorithm for the static model as well as the case-control modification. Section 2.4 adapts the model and algorithm to a dynamic setting, and Section 2.5 describes anomaly detection after estimation is complete. In Section 3 we demonstrate the performance of our algorithm in a simulation study, in Section 4 we apply our algorithm to the LANL computer network data, and in Section 5 we conclude.

2 Dynamic Latent Space Models

In this section we present our model and computation strategy. We first present a network model and computational approach for static graphs and then describe how we incorporate temporal dynamics.

2.1 Bilinear Mixed-Effects Model

To model baseline behavior at a given time point we use the logistic specification of the bilinear mixed-effects model (Hoff, 2005). Letting indicate the presence of directed activity from sender to receiver (e.g. a message passed from a computer to a networked printer), under this model

(1)

where

(2)

controls the overall sparsity of the network, while and represent sender- and receiver-specific popularity terms and and are -dimensional sender and receiver-specific latent factors. The interaction term captures the affinity between and , and can be interpreted as the additional propensity for senders with certain latent characteristics to interact with receivers with other certain latent characteristics over the baseline propensity implied by their respective popularities.

We complete our Bayesian model by introducing independent Gaussian priors for each of the parameters:

Lastly, we use to denote the number of nodes in the network and the matrix Y to denote all directed activity in the network. Without loss of generality, we assume the number of senders and the number of receivers in the network are equal.

2.2 Variational Inference

Let denote the set of latent variables in the bilinear mixed-effects model. Given the large number of parameters, we wish to construct a parsimonious representation of the posterior . For example, the posterior covariance between the sender- and receiver-specific popularity terms would require the storage of a matrix. To this end, we focus on learning a fully-factorized approximation to the posterior, with independent terms for each latent variable. Specifically,

(3)

where each marginal term is modeled with a Gaussian (with covariance matrices for each of the latent factor terms). Representing the posterior of each latent variable with an independent Gaussian leads to a storage complexity of and will also help facilitate the introduction of temporal dynamics in the following section.

Inference proceeds as a direct application of power expectation propagation (Power EP) (Minka, 2005) and takes the form of a message passing algorithm. We describe the algorithm in detail below but omit some of the theoretical basis provided in (Minka, 2005).

First, we can recast the posterior as a product of factors, where each factor is either an dyadic observation or a prior over a latent variable.

(4)
(5)

We use to index the factor denoting the directed dyad , with for the factor denoting the prior over the th latent variable . We arbitrarily index the set of latent variables in our model with for notational simplicity for contexts in which the differences between the various latent variables are unimportant. We cast our fully-factorized approximation (see equation (3)) in light of the same factors such that the approximation to the posterior for is the product of messages from each of the factors to :

(6)

Rearranging terms, we can also view the product of messages from a factor as an approximation to that factor:

(7)

Each message can be conceptualized as the contribution of a single factor to the posterior of a single variable. If a variable is not involved with a given factor (e.g. the sender popularity when considering the factor with sender and receiver ), the message is uniform and provides no contribution to the posterior. Under power expectation propagation, inference proceeds by iteratively selecting a single factor and updating the messages from to the relevant variables (, , , , and ) in order to minimize the local -divergence of factor , i.e. the divergence between and . This local divergence approximates the minimization of the global -divergence between the posterior and approximation under the assumption that the other factors are well-approximated by .

As our approximation is a product of Gaussian densities, we take the messages to be unnormalized Gaussian densities, noting that these densities are closed under multiplication and we can implicitly rescale to be a (normalized) Gaussian density after every iteration.

In order to minimize local -divergence, the update step for variable from factor is given by

(8)
(9)
(10)

where

denotes KL projection to the family of Gaussian densities (matching the mean and variance of

) and is a damping factor to aid with the convergence of the algorithm. Define and note has the form of a Gaussian density, which we can take to be normalized. Then equation (8) can be written as

(11)

One particular strength of the Power EP approach is the ability to choose such that evaluating the above expectations is tractable. The choice of also affects the shape of the approximation relative to . Minka (Minka, 2005) notes the choice of puts greater emphasis on concentrating the mass of inside higher density areas of the (as opposed to “covering” the posterior) and can lead to understate the variability in the posterior. For the logistic likelihood, the choice , is particularly compelling:

(12)
(13)

where the final expectation factors over each term. There are three sets of expectations to evaluate: (and equivalent expressions for the other univariate parameters), , and

. The first two can be evaluated directly from the moment generating functions for the univariate and multivariate Gaussian distribution, and the last can be evaluated using the independence between the distributions over

and with complete the square techniques. Using and or when appropriate to denote the mean and variance of :

(14)
(15)
(16)
(17)
(18)

Completing the square, used in the last equation, relies on

being a positive definite matrix in order for the resulting density to be a multivariate normal distribution.

We focus on the update steps for and , noting the symmetry in equation (13) with respect to , , and , and similarly for and , implies the corresponding update steps can be obtained by swapping the positions of the relevant variables. The updates take the form

(19)
(20)

with

These densities can be represented as a linear combination of two Gaussian densities:

(21)
(22)

where

(23)

and first and second moments can be calculated from each expression (after normalizing by ) to derive the corresponding Gaussian parameters.

To recap, our message passing algorithm will proceed as follows:

  1. Initialize all messages

  2. Repeat until convergence of all messages:

    1. Choose factor

    2. Update approximation to posterior via equations (8) and (9). We find the choice of promising in simulations.

    3. Update messages from this factor via equation (10)

2.3 Case-Control Approximation

The update step for each factor has computational cost, but each iteration over the entire network has computational cost and can be prohibitively expensive in large networks. In addition, tracking the messages for each factor also has storage complexity. Drawing inspiration from (Raftery et al., 2012), we wish to take advantage of the idea that large networks tend to be sparse and iterating over the entire network can be computationally inefficient due to the extreme class imbalance. The influence contained in each non-edge may be relatively small towards informing the overall model compared to the influence of edges, which are many fewer in number.

We propose iterating over the set of factors with and a random sample of factors with . In practice, drawing a single random sample is preferable to drawing a new sample each iteration through the data due to the reduced time observed for the convergence of the algorithm. In a temporal setting, a new sample can be drawn at each time step. Supposing the number of observed edges and a random sample of non-edges of size is drawn, each iteration over the network would cost rather than . Algorithmically, we treat this sample of factors as if it is the full set of data available. To understand the effect of this choice on the means of our parameter estimates, consider exponentiating both sides of equation (2):

(24)

Intuitively, sampling a random proportion

of the non-edges inflates the odds-ratio on the LHS by a factor of

. On the RHS, would shift upwards by but the other parameters would be unaffected. This suggests a simple post-hoc mean correction would suffice to return the parameters to their original scale, although it should be noted the posterior variance of the latent variables should be larger than if the full dataset were used. We provide some evidence for the efficacy of this case-control approximation in Section 3.

2.4 Temporal Dynamics

Next, we introduce discrete time dynamics to the bilinear mixed-effects model by allowing each of the latent variables to evolve via a Markov chain. Let denote the th parameter at time , with the posterior of given (approximated) by

(25)

Supposing evolves via the Gaussian random walk

(26)

the prior for would be given by

(27)

We adopt the adaptive tuning procedure described in (McCormick et al., 2012) to determine the amount of additional variation to introduce at each time point. We parametrize this amount of variation via a “forgetting” multiplier to avoid specifying random walk matrices for and :

(28)

We choose these multipliers based on the average predictive likelihood:

(29)

Evaluating the integral above cannot be done in closed form, and we use a series of two approximations to estimate it. First, note the likelihood term primarily involves the sigmoid of the logistic mean function:

(30)
(31)

Recall the prior over each latent variable is an independent Gaussian. We approximate with a single Gaussian term (via their first two moments) in order to model the entire mean function itself with a single Gaussian. Denoting the mean function with , we use the following approximation for convoluting a sigmoid and a Gaussian could be used (see (Bishop, 2006)):

(32)

Lastly, in order to reduce the computational cost involved in maximizing (29), we allow for a single forgetting multiplier for , a single multiplier for the popularity terms and , and a single multiplier for the latent space terms and , and only evaluate (29) over a coarse grid of values. McCormick et al. (2012) argue searching over a coarse grid leads to comparable results to directly maximizing (29) when running the algorithm over sufficiently many time periods, as periods with unnecessary inflation in prior variance can be balanced against periods with more restrictive inflation. Furthermore, the authors found their results were robust to the choice of grid values. In Sections 3 and 4, we take . This choice of values allows for no change in a parameter (), as well as forgetting multipliers corresponding to multiple scales of variance inflation.

2.5 Anomaly Detection

At an edge level, anomaly detection proceeds by scoring edges based on their probabilities for observing activity, under the assumption that any past activity is non-anomalous and thus is a good representation of normal behavior. We score the dyad

at time via its predictive likelihood:

(33)

which can be evaluated via the approximations described in the previous section. Dyads with activity but low predictive scores as well as dyads without activity but high predictive scores would then be flagged as anomalous.

For many settings, we may be interested in detecting anomalies at a non-edge level. For example, in computer networks such as the LANL network described in Section 1, security experts are interested in identifying anomalous subgraphs which may potentially represent intruder attacks. (Neil et al., 2013a) mentions activity in the shape of -stars and -paths as common behavior for intrusions. Note dyads in these subgraphs would consist solely of edges with observed activity, so lower values of would be characterized as more anomalous. We can compute scores for these subgraphs from our edge level scores given a conditional independence assumption, by multiplying the scores of the corresponding edges, or equivalently, summing the log scores.

Figure 1: A directed 3-path.

For example, for the 3-path shown in figure 1, the score would be given by .

In this paper, we choose to separately consider potentially anomalous behavior for each time period, although combining scores across time may be promising, particularly in settings with fine temporal resolution where attacks may span multiple periods. A fully online detection procedure would proceed at each time step as follows:

  1. Observe network behavior

  2. Tune forgetting multipliers (29)

  3. Flag and assess potential anomalous subgraphs

  4. Remove anomalous activity from

  5. Estimate model parameters

3 Simulation Study

To provide an idea of how well our proposed algorithm can estimate a dynamic bilinear mixed-effects model, we simulate a network following equations (1) and (2) and with time dynamics following independent Gaussian random walks (see (26)). Specifically, we generate a network of size from a bilinear mixed-effects model with latent dimension 2 with the following priors:

(34)
(35)
(36)
(37)
(38)

We evolve the network 99 times for a total of periods, where at every time point each parameter follows a Gaussian random walk with (co)variance equal to 0.001 times its prior (co)variance. These prior and random walk values were chosen to create a network that would be roughly similar to the LANL computer network, that is, characterized by high sparsity, strong heterogeneity between nodes, low temporal dynamics, and strong dependence between time periods. The generated network averages about 2,000 directed connections per time period, or 4 per node, which is slightly less than what we observe for the LANL network.

We compare results from two runs of the Power EP algorithm described in Section 2.2, one that iterates over all 250,000 potential dyads in the network and another that implements the case-control modification described in Section 2.3. For the latter, we sample 2.5% of the non-edges at every time point for consideration, thus iterating over about 8,200 dyads per time point. This results in about a 93.5% reduction in computation time and a 97% reduction in storage complexity.

We compare mean estimates from each run against the generated values in terms of log-likelihood, area under the receiver operating characteristic curve (ROC AUC), and the correlation between the actual and estimated edge probabilities on the logit scale. This correlation is also calculated restricted to dyads not observed in any of the 100 time periods (which is satisfied by 72.6% of all dyads). In Figure

2, we compare the model fit of the mean estimates from the Power EP algorithm with and without the case-control approximation to the model fit of the true parameter values. In these plots, the log-likelihood and ROC AUC under the generating model provide a soft bound on model performance, as no other set of parameter estimates should systematically outperform them over a prolonged period. The estimates from our variational approach perform very similarly to the generating model, particularly after time period 40, suggesting the approximations used in the variational method have, at most, minor effects on the mean posterior estimates.

Figure 2: Log-likelihood of the observed network and ROC AUC of the edge probabilities . Calculated for three sets of edge probabilities: the actual edge probabilities (black), the edge probabilities estimated via the Power EP algorithm on the full data (red), and the edge probabilities estimated via the Power EP algorithm with the case-control approximation (cyan).

In Figure 3, we plot the correlation between the actual edge probabilities and their estimated counterparts on the logit scale. The performance of the algorithms ramp up over time, as each binary network provides limited information about the underlying latent variables which must be aggregated, and is largely stabilized by time 40. The case-control modified algorithm, which iterates over a much smaller subset of the network at each time point, does perform worse than the algorithm over the full network, but these differences are largest in the earlier time periods.

Figure 3: Correlation between the actual edge probabilities and estimated edge probabilities, both on the logit scale. Results with probabilities from the full data are in red, and results with probabilities from the case-control are in cyan.

Restricting ourselves to results from three time points, we plot the distributions of the edge probabilities (again on the logit scale) against their actual counterparts in Figure 4, and find minor systematic differences between the distributions. Note both sets of estimated probabilities do struggle a bit (overestimating) modeling the extreme left tail of probabilities, although these differences are exacerbated due to the logit scale (e.g. = 8.3e-07 and = 1.1e-07) and may be hard to capture given the time frame of the simulation in comparison to the probability size.

Figure 4: Actual versus estimated edge probabilities on the logit scale. Left column compare the actual edge probabilities to estimates from the full Power EP, while the right column represents the edge probabilities estimated via the case-control Power EP.

Lastly, we find the increase in posterior variance of our parameters when adopting the case-control modification to be largely acceptable. Even though we only consider about 3% of the edges in any given time period, this subset of the network appears to capture most of the information for estimating the model parameters.

Time Period
T = 10 6.76 1.92 1.97 2.11 2.42
T = 25 1.92 2.47 2.55 2.10 2.32
T = 100 2.24 2.63 2.73 1.54 1.78
Table 1: Multiplier in posterior variance when using the case-control modification to the Power EP algorithm. For node-specific parameters, the multiplier in variance is calculated for each node and averaged.

4 LANL Netflow Event Data

We demonstrate the potential for attack detection with Netflow communications data on the LANL enterpise network (Turcotte et al., 2018). Event records correspond to directed communication between two network devices and span a total of 89 days. We restrict to the sub-network of the computers with some record of outgoing communications over the 89 days, and aggregate the event records into four-hour intervals, yielding a total of

time periods. These computers comprise the set of network devices which may be the source of malicious behavior. We focus on modeling the presence of any directed network activity between each dyad within each four-hour interval. The resulting network averages about 150,000 directed edges (or 5.5 outgoing edges for each computer) at each time interval, and there is substantial variation in activity levels based on time-of-day and day-of-week.

The LANL data contains a red team attack in the form of a network scanning attack from “Computer A” that begins on day 57, and we are interested in the ability for our model to recognize this activity as anomalous. Following (Neil et al., 2013a), we detect potentially anomalous subgraphs of the three shapes presented in Figure 5, corresponding to common intrusion patterns. While malicious attacks may involve more nodes and activity, detecting a single subgraph involved in the attack may suffice to identify the entire attack upon further (manual) examination.

Figure 5: A 3-path (left), a 3-star (right), and a “fork” (center) representing a combination of the two.

Note the detection procedure described in this section deviates from the typical online setting since details of the red team attack were only obtained after model estimation. Flagging potential anomalies only occurs after model estimation, and the estimated probabilities used in this process assume all preceding network activity was non-anomalous.

We estimate the bilinear mixed-effects model with latent dimension using the Power EP approach described in Section 2.2 with the case-control approximation of Section 2.3, taking a sample of the non-edges of average size 500,000 (corresponding to a case-control rate of 3.3 or sampling proportion ). We slightly modify the bilinear mixed-effects model to incorporate time-of-day and day-of-week effects in the form of mean shifts, with individual terms for each time-of-day and day-of-week pair calculated directly from the mean activity levels over the 89 days. A slightly more sophisticated approach would be to include them as additional parameters in the model to estimate. This would allow these effects to naturally change over time, although there is little evidence for any such changes in the observed data. We choose to separately model these terms from the overall popularity term in order to prevent the dynamics of this parameter to be governed by periodicity effects rather than random walk behavior (Heard et al. (2014)). Note that part of the periodicity effects may be due to recurrent, automated tasks (e.g. weekly at a certain time of day), so allowing for more complicated periodicity effects or removing these activities before estimation (if they are labeled or can be a priori identified) would likely improve model fit.

Before turning to anomaly detection, we assess how well the bilinear mixed-effects model and the popularity model omitting the latent interaction terms are able to predict LANL network activity. Figure 6 plots the area under the receiver operating characteristic curve (AUC ROC) of both models calculated using probabilities from the predictive likelihood (33), which are primarily dependent on estimated parameters from the previous time period. Both models perform quite well, with AUC 0.995, suggesting the network communications data is inherently very structured and predictable in nature. Model performance exhibits both time-of-day and day-of-week periodicity, suggesting a more complex approach to modeling periodicity is likely to improve performance, albeit performance is consistently high despite these effects. The latent interaction terms in the bilinear mixed-effects model do seem to substantially improve performance, with these improvements primarily driven by higher probabilities for active dyads with generally low overall levels of popularity in the network.

Figure 6: Averaged AUC ROC for popularity model (red) and bilinear mixed-effects model (cyan) over days. AUC ROC is calculated for each 4-hour interval using probabilities derived from the predictive likelihood and results are averaged across each day.

We can compute anomaly scores for subgraphs of the types shown in Figure 5 by taking the sum of the log probabilities as described in Section 2.5 . Despite the relative sparsity of the network, the number of subgraphs to consider at each time frame remains quite large. To reduce the number of subgraphs in consideration further, we only examine subgraphs consisting of edges with log probability score of -10 or lower and remove some “overlapping” subgraphs. Specifically, we remove 3-paths with the same middle edge , forks with the same node, and 3-stars with the same node. Ideally, detecting one anomaly from multiple overlapping subgraphs would suffice for finding the entire attack. In Figure 7, we plot the 200 most anomalous subgraphs under each model. Both sets of subgraphs contain a single 3-star (highlighted in red) involving the network scanning attack from Computer A on the first day of the red team attack. The rank of the Computer A 3-star is twice as high under the bilinear mixed-effects model, where this anomaly would be detectable given an average alarm rate of one subgraph per day. The difference in rank can be mainly attributed to low scores on recurrent activity between low-popularity computers under the popularity model, which is not flexible enough to model such activity. This leads to lower scores for certain non-anomalous subgraphs, obfuscating the actual attack. Note our detection procedure is largely unable to identify other attacks from Computer A in the following days. Once anomalous behavior like the activity on day 57 is incorporated into the model of normal activity, subsequent attacks appear to be normal behavior.

Figure 7: Anomaly scores for the 200 lowest scoring subgraphs observed over the 89 day period. The red line corresponds to the rank of a subgraph containing part of a red team attack on day 57.

5 Discussion

In this paper, we present a variational approach for estimating the bilinear mixed-effects model. We adapt our approach to a dynamic, large network setting via a case-control approximation and an autotuning procedure to mimic Gaussian random walks on the model parameters. We demonstrated the efficacy of our algorithm via a simulation study on nodes, estimated the mixed-effects model on the LANL netflow communications network involving over 25,000 computers for a period of 89 days, and detected a red team attack on the same network while only requiring half the detection rate of the popularity model.

A natural extension for the bilinear mixed-effects model considered would be to allow for node- or edge-level covariates in the specification of the mean function. Relational event data is often provided with additional details which may be useful in conjunction with network-based predictors for predicting activity. Along a similar line of reasoning, edge-level covariate data may distinguish between multiple types of network activity which we may wish to model jointly. In particular, the LANL netflow data includes sender port information, and utilizing this information would help distinguish between typical activity on a commonly used port and unusual activity on a rarely used port. Lastly, adapting the algorithm to handle different outcome measures may allow for a more faithful representation of the observed event data. For example, acknowledging the data’s continuous-time nature, we could model the communication activity using Poisson processes with time dependent intensities in order to exploit detailed information about the timing of expected activity.

References

  • Ahmed et al. (2016) M. Ahmed, A. N. Mahmood, and J. Hu. A survey of network anomaly detection techniques. Journal of Network and Computer Applications, 60:19–31, 2016.
  • Bishop (2006) C. M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
  • Durante and Dunson (2016) D. Durante and D. B. Dunson. Locally adaptive dynamic networks. The Annals of Applied Statistics, 10(4):2203–2232, 2016.
  • Heard et al. (2014) N. Heard, P. Rubin-Delanchy, and D. J. Lawson. Filtering automated polling traffic in computer network flow data. In 2014 IEEE Joint Intelligence and Security Informatics Conference, pages 268–271. IEEE, 2014.
  • Hoff et al. (2002) P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.
  • Hoff (2005) P. D. Hoff. Bilinear mixed effects models for dyadic data. Journal of the American Statistical Association, 100(469):286–295, 2005.
  • McCormick et al. (2012) T. H. McCormick, A. E. Raftery, D. Madigan, and R. S. Burd. Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics, 68(1):23–30, 2012.
  • Metelli and Heard (2016) S. Metelli and N. Heard. Model-based clustering and new edge modelling in large computer networks. In 2016 IEEE Conference on Intelligence and Security Informatics, pages 91–96, 2016.
  • Minka (2005) T. P. Minka. Divergence measures and message passing. Technical report, Microsoft, 2005.
  • Neil et al. (2013a) J. Neil, C. Hash, A. Brugh, M. Fisk, and C. B. Storlie. Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4):403–414, 2013.
  • Neil et al. (2013b) J. C. Neil, B. D. Uphoff, C. B. Storlie, and C. L. J. Hash. Towards improved detection of attackers in computer networks: New edges, fast updating, and host agents. In 2013 6th International Symposium on Resilient Control Systems, 2013.
  • Raftery et al. (2012) A. E. Raftery, X. Niu, P. D. Hoff, and K. Y. Yeung. Fast inference for the latent space network model using a case-control approximate likelihood. Journal of Computational and Graphical Statistics, 21(4):901–919, 2012.
  • Salter-Townshend and Murphy (2013) M. Salter-Townshend and T. B. Murphy.

    Variational Bayesian inference for the latent position cluster model for network data.

    Computational Statistics and Data Analysis, 57(1):661–671, 2013.
  • Sewell and Chen (2015) D. K. Sewell and Y. Chen. Latent space models for dynamic networks. Journal of the American Statistical Association, 110(512):1646–1657, 2015.
  • Sewell et al. (2017) D. K. Sewell, Y. Chen, et al. Latent space approaches to community detection in dynamic networks. Bayesian Analysis, 12(2):351–377, 2017.
  • Turcotte et al. (2018) M. J. M. Turcotte, A. D. Kent, and C. Hash. Unified host and network data set. In N. Heard, N. Adams, P. Rubin-Delanchy, and M. Turcotte, editors, Data Science for Cyber-Security, chapter 1, pages 1–22. World Scientific, London, 2018.