Bayesian Models Applied to Cyber Security Anomaly Detection Problems

03/23/2020
by   José A. Perusquía, et al.
University of Kent
0

Nowadays cyber security is an important concern for all individuals, organisations and governments globally. Cyber attacks have become more sophisticated, frequent and more dangerous than ever, and traditional anomaly detection methods have been proven to be less effective when dealing with these new classes of cyber attacks. In order to address this, both classical and Bayesian statistical models offer a valid and innovative alternative to the traditional signature-based methods, motivating the increasing interest in statistical research that it has been observed in recent years. In this review paper we provide a description of some typical cyber security challenges and the kind of data involved, paying special attention to Bayesian approaches for these problems

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/15/2018

Anomaly Detection in Cyber Network Data Using a Cyber Language Approach

As the amount of cyber data continues to grow, cyber network defenders a...
05/24/2019

Cyber Warfare: Terms, Issues, Laws and Controversies

Recent years have shown us the importance of cybersecurity. Especially, ...
08/30/2021

Thermal Management in Large Data Centers: Security Threats and Mitigation

Data centres are experiencing significant growth in their scale, especia...
04/23/2021

Anomaly Detection from Cyber Threats via Infrastructure to Automated Vehicle

Using Infrastructure-to-Vehicle (I2V) information can be of great benefi...
08/18/2020

Dragon Crypto – An Innovative Cryptosystem

In recent years cyber-attacks are continuously developing. This means th...
10/08/2020

Real-time anomaly detection with superexperts

The increasing connectivity of data and cyber-physical systems has resul...
12/18/2021

An Autonomous Self-Incremental Learning Approach for Detection of Cyber Attacks on Unmanned Aerial Vehicles (UAVs)

As the technological advancement and capabilities of automated systems h...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cyber security can be broadly defined as the set of tasks and procedures required to defend computers and individuals from malicious attacks. Cyber security’s origin can be traced back to 1971, a period where the Internet as we know it today was not even born. Amongst the computer science community it is widely accepted that it all started with Bob Thomas and his harmless experimental computer program known as the Creeper. This program was designed to move through the ARPANET leaving the following message: ”I’m the creeper: catch me if you can”. Inspired by Bob Thomas’ Creeper, Roy Tomlinson created an enhanced version, allowing the Creeper to self-replicate, therefore coding the first computer worm. Later on, he would also design the Reaper which can be considered the first antivirus system, since it was designed to move across the ARPANET and delete the Creeper.

Despite a harmless origin, some years later the world would find out that network breaches and malicious activity were more dangerous than expected and cyber threats became a serious matter. Nowadays, cyber security is considered a serious concern that affects people, organisations and governments equally, due not only to the growth of computer networks and Internet usage but also to the fact that cyber attacks are more sophisticated and frequent than ever. These cyber attacks represent a complex new challenge that demands more innovative solutions, hence, it requires a multi-disciplinary effort in order to be well-prepared and protected against such threats. Some of the disciplines involved in this task include computer science, computer and network architecture and statistics (Adams and Heard, 2014).

In this paper we are going to be mainly interested in the statistical approach to cyber security problems. Therefore, a good starting point would be defining key areas of cyber security research and the mathematical challenges that we face in each one of them. There are many reports that outline and describe these areas and challenges, and we believe that the ones described by Dunlavy et al. (2009) are appropriate for the purpose of this paper. We will consider the following three main areas whilst doing cyber security research: modelling large-scale networks, discovering cyber threats and network dynamics and cyber attacks. The reader should note that these areas are not mutually exclusive since all of them are connected by the potential malicious attacks and thereby by the research made in each one of them.

At this moment the reader should have noticed that at least two of the research areas will necessarily require the study of a computer network. Their mathematical representation (just as with any other network) is done through graph theory formulation

(Newman, 2010). Historically, the Erdös-Rényi formulation of random graphs provided a mathematical model that could handle small-scale networks like the infant Internet (Chen et al., 2015). However, the Internet and almost any computer network as we know them today are examples of large-scale dynamic networks. These kind of networks have a large amount of nodes and edges evolving and changing over time which are not completely random. Hence, there is a need to develop more sophisticated network mathematical formulations and with it new statistical techniques for comparing them. The reader could direct the attention to Olding and Wolfe (2014) for a review on classical graph theory methods applied to modern network data.

A second challenge we face when dealing with cyber security problems is discovering cyber threats. As said before, cyber attacks are more sophisticated and frequent than ever, hence, the need for models capable of detecting malicious code and their variations, complicated multi-stage attacks and, if possible, the source of such malicious code (Dunlavy et al., 2009). Moreover, the detection methods should ideally be designed for on-line detection and able be to handle time-evolving data as well. This is the area we will be exploring more in depth in this review paper.

Finally, and due to the fact that almost all cyber attacks work by spreading malicious code through a vast number of the computer network’s nodes, the last of the mathematical challenges found in cyber security is related to network dynamics and cyber attacks. This area is mainly dedicated to the understanding of the spreading characteristics of malicious code through a computer network, before and after it has been detected and protections against the malicious code have been released. Some of the particular problems consist in determining the potential limit of the infection and the interplay of the malicious spreading and the protection processes. More details and issues can be found in Dunlavy et al. (2009).

From the brief description given of the three main research areas, it can be easily argued that one of the main objectives of cyber security is to be well-prepared with good explanatory and predictive models. In order to do so, there is a need for real-world data that preserves its integrity over time (Meza et al., 2009). Fortunately, there are some public and available data sets that can be used for research purposes like Microsoft’s malware data set (Ronen et al., 2018), Los Alamos National Laboratory’s data sets (Hagberg et al., 2014; Kent, 2015; Turcotte et al., 2018) and Palo Alto Networks and Shodan data sets (Amit et al., 2018).

However, having access to data is just the beginning. We need to consider that there is still a challenge related to the inherent nature of the data used nowadays in cyber security research. As we established before, there is a need to design on-line detection models; however, cyber security data is usually a high-dimensional object, so the task becomes to handle high-volumes of data to detect anomalies in real-time with high-accuracy and low false positive rates. The combination of both volume and time conditions yields a significant computational challenge.

In this review we are mainly interested in the research area of anomaly detection. Cyber security threat detection systems have traditionally been built around signature-based methods; in this approach, large data sets of signatures of known malicious threats are developed and the network is constantly monitored to find appearances of such signatures. These systems have been proved effective for known threats but can be slow or ineffective when dealing with new ones. An example of this is the blacklisting approach used by commercial antivirus softwares, where if a signature is found the program is disabled; however, new malicious code can be made with slight changes to the original in order to avoid recognition (McGraw and Morrisett, 2000).

Detecting new malicious code or mutations of known ones and dealing with time-evolving threats are some of the reasons why we need to consider alternatives to signature-based methods. In order to do so, Statistics offers a wide range of options for cyber security problems; these include both classical and Bayesian approaches and, in general, can be built on either parametric or nonparametric assumptions. In this review paper we are going to centre our attention on how Bayesian models have been applied to cyber threat detection. For other approaches to anomaly detection and cyber security, such as classical statistics, machine learning and data mining approaches the reader could direct his or her attention to recent reviews by

Buczak and Guven (2016), Chandola et al. (2009), Gupta et al. (2014) and Adams and Heard (2014).

The remainder of the paper is organised as follows: Section 2 is mainly dedicated to the understanding of some particular problems within the cyber threat discovery framework and the data used for each one of them. Section 3 explores how Bayesian models have been used to address the problems found in Section 2. Lastly, Section 4 presents final observations and conclusions.

2 Cyber threats discovery problems

As anticipated, in this paper we are discussing how the discovery of cyber threats can be considered as an anomaly detection problem. Statistical anomaly detection methods usually build a model of normal behaviour to be considered as a benchmark of reference, so that departures from this behaviour might be an indication that an anomaly has occurred. For our purposes we are going to consider three different classes of anomaly detection problems within cyber security research. The first one deals with volume-traffic anomaly detection, the second one deals with network anomaly detection and, finally, the third one is about malware detection and classification. In the following sections we give a gentle introduction to each one of these problems and we also describe the kind of data used in each one of them.

2.1 Volume-traffic anomaly detection

The first thing we need to establish in this section is how a computer network works. One of the most common ways to do it is with the Open Systems Interconnect (OSI) Reference Model. The OSI-model is a conceptual model that provides a general set of rules for computer systems to be able to communicate with one another. This model uses seven layers to explain this communication process. These layers are: the physical layer, the data link layer, the network layer, the transport layer, the session layer, the presentation layer and the application layer. For a complete understanding of the above layers the reader can direct the attention to Myhre (2001) and Hall (2000).

These layers work in conjunction with one another to ensure correct and reliable transmission of information. As such, malicious activity could be targeted to any of the layers in order to destabilise the communication process between computer systems. For the purposes of this section we shall restrict our attention to the third layer of the OSI-model: the network layer. This layer is in charge of structuring and managing a multi-edge network including addressing, routing and traffic control (Hall, 2000). The data is transmitted by breaking it down into pieces called packets. These packets consist of the control information and the user data, the latter commonly known as payload. The control information provides data for delivering the payload: e.g., source and destination network addresses, error detection codes and segment information.

Every file is divided into packets in order to be routed from one point of the network to another, and the malicious code is no exception. One way to detect such malicious activity is through the constant surveillance of the traffic’s volume characteristics, such as the packet rate (the number of packets per time unit), the bit/byte rate or the number of events per time unit. Volume-traffic data can be useful in order to detect anomalies such as Distributed Denial of Service (DDoS) attacks, which are attacks intended to saturate the victims’ network with traffic.

Volume-traffic data sets can be obtained upon request from Los Angeles Network Data Exchange and Repository (LANDER) project. Polunchenko et al. (2012) analysed the Internet Control Message Protocol (ICMP) reflector attack using the packet rate and the bit rate by modelling the normal traffic behaviour and the traffic data during the attack. The model they presented is defined as a to change-point model, where the first Gaussian component describes the packet rate’s behaviour before the attack and the second one the packet rate’s behaviour during the attack. The use of this particular model is due to the fact that research has shown that the number of packets transmitted per unit of time can be modelled by a Poisson process (see: Cao et al. (2003), Karagiannis et al. (2004) and Vishwanath et al. (2009)

) with a usually large mean parameter for the Poisson distribution; therefore, it can be well-approximated by a Gaussian distribution with the mean equal to the variance (i.e.,

). Should we observe overdispersion (or underdispersion), so that the Poisson distribution does not hold, having to be an extra parameter yields a more flexible and robust model. A simulation, with the proposed model and fitted parameters, of the packet rate before and during an attack (which takes place at time 101 and lasts 240 units of time) can be seen in Figure 1.

Figure 1: Packet rate simulation from a to with fitted parameters and .

Another free network flow data set is described in Kent (2015). The downloadable file ”flows.txt.gz” presents network flow events from 58 consecutive days within Los Alamos National Laboratory’s corporate internal computer network; each event is characterised by 9 variables: time, duration, source computer, source port, destination computer, destination port, protocol, packet count and byte count. The first three events included in the file are reported, as an illustration, in Table 1.

time duration source comp. source port dest. comp dest. port prot. packet count byte
1 0 C1065 389 C3799 N10451 6 10 5323
1 0 C1423 N1136 C1707 N1 6 5 847
1 0 C1423 N1142 C1707 N1 6 5 847
Table 1: Extract form the network flow events (Los Alamos National Library).

For this data set and for volume-traffic anomaly detection purposes one could be particularly interested in the packet count and/or byte count variables or the number of events per time unit or the duration by event. Moreover, and due to the fact that for each event we have the source and destination computer, we could also be interested in analysing the isolated traffic behaviour for each source and/or destination. Doing so might be particularly useful if for example, there is a suspicion of a rogue user or if there is a particularly sensitive and important computer that should not be involved in more than a certain amount of events. A final alternative could be to develop multi-channel detectors, i.e., a detection procedure that analyses the network traffic characteristics by splitting them into separate bins, which could be represented by source or destination.

2.2 Network anomaly detection

Although volume-traffic data is quite useful for detecting cyber attacks that createchanges in the volume-traffic characteristics (Figure 1), there are more variables that we could consider for network anomaly detection, such as the source/destination computer, the protocol, and more (Table 1). The anomaly detection procedures for this kind of data will differ from those ones intended for volume-traffic data. As we later describe volume-traffic anomaly detection will be made through change-point theory (Section 3.1), whether the analysis of network data (as the previously described) will be mainly done through graph theory formulation and through classification models (Section 3.2).

Some of the statistical models used for network anomaly detection have also been used for the analysis of the computer network behaviour. Therefore, we will also consider it as part of this section. For the analysis of the computer network behaviour we might be particularly interested in modelling the connections between computer network components from a graph theory perspective, in the sense that we will create a normal pattern connection in order to analyse the appearance of new edges within the computer network. This kind of modelling will be helpful when dealing with compromised credentials, intrusion detection and for DDoS attacks.

In research one commonly used data set for modelling computer network behaviour belongs to Los Alamos National Library (Hagberg et al., 2014). This data set encompasses 9 months of successful event authentications for a total of 708,304,516 connections. The first seven events are shown in Table 2, while other similar data sets are available and described in Kent (2015) and Turcotte et al. (2018).

time user computer
1 U1 C1
1 U1 C2
2 U2 C3
3 U3 C4
6 U4 C5
7 U4 C5
Table 2: User-computer authentications associations in time.

2.3 Malware detection and classification

The malware detection and classification problem is the third group of cyber threat discovery problems that we are going to consider. A malware is defined as a software specifically designed to disrupt, damage or gain access to a computer system. Nowadays there are many types, e.g., spyware, adware, ransomware etc., and many variations of them. That is why the fast detection of unknown malware is one of the biggest concerns of cyber security. However, the detection is not the only task required when dealing with malicious software. Malware have to be classified into families for a better understanding on how they infect computers, their threat level and therefore, how to be protected against them. Correct classification of new malware into known families may also speed-up the process of reverse-engineering to fix computer systems that were infected.

In order to have good explanatory and predictive models for the malware detection and classification problem, researchers have mainly used the content of the malware in two ways. The first approach requires us to introduce a structure called -gram. An -gram is usually defined as a contiguous sequence of

elements; these elements, depending on the field (e.g., computer science, linguistics, probability), can be letters, numbers, words, etc. In computer science the byte is the basic unit of information for storage and processing and it is most commonly represented by a sequence of 8 binary digits (bits). Every instruction given to a computer (malicious or not) can be broken down into sequences of bytes, which form the instruction’s binary code. In the malware classification and detection problem we will use as elements the hexadecimal representation of the malware binary code, i.e., each byte is going to be written as a combination of two elements of

.

For example, considering , a code’s extract given by

is mapped to the set of -grams . The elements of the set of all the different -grams are assumed to completely characterise benign and malicious code through their sole presence or absence.

The second approach is to use the dynamic traces of the malware, which is basically the set of instructions executed by the malicious code in order to infect the system, for example,

Due to the possibility of thousands of different instructions, researchers have first categorised them into different groups of common instructions. These groups and the transitions amongst them are usually modelled as a Markov chain that allow us to tell whether a computer program is benign or not. In order to do so, some models require the introduction of a latent variable which can be modelled using a logistic regression model or whose posterior probability will determine if the program is malicious or not. A second approach, which will be described in Section 3.3.2, assumes a time-evolving Markovian structure in the dynamic traces and the classification is done based on a similarity measure of the change-points and their respective regimes.

3 Bayesian statistics for cyber threat discovery

Now that we have defined the kind of problems we are interested in and the kind of data found in each one, in this section we give a general rather than exhaustive description of some of the Bayesian models that have been applied for cyber security anomaly detection. Moreover and whenever necessary, we will also provide an insight into the classical formulation of the problem at hand and how they compare to their Bayesian counterparts.

3.1 Volume-traffic anomaly detection

As seen in Section 2, for volume-traffic anomaly detection we are going to be dealing with cyber attacks that produce a change in the volume-traffic characteristics such as, the packet rate. The main goal is to detect as fast as possible changes in the normal behaviour. Once a change has been detected an alarm needs to be sent off so that the system can be checked and then decide whether there has been an attack or not (false alarm). Since false alarms could yield important interruptions in the computer network, there is a need to find the true change by seeking a low false positive rate as well. This yields a tradeoff between the detection delay and the false alarm rate that we need to consider for the detection procedure. The methods used to analyse these kind of data sets are mainly based on the statistical theory of change-point analysis.

3.1.1 Change-point analysis

The main objective of change-point analysis is the accurate detection of changes in a process or system that occur at unknown moments in time. For a single change-point setting we assume that there is a sequence of random variables

(not necessarily independent nor identically distributed (iid)) such that

where and is known as the pre-change density. Then, at an unknown time something unusual occurs and from the time onwards

In this setting is known as the change-point and is called the post-change density. It is important to remark that theoretically, the densities and might depend on and as well, that is,

Allowing these densities to depend on and might help us explain in a more realistic way time-evolving data such as the one found whilst doing cyber security research. Another important observation is that the post-change distribution

might only be known up to some unknown parameter or vector of parameters

. Therefore, in some applications the problem can be usually reduced to detecting changes in mean values, in variance values (or in both) of the distributions. An example can be found in Polunchenko et al. (2012), where they introduce a model for the pre-change and post-change packet rate, respectively.

In order to deal with the change-point detection problems there are mainly two approaches to do it: the first one is to use a non-sequential approach and the second one to use a sequential one (both approaches have been studied from classical and Bayesian perspectives). Due to the fact that for cyber security problems we need to do constant surveillance of the computer network there is a need for fast on-line detection procedures; therefore, we will be more interested in the sequential approach to change-point analysis. For a complete and thorough overview to the state-of-the-art in this area the reader can direct to Polunchenko and Tartakovsky (2011).

3.1.2 Sequential change-point analysis

The objective of a sequential change-point analysis is to decide after each new observation whether the observations’ common probability density function (pdf) is currently

or not. One of the main issues is that the change should be detected with as few observations as possible past the true change-point, which must be balanced against the risk of false alarms. In order to address this issue we must reach a tradeoff between the loss associated with the detection delay and the loss associated with false alarms (Polunchenko et al., 2012). It is considered that a good sequential detection procedure must minimise the average detection delay subject to a constraint on false alarm rates (FAR).

There have been several approaches to analyse the tradeoff between early detection and number of false alarms. As described by Tartakovsky (2014) there is a minimax formulation, where is unknown (but not random) and a Bayesian formulation, where is considered a random variable with an appropriate prior distribution assigned. From a statistical perspective we are going to be testing at each step the following hypothesis:

Once we have a detection statistic based on the likelihood ratio given by

we supply it to an appropriate sequential detection procedure i.e., a stopping time with respect to the natural filtration . For example, in the classical setting it is widely used the cumulative sum control chart (or simply CUSUM) statistic, whereas in the Bayesian approach the most commonly used one is the Shiryaev-Roberts (SR) statistic. Once a realisation of the detection procedure takes place, lets say , we have a false alarm if otherwise, the detection delay is given by the random variable . For a more comprehensive and exhaustive review on both the minimax and Bayesian formulation the reader could direct the attention to Tartakovsky (2014).

In matters of network security the application of change-point theory is quite straightforward. However, there are some important observations we need to consider. Tartakovsky (2014) argues that the behaviour of both pre- and post- attack traffic is poorly understood, as result, neither the pre- nor post-change distributions are known. Therefore, it is suggested that we can not rely on the likelihood ratio at all and we might think of using score-based statistics instead. The statistics proposed are suitable modifications of the CUSUM and SR statistics that include a score sensitive function. The way in which we would select this score function will depend on the type of change we are trying to detect. For example, for a change in mean Tartakovsky et al. (2006a, b) propose a linear memoryless score function.

Another important observation due to Tartakovsky et al. (2006a, b) is that, in certain conditions, splitting packets in bins and considering multichannel detectors helps localise and detect attacks more quickly. This multichannel setting can be thought as a generalisation of the classical change-point detection problem, where we will be assuming an -dimensional stochastic process that is going to be observed simultaneously and at a random time only one of the entries will change its behaviour. This setting might be useful when dealing for example with DoS attacks where it has been observed that an increase number of packets of certain size occurs during the attack.

It is important to notice that in surveillance applications, such as cyber security, there is a need to repeatedly apply the detection procedures. After each false alarm or an actual attack we need to start monitoring the system again. Therefore, a specification of a renewal mechanism is required. For example, assuming an homogeneous process, we could start from scratch after every alarm, yielding a multi-cyclic model (Tartakovsky, 2014). However, there is a second circumstance we need to consider in this scenario. If an actual attack happens it is be very likely it will happen a long time after the surveillance began. This combined scenario has led to interesting results, for example, Pollak and Tartakovsky (2009) proved that under some assumptions the multi-cyclic Shiryaev-Roberts detection procedure is optimal with respect to the stationary average delay to detection.

3.2 Network modelling and anomaly detection

Network anomaly detection and computer network modelling is the second problem we are interested to describe within the Bayesian framework. The basic idea is that we can characterise normal pattern connections within a computer network, either by the constant surveillance of packet features or the pattern connections between the nodes of the computer network (e.g., users and computers). The anomaly detection proceeds when a new connection can not be grouped with normal behaviour clusters. The task of creating groups of normal behaviour can be achieved by cluster analysis. Distance-based algorithms are some of the most common ways to create such clusters; however, each one these algorithms require the specification of what similarity between groups looks like through the choice of a distance measure (e.g., Euclidean distance and Manhattan distance). This implies that the clustering process will rely on both the algorithm and the distance (or similarity measure) choice.

Bayesian clustering models have also been used for cyber security research. In particular, Metelli and Heard (2016) use a 2-step procedure for inferring cluster configurations and at the same time modelling new edges. The first step uses a Bayesian agglomerative clustering algorithm with the choice of the multiplicative change in the posterior probability as a similarity measure. This algorithm yields an initial cluster configuration of users with similar connection behaviours, which is then used in a Bayesian Cox proportional hazards model (Cox, 1972) with time-dependent covariates for the identification of new edges within the computer network. In this case, the chosen covariates represent the overall unique number of authentications and the restriction to a subset of similar users. This 2-step procedure requires a Markov Chain Monte Carlo (MCMC) algorithm for the joint update of the initial cluster and the coefficient parameters.

As an alternative to cluster analysis we can consider probabilistic methods, such as topic modelling, which has been widely used for cyber security research, specially what is known in literature as the latent Dirichlet allocation (LDA) model (Blei et al., 2003). In the following, we describe this method as well some others that have been used for modelling computer network behaviour and for network anomaly detection.

3.2.1 Topic modelling and LDA

Topic models are a kind of probabilistic models that had their origins in the latent semantic indexing (LSI) modelling (Deerwester et al., 1990)

. LSI uses the singular value decomposition of a term-document association matrix in order to create a space where terms and documents that are considered to be related are placed near one another.

Hofmann (2001), using this idea, described the probabilistic latent semantic analysis (PLSA) which can be considered the first topic model. His model uses a generative latent class model to perform a probabilistic mixture decomposition. Other models, such as the LDA model, have also been introduced with the task of discovering the topics that occur in a set of documents. However, they have been widely used in other fields where there is a need of unsupervised clustering.

The topics produced by these models are clusters of similar words which allows examining a set of documents (also known as corpus) and discovering what the topic (or topics) might be. One of the usual assumptions made is that the words in a document are exchangeable (also known as ”bag-of-words” assumption). As discussed in Blei et al. (2003), these methods also assume the exchangeability of documents. These assumptions allows us to exploit de Finetti’s representation theorem for exchangeable random variables.

The general setup includes:

  1. a corpus with documents ,…,,

  2. and each document contains words, taken from a dictionary of words.

The objective is to classify the observations into possible topics. The basic idea of the LDA model is that every document in the corpus can be represented as a random mixture over latent topics, where each topic is characterised by a distribution over words. The original generative process is as follows: first we generate the length of the document using a Poisson distribution (this assumption can be changed and the choice of a more realistic document-length-distribution used instead), then we sample a concentration parameter, , for the topics from a Dirichlet distribution of parameter and for each of the words we first select a topic from a multinomial distribution of parameter and from this topic we choose a word from a multinomial probability conditioned on the topic.

Some of the key assumptions made in this model are that the number of topics , is known and fixed and that the word prprobabilities are characterised by, an unknown but fixed, matrix of dimensions

that needs to be estimated. The entry

, represents the probability of observing the -th word given the -th topic.

Given the parameter (or vector of parameters) and the matrix

the joint distribution of a topic mixture

, a set of topics z and a set of words w is given by:

Integrating over and summing over we obtain the marginal distribution of a document:

and taking the product of single documents we obtain the probability of a corpus:

where and (as described above) are parameters at the corpus level, so that we only sample them once in the generative process, are document-variables, so that we have one for each document and and are word-variables, i.e., one for each word in each document.

Although the latent Dirichlet allocation model was first developed for document classification problems, it has been shown to be a valid and useful technique for network anomaly detection and for modelling computer network behaviour. In network anomaly detection, Heard et al. (2016), with the objective of detecting misuse of credentials, use a LDA model to analyse computer network connection traffic data to determine the number of users present. In their approach, documents are represented by the days, different users are the topics and the destination computers play the role of the words.

Cramer and Carin (2011) considered applying topic models to computer networks by analysing raw Ethernet packets captured using tcpdump. In this approach the data is divided into fixed time intervals and the words are the packet information aggregated over the time interval (and so ignores the structure of the packets and arrival times of the packets). They also consider extending the latent Dirichlet allocation model to time-varying problems using fixed time intervals of 5, 10 or 15 minutes. Cao et al. (2016) considered a similar model without the dynamic element for cyber security by training on known attack free data.

3.2.2 Poisson factorisation

Topic models are not the only probabilistic models used for cyber security research that have been originally designed for other purposes. Poisson factorisation models, which are widely used for recommender systems in machine learning (see for example, Gopalan et al. (2013) and Gopalan et al. (2014)), have also been used for network anomaly detection. Poisson factorisation is a probabilistic model of users and items that was proposed as an alternative to the classical probabilistic matrix factorisation (PMF) (Salakhutdinov and Mnih, 2007).

The main assumption is that the data can be represented in a matrix, in the recommender system the rows are the clients and the columns the number of items. Each entry of this matrix is assumed to be the rating given by a certain user to a particular item, and these are modelled using the dot-product of latent factors for both the users and the items. Probabilistic matrix factorisation assumes that each entry is normally distributed and so does each of the latent factors. This will theoretically imply that the ratings could become negative which is something we would not desire to have. In order to address this issue, Poisson factorisation assumes that both the users and the latent factors are non-negative and so a Poisson distribution for the entries and gamma distributions for the latent factors are used instead. These assumptions make the Poisson factorisation more applicable to real data sets like the Netflix

® one, where we have a set of users and the rankings for each movie in the catalogue.

With respect to cyber security research, Turcotte et al. (2016a) considered Poisson factorisation models for peer-based user analysis. The basic idea is that computer users with similar roles within an organisation will have similar patterns of behaviour. This type of analysis can be particularly important for quickly detecting rogue users. The behaviour of a new user can be compared to their peers and anomalies detected. The data used is the recorded authentication events as briefly explained in Section 2.

The model is further specified by letting be the number of times that user authenticates on machine and where it is assumed that

where for and for are -dimensional vectors of positive values. The model is interpreted as having features, with representing scores for the -th user on the features and representing scores for the -th computer on the features.

A feature might have a high value for all machines within one department and low scores otherwise. If a user had a high-score on that feature then they are likely to have many authentication events on machines in that department (perhaps, representing that they work in that department). If a user had a low-score on that feature then they are likely to have a very low number of authentication events. In general, the mean number of authentication events for a user on a machine is the sum over products of many features which allows similarities between users and computers to be learnt from the data. The specification of the model is completed by assuming that , , and . The model is fitted to a training sample and anomalies can be detected by comparing predictions from this model to observed values from a testing sample.

3.2.3 Dirichlet process

So far, we have only described Bayesian models working under parametric assumptions. Although these models have been proven effective in detecting network anomalies, there is still a missing methodology that we need to consider: the non-parametric realm. Bayesian nonparametrics has become widely popular in research areas such as finance, biology, machine learning, recommender systems and computer science, amongst others. Bayesian nonparametrics models are becoming increasingly appealing mainly because they posses a flexibility that can hardly be achieved by parametric models. This flexibility becomes handy specially in problems where there is an increase in the dimensionality as more data becomes available. For cyber security research, where huge amounts of time evolving data are available almost instantly, Bayesian nonparametrics models should provide a more suitable modelling techniques and provide interesting insights and solutions to the problem at hand.

Historically, it is widely accepted that Bayesian nonparametrics had its beginnings with the work of Ferguson (1973) and in recent years the research in this area has increased significantly. In this paper we do not cover the motivation and basic ideas of Bayesian nonparametrics; however, the reader could direct his attention to Hjort et al. (2010) for a thorough and comprehensive introduction to Bayesian nonparametrics and to the state-of-the-art practice.

The Dirichlet process was introduced by Ferguson (1973) and since then it has played a vital role in Bayesian nonparametrics and its applications. As described by Ghosal (2010), a nice motivation for the Dirichlet process is the inference problem of estimating a probability measure on the real line. A Bayesian solution requires the specification of a random probability measure as a prior and the detailed derivation of the posterior distribution. In a finite-dimensional setting this can be achieved through the Dirichlet-multinomial conjugate model. Therefore, for the infinite-dimensional case, the Dirichilet process works as a prior on the space of probabiliity distributions and just as the Dirichlet distribution we will preserve a nice conjugacy property with the multinomial likelihood. If

where is a finite measure on the space and is defined and can be written as with the total mass and a probability measure, then the posterior distribution follows a Dirichlet process with updated parameter:

Using this last expression we obtain a nice expression for the predictive distribution

where is the set of distinct values observed in and the number of times we have observed them. So we will be observing either an already seen value or a completely new one, which provides a natural framework for clustering problems (with not a finite priori restriction to the number of groups).

Exploiting the structure of the posterior and predictive distribution of the Dirichlet process, Heard and Rubin-Delanchy (2016) develop a Bayesian nonparametric approach to intrusion detection by assuming a Dirichlet process-based model for each message recipient on a set of computers that work as the node set in a directed graph and where the set of edges, , are the directed connections amongst these computers. The first step of their anomaly detection procedure is the obtention of the predictive p-value for the event defined as

where and . These p-values quantify the level of surprise of a new connection. Since the goal is to detect anomalies in each source computer, the p-values observed in the edge are reduced to a single score using Tippett’s method (Tippett, 1931), i.e., by using the lower tail of a Beta distribution evaluated in the minimum of these p-values. Finally, a single score for each node is obtained using Fisher’s method (Fisher, 1934), that is, by using the upper tail of a distribution with degrees of freedom. These finals scores obtained for each source computer , are ranked from the most interesting to least interesting source nodes. This detection procedure is highly parallelisable and hence, suitable for large networks.

Working directly along this line, Sanna Passino and Heard (2019) examined a joint modelling of a sequence of links based on the Pitman-Yor process. This stochastic process was first introduced in Perman et al. (1992) and further results were developed in Pitman and Yor (1997). The Pitman-Yor process, also known as the two-parameter Poisson-Dirichlet process, belongs to a special class of random probability measures known as Poisson-Kingman models (Pitman, 2003). This stochastic process, as the name suggests, requires two parameters which are commonly denoted by (commonly known as the discount parameter) and (the strength parameter). An appealing characteristic of this stochastic process is the predictive distribution that can be obtained in a close form and given by:

where is the number of different observations and everything else remains the same as for the Dirichlet process. From the predictive distribution we can notice that if we let we recover the DP, thus, the Pitman-Yor process can be thought as a generalisation of the DP. Another interesting observation is that the probability of a new observation will depend on the number of different observations, , which is something we did not have for the DP.

In Sanna Passino and Heard (2019) the joint modelling of is achieved through the decomposition and where Pitman-Yor processes priors are used for both components. The detection procedure follows the same reasoning as in Heard and Rubin-Delanchy (2016), that is, the obtention of the predictive p-values and further combination of them into a single score for each source computer. Besides the use of the Pitman-Yor process rather than the Dirichlet process, there are two other interesting results found in their work. The first one is that they do not restrict their attention to the use of p-value and they explore the use of mid p-values (the reader can direct the attention to Lancaster (1952) and Rubin-Delanchy et al. (2019) for an insight on mid p-values and why in some problems they might be preferred over p-values). Finally, they also explore the use of other p-value combiners such as Pearson’s method (Pearson, 1933) and Stouffer’s method (Stouffer, 1949). Results on the same data set yields better results with the Pitman-Yor process and the with use of mid p-values.

3.3 Malware detection and classification

The malware detection and classification problem is the third and last research area we explore in this review paper. As established before, a malware is a software designed to disrupt, damage or gain access to computer systems. The fast and accurate detection and posterior classification into known families of these malicious codes is an important task for cyber security in order to have a better understanding on how these malware infect computers, their threat level and therefore, for being well-protected against them. As briefly explained in Section 2.3, there have been mainly two approaches in this area which differ in term of the kind of data used, either using the hexadecimal representation of the binary code or the dynamic trace. For a more comprehensive structure this subsection will be divided into how these two kind of data sets have been addressed rather than describing directly the methods as previous sections.

3.3.1 Hexadecimal representation and -grams

The use of the hexadecimal representation of the binary code has been used in conjunction to what is known in literature as -grams. An -gram analysis is a collection of subsets of the original sequence and have been widely used as a probabilistic model for the prediction of the next item in a sequence (Section 2.3). However, for the purposes of malware detection the use of -grams is restricted to the assumption that the presence or absence of these structures completely characterises both benign and malicious code. Therefore, these -grams are going to work as the data we are going to be using for malware detection and classification.

Working under this assumption Kolter and Maloof (2004) provided a machine learning and data mining approach to the detection of malware in the wild. In their approach a large collection of both benign and malicious code are obtained and from this collection they obtain what they call the most important -grams using the mutual information gain. Using the

-grams obtained from both the malicious and benign code they train several models (such as naive Bayes, support vector machine, decision trees, amongst others) in order to obtain a classification procedure rule. These classification procedures are used for detecting new malware in the wild and they are further compared using the area under the ROC curve as the performance metric.

Although the methodology described above is not entirely Bayesian, we still believe their approach and the assumptions made on the data could be appealing to the Bayesian community as well. First of all, we have that the data is represented by a binary matrix where the rows are either malicious and benign code representations and the rows the -grams that are assumed to characterise them. Moreover, since this matrix is usually a high-dimensional object in both the rows and columns, from a Bayesian perspective we could try to obtain classification procedures that exploit the beta-Bernoulli motivation. These classification procedures would work for detecting new malware in the wild or they could be generalised for the second task cyber security faces with malware, that is, classification into known families.

For some malware sets like Microsofts’ one (Ronen et al., 2018) we might be able to have information not just only about the malware but also about the family they belong. As established before accurate classification of new malware is an important task and following the assumptions described in the last paragraph leads to interesting Bayesian models we could apply. In this scenario the data will also be a binary matrix with family-indexed malware in the rows and -grams in the columns. These -grams (due to the mutual information gain selection procedure) will be present at least once for each family; therefore, hierarchical models allowing the share of information could result attractive for research purposes.

3.3.2 Dynamic traces as Markovian structures

As established in Section 2.4, the dynamic trace is the sequence of instructions called during the execution of a computer program. Following the ideas of Storlie et al. (2014) many authors have assumed that these traces have a Markovian structure and in that way the interest relies in modelling and analysing the transition probability matrix. Both Bayesian parametric and nonparametric models have been used and in the following we provide a comprehensive description of them.

As pointed out by Storlie et al. (2014) there are hundreds of commonly used instructions and thousands of them overall. Hence, modelling the one-to-one transition is not feasible; in order to solve this issue, there is a need of creating groups of similar instructions. Although not unique, the most common categorisation is with a set of 8 different groups which include amongst others: math, memory, stack, other. This assumption will be further followed by Bolton and Heard (2018) and Kao et al. (2015) in their respective models.

Letting to be the number of instruction categories previously chosen (e.g., 8) then a dynamic trace is a sequence with that will be modelled as a Markov chain for malware detection and classification. Hence, we let be the transition counts matrix for the -th program, to be the probability transition matrix and an indicator of whereas the program is malicious or not. Storlie et al. (2014) proposed that elements of the estimated should be used as predictors to classify a program as malicious or not through a logistic spline regression model. The actual predictors to model are

The model is completely specified by giving a symmetric prior Dirichlet distribution with parameter to each row of and where is the posterior mean.

Working directly on this approach, Kao et al. (2015) proposed a flexible Bayesian nonparametric approach for modelling the probability transition matrices using a mixture of Dirichlet processes (MDP) (Antoniak, 1974) as prior for the transition probability matrix in order to capture variability across programs. Choosing the MDP as prior rather than the Dirichlet process is due to the fact that the Dirichlet process has almost surely discrete realisations; therefore, having ties would be something unrealistic for some applications (including the modelling of probability transition matrices).

Thus, Kao et al. (2015) specify the following model for the probability transition matrices:

where follows a matrix Dirichlet distribution (MD) centered in some constant matrix , controls the variance of and the matrix is the shape parameter. The model is completely specified by letting

be the indicator random variable of maliciousness just like before. A new program

is classified as malicious if exceeds a predefined threshold, with being the collection of all observed counts matrices. Moreover, if the program is malicious it can be further classified into a cluster with existing programs that share common features using an MCMC procedure.

A different approach for the modelling of dynamic traces was developed by Bolton and Heard (2018). Their approach follows the same assumptions, i.e., that dynamic traces, specified by the prior clustering of common instructions, have a Markovian structure so that they can be well-modelled by a Markov chain model. However, they further assume that the Markov structure changes over time with recurrent regimes of transition patterns. Therefore, each dynamic trace will be modelled as a Markov chain with a time varying transition probability matrix .

In order to detect the regime changes three change-point models are described. The basic idea is that there are change-points that partition the dynamic trace and where within each segment the trace follows a homogeneous Markov chain. The methods vary on the way the probability transition matrices are defined within each segment. The first method changes the whole matrix in each segment, the second one only allows some of the rows to change and finally the regime switching method allows the change in rows not only to be forward but also to go back to a vector of probabilities that governed the Markov chain in earlier segments. A reversible jump MCMC is further used for the sampling of these change-points.

Finally, the authors propose a classification procedure based on a similarity measure of the vectors of change-points and their regimes that obtains the minimum value for the two samples of the proportions of instructions which occur within regimes shared by both traces. A high level of similarity requires that a large number of observations in both dynamic traces need to be drawn from common regimes.

4 Final remarks

Without a doubt, cyber security research from a mathematical and statistical point of view has become more appealing due to the inherent complexity of the problems and the nature of the data sets used. We believe, as many authors do, that, in order to be well-prepared against the current cyber threats, there is a need of more sophisticated models. Bayesian statistics offers a wide range of flexible models that might be the key for a deeper understanding of the generative process at the basis of malicious attacks and, at the same time, for us to have predictive models able to handle large volumes of time-evolving data. That is why in this review paper we have presented the statistical approach to cyber security anomaly detection methods, making particular emphasis on Bayesian models.

It is imperative to stress that the models described in our review are far from being exhaustive. They represent the ones that are the most frequently used for the general class of cyber security problems presented here. However, there are other kind of cyber security problems that have been tackled from a Bayesian perspective. For example: in Price-Williams et al. (2018) the authors propose an alternative approach to users’ activity anomaly detection by analysing the amount of user activity on a given day and the times where these activities were realised; other interesting cyber security related problem can be found in Price-Williams et al. (2017), whose work aims to detect automated events that can be viewed as polling behaviour from an opening event originated by a user; or in Turcotte et al. (2016b), where they provide a modelling of the computer event logs per user by viewing them as a multivariate data stream; and in Price-Williams et al. (2019) whose work is aimed to detect correlated traffic patterns in computer networks in order to reduce false positives when performing anomaly detection.

It is also important to notice that as the interest in cyber security keeps increasing we are able to find (in a frequent basis) new models that work directly along the line of some of the ones we have presented here. The work done by Metelli and Heard (2019), where they present a Bayesian modelling for new edge prediction and anomaly detection is an example of this. In the above paper, the authors use a Bayesian Cox regression model like the one used in Metelli and Heard (2016). However, in the most recent approach, the authors use two different classes of covariates: the first one is comprised of the time-varying out-degree of each client computer, the in-degree of each server computer and two indicator variables telling us if the last connection and/or the last two connections were new. The second set of covariates represent what the authors describe as the notion of attraction between clients and servers. For the construction of the second set of covariates they use both hard-threshold and soft-threshold clustering models in a latent feature space.

We would also like to point out that, although there has been an actual increase in cyber security research from a Bayesian point of view, to the best of our knowledge, there are some areas that have not been as widely explored as others. Most of the work we have encountered corresponds to either volume-traffic or network anomaly detection. Malware related problems, like detection and classification, are still open areas of research that need to be deeply developed.

As a final comment, we would like the reader to note that although, it was not mentioned directly in each of the sections of the review paper, anomaly detection models for cyber security research require the analysis of high-volumes of data. No matter if it is for volume-traffic analysis, network modelling or malware detection and classification, all of them require handling and learning from data sets that are usually very large. This definitely plays a vital role in cyber security research, since we have always to keep in mind that whilst developing statistical models for this kind of problems, there is a need for algorithms able to scale well, to be parallelised and, in preference, able to perform in a sequential procedure as new data is observed.

Notes

The Advanced Research Projects Agency Network (ARPANET) was a packet switching network developed in the late 1960s that is widely considered to be the predecessor of the Internet (Oppliger, 2001).
de Finetti’s representation theorem is due to Hewitt and Savage (1955) who generalised de Finetti’s theorem for exchangeable 0-1 random variables (de Finetti, 1930).
tcpdump is a computer software that allows the user to obtain the transmitted or received packets over a computer network.

References

  • Adams and Heard (2014) Adams, N. and Heard, N., editors (2014). Data Analysis for Network Cyber-Security. Imperial College Press.
  • Amit et al. (2018) Amit, I., Matherly, J., Hewlett, W., Xu, Z., Meshi, Y., and Weinberger, Y. (2018). Machine Learning in Cyber-Security - Problems, Challenges and Data Sets. arXiv:1812.07858v3.
  • Antoniak (1974) Antoniak, C. E. (1974). Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. The Annals of Statistics, 2(6):1152 – 1174.
  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022.
  • Bolton and Heard (2018) Bolton, A. and Heard, N. (2018). Malware Family Discovery Using Reversible Jump MCMC Sampling of Regimes. Journal of the American Statistical Association, 113(524):1490 – 1502.
  • Buczak and Guven (2016) Buczak, A. and Guven, E. (2016). A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Communications Surveys Tutorials, 18(2):1153 – 1176.
  • Cao et al. (2003) Cao, J., Cleveland, W., Lin, D., and Sun, D. (2003). Internet Traffic Tends Toward Poisson and Independent as the Load Increases, pages 83 – 109. Springer New York, New York, NY.
  • Cao et al. (2016) Cao, X., Chen, B., Li, H., and Fu, Y. (2016). Packet Header Anomaly Detection Using Bayesian Topic Models. IACR Cryptology ePrint Archive, 2016:40.
  • Chandola et al. (2009) Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41:1 – 72.
  • Chen et al. (2015) Chen, G., Wang, X., and Li, X. (2015). Fundamentals of Complex Networks. Wiley Publishing, 1st edition.
  • Cox (1972) Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187 – 220.
  • Cramer and Carin (2011) Cramer, C. and Carin, L. (2011). Bayesian Topic Models for Describing Computer Network Behaviors. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Press.
  • de Finetti (1930) de Finetti, B. (1930). Funzione Caratteristica Di Un Fenomeno Aleatorio. In Memorie della R. Accademia dei Lincei, volume 4, pages 86 – 133.
  • Deerwester et al. (1990) Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391 – 407.
  • Dunlavy et al. (2009) Dunlavy, D., Hendrickson, B., and Kolda, T. (2009). Mathematical Challenges in Cybersecurity. Technical report, Sandia National Laboratories.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1(2):209 – 230.
  • Fisher (1934) Fisher, R. (1934). Statistical Methods For Research Workers. Olyver and Boyd, Edinburgh.
  • Ghosal (2010) Ghosal, S. (2010). The Dirichlet process, related priors and posterior asymptotics, chapter 2, pages 35 – 79. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • Gopalan et al. (2014) Gopalan, P., Charlin, L., and Blei, D. M. (2014). Content-based Recommendations with Poisson Factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 3176 – 3184, Cambridge, MA, USA. MIT Press.
  • Gopalan et al. (2013) Gopalan, P., Hofman, J. M., and Blei, D. (2013). Scalable Recommendation with Poisson Factorization. arXiv:1311.1704v3.
  • Gupta et al. (2014) Gupta, M., Gao, J., Aggarwal, C., and Han, J. (2014). Outlier Detection for Temporal Data: A Survey. IEEE Transactions on Knowledge and Data Engineering, 26(9):2250 – 2267.
  • Hagberg et al. (2014) Hagberg, A., Kent, A., Lemons, N., and Neil, J. (2014). Credential hopping in authentication graphs. In 2014 International Conference on Signal-Image Technology Internet-Based Systems. IEEE Computer Society.
  • Hall (2000) Hall, E. (2000). Internet Core Protocols: The Definitive Guide: Help for Network Administrators. An owner’s manual for the internet. O’Reilly Media, Incorporated.
  • Heard et al. (2016) Heard, N. A., Palla, K., and Skoularidou, M. (2016). Topic modelling of authentication events in an enterprise computer network. In 2016 IEEE Conference on Intelligence and Security Informatics. IEEE Press.
  • Heard and Rubin-Delanchy (2016) Heard, N. A. and Rubin-Delanchy, P. (2016). Network-wide anomaly detection via the Dirichlet process. In the Proceedings of the IEEE workshop on Big Data Analytics for Cyber-security Computing.
  • Hewitt and Savage (1955) Hewitt, E. and Savage, L. (1955). Symmetric measures on Cartesian products. Transactions of the American Mathematical Society, 80:470 – 501.
  • Hjort et al. (2010) Hjort, N., Holmes, C., Müller, P., and Walker, S., editors (2010). Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • Hofmann (2001) Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1–2):177 – 196.
  • Kao et al. (2015) Kao, Y., Reich, B., Storlie, C., and Anderson, B. (2015). Malware Detection Using Nonparametric Bayesian Clustering and Classification Techniques. Technometrics, 57(4):535 – 546.
  • Karagiannis et al. (2004) Karagiannis, T., Molle, M., Faloutsos, M., and A. Broido, A. (2004). A nonstationary Poisson view of Internet traffic. In IEEE International Conference on Computer Communications 2004, volume 3, pages 1558 – 1569.
  • Kent (2015) Kent, A. D. (2015). Cybersecurity Data Sources for Dynamic Network Research. In Dynamic Networks in Cybersecurity. Imperial College Press.
  • Kolter and Maloof (2004) Kolter, J. Z. and Maloof, M. (2004). Learning to detect malicious executables in the wild. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 470 – 478, New York, NY, USA. Association for Computing Machinery.
  • Lancaster (1952) Lancaster, H. O. (1952). Statistical control of counting experiments. Biometrika, 39(3 - 4):419 – 422.
  • McGraw and Morrisett (2000) McGraw, G. and Morrisett, G. (2000). Attacking Malicious Code: A Report to the Infosec Research Council. IEEE Software, 17(5):33 – 41.
  • Metelli and Heard (2016) Metelli, S. and Heard, N. (2016). Model-based clustering and new edge modelling in large computer networks. In 2016 IEEE Conference on Intelligence and Security Informatics. IEEE Press.
  • Metelli and Heard (2019) Metelli, S. and Heard, N. (2019). On Bayesian new edge prediction and anomaly detection in computer networks. The Annals of Applied Statistics, 13(4):2586 – 2610.
  • Meza et al. (2009) Meza, J., Campbell, S., and Bailey, D. (2009). Mathematical and Statistical Opportunities in Cyber Security. arXiv:0904.1616.
  • Myhre (2001) Myhre, R. N. (2001). Introduction to Networking and the OSI Model. Prentice Hall.
  • Newman (2010) Newman, M. (2010). Networks: An Introduction. Oxford University Press, Inc.
  • Olding and Wolfe (2014) Olding, B. and Wolfe, P. (2014). Inference for Graphs and Networks: Adapting Classical Tools to Modern Data, pages 1 – 31. Imperial College Press, London.
  • Oppliger (2001) Oppliger, R. (2001). Internet and Intranet Security. Artech House, Inc., USA, 2nd edition.
  • Pearson (1933) Pearson, K. (1933). On a Method of Determining Whether a Sample of Size n Supposed to Have Been Drawn from a Parent Population Having a Known Probability Integral has Probably Been Drawn at Random. Biometrika, 25(3 - 4):379 – 410.
  • Perman et al. (1992) Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, 92:21 – 39.
  • Pitman (2003) Pitman, J. (2003). Poisson-Kingman Partitions. Lecture Notes-Monograph Series, 40:1 – 34.
  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. The Annals of Probability, 25(2):855 – 900.
  • Pollak and Tartakovsky (2009) Pollak, M. and Tartakovsky, A. (2009). Optimality Properties of the Shiryaev-Roberts Procedure. Statistica Sinica, 19(4):1729 – 1739.
  • Polunchenko et al. (2012) Polunchenko, A. S., Tartakovsky, A., and Mukhopadhyay, N. (2012). Nearly Optimal Change-Point Detection with an Application to Cybersecurity. Sequential Analysis, 31:409 – 435.
  • Polunchenko and Tartakovsky (2011) Polunchenko, A. S. and Tartakovsky, A. G. (2011). State-of-the-art in sequential change-point detection. Methodology and Computing in Applied Probability, 14(3):649 – 684.
  • Price-Williams et al. (2019) Price-Williams, M., Heard, N., and Rubin-Delanchy, P. (2019). Detecting weak dependence in computer network traffic patterns by using higher criticism. Journal of the Royal Statistical Society: Series C, 68(3):641 – 655.
  • Price-Williams et al. (2017) Price-Williams, M., Heard, N., and Turcotte, M. (2017). Detecting Periodic Subsequences in Cyber Security Data. In 2017 European Intelligence and Security Informatics Conference, pages 84 – 90.
  • Price-Williams et al. (2018) Price-Williams, M., Turcotte, M., and Heard, N. (2018). Time of Day Anomaly Detection. In 2018 European Intelligence and Security Informatics Conference, pages 1 – 6.
  • Ronen et al. (2018) Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., and Ahmadi, M. (2018). Microsoft Malware Classification Challenge. arXiv:1802.10135.
  • Rubin-Delanchy et al. (2019) Rubin-Delanchy, P., Heard, N. A., and Lawson, D. J. (2019). Meta-Analysis of Mid-p-Values: Some New Results based on the Convex Order. Journal of the American Statistical Association, 114(527):1105 – 1112.
  • Salakhutdinov and Mnih (2007) Salakhutdinov, R. and Mnih, A. (2007). Probabilistic matrix factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pages 1257 – 1264, USA. Curran Associates Inc.
  • Sanna Passino and Heard (2019) Sanna Passino, F. and Heard, N. A. (2019). Modelling dynamic network evolution as a Pitman-Yor process.

    Foundations of Data Science

    , 1:293 – 306.
  • Storlie et al. (2014) Storlie, C., Anderson, B., Vander Wiel, S., Quist, D., Hash, C., and Brown, N. (2014). Stochastic identification of malware with dynamic traces. The Annals of Applied Statistics, 8(1):1 – 18.
  • Stouffer (1949) Stouffer, S. (1949). The American soldier. Studies in social psychology in World War II. Princeton University Press.
  • Tartakovsky (2014) Tartakovsky, A. G. (2014). Rapid Detection of Attacks in Computer Networks by Quickest Changepoint Detection Methods, pages 33 – 70. Imperial College Press, London.
  • Tartakovsky et al. (2006a) Tartakovsky, A. G., Rozovskii, B. L., Blaźek, R. B., and Kim, H. (2006a). Detection of intrusions in information systems by sequential change-point-methods. Statistical Methodology, 3(3):252 – 293.
  • Tartakovsky et al. (2006b) Tartakovsky, A. G., Rozovskii, B. L., Blaźek, R. B., and Kim, H. (2006b). A novel approach to detection of instructions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Transactions on Signal Processing., 54(9):3372 – 3382.
  • Tippett (1931) Tippett, L. (1931). The Methods of Statistics. Williams and Norgate, London.
  • Turcotte et al. (2016a) Turcotte, M., Moore, J., Heard, N., and McPhall, A. (2016a). Poisson factorization for peer-based anomaly detection. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI). IEEE Press.
  • Turcotte et al. (2016b) Turcotte, M. J. M., Heard, N. A., and Kent, A. D. (2016b). Modelling user behaviour in a network using computer event logs, pages 67 – 87. World Scientific.
  • Turcotte et al. (2018) Turcotte, M. J. M., Kent, A. D., and Hash, C. (2018). Unified Host and Network Data Set, chapter 1, pages 1 – 22. World Scientific.
  • Vishwanath et al. (2009) Vishwanath, A., Sivaraman, V., and Ostry, D. (2009). How Poisson is TCP traffic at short time-scales in a small buffer core network? In 2009 IEEE 3rd International Symposium on Advanced Networks and Telecommunication Systems, pages 1 – 3.