Attacker Behaviour Profiling using Stochastic Ensemble of Hidden Markov Models

05/28/2019 ∙ by Soham Deshmukh, et al. ∙ Veermata Jijabai Technological Institute Yahoo! Inc. 0

Cyber threat intelligence is one of the emerging areas of focus in information security. Much of the recent work has focused on rule-based methods and detection of network attacks using Intrusion Detection algorithms. In this paper we propose a framework for inspecting and modelling the behavioural aspect of an attacker to obtain better insight predictive power on his future actions. For modelling we propose a novel semi-supervised algorithm called Fusion Hidden Markov Model (FHMM) which is more robust to noise, requires comparatively less training time, and utilizes the benefits of ensemble learning to better model temporal relationships in data. This paper evaluates the performances of FHMM and compares it with both traditional algorithms like Markov Chain, Hidden Markov Model (HMM) and recently developed Deep Recurrent Neural Network (Deep RNN) architectures. We conduct the experiments on dataset consisting of real data attacks on a Cowrie honeypot system. FHMM provides accuracy comparable to deep RNN architectures at significant lower training time. Given these experimental results, we recommend using FHMM for modelling discrete temporal data for significantly faster training and better performance than existing methods.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The existing cyber security tools focus on reactive methods and algorithms as a major part of their cybersecurity arsenal. In current world where organizations are highly digital, a single vulnerability can lead to penetrative attack negatively affecting business on a large scale. Moreover, attackers now are leveraging automation and cloud to scale their attacks faster and infiltrate systems in record break time. Therefore, it is advisable for organization to stay one step ahead of attackers and be able quickly foresee where and when they will strike. Knowing the potential strike points or actions of attacker the organization can take necessary steps for mitigating cyber risks to organization’s business.

The attacker may use multiple access point to deploy his attack. Some attackers use persistent attack strategy consisting of a sequence of attack behaviour continuously until the intended target system is compromised [1]. Some attackers tend to stay in the system for long time do trivial task to bypass IDS and other things then later perform malicious act. These type of attacks are difficult to track and detect. Moreover, insider threats have become more prevalent and the breach level index shows that almost 40% of the breaches are due to poor employee awareness of cyber security [3]. Although raw data in the form of logs is abundantly available, it is difficult and time-consuming to extract meaningful information based on which proactive measures can be employed. Thus, it has become indispensable to develop a solution which provides threat intelligence capabilities to combat these attacks by being both proactive and responsive.

In order to identify and mitigate security breaches, three major types of network analysis are done: signature based, anomaly based, and hybrid. Signature based techniques try to detect attacks based on their signature [4]. They have less false positive cases but are useless against new (zero-day) attacks or attack not present in their database. Anomaly based techniques model normal behaviour and detect deviations from the normal behaviour. They have the capacity to detect zero-day attacks. Additionally, they can be used to generate dataset for signature based methods. However, the user or employee can deviate from standard pattern of operation and this leads to high amount of false alarms generated by anomaly based approaches. Hybrid based methods combine both signature and anomaly detection to increase detection [5] of threats and reduce false positive alarms. Most of the Machine learning and Deep learning based methods are hybrid methods. However, all of this methods are supervised learning methods where the algorithms need to be provided with labelled data in the form of database, or time-series to accurately make predictions.

There is a need of unsupervised or semi-supervised algorithm which can work on raw data dumps to provide intimation of a potential attack or threat beforehand. For the algorithm to be effective and employable for predicting threats at organization level -

  • Modelling capacity: The algorithm must have sufficient modelling capacity for detecting nonlinear patterns in sequential data.

  • Time constraint: The algorithm must be scalable and parallelizable to train on large data the organization generates. It should provide predictions with low latency and have low training time as the attackers can utilize the downtime of algorithm to its advantage.

  • Data Imbalance constraint: The algorithm should be able to handle imbalanced data distribution where threat or breaches are needle in a hack stack of ‘normal’ behaviour user or employee behaviour pattern.

  • False Positives: The algorithm should be able to incorporate uncertainty and possibly inform uncertain predictions rather than generating false positives.

With view to satisfy these constraints, we propose a new algorithm called Fusion Hidden Markov Model which exploits the benefits of ensemble learning to produce better results. We train a set of diverse HMMs on different low-correlated partitions of data and amalgamate the predictions of these models using a nonlinear weight function. The nonlinear function uses the posterior distribution over these HMMs to generate a single output. One advantage of fitting a set of HMMs is that each model learns temporal features unique to a group of similar attack patterns, thus making FHMM less susceptible to noise. We employ this approach to effectively model attack patterns on file system based honeypots. The predictive power of this model can be used to comprehend the mindset of the attacker beforehand and would help in preventing systems from being compromised.

The remainder of the paper is structured as follows. Recent work in the field of cyber security has been elaborated in Section II. Section III provides a detailed problem formulation along with an overview of the proposed system architecture. In Section IV, we explain the proposed FHMM algorithm. Section V includes experiments performed. We analyze the results of these experiments and provide a comprehensive comparison of FHMM with other commonly used sequence models for modelling attacker behaviour. Finally, Section VI summarizes and concludes the paper.

Ii Related Work

Threat intelligence or predictive threat intelligence is a newly coined term which focuses on predictive methods for cyber security. [26] suggested the use of big data analytics for analyzing and handling the huge amount of data traffic. One of the most prominently used machine learning based approach for insider threat or breach detection is user behaviour profiling, M. Al-Qurishi [36]. In this approach, the sequential actions of employee or student are modelled to form a profile for that specific user. The most common patterns of actions are termed as normal behaviour and any deviation from this predefined path is considered as a deviation. This anomalous activity is then marked for further investigation or potential attack. For modelling ‘normal’ behaviour of employees’ variety of strategies are used. For detecting long-term anomalies in cloud data, Owen et. al. [7] proposed using Extreme Studentized Deviate test (ESD) and employing time series decomposition and robust statistics for detecting anomalies. While other methods in literature rely on combining user action data with behavioural and personality characteristics, Weiming Hu[2] and Jiong Zhang et. al. [4] propose the use of bagging and boosting algorithms for intrusion detection. Apart from using collected or logged data on attacker actions, past research literature has shown that behavioural and personality characteristics can provide great insight in profiling the attack. This particular example is shown by Brdiczka et al. [8], which uses structural anomaly detection combined with psychological profiling, in order to reduce false positives compared to traditional anomaly detection. Work by L. Spitzner [21] explores how honeypot technology can be used to detect, identify and gather information on these specific threats. Ultimately the task is to find out if there is an activity being planned, and if so, find what stage the planning is in. Karl Granstrom et. al [9] showed how Bernoulli filter can be used to detect the presence of HMM in structured action data along with minimum complexity that an HMM would need to involve in order that it be detectable with reasonable fidelity.

Recently, a lot of work has focused on rigorous analysis of data collected by honeypots using machine learning and statistical techniques. Despite potential limitations on the availability of data in the majority of prior research, intrinsic patterns in cyber attacks have been identified. For example, Z. Zhan et. al. [23] demonstrated the presence of long-range temporal dependencies in attack patterns captured by low-interaction honeypots. Owning to the presence of long-range dependencies in the rich information captured by honeypots, various frameworks have been proposed which exploit these statistical properties for modelling cyber attacks [25]. Other framework proposed involve [28] deals with graph-based clustering of time-series attack patterns thereby identifying the activities of worms and botnets present in the honeypot traffic. A technique proposed in [25] employs a similarity metric called squared prediction error (SPE) for computing distance between observation projected in residual space and the k-dimensional hyperspace determined by PCA. This metric was used to identify new attack patterns in the honeypot logs. On the other hand, the intent behind cyber attacks also provide a fair indication of the target of the attacker and several approaches for intent prediction using HMMs [16] have been developed. To detect intrusions, Nong Ye[18] employed two EWMA techniques to detect anomalous changes in event intensity specifically for correlated and uncorrelated data. Behaviour-rule based intrusion detection methods where also investigated in domain of cyber physical systems particularly in Smart Grid Applications by Robert Mitchell [12] which demonstrated that detecting attackers based on behaviour features led to low false positives. On the other hand, Dawei [17] Shi used stochastic modelling framework particularly using finite state hidden markov model to solve joint state and attack estimation problem. Erland et. al. [13] found time feature to be extremely important in modelling attacker behaviour and the attack process can be split into three phases namely the learning phase, standard phase and innovative attack phase. [22] suggested the use of game theory approach by formulating a Bayesian game to understand the attacker-defender interactions in honeypot-enabled networks.

The use of Hidden Markov Models for modelling normal behaviour (against attacker actions) was proposed by Tabish Rashid et. al [6] for detection insider threat. There is significant work been done in improving HMM and its learning algorithm. The ability of HMM has to model sequence is dependent on the structure of HMM. Though the integration over all possible model structures are not possible, structures suited to specific domain like profile hmm [10] are developed. In the context of biological sequence analysis, researchers have used genetic algorithm to determine the structure of HMM [11]. Mathias Johansson et. al. [20] employs a unified Bayesian treatment to derive posterior probability for different model structures within class of multinomial, Markov, Hidden Markov models without assuming prior knowledge of transition probabilities. The searching of hmm structure followed by learning makes the approach infeasible in domains of cyber security where latency is of utmost importance.

Iii Preliminary Research Context

Iii-a Problem Formulation

Given previous actions of attacker at each timestep , we intend to predict the next action attacker will likely take. Here are discrete states corresponding to attacker action where .

The goal of our system is to achieve accurate attack prediction with low prediction latency and training time. The algorithm, beforehand, should determine the optimal sequence length and number of HMMs to employ to achieve some global minima. Given the predictions of HMMs, the algorithm must use them to accurately predict future states. Another challenging task is to update models when necessary, which is aimed to correct the model without incurring large overhead to the monitored infrastructure. We proposed a possible scheme for these issues in this paper.

Iii-B Brief introduction to HMM

FHMM uses HMM as its backbone algorithm for learning relationship between defined discrete states. An HMM [19][29] is a statistical Markov model with hidden states. These hidden states are not directly visible to the observer. The HMM can be completely defined by parameters A (transition matrix), B (observation matrix),

(prior probability) as


Two assumptions are made by the model. The first, called the Markov assumption, states that the current state is dependent only on the previous state. The second, called the independence assumption, states that the output observation at time is dependent only on the current state; it is independent of previous observations and states. Given a set of examples from a process, we would be able to estimate the model parameters that best describe that process. Then, we could discover the hidden state sequence that was most likely to have produced a given observation sequence. More details can be found in [19].

Iii-C HMM from Bayesian Perspective

In the Bayesian approach we assume some prior knowledge about the learning process or structure employed. For the case of HMM, the prior knowledge in encoded in terms of arcs of HMM, and model parameters. This prior knowledge is represented in terms of prior distribution is used to obtain posterior distribution over model structure and parameters. More formally, assuming a prior distribution over model structure and a prior distribution over parameters for each model structure , a data set is used to form a posterior distribution over models using Bayes rule [35]:


which averages over uncertainty in parameters. The posterior distribution over parameters is computed as:


If we wish to predict the next observation, based on our data and models, the Bayesian prediction


averages over both the uncertainty in the model structure and its parameters. This is known as the predictive distribution for the model [35].

For complex attacker behaviour modelling problems, using sequences with different characteristics to learn one Hidden Markov Model leads to too much generalization and thus losing the discriminating characteristics of the different attackers. Thus, it is evident that the use of diverse attack sessions, to train a single model, would lead to loss of information which might be unique to a small number of attack sequences. For a threat intelligence model, capturing this information becomes important since it might a crude attack to hack into the system. The proposed approach, called fusion HMM (FHMM), attempts to partition the training data according to the distribution of the number of attack sessions with respect to their lengths, and then train multiple HMMs in a semi-supervised fashion. The predictions of these models are combined using a nonlinear network to provide better, robust predictions. This approach, thus aims to capture the characteristics of the attack sequences that would be lost while using a single global model.

Iv Methodology

Using a single Bayesian structure like HMM to model the joint distribution of all the observations and hidden state, makes optimising

to maximise the likelihood intractable. Particularly when the data set consists of mixture of distributions over different sequence types where the individual distribution might not be a Gaussian in nature, optimising to maximise the likelihood becomes challenging. Hence we propose segmenting data set D into sub parts such that where are independent and identically distributed observation sets. Empirical methodology for segmenting data set into parts is provided in this section. The benefits of this segmentation are made apparent in further equations. Intuitively, this is equivalent to breaking down the problem into smaller independent sub-problem of , which allows using different values to reach a local minima for that sub-problem. Modifying the original equation to incorporate sub data set:


In the limit of large data set and an uninformative or uniform prior over parameters, the posterior will be sharply peaked around the maxima of the likelihood, and therefore the predictions of a single maximum likelihood model will be similar to those obtained by Bayesian integration over parameters.


The maximum likelihood (ML) model parameters are obtained by maximising the likelihood or log likelihood.


Further we assume a limiting case of Bayesian approach to learning if we assume a single model structure

and we estimate the parameter vector

that maximising the likelihood under that model. If model structure is assumed constant,


In order to maximize the overall log likelihood with respect to model , we need to maximize the individual log likelihood terms where in each term can be different corresponding to differing model parameters required to aptly represent the joint distribution observation and hidden state.


These individual models are represented by HMM whose parameters are estimated by Baum Welch algorithm, a special case of EM algorithm. We train HMM independently on each sub-data using Baum Welch, such that where . If is a sample sequence in then,


The denominator being independent from , we compute for each possible . The value of which provides maximum likelihood can be estimated as the best guess for the next observation for model . We now have which are estimates obtained from multiple distributions instead of a estimate derived from a single distribution .

The true estimate given is a combination of ’s which might not be a linear combination. Let function be used to represent the non linear deterministic component of the approximation, the random noise, then the true output is given by:


Expressing true outcome is this form, helps in circumventing time-invariant assumption of state transition and emission matrix of earlier trained HMM . Therefore the function along with incorporating the Bayesian analysis of HMM, now incorporates time-step t as input which acts as a deciding factor for providing non-linear weight-age to each


We choose to approximate this mapping of to by neural network.

Thus, the proposed Fusion Hidden Markov Model (FHMM) algorithm can be subdivided into three major steps -

  1. Finding optimum value of according to the distribution of input sequences and clustering data into diverse groups.

  2. Training different Hidden Markov Models on the subgroups.

  3. Learning mapping of individual HMM predictions to a single prediction using a Neural Network.

The basic terminology used in this paper is as follows -
       T = length of the observation sequence
       N = number of states in the model
       M = number of observation symbols
       Q = distinct states of Markov process =
       V = set of possible observations =
        = initial state distribution
       A = state transition probabilities =
       A = with shape x where =             P
       B = observation probability matrix =
       B = with shape x where
       O = = observation sequence

An HMM is completely defined by parameters A, B and π and denoted by .

Iv-a Partitioning Data into K Groups

An important part of any ensemble learning algorithm is to partition data into non-correlated subsets which will make learning from it fruitful. The non-correlated subsets of data also make sure HMM’s are fitted to different information subset which when combined will characterise the whose time-series data. The number

is a hyperparameter which characterizes the complexity of the data. There exists a tradeoff between

used and time required to train HMM as complexity of data increases. To satisfy the above conditions FHMM uses a dissimilarity function f(F) to divide data into subsets such that each subset captures a particular pattern of temporal data. The dissimilarity function is a custom defined distance function whose definition would change depending on the type of data used (some of the definitions are provided in appendix). Initially, dividing data depending upon length of sequence provides two benefit -

  • The data of similar length generally originates from same distribution. To illustrate, in case of cyber attacks, the attacks of similar length generally employ same attack strategy and pattern. This makes sure that we are capturing attacker mindset or bot attack patterns in subset of data. This will help HMM to learn the pattern quickly with higher accuracy.

  • From implementation and operational point of view. HMMs train on data of same length comparatively faster as compared to data consisting of variable length.

  • Diverse state transitions are present in attack sequences of significantly different lengths. Thus, it becomes difficult for a single HMM to accurately model it.

The data partitioned into different subsets by length is then chosen for the ensemble depending on its correlation with each other. For each sub dataset where we compute a frequency occurrence array. This frequency array is computed for each sub dataset . The frequency array consists of occurence of discrete states in the sub dataset . This frequency occurrence array consists of occurence of each state divided by total states seen. The frequency array characterises the sub-dataset quantitatively which makes it possible to compare its ‘similarity’ with other . The sub dataset is represented by a x frequency array vector where is number of discrete states. Then the x frequency matrix is used to obtain a set of datasets called similarity set, which is similar to a particular dataset by computing euclidean distance between the corresponding frequency arrays. These similar sets are used to rank datasets to select datasets from total sub dataset such that these datasets are diverse, have low correlation with each other and cover most of the training data.

0:  , Dissimilarity function
1:   = max length ()
2:  for all 0 in  do
3:     O where int
4:  end for
5:  for all  do
6:     Compute Frequency array where
7:  end for
8:  Construct similarity sets by computing Euclidean distance between frequency arrays
9:  Assign Ranks , where using
10:  Use to obtain lengths
11:  return  
Algorithm 1 Choosing and Training HMM

Iv-B Training K HMMs

In the second step, a group of attack sequences in the training data , is used to learn an HMM model . Although the use of small set of attack sequences to learn an HMM might lead to overfitting, in a broader context, it ensures that each model captures the set of characteristics distinct to that group of sequences. Each Hidden Markov Model

is trained using Baum-Welch algorithm [19] along with Expectation Maximization step (EM) [34] in order to train the parameters (transition, emission and prior probabilities) of the model. EM maximizes the likelihood of each sequence with respect to the corresponding model


. Thus, each model is capable of accurately predicting future steps in a sequence which belongs to the probability distribution learnt by it. The detailed training algorithm is given below. In each model the number of hidden states can be viewed a hyper-parameter which can be tuned.

0:  HMM Parameters
1:  Random Initialize
2:  Forward pass: Compute Normalize computed
3:  Backward pass: Compute recursively Normalize computed
4:  Compute Di-gammas Gammas
5:  Re-estimate and
6:  Compute
7:  if  then
8:     Go to step 2
9:  else
10:     return  
11:  end if
12:  End
Algorithm 2 Individual Hidden Markov Model Training and Updating Process
1:  Initialize: array
2:  for  in  do
3:     .append()
4:     Compute Gamma
5:     .append()
6:  end for
7:  return  argmax()
8:  Train individual HMM on where
9:  return  []
Algorithm 3 Individual Hidden Markov Model Prediction Process

Iv-C Combining Predictions of K HMMs

The resulting predictions from Hidden Markov Models are then aggregated by neural network to generate a combined prediction for next step. The neural network layer consists of a linear weight component followed by a nonlinear function. It minimizes the mean squared error of the HMM predictions on the training data and assigns a probabilistic weight to each HMM. Thus, the neural network layer learns a nonlinear mapping from the predictions of HMMs to the next state output. The Hidden Markov Model predictions are then assigned a weight where equation (summation of over all HMM = 1). This importance to particular HMM’s output characterized by weight is learnt by iterating through samples. The problem is now defined as given predictions [] from HMM at timestep , what is the likely action the attacker will take at timestep . FHMM uses supervised neural network, as shown in figure 1, for learning the weight in training phase from N data samples and then uses the learned weight to estimating the future action taken. The network used is characterized by -

Fig. 1: Mapping estimates using neural network

where is the activation and

of the jth neuron in the lth layer.

We define a cost which is used as a measure to indicate the offset of predicted action from actual action taken. We define the cost C over all samples n of data as:


This equation characterises the difference in the true output and the output predicted by network

. Then we use backpropagation [32] to compute the gradients and update the weights by -

3 equations: error compute, bias and weight update equation.

The depth of network required in terms of layers increases as the complexity of pattern increases. Apart from neural networks superior ability to model nonlinear complex relationship in data, it has following benefits over traditional parametric approaches -

  • Neural Network generalize well to unseen data and can infer unseen patterns not initially present in data provided in training. This is extremely crucial in cyber security applications where new attacks consists of previously employed strategies intermittently spread throughout the attack. The neural network can detect this type of sub-pattern in attack and therefore complements the ability of HMM.

  • Unlike HMM and other parametric techniques, neural network does not impose restriction on the distribution of the input variables. Moreover, neural networks can better model heteroscedasticity, while traditional model fail in to model data with high volatility and non-constant variance which is common in cyber security applications.

The training and evaluating phase is given in pseudo-code below:

0:   []
1:  for  in  do
2:      = Predictions()
3:  end for

Training phase:

0:  [],
1:  Initialize: randomly
2:   with L2 regularization
3:  Compute C = Cost()
4:  Update parameters:
5:  if Cost() ¡ Cost(then
6:     Go to step 1
7:  else
8:     return  
9:  end if

Evaluation phase:

0:  []
Algorithm 4 Second Stage Data Collection

Iv-D Overall Algorithm

During training stage, FHMM algorithm takes multiple attack sequence consisting of discrete states as input. Depending on sequence length and rank computed using dissimilarity function it selects and divides data into least correlated sub data. Then the predictions obtained from HMM trained to overfit on the sub data are fed to an neural network for learning the weightage of each HMM.

During prediction phase, the intermediate HMM predictions are fed to neural network which outputs a single value of next state in sequence of attack.

Figure 2 illustrate the training and prediction phases in FHMM algorithm.

Fig. 2: Overall FHMM Algorithm

V Experiments and Discussion

V-a Dataset Used

For collecting real attack logs, we had setup Cowrie honeypot [31] which is a medium interaction SSH and telnet honeypot. Honeypot is a decoy system with the sole intention of tricking attacker with an easy target to log his attack patterns. In case of Cowrie, the attack patterns are logged in JSON format. The detailed description of attack features logged and dataset description is provided in [15] by Rahul et. al. We processed events in Cowrie logs and divided them into 19 commands consisting of:

client.size, client.version, command.failed,

command.input/delete, command.input/dir-sudo,

command.input/other, command.input/system,

command.input/write, command.success,,

direct-tcpip.request, log.closed,,

login.failed, login.success, session.closed,

session.connect, session.file-download, session.input

We encoded these events as discrete states labelled 0 to 18. These 19 states were used for modelling and prediction in HMM, LSTM and FHMM algorithm. The data is grouped by session id for considering each sequence where each session id corresponds to the sequence of actions taken by hacker. The assumption made here is different session id are independent of individual attacker characteristics and hence dividing depending on session ID rather than source IP wouldn’t affect the modelling and prediction by a large factor.

The dataset used for training FHMM was extracted from the logs generated by Cowrie honeypot from April 2017 to July 2017. By processing these logs, we generated 22,499 distinct attack sessions involving the sequence of steps taken by a particular source IP. The attack sessions lasted from 2 steps to over 1400 steps. These sessions are raw logs of shell interaction performed by the attacker. We evaluate the performance of FHMM on a separate test set comprising of real time logs of attackers’ actions for 1 month.

V-B Partitioning Data into K groups

In cyber security logs, there are multiple attacks from different hackers of variable attack lengths. This results in logs being generated which are varied in terms of types, pattern, sequence length, and duration of attack or infiltration. The data can be divided into sub-datasets depending on either of the factors, each with its merits and demerits. Due to phenomena prevalent in cyber security domain where attacks of similar length following similar patterns, we have considered the distribution of the number of attack sessions with respect to their lengths and partitioned the preprocessed training data according to lengths. Figure 3 shows the distribution of number of attack sessions with respect to their lengths.

Once the data is divided into sub-datasets we can train number of HMM on the sub-datasets and pass their predictions to Neural Network. But to provide fast response time to prevent real-time attack, training and predictions of HMMs where can be in thousands is not practically feasible. Hence we use similarity measures to reduce the sub-datasets from to . In order to divide training data into sets, we construct 19-dimensional frequency arrays for each dataset consisting of sequences of length where

. The frequency arrays describe the distribution of different states in terms of probability of occurrence of each state across multiple attacks of same length. Figure 5 shows these frequency arrays reduced to 2-dimensional vector space by employing Principal Component Analysis (PCA) [33]. Using frequency array

to characterise all attacks of particular length, is sufficiently accurate to compute dissimilarity measure between attacks of different lengths. It is also essential the sub-dataset produced from be containing maximum information along with being mutually exclusive and dissimilar with one another. Euclidean distance between these arrays along each of 19 dimension provides the dissimilarity between the arrays. Hence we compute euclidean distance between these arrays to find dissimilar sets and select datasets such that these datasets cover maximum information present in the training data. The selected datasets out of total datasets helps in reducing training and prediction time without significantly affecting accuracy of deployed system.

Fig. 3: Distribution of number of attacker sessions with respect to duration
Fig. 4: Distribution of states predicted by different HMMs

The figure 7 depicts individual HMM error curve while training.

V-C Training K HMMs

The datasets selected, are used to train different HMMs This HMMs are trained until convergence using Baum Welsh algorithm. HMM is implemented in cython and trained parallely, thus reducing computation time by a large factor. Then each of the HMMs predicts the next possible state, resulting in predictions for next state which are fed to neural network.

To illustrate the practical need of splitting data and training different HMM, see the figures 4, 5 and 6. The figure 4 shows HMM trained on different lengths and the distribution of output states predicted by HMM. Here each state type is represented by different color and clearly shows why different HMMs are required to model entire data. Moreover, the figure also depicts that each HMM is learning different data patterns unique to an attacking type.

V-D Combining Predictions of K HMMs

The predictions of these HMMs are combined by a neural network containing 60 units with ReLU activation. Selecting the optimum value of

is a classical bias-variance tradeoff. Large values of (50) lead to poor generalization on test data. Table I shows the weight assigned to each feature input given to neural network in descending order. Here the features are prediction made by HMM of trained on particular length. The table clearly indicates temporal feature count has the highest importance in determining the next state taken by attacker, followed by predictions from HMM trained on length 11 and 44.

Fig. 5: 2-D plot of frequency arrays
Fig. 6: Correlation plot between predictions of individual HMMs

V-E Quantitative Results

Fig. 7: Error rate while training a single HMM

Selecting the optimum value of is a tradeoff between error rate and computational requirements. Increasing the value of reduces the error rate. Figure 8 shows a plot of error rate vs number of models in FHMM.

As evident, there is a significant reduction in error due to adding learners to the fusion. For our dataset, much of the reduction appears after 20-25 classifiers. One reason for this is the diverse set of features learnt by HMMs - 23, 24 and 25. This is evident from the correlation between the predictions of individual HMMs as shown in figure 6.

After 35-40 models, the error reduction for FHMM appears to have nearly reached a plateau. So, we primarily focus on the performance of FHMM for = 38. Table II shows the prediction accuracy attained by FHMM with = 25 and = 38 along with that of other sequence models such as Markov chain, single HMM and LSTM. Depending on how the HMMs in FHMM are trained, the training time differs. As the training of individual HMM is independent of other HMMs, this allows parallel training with faster training times compared to sequential training pipeline.

Weight Feature
12.0151 0.0308 count
4.6469 0.0179 hmm_11.0
3.9593 0.0131 hmm_44.0
3.9512 0.0261 hmm_13.0
2.9618 0.0069 hmm_188.0
2.6922 0.0071 hmm_106.0
2.6831 0.0089 hmm_23.0
2.6451 0.0069 hmm_127.0
2.5365 0.0214 hmm_28.0
2.3346 0.0166 hmm_18.0
2.2671 0.0220 hmm_76.0
2.1762 0.0204 hmm_129.0
2.0389 0.0093 hmm_199.0
1.7186 0.0198 hmm_4.0
1.6456 0.0043 hmm_80.0
1.5216 0.0082 hmm_9.0
1.4252 0.0136 hmm_6.0
1.3633 0.0047 hmm_20.0
1.3594 0.0037 hmm_69.0
1.2440 0.0063 hmm_194.0
1.2393 0.0072 hmm_197.0
1.1861 0.0085 hmm_14.0
1.1355 0.0061 hmm_31.0
TABLE I: Weights Assigned by Neural Network to Individual HMMs
Model Accuracy Training Time (in hrs)
Markov Chain 72 0.3
HMM 77 1
LSTM 86 5
FHMM (K=25, sequential) 87.19 2.3
FHMM (K=38, sequential) 90.82 2.5
FHMM (K=25, parallel) 87.19 1.3
FHMM (K=38, parallel) 90.82 1.5
TABLE II: Comparison of accuracy obtained by different models

One obvious conclusion drawn from the results is that the reduction in error rate provided by FHMM is very large as compared to that of a single learner. Additionally, FHMM has a better generalization ability than single models which may be attributed to the following reasons. The training data contains considerably diverse information and it becomes difficult for a single learner to learn a generalized joint probability distribution over the data. Thus, we use many learners which perform well on parts of data. These learners may learn different distributions over the data and combining them is a convenient choice. Training many learners also circumvents the imperfect search process of HMM. In HMM, we assume some prior knowledge about the learning process and the model structure and the desired complex input-output mapping may not be present in the hypothesis space being searched by the learning algorithm. In such cases, exploiting multiple learners provides a better estimate.

Fig. 8: Error rate vs number of HMMs in FHMM algorithm
State 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
FHMM 0 0 59.1 70.2 92.9 68.2 0 97.2 99.6 59 54.3 53.6 99.1 41.6 37.3 42.4 - 25.6 42.9
HMM 9 0 100 0 0 0 0 0 0 0 63.2 35.5 0 0 0 37.2 0 - 0 0
HMM 11 0 0 82.9 66.3 0 0 0 0 2.1 0 0 0 97.5 0 62.5 0 - 0 0
HMM 44 0 100 0.1 12.7 0 0 0 0 6.3 0 9.7 13.2 0.8 0 37.2 0.1 - 7.1 0
HMM 71 0 0 53.4 35.3 0 72.4 0 17.3 81.3 0 0 9.4 0 0 0 40.8 - 0 0
HMM 199 0 0 0 73 73.2 0 0 0 96.9 0 0 0 97.5 0 62.5 0 - 0 0
TABLE III: Accuracy obtained for different states by FHMM (k=38)

V-F Limitations

Although FHMM algorithm is robust to noise and provides a significant reduction in error rate while modelling attack sequences, it has some limitations common to other ensemble methods. The basic requirement of FHMM is that the base HMMs should be diverse and must have low correlation with each other for a significant reduction in error rate over train distribution. However, creating diverse base models is not always possible. Moreover, it requires the use of techniques such as partitioning data into diverse groups and initializing the base learners differently to induce heterogeneity. In addition, FHMM is complex and computationally expensive as compared to simple probabilistic algorithms such as Markov chain and HMM. With FHMM, learning time and memory constraints need to be taken care of.

V-G Other Applications

The proposed FHMM algorithm can be easily extended to other sequence problems where the goal is to predict the next state in the sequence. While the hidden state and the observation spaces are discrete in the above FHMM algorithm, the FHMM can also be used to model continuous observations. Potential applications include stock prices prediction, speech synthesis, time-series analysis, gene prediction and parts-of-speech tagging. For these applications, the major portion of the algorithm would remain identical with a change in the criteria for partitioning data into groups. For this purpose, other methods such as clustering and similarity measures like cosine distance can be employed depending on the training data and the application. After incorporating a suitable partitioning technique, the FHMM algorithm can be identically applied to other sequencing tasks.

Vi Conclusion

This paper proposes Fusion Hidden Markov Model which exploits the benefit of ensemble learning for modelling behavioural aspect of attacker to obtain better insight on predicting his future actions. FHMM provides compelling results while modelling temporal patterns due to its higher modelling capacity, robustness to noise, and reduced training time. FHMM’s superiority is substantiated by comparing against traditional approaches of Markov Chain, HMM, deep LSTM. The model is evaluated on Cowrie Honeypot dataset which consists of large number of diverse real-time attack sessions. Keeping initial conditions and preprocessing constant, the proposed architecture outperforms other traditional and benchmark models. In addition, we explored FHMM in depth, with highlights to individual parameter contribution to the overall model. The architecture of FHMM allows it to be generalized to other domains, where nature of dissimilarity between sequence is not linearly mapped.


The authors would like to acknowledge the support of Centre of Excellence (CoE) in Complex and Nonlinear Dynamical Systems (CNDS), VJTI and Larsen & Toubro Infotech (LTI) under their 1-Step CSR initiative.


  • [1] A. Razzaq, A. Hur, H. F. Ahmad and M. Masood, ”Cyber security: Threats, reasons, challenges, methodologies and state of the art solutions for industrial applications,” 2013 IEEE Eleventh International Symposium on Autonomous Decentralized Systems (ISADS), Mexico City, Mexico, 2013, pp. 1-6.
  • [2] W. Hu, W. Hu and S. Maybank, ”AdaBoost-Based Algorithm for Network Intrusion Detection,” in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 2, pp. 577-583, April 2008.
  • [3] Gemalto. Breach level index—Data Breach database & Risk Assessment Calculator, 2016.
  • [4] C. N. Modi and K. Acha, “Virtualization layer security challenges and intrusion detection/prevention systems in cloud computing: a comprehensive review,” J. Supercomput., vol. 73, no. 3, pp. 1–43, 2016.
  • [5] E. Viegas, A. O. Santin, A. França, R. Jasinski, V. A. Pedroni, and L. S. Oliveira, “Towards an Energy-Efficient Anomaly-Based Intrusion Detection Engine for Embedded Systems,” IEEE Trans. Comput., vol. 66, no. 1, pp. 163–177, 2017.
  • [6] Tabish Rashid, Ioannis Agrafiotis and Jason R. C. Nurse, “A New Take on Detecting Insider Threats: Exploring the use of Hidden Markov Models,” Proceedings of the 8th ACM CCS International Workshop on Managing Insider Security Threats, Pages 47-56
  • [7] Owen Vallis, Jordan Hochenbaum and Arun Kejariwal, “A novel technique for long-term anomaly detection in the cloud,” HotCloud’14 Proceedings of the 6th USENIX conference on Hot Topics in Cloud Computing Pages 15-15
  • [8] Oliver Brdiczka, Juan Liu, Bob Price, Jianqiang Shen, Akshay Patil, Richard Chow, Eugene Bart and Nicolas Ducheneaut, “Proactive Insider Threat Detection through Graph Learning and Psychological Context ,” 2012 IEEE Symposium on Security and Privacy Workshops
  • [9] Karl Granstrom, Peter Willett and Yaakov Bar-Shalom, “Asymmetric Threat Modeling Using HMMs: Bernoulli Filtering and Detectability Analysis,” IEEE Transactions on Signal Processing Volume 64 Issue 10, May 2016 Page 2587-2601
  • [10] Anders Krogh, Michael Brown, I. Saira Mian Kiminen Sjolander and David Hausder, “Hidden Markov Models in Computational Biology Applications to Protein Modeling” J Mol Biol. 1994 Feb 4;235(5):1501-31
  • [11]

    Kyoung-Jae Won, Adam Prügel-Bennett, and Anders Krogh, “Evolving the Structure of Hidden Markov Models” IEEE Transactions on Evolutionary Computation Volume 10 Issue 1, February 2006 Page 39-49

  • [12] R. Mitchell and I. Chen, ”Behavior-Rule Based Intrusion Detection Systems for Safety Critical Smart Grid Applications,” in IEEE Transactions on Smart Grid, vol. 4, no. 3, pp. 1254-1263, Sept. 2013. doi: 10.1109/TSG.2013.2258948
  • [13] E. Jonsson and T. Olovsson, ”A quantitative model of the security intrusion process based on attacker behavior,” in IEEE Transactions on Software Engineering, vol. 23, no. 4, pp. 235-245, April 1997. doi: 10.1109/32.588541
  • [14] D.E. Denning, ”An Intrusion-Detection Model”, IEEE Trans. Software Eng., vol. 12, no. 2, pp. 222-32, 1987.
  • [15]

    Rade R., Deshmukh S., Nene R., Wadekar A.S., Unny A. (2019) Temporal and Stochastic Modelling of Attacker Behaviour. In: Akoglu L., Ferrara E., Deivamani M., Baeza-Yates R., Yogesh P. (eds) Advances in Data Science. ICIIT 2018. Communications in Computer and Information Science, vol 941. Springer, Singapore

  • [16] Q. Zhang, D. Man and W. Yang, ”Using HMM for Intent Recognition in Cyber Security Situation Awareness,” 2009 Second International Symposium on Knowledge Acquisition and Modeling, Wuhan, 2009, pp. 166-169.
  • [17] D. Shi, R. J. Elliott and T. Chen, ”On Finite-State Stochastic Modeling and Secure Estimation of Cyber-Physical Systems,” in IEEE Transactions on Automatic Control, vol. 62, no. 1, pp. 65-80, Jan. 2017.
  • [18] Nong Ye, S. Vilbert and Qiang Chen, ”Computer intrusion detection through EWMA for autocorrelated and uncorrelated data,” in IEEE Transactions on Reliability, vol. 52, no. 1, pp. 75-82, March 2003.
  • [19] L. Rabiner and B. Juang, ”An introduction to hidden Markov models,” in IEEE ASSP Magazine, vol. 3, no. 1, pp. 4-16, Jan 1986.
  • [20] M. Johansson and T. Olofsson, ”Bayesian Model Selection for Markov, Hidden Markov, and Multinomial Models,” in IEEE Signal Processing Letters, vol. 14, no. 2, pp. 129-132, Feb. 2007.
  • [21] L. Spitzner, ”Honeypots: catching the insider threat,” 19th Annual Computer Security Applications Conference, 2003. Proceedings., Las Vegas, NV, USA, 2003, pp. 170-179.
  • [22] Q. D. La, T. Q. S. Quek, J. Lee, S. Jin and H. Zhu, ”Deceptive Attack and Defense Game in Honeypot-Enabled Networks for the Internet of Things,” in IEEE Internet of Things Journal, vol. 3, no. 6, pp. 1025-1035, Dec. 2016.
  • [23] Z. Zhan, M. Xu and S. Xu, ”Characterizing Honeypot-Captured Cyber Attacks: Statistical Framework and Case Study,” in IEEE Transactions on Information Forensics and Security, vol. 8, no. 11, pp. 1775-1789, Nov. 2013.
  • [24] Kaâniche, M., Deswarte, Y., Alata, E., Dacier, M., & Nicomette, V. (2006). Empirical analysis and statistical modeling of attack processes based on honeypots. CoRR, abs/0704.0861.
  • [25] S. Almotairi, A. Clark, G. Mohay and J. Zimmermann, ”A Technique for Detecting New Attacks in Low-Interaction Honeypot Traffic,” 2009 Fourth International Conference on Internet Monitoring and Protection, Venice, 2009, pp. 7-13.
  • [26] A. A. Cárdenas, P. K. Manadhata and S. P. Rajan, ”Big Data Analytics for Security,” in IEEE Security & Privacy, vol. 11, no. 6, pp. 74-76, Nov.-Dec. 2013.
  • [27] Z. Zhan, M. Xu and S. Xu, ”Characterizing Honeypot-Captured Cyber Attacks: Statistical Framework and Case Study,” in IEEE Transactions on Information Forensics and Security, vol. 8, no. 11, pp. 1775-1789, Nov. 2013.
  • [28] Olivier Thonnard, Marc Dacier, “A framework for attack patterns’ discovery in honeynet data”, Digital Investigation, Volume 5, Supplement, 2008, Pages S128-S139, ISSN 1742-2876.
  • [29] Alghamdi, Rubayyi. (2016). Hidden Markov Models (HMMs) and Security Applications. International Journal of Advanced Computer Science and Applications. 7. 10.14569/IJACSA.2016.070205.
  • [30] Dietterich T.G. (2000) Ensemble Methods in Machine Learning. In: Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, vol 1857. Springer, Berlin, Heidelberg
  • [31] Github Michel Oosterhof, “Cowrie - medium-interaction honeypot,”, last accessed on December 2015.
  • [32] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). ”Learning representations by back-propagating errors”. Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0.
  • [33] Shlens, Jonathon. ”A tutorial on principal component analysis.” arXiv preprint arXiv:1404.1100 (2014).
  • [34] T. K. Moon, ”The expectation-maximization algorithm,” in IEEE Signal Processing Magazine, vol. 13, no. 6, pp. 47-60, Nov. 1996.
  • [35]

    Z. Ghahramani, An introduction to hidden markov models and bayesian networks, International journal of pattern recognition and artificial intelligence 15 (01) (2001) 9–42.

  • [36] M. Al-Qurishi, M. S. Hossain, M. Alrubaian, S. M. M. Rahman, and A. Alamri, “Leveraging Analysis of User Behavior to Identify Malicious Activities in Large-scale Social Networks,” IEEE Transactions on Industrial Informatics, 2017.
  • [37]

    J. Zhang, M. Zulkernine and A. Haque, ”Random-Forests-Based Network Intrusion Detection Systems,” in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 5, pp. 649-659, Sept. 2008.