Analyzing CNN Based Behavioural Malware Detection Techniques on Cloud IaaS

02/15/2020 ∙ by Andrew McDole, et al. ∙ Tennessee Tech University Manhattan College UNCW 0

Cloud Infrastructure as a Service (IaaS) is vulnerable to malware due to its exposure to external adversaries, making it a lucrative attack vector for malicious actors. A datacenter infected with malware can cause data loss and/or major disruptions to service for its users. This paper analyzes and compares various Convolutional Neural Networks (CNNs) for online detection of malware in cloud IaaS. The detection is performed based on behavioural data using process level performance metrics including cpu usage, memory usage, disk usage etc. We have used the state of the art DenseNets and ResNets in effectively detecting malware in online cloud system. CNN are designed to extract features from data gathered from a live malware running on a real cloud environment. Experiments are performed on OpenStack (a cloud IaaS software) testbed designed to replicate a typical 3-tier web architecture. Comparative analysis is performed for different metrics for different CNN models used in this research.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Cloud has become a popular platform due to its characteristics of on-demand services, infinite resources, ubiquitous availability and pay-as-you go business model [20]. Infrastructure as a Service (IaaS) is the most widely offered service model where the resources of a large data center can be purchased by clients to perform computing tasks. Since user clients can utilize any number of virtual machines, ranging from a couple to thousands, automatic monitoring of these virtual machines is necessary to ensure the security of the cloud provider and its clients. While there are several risks associated with IaaS, one of the greatest risks is the possibility of a virtual machine becoming infected with malware and spreading the malware to other virtual machines in the data center. This would put cloud providers and their customers in danger as well as end users whose data is stored or transferred on these infected virtual machines. As cloud providers increase their client base, the potential for loss also increases and so does the responsibility of cloud providers to invest in security mechanisms for their customers. The scale of an attack is multiplied due to similar configuration and automatic provisioning of the virtual machines (VMs) hosted by a cloud service provider. Identical configurations for these virtual machines make attacks repeatable and allow them to more likely spread within the data center once a single machine is infected.

Static malware analysis technique is widely used, in which the files are scanned before they can be executed on the systems. In such case, file is disassembled by a disasssemblers to obtain the source code which can then be examined using different tools. Although the method is fast and efficient, but it can be easily dodged by malware writers who can trick the disassemblers into generating incorrect code. This is done by inserting errors which lead to the actual code execution path being hidden or obfuscated. The binary file can also be worked on directly. An example of this is extracting n-grams of the binary file as features and then using machine learning techniques to locate known malicious patterns. Static analysis generally fails in the case of cloud malware as malware is injected into an application that was already scanned and deemed safe. Such an attack in cloud IaaS is referred to as a cloud malware injection

[12]. In this case, if the application is not re-scanned at a later time, the newly inject malware will not be detected. Therefore, the need to constantly monitor these applications running in cloud environments is essential.
While there are several works in the domain of malware detection, few research papers [2, 1, 3, 22, 7, 27, 29] deal with online malware detection specifically and in particular provide solutions using machine learning based approach. This process consists of a typical machine learning approach i.e. building a machine learning model, training the model with relevant dataset captured, and using the trained model to determine if a malware exists in the system or not. In the case of building the model, features must first be selected to determine what data will be used as input. This is no different for cloud based detection methods except that the features to be chosen are limited to the information that can be gained through the hypervisor. Through careful selection of features, machine learning can be used to provide dynamic malware analysis and detect in case machines have been infected by adversaries in the data centers. This kind of dynamic analysis fulfills the need for constant surveillance in cloud IaaS for malware detection.
The most unique characteristics of cloud computing include resource pooling, on-demand self-service, and rapid elasticity which can be fulfilled by an auto-scaling architecture. In this paper, we focus on auto-scaling wherein the machines are spawned based on the demand and usually these VMs are of similar type, resulting in similar behaviour. It is likely that that an injected malware will result in behaviour deviation on a VM at some point. In this work, we seek to detect such malicious behaviour and compare state-of-the-art deep learning models on several parameters. We are focused on detecting only one VMs which have been compromised ignoring that the fact that all similar VMs can be infected by an adversary in a more sophisticated attack. We plan to work on this as a next step to this problem.

This work is an extension to our earlier work where only one kind of CNN model was used, with the prime goal that such techniques can be effectively used malware analysis. In this work, we compare and contrast several CNN models using the same data as [1, 2, 3] and six other deep learning models to determine possible use cases within a cloud IaaS scenario. For all models, the dataset consists of process-level metrics collected from the virtual machine hypervisor. Since these models are CNNs, the data is formatted as two dimensional matrices with the dimensions being . Since many of our models require the input to be 3 dimensional shape, the 2d matrix is copied to fulfill the third dimension requirement.

The paper is organized as follows. Section 2 discusses related work in cloud online malware detection. Section 3 provide an overview of the key intuition and methodology for the experiments. Section 4

covers evaluation metrics and experimental results whereas Section

5 presents comparative analysis among different CNN models used. Section 6 covers certain limitations of our approach along with discussion on future work. Finally, section 7 summarizes this paper.

2 Related Work

Several works have been done in malware detection which focus on different aspects of several approaches. The first step in developing a machine learning based model for online malware detection is to determine which features are most relevant and are to be extracted. Research papers [5, 24, 26] focus on API calls whereas [18, 9, 7] primarily utilize system calls. Other features such as performance counters [8] or memory features [15, 30] have also been used. Although several existing resilience frameworks exist [25, 28, 29, 19], it is likely that novel attacks and new techniques will defeat existing detection methods.

Most of the algorithms for detecting malware, such as support vector machines (SVM)


, all-nearest-neighbor (ANN) classifier

[10], and naïve bayes[11, 5], work for examining a single VM in the cloud. While a running single vm is not the expected use case of cloud environments, there is virtually no difference between a single VM and a standalone host when it comes to detecting malware on them. Generally, most works[18, 5, 24, 26, 9, 7, 8, 15, 30] focus on features that can be extracted through the hypervisor. Dawson et al [7] collect system calls for features and are primarily concerned with rootkits. A non linear phase-space algorithm is used in their analysis of system calls to detect anomalies. The results are evaluated on the phase-space graph dissimilarities.
Entropy based Anomaly Testing (EbAT) was introduced in [27]

. EbAT analyzed multiple metrics such as CPU and memory utilization for the purposes of anomaly detection. The paper analyzed these metrics based upon distribution instead of a flat threshold. This approach yielded accurate results for detection and the ability to scale to keep up with metric processing. However, the evaluation did not demonstrate usefulness in practical and realistic cloud environment scenarios. Azmandian et al.


utilize performance metrics such as disk and network input-output gathered from the hypervisor to form a new anomaly detection approach. K-NN and Local Outlier Factor are unsupervised machine learning techniques used in this work.

Work by Abdelsalam et al [1] showed that a black box approach can be used to detect malware. This paper used VM-level performance and resource utilization metrics. This approach worked well in detecting highly active malware which showed up in the resource utilization metrics, but was not as effective in detecting malware that hid itself with low utilization. Similarly, in [2] the authors introduce a detection method which uses a CNN model with the goal of identifying low profile malware. This method achieved  90% accuracy using resource metrics and was able to identify multiple low-profile malware. While these results are good, it is limited in that it targeted only a single virtual machine like many other related works without features like auto-scaling.

3 Key Intuition and Methodology

In this section, we discuss the key intuition behind our approach and describe our methodology in detail.

3.1 Key Intuition

So as to detect online malware using process-level information, we train a model on a dataset that contains benign and malicious samples. Each sample consists of information about a process or collection of processes, and the task is to classify the input sample as benign or malicious. To build up our dataset of benign samples, we run a Virtual Machine (VM) normally without the presence of malicious software. Malicious data samples are collected after the VM has been infected with malware.

Different malware are used for different runs of the experiment to create the dataset. We then partition our dataset into into training, validation, and testing datasets. In other words, the model is trained on samples from different experiments which contained different malware. This way, the model generalizes itself to detect different malware through the various ways they reveal themselves in process metrics. A model’s ability to generalize and predict new samples is dependent on its internal architecture. More complex models may achieve higher accuracy by adding more hidden layers or by connecting those hidden layers in a novel manner.

3.2 Methodology

Convolutional Neural Networks (CNN) have been commonly used in various visual imagery tasks. A basic flowchart of a neural network is shown in Figure 2. CNNs generally take two dimensional data as an input, in our case, the process level data is represented as a two dimensional array. A sample consists of rows of processes with columns of process features. Assuming is a process, is a process metric, is a virtual machine ID, then is a sample at time as shown in Figure 2.

Figure 1: Neural Network Flow
Figure 2: Sample at Time Consisting of Process Level Information

Each sample represents a single virtual machine at a given time interval so the models learn what an infected machine “looks like” over time. During the course of time in an operating system, processes get created and destroyed and as these IDs can be assigned/re-assigned to different processes, they provide no useful information for the task at hand. For this reason, we focus on unique process defined as a tuple that contains a process ID, the command used to run the process, and a hash of the binary executable. This unique process will be referred to as a process in this work. Once the training dataset has been used to train the model, it is used for generating predictions on an unseen test data set that the model did not use during the training process.

We used Openstack111Openstack., a popular cloud computing platform to replicate a standard 3-tier web architecture consisting of a web server, application server, and a database. Auto-scaling was enabled on the web server and the application servers were configured with a policy based on the average CPU utilization of the VMs. As per the policy, if the average CPU utilization is above 70%, the architecture scales out and it scales in if the utilization is below 30%. We spawned between 2 and 10 servers in each tier depending on the traffic load. An ON/OFF Pareto distribution with the default NS2222NS2 Manual. tool parameters was used to generate the traffic load.

Figure 3, shows the data collection process. Each experiment was 1 hour long, consisting of a 30 minute clean phase and a 30 minute infected phase. During the clean phase, the virtual machines were untouched. During the infected phase, malware was injected into a virtual machine at some time after the infected phase started. We introduced 113 different malwares to collect our dataset. These malwares were obtained from VirusTotal333VirusTotal Website. The VMs were configured with full internet access and all firewalls were disabled. This was done so that the malware could operate without any interference. After every 10 seconds, a sample was collected from the infected virtual machine in the experiment resulting in 360 samples over the course of each experiment.

Figure 3: Data Collection Overview
Figure 4: LeNet-5 Model

3.3 Convolutional Neural Network Models

3.3.1 LeNet-5 [17]:

It is an example of a shallow CNN. It has few layers so the gradients can be computed quickly. Figure 4, shows the model architecture. Note that the architecture is simple and straightforward where the output of each layer serves as the input to the next layer.

The input to the model, would be a 2 dimensional matrix of 120x45 representing a sample with a maximum of 120 processes and 45 features of these processes. Each process that was not active at the time the sample was taken, but would become active during the course of the experiment was padded with zeroes. The first layer of LeNet-5 consists of a convolutional layer with 32 kernels, each with a size of 5x5. The output of this layer is 32 feature maps with the same input shape of 120x45. The max pooling layer of size 2x2 downsizes these feature maps to become 60x23. The second convolutional layer has 64 kernels with the shape of the output from the previous max pooling layer, 60x23. This convolutional layer is followed by another max pooling layer of size 2x2 which results in 64 feature maps of size 30x12. The final layers of LeNet-5 are fully connected with sizes 1024, 512, and 2 respectively. The final layer has an output of size 2 since it represents a binary prediction of malicious or benign sample.

All of the activation functions used Rectified Linear Units (ReLU)

[4] which were placed after every convolution and fully connected layer excluding the final layer. We used the Adam Optimizer [16]

, which is a stochastic gradient descent algorithm with automatic learning rate adaptation. This optimizer trains the weights of the model after every min-batch. The learning rate controls how drastically the weights of the model are changed in response to the backpropagation. A higher learning rate leads to faster training but can result in unstable gradient descent and can inhibit convergence. A learning rate that is too slow can cause the model not to achieve higher accuracy results.

Figure 5: Residual Block Diagram
Figure 6: Data Input Shape with Window Size 3

3.3.2 Residual Networks:

One problem with models with a large number of layers is degradation [13]

. This is observation that adding more layers to the network can lead to optimization problems and therefore lower accuracy. This degradation is caused by the backpropagation not being able to reach the initial layers of the model. Residual networks (ResNets) solve this issue by adding skip-connections or residual connections. By adding these shortcut paths between layers, the gradient is allowed to flow better through the model and deeper models are able to be trained without degradation.

Residual blocks as shown in Figure 6, are used in ResNets[13]. The identity is the shortcut connection and it is what allows the back propagation to affect the initial layers and allow them to learn as quickly as the final layers in the model. Three ResNets were used in our work: ResNet-50, ResNet-101, and ResNet-121. Each ResNet required the window size of the data to be three, but the samples were all 2d matrices. All samples had their data replicated twice more to form 3 dimensional data. A representation of this data is shown in Figure 6. At the end of each model, global average pooling was added.

Figure 7: Dense Networks

3.3.3 Dense Networks:

Where ResNets seek to resolve the gradient degradation problem, DenseNets [14]

attempt to alleviate the vanishing gradient problem

[23]. A generic DenseNet model is shown in Figure 7. DenseNets are different from ResNets because instead of having an identity mapping from one layer to the next, DenseNets pass the outputs of each layer to all subsequent layers. This way, each layer has collective knowledge from all the preceding layers. This causes the feature maps to be ‘reused’ by latter layers. Due to this reuse of feature maps, less feature maps are required as input due to the compounding nature of DenseNets.
Each dense block makes use of these identity mappings and feature reuse. Between each dense block, there are transition layers that are comprised of a convolution and pooling layer. These are meant to reduce the feature map size between dense blocks. Similar to the ResNets, all DenseNets models received the same input shape as the ResNet models, 120 45 3. The batch size used was 64 for all models to maintain consistency.

4 Experimental Evaluation and Results

4.1 Evaluation

For our comparative analysis, we have used four evaluation metrics:

True Positives () is the number of correctly identified malicious samples. True Negatives () is the number of correctly identified benign samples. False Positives () is the number of samples that were benign but identified as malicious. False Negatives () are the samples that were malicious but not identified correctly by the model.

Accuracy is a measure of correct classification. Precision is a measure of accurate positive predictions over the total amount of positive predictions. Precision is important because if the precision is low, then the model is predicting many benign samples to be infected. In the case of cloud data centers, this can hurt the availability of many services if their samples are being incorrectly classified as malign. Recall is a measure of true positive over total actual positive. This metric is important because it reveals how often infected samples get through the model without detection. Recall is useful when the cost of a false negative is high, such is the case with identifying malware. The F1 score is used whenever there needs to be a balance between Precision and Recall and there is a large imbalance in the dataset.

Model Accuracy Precision Recall F1 Detection Time (ms)
LeNet-5 89.2 94.7 80.9 87.2 54
ResNet-50 88.4 86.0 88.9 87.4 96
ResNet-101 86.6 82.3 89.7 85.9 130
ResNet-152 89.5 89.0 87.8 88.4 165
DenseNet-121 92.9 100 84.6 91.5 164
DenseNet-169 92.8 99.7 84.4 91.4 209
DenseNet-201 92.8 99.5 84.6 91.5 249
Table 1: Results for Different Evaluation Metrics

4.2 Experiment Results

Table 1

shows the results of each the CNN models considered in this research. While each model was tested over the course of 100 epochs, these numbers were taken from the model when it scored the highest on the validation data set. This means that these are the best case scenario for each model. If these models were deployed in a cloud environment, they would be trained up to the point at which they generate the best results. This point could be different for every model so it is important to pick out the best performing models and not compare models based on something arbitrary such as after

The dataset used consisted of 113 data collection experiments which were split up into the following: training dataset (60%), validation data (20%), and testing data (20%). The training dataset was shuffled but the validation and testing dataset were not. DenseNets reached the highest accuracy at almost 93% and precision at  100%. DenseNets also had the highest F1 scores at  91.5%. ResNet-101 had the best recall score at 89.7%.

Figure 8: Metrics Comparison for used CNN Models
Figure 9: ROC Curves

5 Comparative Analysis and Discussion

As stated in section 4.1, the comparative analysis is performed using four metrics. We discuss each of the metrics along with the ROC curves. Additionally, we discuss the detection time of the models. Finally, we provide an overall analysis discussion and take away which sheds the light on the importance of finding the balance in choosing right models based on the use case and intention. Results for all performance metrics are show in Figure 8.

5.1 Performance Analysis

Accuracy. The base model LeNet-5 reaches an accuracy of  89%. This is expected as it is a shallow model, thus it lacks the ability to capture enough features. The DenseNet-121 model has the highest accuracy of  93%, with a very negligible difference compared to DenseNet-169 and DenseNet-201. This indicates that the adding more layers did not increase the accuracy. One reason might lie in the fact that our dataset is limited (i.e., 40k samples) and deeper networks need more data.

ResNet-152 has a slightly better accuracy than LeNet-5. Considering the substantially longer training time for ResNet-152, such slight accuracy increase from LeNet-5 might not be worthwhile in some cases. Note that ResNet might perform better considering other metrics and, in turn, might work in different scenarios. ResNet-50 and ResNet-101 have the lowest accuracy.

The DenseNets performed better than the other models likely due to the feature reuse property of the dense blocks. Also, DenseNet models are more feature efficient than the other models.

Precision. The DenseNet models highly outperformed the other models in precision. DenseNet-121 achieved a precision of 100%, meaning that every sample classified as infected was indeed infected. DenseNet-169 also achieved a high precision of 99.7% followed by DenseNet-201 with a precision of 99.5%.

The ResNet models have noticeable lower precision than all the other models, indicating that they are incorrectly classifying benign samples as malicious. LeNet-5 achieved a high precision score so it would be more appealing than the ResNet models when some false positives can be tolerated. The high precision achieved by all the DenseNet models indicates that they correctly identified the benign samples more often and were less likely to classify samples as malicious unless they had a high confidence.

Recall. Recall is the only metric where ResNets performed better than the other models. All three ResNet models were close but ResNet-101 was the best. The DenseNet models performed worse than the ResNet models but LeNet-5 performed the worst by far. Since recall is a measure of many infected samples where missed by the models, ResNets seem to be effective at identifying most infected samples. LeNet-5’s low recall score suggests that the model is weak at identifying less obvious malicious samples. This would be a large problem in datacenters where the samples taken should represent an unbalanced dataset. There should be an overwhelming amount of benign samples before machines are infected and malicious samples begin to show up and a low recall scoring model would be less reliable in predicting the malware as soon as it appears. The higher recall scores demonstrated by the ResNet models are caused by the model being more sensitive and classifying more samples as malicious. This means that the model predicted a sample was malicious more often and was better at identifying those malware who were not as ”obvious” in the performance metrics.

F1 Score. F1 Score is about the balance of precision and recall. In that regard, the DenseNet models scored the highest, which indicates that they have the best balance between identifying only malicious samples and identifying most of infected samples overall.

ROC Curves. The receiver operating characteristic (ROC) analysis [21] is used for comparing models at different thresholds. Our ROC curves are shown in Figure 9. The ROC characteristic measures a models ability to distinguish between classes so in our experiments, it measures the models’ abilities to detect malware. If the ROC curve for a model is close to representing a line, then the model has little ability to differentiate between classes. A common way to analyze the ROC curve is to measure the area-under-curve (AUC) value. When the AUC is higher, then the model is accurately predicting benign samples as benign and malicious samples as malicious. The best performing models were the DenseNet models due to their high precision scores which involve both TP and FP values.

5.2 Cost Analysis

Training Time. LABEL:tab:training_time shows the training time needed to reach the respective accuracies for the models. LeNet-5 trained ten times faster than the next fastest model making it viable as a model to quickly process large volumes of data. DenseNet-201 and DenseNet-169 took much longer to train than DenseNet-121 while reaching similar accuracy making them less desirable.

Figure 10: Highest Validation Accuracies Achieved
Model Validation Accuracy Epoch Reached Time Elapsed (s)
LeNet-5 89.9 29 170
ResNet-50 90.7 67 1815
ResNet-101 87.0 60 2940
ResNet-152 88.7 99 7029
DenseNet-121 92.1 32 1683
DenseNet-169 91.9 81 5848
DenseNet-201 91.5 36 3060
Table 2: Time to Reach Highest Accuracy

Detection Time. Detection time is used to show how long in milliseconds each model took to produce a prediction for any given sample. The results are unsurprising, more layers in a model cause it to take longer to feed the input through the model. This is important to include, however, because samples in a data center may be getting collected faster than a given model process a prediction. The detection time differences may also indicate that some models may not be suited for lower specification hardware. Since the detection time is dependent on how quickly the model can process the input, increasing the input size or the volume of inputs could prevent some models from scaling with large data center operations. In these cases, the models with lower detection times may be preferable.

Figure 11: Training and Validation Loss for used CNN Models

5.3 Overall Analysis

Overall, the DenseNet models were the most accurate models with the best balance between precision and recall. The low scores in recall though might be an issue for our use case where allowing malicious samples to slip through could be disastrous. It is also worth noting that while most of the models had validation accuracies that converged to a value, ResNet-101 and ResNet-152 had large fluctuations and never seemed to settle in to a value. This can be seen in Figure 11. If ResNet-101 and ResNet-152 converged to some values, then they may have had better scores. With the inclusion of detection time, assuming the volume of input does not overwhelm the model predicting capabilities, then the DenseNet models would be preferable due to their high accuracy and near perfect precision.
Figure 10 and LABEL:tab:training_time show the points in the training where the models reached their highest validation accuracy. The time elapsed column shows the total time needed to reach the epoch where those highest accuracy numbers were achieved. For examples, DenseNet-121 reached its highest accuracy after 32 training epochs and it took 1683 seconds. This shows that DenseNet-121 could be trained for less time than DenseNet-169 or DenseNet-201 and attain better accuracy.

6 Limitations and Challenges

Although, our results provide good understanding of which CNN model works best in what kind of scenario, there are some limitations we would like to highlight based on our experience. The most important limitation of using CNN on the type of data we used is that it fails to capture a time correlation in the data set. When detecting malware in an already running virtual machine, it is important for a model to have some knowledge about existing samples and the behavior of the machine over time. One such scenario is when a machine begins to experience more traffic and due to some constraint on scaling, the samples generated from that machine begin to resemble some malicious samples. In this case, if the model does not learn that process metrics can be scaled according to valid demands on the machine, the false positive rate might increase. Another scenario is when the model detects an infected sample, but the malware immediately becomes dormant as to hide itself. If the model does not take into account the previous sample when the malware was detected, it may increase false negatives where the model doesn’t detect a malware even if it is hidden.

These limitations can be mitigated by using Recurrent Neural Networks (RNN). RNN’s are comprised of cells which have a memory mechanism and can learn relationships among data with respect to time. These RNN models are used to process sequences of data such as audio or text. Our brief introduction to RNN models suggest that they can be used to solve some of the issues discussed above by lowering false positives and false negatives in certain scenarios.

Another limitation of this paper is the number of malware samples used. We used roughly 120 malware samples, however, we believe with more samples CNN models could have performed better. The deeper networks such as DenseNet-201 and ResNet-152 may perform better on malware that affect the system very little, and the complexity of those networks may be trained on those samples better than a shallower model. By increasing the amount of malware available, the models also gain a broader data set that could be used to better generalize their predictive power. Also once malware is injected, there are no guarantees that the malware is exhibiting malicious behavior at any given time without knowing what code was being executed at that same moment the sample was recorded. This can lead to a problem where samples are mislabeled as malicious or benign. This problem was addressed in

[2], but without writing custom malware that will beacon when malicious activity begins and ends, it is unlikely that all samples will be labeled properly.

7 Conclusion and Future Work

In this paper, we analyzed seven different convolutional neural network models to determine which one is better suited for malware detection in cloud IaaS. Our analysis shows that LeNet-5 model is quick but sacrifices accuracy. The model is still useful as it attains a 90% accuracy and can be used in situations where a quick prediction is needed but incorrectness is not too costly. It can also be used when early predictions can be made with LeNet-5 which can be rechecked with more complex models. Also, our analysis suggest that while the residual networks performed well averaging  86 accuracy, the DenseNet models performed the best at 93% accuracy. The ResNet models have higher recall scores indicating that they are more suited for cases where not identifying the malware posing a great security risk. DenseNet models have higher accuracy and precision which indicates they are less likely to generate false positives which are useful in IaaS environments where service availability is extremely important. For future work, we plan to examine more malware samples including Windows malware as well as examine other architectures such as Hadoop and containers. We also plan to analyze and propose new deep learning techniques by infecting multiple VMs to replicate more sophisticated attack scenarios.


  • [1] M. Abdelsalam, R. Krishnan, and R. Sandhu (2017) Clustering-based iaas cloud monitoring. In Proc. of IEEE International Conference on Cloud Computing (CLOUD), pp. 672–679. Cited by: §1, §1, §2.
  • [2] M. Abdelsalam et al. (2018) Malware detection in cloud infrastructures using convolutional neural networks. In Proc. of IEEE International Conference on Cloud Computing (CLOUD), pp. 162–169. Cited by: §1, §1, §2, §6.
  • [3] M. Abdelsalam et al. (2019) Online malware detection in cloud auto-scaling systems using shallow convolutional neural networks. In Proc. of IFIP Annual Conference on Data and Applications Security and Privacy, Cited by: §1, §1.
  • [4] A. F. Agarap (2018) Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375. Cited by: §3.3.1.
  • [5] M. Alazab et al. (2011)

    Zero-day malware detection based on supervised learning algorithms of api call signatures

    In Proc. of the Australasian Data Mining Conference, AUS, pp. 171–182. External Links: ISBN 9781921770029 Cited by: §2.
  • [6] F. Azmandian et al. (2011) Virtual machine monitor-based lightweight intrusion detection. ACM SIGOPS Operating Systems Review 45 (2), pp. 38–53. Cited by: §2.
  • [7] J. A. Dawson et al. (2018) Phase space detection of virtual machine cyber events through hypervisor-level system call analysis. In Proc. of IEEE International Conference on Data Intelligence and Security (ICDIS), pp. 159–167. Cited by: §1, §2.
  • [8] J. Demme et al. (2013) On the feasibility of online malware detection with performance counters. ACM SIGARCH Computer Architecture News 41 (3), pp. 559–570. Cited by: §2.
  • [9] G. Dini et al. (2012) MADAM: a multi-level anomaly detector for android malware. In Computer Network Security, I. Kotenko and V. Skormin (Eds.), Berlin, Heidelberg, pp. 240–253. External Links: ISBN 978-3-642-33704-8 Cited by: §2.
  • [10] Y. Fan, Y. Ye, and L. Chen (2016) Malicious sequential pattern mining for automatic malware detection. Expert Systems with Applications 52, pp. 16–25. Cited by: §2.
  • [11] I. Firdausi et al. (2010) Analysis of machine learning techniques used in behavior-based malware detection. In Proc. of IEEE International conference on advances in computing, control, and telecommunication technologies, pp. 201–203. Cited by: §2.
  • [12] N. Gruschka et al. (2010) Attack surfaces: a taxonomy for attacks on cloud services. In Proc. of IEEE international conference on cloud computing, pp. 276–279. Cited by: §1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §3.3.2, §3.3.2.
  • [14] G. Huang, Z. Liu, and K. Q. Weinberger (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Link, 1608.06993 Cited by: §3.3.3.
  • [15] K. N. Khasawneh et al. (2015) Ensemble learning for low-level hardware-supported malware detection. In Proc. of International Symposium on Recent Advances in Intrusion Detection, pp. 3–25. Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.1.
  • [17] Y. LeCun et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3.1.
  • [18] P. Luckett et al. (2016-04) Neural network analysis of system call timing for rootkit detection. In Proc. of Cybersecurity Symposium (CYBERSEC), Vol. , pp. 1–6. External Links: Document, ISSN null Cited by: §2.
  • [19] A. K. Marnerides et al. (2015) A multi-level resilience framework for unified networked environments. In Proc. of IFIP/IEEE International Symposium on Integrated Network Management (IM), pp. 1369–1372. Cited by: §2.
  • [20] P. Mell, T. Grance, et al. (2011) The NIST definition of cloud computing. Cited by: §1.
  • [21] C. E. Metz (2006) Receiver operating characteristic analysis: a tool for the quantitative evaluation of observer performance and imaging systems. Journal of the American College of Radiology 3 (6), pp. 413–422. Cited by: §5.1.
  • [22] H. S. Pannu, J. Liu, and S. Fu (2012) Aad: adaptive anomaly detection system for cloud computing infrastructures. In Proc. of IEEE Symposium on Reliable Distributed Systems, pp. 396–397. Cited by: §1.
  • [23] R. Pascanu, T. Mikolov, and Y. Bengio (2012)

    Understanding the exploding gradient problem

    CoRR abs/1211.5063. External Links: Link, 1211.5063 Cited by: §3.3.3.
  • [24] R. S. Pirscoveanu et al. (2015) Analysis of malware behavior: type classification using machine learning. In Proc. of IEEE International conference on cyber situational awareness, data analytics and assessment, pp. 1–7. Cited by: §2.
  • [25] J. P. Sterbenz et al. (2010) Resilience and survivability in communication networks: strategies, principles, and survey of disciplines. Computer Networks 54 (8), pp. 1245–1265. Cited by: §2.
  • [26] S. Tobiyama et al. (2016) Malware detection with deep neural network using process behavior. In Proc. of IEEE Annual Computer Software and Applications Conference, Vol. 2, pp. 577–582. Cited by: §2.
  • [27] C. Wang (2009) Ebat: online methods for detecting utility cloud anomalies. In Proc. of the Middleware Doctoral Symposium, pp. 1–6. Cited by: §1, §2.
  • [28] M. R. Watson et al. (2013) Towards a distributed, self-organising approach to malware detection in cloud computing. In Proc. of International Workshop on Self-Organizing Systems, pp. 182–185. Cited by: §2.
  • [29] M. R. Watson et al. (2015) Malware detection in cloud computing infrastructures. IEEE Transactions on Dependable and Secure Computing 13 (2), pp. 192–205. Cited by: §1, §2.
  • [30] Z. Xu et al. (2017) Malware detection using machine learning based analysis of virtual memory access patterns. In Proc. of IEEE Design, Automation & Test in Europe Conference & Exhibition, 2017, pp. 169–174. Cited by: §2.