Deep Learning in Information Security

09/12/2018 ∙ by Stefan Thaler, et al. ∙ 0

Machine learning has a long tradition of helping to solve complex information security problems that are difficult to solve manually. Machine learning techniques learn models from data representations to solve a task. These data representations are hand-crafted by domain experts. Deep Learning is a sub-field of machine learning, which uses models that are composed of multiple layers. Consequently, representations that are used to solve a task are learned from the data instead of being manually designed. In this survey, we study the use of DL techniques within the domain of information security. We systematically reviewed 77 papers and presented them from a data-centric perspective. This data-centric perspective reflects one of the most crucial advantages of DL techniques -- domain independence. If DL-methods succeed to solve problems on a data type in one domain, they most likely will also succeed on similar data from another domain. Other advantages of DL methods are unrivaled scalability and efficiency, both regarding the number of examples that can be analyzed as well as with respect of dimensionality of the input data. DL methods generally are capable of achieving high-performance and generalize well. However, information security is a domain with unique requirements and challenges. Based on an analysis of our reviewed papers, we point out shortcomings of DL-methods to those requirements and discuss further research opportunities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information security (InfoSec) addresses the protection of information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide confidentiality, integrity, and availability [8]. Many InfoSec challenges involve the analysisof large amounts of data. For some of the challenges, it is impractical to analyze the data manually or to write software, because either the volume of the data is too large, or formalizing all the knowledge a computer needs to know to solve the problem is challenging task itself. To overcome such problems, many methods that are based on machine learning have been proposed [68]. Machine learning techniques are computer algorithms that learn a model from experience to solve a task according to a performance measure [94]

. Examples for tasks that can be addressed with machine-learning are classification, regression or anomaly detection.

Deep Learning (DL) is a sub-field of machine learning. DL techniques use models that consist of multiple layers of abstraction. In contrast to traditional machine learning, DL models can learn representations that are useful for solving a task from data, instead of relying on expert-designed representations. Commonly, the ’deep’ refers to the depth of the model regarding layers. In the past few years, an increase of available data and computing power led to remarkable successes of DL techniques in many domains, e.g., image classification [78], speech recognition [58] or end-to-end machine translation [42].

DL algorithms have many properties that make them attractive for solving InfoSec problems. First, DL techniques scale very well to large amounts of training data and also scale well with respect to the number of model parameters [128]. Scalability is of great interest in InfoSec, since the amount of data is produced and needs to be analyzed grows exponentially. The amount of data that is worldwide produced is projected to grow to to 163 zettabytes in 2025 [115]

. Secondly, DL techniques are in general capable of learning from high-dimensional data. In contrast, traditional ML algorithms suffer from the curse of dimensionality 

[72]

. That is, high-dimensional input data is problematic for methods that require statistical significance to learn useful properties from the data. Furthermore, DL models are capable of learning distributed representations, which can be exponentially more expressive than non-distributed representations. Distributed representations are composed of many elements that can be set for each other. This is powerful because they can use

features with values to represent concepts [13]. Finally, since DL models can learn representations from data, they reduce the need for feature engineering. Many shallow machine learning approaches need hand-crafted features to solve a task. Such feature engineering is labor-intensive and hand-crafted features are often brittle and problem specific. The capability to learn features from data also allows to transfer solutions from one domain to another, as long as the underlying data-type is the same.

DL algorithms have many compelling properties and have recently become hugely successful in other domains. Hence, in this literature review we study the application of DL technologies on InfoSec problems. We aim to understand how DL can be used tackle InfoSec problems, which tasks have been addressed and which challenges are remaining. In detail, our contributions are:

  • We conduct a systematical literature review on DL research in InfoSec. We conduct this review from a data-centric perspective. We propose to classify the papers along three dimensions: data-type, task, and model. We present our classification scheme in Section

    3. We review 66 papers that apply DL on five different data-types: sequential data in Section 5.1, spatial data in Section 5.2, structured data in Section 5.3, text data in Section 5.4 and multi-modal data in Section 5.5. Additionally, we present 11 papers, that study security properties of DL algorithms such as privacy and integrity in Section 5.6.

  • We discuss the application in current approaches in Section 6 and highlight their strength. We then touch on special requirements in the InfoSec domain and outline future challenges and research opportunities for DL in InfoSec.

The remaining paper is structured as follows. In Section 2, we introduce the necessary background for this paper. We first briefly state the fundamental goals of InfoSec and the role of machine learning in InfoSec (Section 2.1). Then, we give an overview over the fundamental components of DL techniques (Section 2.2), which is followed by an overview of the common DL architectures (Section 2.3) , data types (Section 2.4) and tasks (Section 2.5). Thereafter, we introduce our classification scheme (Section 3) and detail our survey methodology (Section 4). We present the actual survey in Section 5. Section 5 consists of five sub-sections, one for each data type that we distinguish. Reviewed papers are grouped first according to the data-type that is analyzed, and then according to the task that is addressed. In Section 6 we discuss challenges and outline future research directions for DL in InfoSec.

2 Background

2.1 Information Security

InfoSec is the field of protection of information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction to provide confidentiality, integrity, and availability [75].

InfoSec is structured around fundamental security objectives [145] : Data confidentiality, which assures that private information is not made available or disclosed to unauthorized individuals; Privacy, which assures that individuals control or influence what information related to them may be collected and stored and by whom and to whom that information may be disclosed; Data integrity, which assures that information and programs are changed only in a specific and authorized manner; System integrity, which assures that a system performs its intended function in an unimpaired manner, free from deliberate or inadvertent unauthorized manipulation of the system; Availability, which assures that systems work promptly, and service is not denied to authorized users; Authenticity, which assures that users are who they claim to be and that each input arriving at the system came from a trusted sources; and Accountability, which assures that actions of an entity are traceable uniquely to that entity.

Most problems in InfoSec are complex and challenging to tackle for many reasons. The assets that need to be continuously protected evolve and are often organized into large, complex systems. The attackers continuously develop new methods for attacking systems. Hand-crafted or statistically derived mechanism to address modern-day security problems become increasingly difficult to design and are labor intensive to maintain. Machine-learning techniques allow computer programs to solve InfoSec challenges by learning solutions from data, which would otherwise be difficult to solve. Joseph et al. provide an overview over the use of Machine Learning for InfoSec [68].

2.2 Deep Learning

DL is a fast-growing field, and a comprehensive overview is out of scope for this paper. Instead, in this section we want to briefly introduce the main components of a DL algorithms. DL-based methods commonly consist of four components: a model, data, an objective and an optimization procedure. We sketch the relationship between data, model and the objective in Figure 1. In the following paragraphs, we will introduce these four components.

Figure 1: DL overview. A model, which is a parametrized function, maps data from input into target space . An objective function and a ground truth is used to calculate a loss . The loss measures of how ’wrong’ the model predicted the given .
Model

The model is a parameterized function, which maps from the input data, which often has a high dimensionality, to target data, which often has lower dimensionality. An optimization procedure learns the parameters of such a model. Furthermore, a model is often composed in different ways, commonly referred to as architecture. We present different ways on how to construct such composite ’models in Section 2.3.

Data

Data is an essential ingredient of DL-approaches. Since DL methods learn useful representations to solve a task from data, the learned model can only be as representative as the data was of the problem that is about to be solved. If the data is not representative, the model most likely is also not. Mostly, data used to train DL-models is high dimensional. One of the fundamental assumptions is that such high dimensional data concentrates around a low dimensional manifold, and the DL algorithm is capable of finding such lower dimensional manifolds.

Objective

The objective is a function that defines how the parameters are learned. It tells a model the prediction error towards the ground truth. It is often also referred to as goal-, cost-, or loss-function. An example objective is the Mean Squared Error.

Optimization procedure

The optimization procedure describes the steps to perform the actual learning. In DL, the optimization procedure is almost always a form of mini-batch stochastic gradient descent. Mini-batch stochastic gradient descent performs the following steps for a subset of the overall data: a forward pass, which calculates the predictions

given a model and the data ; a backward pass, which calculates the gradients of all parameters with respect to to the objective; an parameter update step, which updates the parameters of the model. More recently, the DL community switched to momentum-based forms of mini-batch SGD such as Adam [73]

or RMSprop 

[155].

For a more comprehensive overview over DL, we would like to point to excellent resources that attempt to do so. Jürgen Schmidhuber provides a historical overview over the field up to now  [121]. In this overview, he summarizes popular algorithms and chronologically attributes them. Bengio et al. provide an overview of the DL field  [10], and later with a focus on the unsupervised, representation learning approaches [13]. Furthermore, Deng et al. present another overview over DL methods and applications [37]. Goodfellow et al. published a textbook on DL [53], which is also available online 111http://www.deeplearningbook.org/.

2.3 Deep Learning Architectures

DL models are composed of different ”building blocks” – often called layers – and the composition of these building blocks is commonly referred to as model architecture. Mathematically speaking, a layer is a function that maps some input to some output , mostly is a parameterized function. Here we list and briefly describe common DL model building blocks. Describing them or even describing their functioning in a detailed way is out of the scope of this survey paper, especially since many of the technologies rapidly evolve and adapt. Instead of providing details, we attempt to explain these building blocks on a high level, highlight common use cases and reference to important developments of the field. In this section, we will first introduce common layers, and then introduce a few popular composite architectures.

As the term ”building block” hints, DL building blocks can be arbitrarily combined. Three examples of such building blocks are convolutional layers and recurrent layers and fully connected layers. The model architect can combine these building blocks into to a model for a sequence analysis task, where the convolutional layer reduces the input dimension of the sequence and abstract local, spatial correlation, the recurrent layer models temporal connections and the fully connected layer models complex relationships of the temporal model that are useful for the analysis task.

Which building block, how many of them, and how to combine them often depends on the data and the task at hand, and depends on the researcher addressing the task. Decisions often involve domain knowledge and knowledge about the data structure. A few guidelines and best practices exist [97], such as random search for hyper-parameter optimization [14]

, advice for restricted Boltzmann machines 

[60], tips for using stochastic gradient descent [22], design of deep architectures [11]

, or how to calculate the size of the receptive field of convolutional neural networks 

[41].

We provide an overview of common composite models in Figure 2. There are many other model architectures, for example, Generative Adversarial Networks (GANs) [52], or deep Boltzmann Machines (DBMs) [120], as well as countless variations, combinations, and improvements of different layers, e.g., attention mechanisms for RNNs [42]. A comprehensive overview is out of scope, but many models, techniques, and recent developments can be found in Bengio et al.’s survey[13] and Juergen Schmidhuber’s [121].

Fully-connected Layers

A fully-connected (FC) layer connects each of the inputs elements to each of the output elements in a non-linear way, i.e., , with

being a non-linear activation function. Fully-Connected layers are a fundamental building block and can be found in many DL model architectures. Multiple, chained fully-connected layers can learn arbitrary function mappings between the input an the output 

[64].

Convolutional Layers

Convolutional layers (CO) use locally shared parameters to learn an activation map of an input. The activation map is derived by ”sliding” a kernel over the input and calculating a dot product of the kernel and the input segment. Often, an activation map is passed through a non-linear activation function. Convolutional layers are fundamental building blocks for convolutional neural networks and used to learn hierarchically structured features, especially in spatial data.

Pooling Layers

Pooling layers (PO) reduce the size of representations, and thereby achieve desirable properties such as translation- or shift- invariance. Common ways to achieve that reduction are taking the maximum [117] or the sub-sampling layers [79]. Pooling layers are often found in convolutional neural networks.

Recurrent Layers

Recurrent layers learn relationships about the input data by having a recursive function defined over different parts of the input. LSTMs [63] and GRUs [29]

are among the most popular recurrent layers. Recurrent layers are building blocks for recurrent neural networks which are useful for many tasks in sequential and text data, where a state of the sequence plays an important role.

Restricted Boltzmann Machines

Restricted Boltzmann Machines are energy-based models that learn a joint probability distribution of the input and output data 

[138]. An energy function defines this joint probability distribution. The restriction in the Restriction Boltzmann Machine is that there are no intra-layer connections in the hidden layer.

Dropout Layers

Dropout layers are parameterless layers, which stochastically add noise to the input or set the input to zero [78; 144]. Dropout essentially turns one network into an ensemble of networks, thereby increasing the generalization capability of the network.

These layers can be composed to more sophisticated models. Here, we briefly introduce six common model architectures.

Multi-Layer Perceptrons (MLP)

Multi-layer Perceptrons (MLPs) are feed-forward neural networks that are composed of multiple, fully-connected layers. This composition allows each layer to use the features of the previous layer to create more abstract features. Such a network learns to produce distributed representations that help to solve tasks efficiently. MLPs can approximate arbitrary, non-linear functions, and should not be confused with perceptrons, which can only learn simple, linear relationships between the input and the target data. One significant disadvantage of MLPs is that the model size grows proportionally with respect to the dimension of the input features.

Convolutional Neural Networks (CNN)

Feedforward neural networks make minimal assumptions about the data that they are processing. However, often we have general information about the data that we are processing. One example of such data is images. The pixel in images usually have a loose spatial correlation, i.e., a pixel will loosely influence the value of the pixels in its vicinity.

Convolutional neural networks (CNNs) are a particular type of feed-forward neural network that use this spatial information correlation to design neural networks that are better for processing such data. CNNs combine three key ideas: local receptive fields, parameter sharing, and local sub-sampling.

Local receptive fields, also called kernels, connect small patches of the input data with one point of the output data. Local receptive fields assume that the input data are spatially correlated, i.e., that the neighborhood of a data point influences this data point and vice versa. Connecting all small patches with own parameters to all outputs would be computationally impractical as it would increase the number of parameters to learn required drastically. Instead, parameters for such local receptive fields are ”slid” over the input, and an output is calculated for each different position. The parameters are shared for each position. When training CNNS, multiple kernels per CO layer will be trained and slid over the input, thereby producing multiple activation maps. This approach drastically reduces the number of parameters needed.

CNNs are typically composed of multiple convolution layers, pooling layers, and fully-connected layers. A combination of CO and PO layers learn hierarchical representations from the data, and the FC layers learn complex interdependencies between such representations. Examples of such architectures are AlexNet [78] and VGG-Net [136]. Other compositions are possible, such as a CNN that consists purely of CO-layers [143].

Recurrent Neural Networks (RNN)

Another general assumption one can make about the data is temporal (or sequential) interdependence of data, as time series, natural language or sound. If the data is locally sequentially correlated, 1-D convolutional neural networks can be used to learn representations from such patterns. For more complicated patterns or patterns that occur over time, recurrent neural networks (RNNs) have been developed.

RNNs are neural networks that are designed in a way to reuse the outputs of the network in later calculations, which also clarifies the name. Similarly to CNNS, recurrent neural networks rely on parameter sharing, but in a different fashion. In addition to parameter sharing, recurrent neural networks remain a state (or context) of the network, which they pass on for further processing.

RNNs can be used to learn functions in flexible ways. They can be used to learn functions to map sequences to a single output (many-to-one), for example, to classify sequences. They can be used to learn functions that map sequences to other sequences (many-to-many), for example, to tag sequences with specific labels or for natural language translation [29]. Moreover, they can be used to map a single value to multiple outputs (one-to-many), for example, to generate descriptive text from an input image [160].

RNNs are typically composed of one or more recurrent layers (RE) to represent the sequences and one or more fully connected layers (FC) to learn complex relationships for such sequence representations. A common variant of the RNN is the bi-directional RNN, which combines two RE layers, one that processes the sequence of inputs from start to end and the other in reverse order. The output of these two layers is subsequently merged.

Autoencoders (AE)

Autoencoders (AE) are composite models that consist of two components: an encoder model and a decoder model. The task of an AE is to output a reconstruction of the input under certain constraints. The encoder and the decoder Model can be any Neural Network, preferably one that works well with the data type. So one can imagine an AE for images where the component models are CNNs or an AE for text where the component models are RNN.

If AE have sufficient capacity, they will learn two functions that will copy the input to the output. Such functions are generally not useful. Therefore the representations that the AE has to learn are typically constrained in specific ways, for example by sparsity, or by a form of regularization. Such constraints force the encoder model to learn representations that contain potentially useful properties or regularities of the data. One use case for AEs is dimensionality reduction.

Variational Autoencoders

Similar to an autoencoder, the Variational Autoencoder (AE) consist of two models, an encoder model, and a decoder model. The encoder model learns and

of a multivariate Gaussian distribution given a particular input

. This distribution is used to sample a random variable

. The decoder model learns to reconstruct given . The decoder model is the generative model.

Deep Belief Networks (DBN)

Deep Belief Networks (DBNs) are generative models that are composed of multiple layers of restricted Boltzmann machines [61]. Each layer is trained individually and then combined to form a DBN. DBNs have mostly been used to learn input representations in an unsupervised way, for example, [159]. DBNs are trained by greedily training each component RBM in an unsupervised way. DBNs are generative models, so they can also be used to generate samples. DBNs and RBMs have mostly been replaced by other techniques [53].

Figure 2: Schematics of common DL composite architectures. Architectures sketched are Multi-layer perceptron (MLP), Convolutional Neural Network (CNN), Recurrent neural network (RNN), Autoencder (AE), Variational Autoencoder (VAE) and a restricted Boltzmann machine (RBM).

2.4 Data Types

DL algorithms learn models from data. Data can be organized in many forms. The organization depends on the domain and the task that is solved. There is no one right way to organize, but commonly data are organized in a way that is efficient to process or natural for the domain. In this survey, we distinguish between five different data types: spatial data (PD), sequential data (QD), and structured data (SD), text data (TD) and multi-modal data (MD). In the next paragraphs, we will describe these five data types in more detail.

Spatial Data

Spatial data is data that is organized in such a way that individual data points are addressable by an N-dimensional coordinate system. For certain data types, the spatial location of data points carries much information about the information stored in the data. An example is images, where much of the information that is contained in the image is stored the location of a specific pixel. Consequently, a spatial representation of such data offers the advantage of leveraging the information that is stored in this information.

In InfoSec, the most prevalent type of spatial data which DL is applied to is images. Images occur in naturally in domains such as biometrics, steganography, or steganalysis where images are a data modality of central interest for the domain. However, images also occur in other domains, such as in website security, where researchers attempt to break CAPTCHAs, which are often visual. Apart from that, only a few approaches in InfoSec use DL on spatial data, such as organizing audio data into a spatial arrangement using spectrograms, or binary malware data that gets transformed into a spatial arrangement.

Sequential Data

Generally, sequential data are ordered lists of events, where an event can be represented as a symbolic value, a real numerical value, a vector of real or symbolic values or a complex data type 

[165]. Sequential data can also be considered as 1D spatial data, that is, locally correlated and each event is referable via a 1D coordinate system. Furthermore, text can also be seen as sequential data, either as a sequence of characters or as a sequence of words, although the relationship between the elements of text is complex. The distinguishing is arbitrary and depends on the data, the treatment of the data and the chosen data analysis techniques.

In this survey, we treat sequential data as sequential, if it is analyzed with respect to the sequential nature of the data. For example, network traffic can be treated as a sequence of packets, but also each packet could be analyzed individually, without regard to the order of the packets. Many data types in InfoSec can be treated as sequential data. In our survey, we have three main categories of sequential data: operation codes, network traffic, and audio data.

Structured Data

Structured data are data whose creation depends on a fixed data model, for example, a record in a relational database. Semi-structured is a data type that has a structured format, but no fixed data model, for example, graph data or JSON formatted data. In this survey, we combine structured and semi-structured data, since from a DL perspective they are very similar.

A prominent source for structured data in InfoSec is network traffic. Network traffic is usually parted into packages, each of which contains a fixed set of fields with specific values. Another source of structured data is event logs that are stored in a database. We also consider data where a fixed set of high-level features (the stress is on high-level) is extracted from data as structured data because this equals a record in a database where each field of the entry corresponds to a feature.

Among the surveyed paper, we found three tasks on structured and semi-structured data: code similarity detection, network intrusion detection, and drive-by attack detection.

Text Data

Text data are a form of unstructured, sequential data. Commonly, a text is interpreted in two ways: as a sequence of characters, or as a sequence of tokens where tokens can be words, word-parts, punctuation or stopping characters. The elements of a time series are generally related to each other in a linear, temporal fashion, i.e., elements from earlier time steps. This relationship is often also true for text data. However, the relationships between the elements are more complicated, often long-term and high-level.

In InfoSec, text data is mainly found in log files and in analyzing communication such as social media messages or emails. Another source of text may be the source code of software to identify vulnerabilities.

Text data is generally treated as sequential data and analyzed with similar means. One problem of analyzing text is to find suitable representations of the input words. Text data is high dimensional, which renders sparse representations computationally impractical. Hence, commonly dense representations such as Word2Vec [93] and GloVe [109] are used to represent input words.

Multi-Modal Data

At times more than one data modality is used for solving a problem, or multiple modalities are derived from the same data type. An example of multiple modalities are videos that contain both a spatial and a temporal component. Such data can be analyzed by combining models like CNNs and LSTMs to represent both of the aspects of the data.

In this survey we only have two examples of multi-modal data, one that treats ECG data as spatial as well as sequential data, and moreover, one that uses text and images for the detection of cyberbullying.

2.5 Tasks

Tasks are the problems that should be solved. One of the key challenges to solve a problem with machine learning is how to present the problem in such a way that the machine learning algorithm can solve it. For example, the problem of finding malware can be reduced to a classification problem, i.e., given a binary file, decide whether it is malicious or benign. How to solve a problem and which task to use is often a design decision, since the borders between different tasks are often vague or a problem can be solved in multiple ways. For instance, malware could be detected by a classifier that classifies binaries into malicious or benign. Alternatively, malware could also be detected by clustering, which groups malicious and benign binaries close together.

Here we describe four common tasks, anomaly detection, classification, clustering, dimensionality reduction and representation learning. This selection is by no means a complete, as there are many other tasks such as regression, machine translation, imputation of values, manifold learning or metric learning. We do not describe all of them since up to now many of them do not play a significant in the papers that we have reviewed.

Anomaly detection (AD)

Anomaly detection (AD) is the task of searching for atypical objects or events from a larger group. AD techniques have frequently been applied to InfoSec problems such as intrusion detection, malware detection or insider threat detection. DL-techniques recently were also explored for anomaly detection, e.g., for detecting anomalous log lines [39]. There are additional considerations when using anomaly detection to solve an InfoSec problem. For example, when used for intrusion detection, one needs to show that anomalous data is actually malicious [49].

Classification

Classification is the task of mapping an input sample to a discrete category of output classes. Many problems can be formulated as a classification task, and often in multiple ways. For example, biometric fingerprint matching can be framed as a classification task in two ways. One could frame it as binary classification task of given two fingerprints, are they from the same individual. Alternatively, one could ask: given a fingerprint, to which individually does it belong. Regression tasks are closely related to classification tasks, but instead of discrete output categories, the goal is to predict a numerical value.

Clustering

Clustering (CL) is the task of finding groups and group membership in a set of objects. Objects that are similar to a certain metric and attributes should be grouped, and objects that are dissimilar should not be grouped. Clustering is related to classification in the sense that it assigns membership to a certain group. However, the main focus of clustering is to find the groups. Deciding the number of groups is a currently unsolved problem, but several heuristics exist 

[146; 154].

In InfoSec, clustering techniques have the appealing property that they deal well with unknown groups and unknown objects. However, same problems exist as for anomaly detection. In InfoSec, DL-techniques are not directly used for clustering but instead used to learn representations that can in a second step be clustered.

Representation learning

Representation learning (RL) is the task of finding another representation for some input data. When using DL-techniques for representation learning, representations are always learned, because the architecture usually consists of multiple layers of representation. For example, when classifying a binary file into malicious or benign, the model that is being trained automatically learns a representation of the data. However, in some cases, no other explicit tasks is addressed, for example when an AE is used. There, the primary focus is to find a representation for the data that fulfills specific properties, e.g., a reduction of input dimension or a denoised version of the input.

In InfoSec, explicit representation learning using DL-techniques is often used to reduce the input dimensions and in combination with another classifier [78; 86] or to use domain knowledge from a similar task, e.g.,  [163].

3 Classification Scheme

This survey aims to understand how DL techniques have been applied to InfoSec problems. DL methods are data-centric but mainly domain-independent methods. Hence we present this survey from a data-centric perspective, not from a topic-centric perspective. We chose this perspective because it allows researchers from other InfoSec sub-domains and other non-security domains that have a problem with similar data types see how these problems were addressed, and potentially to reuse solutions from other fields.

Our classification consists of three dimensions: data-type, task-type, and model. Figure 3 depicts the high-level relationship between the three categorization dimensions of our survey. We use data to train a model to solve a task. Figure 4 depicts the abstract values of our categorization.

Figure 3: We use data centric perspective on DL in InfoSec. We learn a model using data to solve a problem.

We distinguish between the five data types that were introduced in Section 2.4: spatial data (PD), sequential data (QD), structured data (ST), text data (TD) and multi-modal data. In addition to the abstract data type, we report on the concrete data type in the survey section.

Furthermore, the next dimension of our classification is a distinction between four abstract task types: Classification (C), Representation Learning (RL), Anomaly detection (AD) and Clustering (CL). After this high-level, abstract categorization we group approaches together based on concrete security problems from different sub-domains of InfoSec.

Finally, we distinguish which architecture components has been used to compose a model. The model choice depends largely on the data and the task that need to be solved, and multiple different architectures can be used to solve a task. We distinguish between fully connected layers (FC), convolutional layers (CO) which include pooling layers, recurrent layers (RE), AE (AE), restricted Boltzmann machines (RBM), variational AE (VAE) and generative adversarial networks (GAN). A model may contain more than one of these components. Often, model-architectures that were successful on a particular task get a name, such as LeNet-5 or AlexNet. When we found models that were heavily inspired by such architectures, we referred to that model.

Figure 4: Dimensions of our categorization.

4 Survey Methodology

To obtain the literature for this survey, we followed the following protocol.

We chose to restrict the sources of this survey by the venue. We selected security venues which published DL approaches and Machine Learning venues that published Security approaches that were about DL. Our venue selection criteria were the following. We selected the top 20 journals and conferences based on Security and Cryptography, Data mining, Artificial intelligence on Google Scholar’s ranking. We added the top 20 Security conferences from Microsoft Academic Research. We searched for DL approaches in the Security conferences and Journals, and we searched for Security approaches, and DL approaches in the Machine learning areas. We limited the venues by name and the areas of DL and Security by keywords. We have obtained the keywords of DL by conventional technologies and the keywords for Security from the ACM CCS 

[2] and the IEEE taxonomy [66].

To refine the selection of papers for this survey, we selected papers according to the following criteria. We excluded papers that only proposed ideas, but did not conduct any experiments. We excluded invited papers, talks, posters and workshop papers. We excluded papers that claimed to use DL, but only reference general resources such as Goodfellow et al.’s DL book [53] without providing any additional detail that would enable researchers to reproduce results. Further, we excluded papers that were not explicitly aiming to achieve a security goal. For example, we excluded approaches on object detecting in videos, even though it could be argued that these approaches are useful for surveillance purposes.

We conducted the literature search on the 9th of May 2018 and we time-range of the survey was from 1st of January 2007 to the 8th of May 2018. The begin of 2007 was chosen because it approximately marks the begin of the advent of deep neural networks [121].

We used Scopus 222https://www.scopus.com/search/form.uri?display=basic and Google Scholar 333https://scholar.google.com/. We have conducted the keyword and venue based literature search on Scopus since it is the most comprehensive database of peer-reviewed computer sciences literature. For two venues, NDSS and Usenix-security, we used Google scholar because Scopus did not index them.

Using the previous search criteria, we obtained 177 papers, which we reviewed more carefully. Of these papers, we only kept 77 that did fit our previously defined criteria. We list the complete list of venues that we included in our search in the Appendix A.

5 Deep Learning in Information Security

Here, we review the papers that we found using the methodology of Section 4. On a high level we group the approaches into two parts. The first part consists of Sections 5.1 to 5.5 and presents approaches that apply DL algorithms to address security and privacy issues. The second part – Section 5.6 presents approaches that address security and privacy issues of DL algorithms.

We structured the first part based on the five data categories that we have defined in our classification scheme: sequential-, spatial-, structured-, text- and multi-modal data (see Section 3). For each of the data types, we group the reviewed approaches based on their task. For each of the tasks, we compare the different architectural decisions to solve the tasks and draw connections to advancements from in the DL community. After assessing applications of DL on these data types, we survey approaches that assess security properties of DL methods. We chose this data-type drive perspective of the use of DL in InfoSec because it reflects on the fact that the successful application of DL methods is data dependent, but not domain dependent. We believe that this perspective will help researches identify challenges and potential solutions to data of their domain more easily.

As mentioned in 3, clearly distinguishing between different data types is often challenging. For example, an image, which is generally referred to as 2D spatial data, can also be viewed as a sequence of image patches and analyzed with methods that are appropriate for such data type. How to view the data and is ultimately a design choice which depends on the researcher. We group the approaches to the data types that are fed to the neural network, that is, if some form of feature processing changes the input data type, we use the processed version to categorize the approach.

5.1 Sequential Data

5.1.1 Classification on Sequential Data

Function recognition

Binary data or bytecode analysis is an important tool for malware analysis. One challenge of binary data analysis is function recognition, i.e., finding the start and end positions of software functions in a piece of code. Shin et al. tackle the challenge of function recognition from binary code by framing it as a binary classification task [131]

. They treat the binary code as a sequence of one-hot-encoded bytes and predict for each byte whether it is a start or end byte of a function. They evaluate a variety of sequential models such as RNNs, LSTMs 

[63], GRUs [32], and bi-directional RNNs [123], which are all well-suited for finding regularities in complex sequences. They train all their models using rmsprop [155], which is a momentum-based back-propagation through time variant. They validate their approach on multiple datasets, and their best-performing model, a bi-directional RNN, achieves an F1-score between 98% and 99%, which means an average improvement of  4% to the then state-of-the-art, and the training and the evaluation are an order of magnitude faster than the state-of-the-art.

Instead of predicting start and stop bytes, Chua et al. aim to identify the function type given x86/x64 machine code [30]. To this end, they treat the machine code as a sequence of instructions, which are embedded using neural word embeddings [12]. These embedded sequence instructions are then modeled using a 3-layer stacked GRU-RNN since the sequences are variable length and since the stateful nature of the recurrent neural network aids modeling sequence of instructions. The approach, which is named EKLAVYA, is evaluated on the machine code of two commonly used compilers: cc and clang and on multiple classification tasks. EKLAVYA achieves an average of up to 0.9748 accuracy for unoptimized binaries, and up to 0.839 accuracy for optimized code.

Traffic Classification

Chen et al. propose a method for classification of mobile app traffic classification [28]. They capture the traffic, transform the headers into a 1-hot encoding and treat a sequence of packages as a sequence of vectors. They then train a 6-Layer custom CNN model to analyze the traffic. They validate their approach on a dataset captured from 20 apps with 355,235 requests and can achieve an average accuracy of 0.98.

Encrypted Video Classification

The MPEG-DASH streaming video standard allows to stream in an encrypted way. Schuster et al. show that it is possible to classify the videos based on the packet burst patterns in the video even though the videos are streamed in an encrypted way [124]

. A dash time series is created from captured TCP flows by aggregating the series into 0.25-second chunks. Then, they use a 1D CNN architecture to capture the locally correlated patterns that the encrypted sequences of burst sizes contain. The model uses three convolution layers to avoid an information bottleneck in the beginning. The convolutional layers are followed by a max-pooling layer and two dense layers. Dropout is used to prevent overfitting. The attack is validated on video datasets from different platforms, and in their best setting they achieve a recall of 0.988 with zero false positives.

Physical Side Channels

Industrial control systems are controlled by programmable logic controllers (PLC). Han et al. propose ZEUS, a DL-based method for monitoring the integrity of the control flow of such a PLC [55]. PLCs, when operate emit electromagnetic waves. Han et al. learn a program behavior model from spectrum sequences that are derived from execution traces. The execution traces are collected unobtrusively via a side channel by an inexpensive electromagnetic sensor. The spectrum sequences are modeled with stacked-LSTMs, which are well suited for modeling the sequential nature of the data. For validation, ZEUS is implemented on a commercial Allen Bradley PLC. ZEUS is capable of distinguishing between legitimate and malicious executions with an accuracy of 98.9% while causing zero overhead to the PLC.

Mobile User Authentication

Sun et al. use sequences of characters and the accelerator information to identify users on a mobile phone [147]. They call their approach DeepService. To model the sequences they use a GRU [31], which is a variant of the LSTM that uses fewer parameters, which is important for a mobile phone setting. DeepService is evaluated on a dataset that recorded 40 users, which was collected by Sun et al. On this dataset, DeepService achieves san F1-score of 0.93.

Steganalysis of Speech Data

Steganography is the discipline of hiding data in other data. Steganalysis attempts to discover and recover such hidden data. Lin et al. propose an RNN-based method to detect hidden data in raw audio data [82]. They segment speech data into small audio clips and frame the steganalysis problem as binary classification problem to differentiate between stego and cover. They train an LSTM-variant, called RNN-SM to classify on the raw data of the audio clips. Due to their stateful nature, LSTMs are well suited for modeling such data. RNN-SM is evaluated on two self-constructed datasets. One dataset contains 41 hours of Chinese speech data and the other contains 72 hours of English speech data. The accuracy depends on the language of the audio clips. In the best performing setting RNN-SM achieves an accuracy of up to 99.5%.

5.1.2 Representation Learning on Sequential Data

Audio tampering detection

Adaptive multi-rate audio is a codec for compression of speech data that is commonly used in GSM networks. One way to manipulate such data is to decompress it to raw wave, modify the wave and re-compress it again. Luo et al. propose to extract features from compressed audio with a stacked under-complete AE SAE to identify whether a speech file has been recompressed or not [86]

. To detect double-compression, the extracted feature sequences are classified with a universal background model, namely a Gaussian Mixture Model. The features are extracted by first splitting the audio file to frames, then normalizing them, then compress them via a SAE and finally fine-tune them with a binary classification layer. Two SAE models are trained, one for single compressed files and one for double compressed files. The under-completeness of the AE motivates the extraction of the most salient features, thereby enables to differ between the two compression types. Luo et al. validate their approach on the TIMIT database, where they achieve a 98% accuracy.

5.1.3 Clustering on Sequential Data

Speech forensics

Li et al. combine DL-based representation learning and spectral clustering to cluster mobile devices 

[81]. They use an under-complete AE to compress speech files that originate from different mobile devices. This under-complete AE compresses the speech files and thereby extracts representations that capture the artifacts of each type of mobile device. The AE was pre-trained with restricted Boltzmann machines. Pre-trained AE have been showed to work well for modeling audio data [58]. The representations that are learned by the AE are then used to construct Gaussian super-vectors. Finally, these Gaussian super-vectors are clustered via a spectral clustering algorithm to determine the phone type of the speech recording. Li et al. evaluate their method on three datasets, MOBIPHONE, T-L-PHONE, and SCUTPHONE and in their best configuration their method achieves a maximum classification accuracy of 94.1% among all possible label permutations.

Data Task Architecture details Approach
Classification
Binary Code Function Recognition bi-RNN [123] [131]
Binary Code Function Type Signatures Word2Vec, GRU [92; 31] [30]
Speech Data Speaker Identification bi-RNN [123] [131]
Encrypted Burst Sequences Video Classification 1D-CNN [124] [124]
HTTP Traffic App classification 6-layer CNN [28] [28]
Spectrum Sequences Attack detection stacked LSTM [50] [55]
Keystrokes User authentication GRU [31] [147]
Speech data Steganalysis LSTM-variant [82]
Clustering
Speech data Speech forensics Undercomplete AE [81]

Representation Learning
Speech data Forging detection Undercomplete SAE [86]
Table 1: Classification using Deep Learning on sequential data in information security

5.2 Spatial data

5.2.1 Classification on Spatial Data

Breaking CAPTCHAs

CAPTCHAs are a method for discriminating between automated and human agents [162]. CAPTCHAS achieve this discrimination by posing an agent a challenge that is easy to solve for humans but hard for computers. An example of such a challenge is to ask whether an image contains a store or not, which is easy to solve for humans but not for computers. From their inception, CAPTCHAs have been continually evolving in an arms race with smarter bots that attempted to break them. DL algorithms proved to be very successful in analyzing images, especially distorted and incomplete images, which eventually led to attacks on CAPTCHAs using DL methods.

Gao et al. attack hollow CAPTCHAs using a DL-based approach [47]. Hollow CAPTCHAs use contour lines to form connected characters. Their attack consists of two phases, a pre-processing phase, and a recognition phase. In the pre-processing phase, they repair the contour lines using Lee’s algorithm, fill the contour lines using a flooding algorithm, remove the noise component, and finally remove the contour line. The pre-processed data is then used to train a CNN-based model, which is trained to classify the characters on the picture by numbering and classifying the strokes and stroke combinations in the CAPTCHA. The final characters are determined by a depth-first tree search of a sequence that scored the highest predicted values. They selected a fully-connected version of LeNet-5’s architecture [135], which provides partial invariance to translation, rotation, scale, and deformation and has shown to work well on handwritten digits. They validate their attack on multiple datasets and their success-rate for attacks on hollow CPATHCA’s ranges from 36%-to 66%.

Algwil et al. use a similar CNN architecture and method to break Chinese CAPTCHAs [3]. Chinese CAPTCHAs consist of distorted Chinese symbols which need to be recognized by the agent. Algwil et al. address this challenge with a multi-column CNN [33]. A multi-column CNN is an ensemble of CNNs, whose predictions are averaged. The structure of each CNN is inspired by LeNet5, but it uses more and larger filter maps. Another difference is that the multi-column CNN uses max-pooling operations instead of trainable sub-sampling layers. This architecture also profits from the robustness of the CNNs to distortion, scaling, rotation and transformations. To attack the CAPTCHAs, they split each character into a set of radicals, i.e., composite elements, and classify input character to their radicals. Their best performing network is able to correctly solving a Chinese CAPTCHA challenge with 3755 categories with a test error rate of only 0.002%.

reCAPTCHA is an image based CAPTCHA challenge developed by Google. Agents get a text-based assignment, and need to click all images that correspond to the assignment, for example: ”Select all images with wine on it.” Sivakorn et al. propose an automated reCAPTCHA breaker system [137]. Their approach uses reverse image search to derive labels for images, word vectors [92] to derive similarity between assignment and a web-based on service that is based on deconvolutional neural networks [175] to classify the images of the reCAPTCHA challenge. Deconvolutional neural networks use a form of sparse coding and latent switch variables that are computed for each image. These switch variables locally adapt the model’s filters to learn a hierarchy of features for the input images, which enhance the classification performance on natural images. Their system is able to pass the reCAPTCHA challenge in 61.2% of the attacks.

Finally, Osadachy et al. propose a CAPTCHA generation scheme, that is robust to DL-based CAPTCHA attacks [104]. Their proposed CAPTCHA task is: given an adversarial image of one class, identify all images of the same class from a set of other images. An adversarial image is an image that has been modified with adversarial noise [150]. Such adversarial noise prevents DL architectures from correctly classifying images, but the images appear to be visually similar for a human. They generated such adversarial images using a CNN-F architecture [27]. CNN-F is a variant of AlexNet [78]

, which was the winning architecture on the ImageNet LSVRC-2012 challenge. AlexNet has popularized three main techniques for image analysis: ReLUs as activation functions, Dropout instead of regularization to prevent overfitting and overlap pooling to reduce the network size. CNN-F is a stripped down version of Alex net that reduces the number of convolutional and fully-connected layers.

Biometrics from 2D Spatial Data

Biometrics are biological traits of a person which can be used to derive keys for identification in digital systems. Many biometrics are based on 2D spatial data, for example, images of fingerprints, faces or irises. One of the key challenges of creating biometrics from 2D spatial data is to be able to distinguish between subtle elements of the raw data that allow the identification.

Faces are a standard way to identify people. One of the main challenges is that in real-life situations, images from faces will have distortions, different angles, and shades. Sun et al. proposed a DL-architecture that is well suited for such a task [148]. This architecture describes a CNN that consists of 4 convolutional layers, three max-pool layers and one fully connected layer, which yields an image representation. Additionally, they use local feature sharing in the third convolutional layer to encourage more diverse high-level features in different regions. Finally, the outputs of the last max-pool layer and the output of the last convolutional layer are shared inputs to the fully connected layer. This connection prevents an information bottleneck caused by the last convolutional layer and is essential for learning good representations. Their approach manages to achieve a 99.15% accuracy on the LFW dataset with 5749 different identities when their network was augmented with additional, external training data. Zhong et al. use a similar approach for the task of face-attribute classification [182]. They compare VGGnet [136] with FaceNet [122] and conclude that both are suitable to extract facial attributes. Instead of Iris as a biometric, Zhao et al. focus on the region around the eyes – the periocular region – for authentication [180]. They use AlexNet to classify the eyes and derive semantic information such as age, or gender using additional CNNs that are trained on the surrounding regions of the eye. Goswami et al. use a combination of an SDAE and a DBM to extract features from multiple face images which they extract from video clips and then use an MLP to authenticate people via their faces [54]. They pre-filter the frames with a high information content as input for the classification network. At a false-accept rate of 0.01, they achieve up to 97% accuracy when validated against the point and shoot challenge database, and 95% against the Youtube faces dataset.

Instead of directly classifying faces with a CNN for biometrics, Liu et al. propose to learn a metric space for faces [84]. Their approach is inspired by triplet networks of Schroff et al.  [122]. Triplet networks learn a metric space by solving a ranking problem between three images that are represented by the same CNN. The parameters of such a model are learned by minimizing a triplet loss. Liu et al. pre-train their model on the CASIA Web-Face Dataset and evaluate CASIA NIR-VIS 2.0 face dataset, where they achieve a rank one accuracy of 95.74%. The second best approach achieved an accuracy of 86.16%.

Fingerprints are another standard way to identify people. One important feature to match fingerprint images is the local orientation field. Cao et al. frame the local orientation field extraction as a DL-based classification task. Given a fingerprint image, they predict the pixel probabilities of this image that a pixel belongs to a local orientation field. They base their CNN model on LeNet-5, but adjust the filter size and extend the model by dropout and local response normalization [78]. In addition to that, they augment their training data with texture noise. They evaluate their method on on the NIST SD27 dataset, where they achieve a pixel-class root mean square deviation of 13.51, which presents an improvement for all latents of 7.36%.

Segmentation is the computer vision task of splitting an image into multiple parts so that they can be processed better. In biometrics, iris segmentation aims to separate the parts of an image that belongs to an iris from the remaining images, so that the iris can be used for biometric identification. Liu et al. propose a DL-based approach for biometric iris segmentation 

[85]. They use an ensemble of pre-trained VGG19-net [136], which is fine-tuned on two iris datasets ( UBIRIS.v2, CASIA.v4), to perform a pixel-based, binary classification task. VGG-net [136] advanced the state-of-the-art by two novelties: they increased the depth of the network by using many layers of smaller, 3x3 filters and they used 1x1 convolution, to increase the non-linearity of the decision function without reducing the size of the receptive fields. Simonyan et al. evaluate their approach on the UBRIRIS.v2 and the CASIA.v4 datasets and achieved a pixel segmentation error rate of less than 1%. Qin and Yacoubi use a similar approach for finger-vein segmentation. However, they use a custom CNN instead of a pre-trained VGG19-net [113]. Their approach is validated on two finger-vein databases and achieves an equal-error-rate of 1.69 in the best setting.

Peoples’ appearances are a form of soft-biometrics that can be used to identify people in surveillance videos. Zhu et al. uses multiple AlexNet-based CNNs to classify properties of people from surveillance images to identify them [183]. They use an AlexNet for a binary classification task for each of the desired attributes. They validate their approach on the GRID and VIPeR databases, and their method out-performed the baseline, an SVM based classifier, by between 6% and 10% percent of accuracy on average. Hand and Chellappa also classify attributes of faces. They frame the problem as multi-class approach and design a particular architecture for the task: MCNN-AUX [56]

. The MCNN-AUX consists of two shared convolution-max pool layers for all attributes and one CO layer followed by two FC layers for each task. Additionally, the data of all attributes are used for classification. Gonzalez-Sosa et al. compare a VGG-net variant with commercial off the shelf face attribute estimators, and find that soft biometrics improve the verification accuracy of hard biometric matchers 

[51]. Another soft biometric is gait, i.e., how people walk. Shiraga et al. use a CNNs called GEINet to classify peoples gait [132]. The input to GEINet is gait energy images (GEI), a particular type of images that are extracted from multiple frames of images that show people’s walk. GEINet is based in AlexNet [78]

. Shirage et al. reduce the depth of the network because of the different nature of the task, that is to classify GEI the network needs to focus on subtle inter-subject images. During the evaluation, GEINet consistently manages to outperform baseline approaches by a significant margin; In their best experiment settings, GEINet achieves a rank-1 identification rate of 94.6%. Age, gender, and ethnicity are other common soft-biometrics. Narang et al. empirically validate that VGGnet is suitable for age, gender, and ethnicity classification 

[99]. Azimpourkivi et al. propose another soft-biometric for authentication for mobile-phone pictures of objects [7]. They use an InceptionV3 CNN  [7]

to derive features from pictures a mobile phone user takes. The Inception architecture introduced inception modules, which consist of multiple convolutional filters of different sizes. These filters are concatenated, and allow the network to decided which filter size it needs at a given layer in the network to solve the task at hand. To construct images hashes that are not sensitive to subtle changes, they use a particular version of locality sensitive hashing which allows them to group similar image features next to each other. Azimpourkivi et al. demonstrate that their approach is robust to real image-, synthetic image-, synthetic credential- and object guessing attacks, and show, that the Shannon entropy of their soft biometric is higher than that of fingerprints. Additionally, unlike real biometrics, soft biometrics that are derived from object pictures can be easily exchanged.

People’s voices are also biometric data that can be used to identify a person. Generally, audio data is presented in waveform, which is a form of sequential data. However, a common pre-processing step in speech recognition is to calculate the Mel-frequency cepstral coefficients (MFCC), which represents the speech waves as 2D spatial data. Uzan and Wolf combines MFCCs and an AlexNet-based CNN [78] to identify speakers [158]. They validate their approach on their own dataset with 101 persons and 4584 utterances and their method achieves an utterance classification accuracy of 63.5%.

Another challenge in biometrics is to determine whether a biometric is authentic or not, i.e., to detect it has been tampered with or not. Menotti et al. propose a DL-based framework for detecting spoofed biometrics [91]. At the heart of their framework is a CNN that performs a binary classification task on images such as iris scans, fingerprint scans or face images. The goal of this task is to distinguish between fake and real biometric images. They evaluate two different strategies, architecture optimization, and filter optimization. In their architecture optimization strategy, they leave the parameters random but change the architecture to fit the problem. In their filter optimization, they fix the architecture but try to find the optimal hyper parameter settings. To do so, they create spoofnet. Spoofnet is a CNN, which is an architecture that is based on Cuda-net [77]. Spoofnet is designed to handle subtle changes in data better because of the modified filter sizes. In contrast to their architecture optimization strategy, they train the parameters of spoofnet with SGD. They validate their approach on seven different databases and achieved excellent results compared to the state-of-the-art. Nougeria et al. use pre-trained VGG-19, which is fine-tuned on fingerprints, for distinguishing between fake and live fingerprints [101]. Their best performing model achieves a 97.1% test-accuracy. Instead of CNNs, Bharati et al. use supervised deep Boltzmann Machines [18] to detect the spoofing of face images. Their method outperformes an SVM on their custom datasets. Conceptually very similar to spoofing detection is biometric liveness detection.

Detecting Privacy Infringing Material

Pictures may contain privacy-sensitive information, such as drivers license or other confidential material. Tran et al. propose a CNN-based method to detect such material [156]. They combine a modified version of AlexNet  [78]

with a CNN for sentiment analysis 

[172] to detect such material. They name this model PCNH. They use 200 classes from the ImageNet challenge [119] to pre-train PCNH, and then train PCNH on categories of sensitive material. They validate their approach on their own and a public dataset with an F1-score of 0.85 and 0.89, respectively. Yu et al. tackle a similar challenge with a hierarchical CNN [173]. They combine deep CNN for object detection with a tree-based classifier over the visual tree. The object identification focuses on detecting infringing privacy object, and the tree-based classifier combines them.

Steganalysis and Steganography

Steganalysis is the process of analyzing data to find out whether secrets have been hidden in the data or not. One way to hide information is hiding it in images. One of the key challenge of detection such hidden data requires to identify subtle changes in the data. Ye et al. propose a DL-based method for image steganalysis, where they use a deep neural network to detect whether information has been hidden or not [171]. Their proposed CNN architecture has an adaption to suit the data for the task at hand. First, the first three layers are convolutional layers only, which is important not to lose any information. Also, no Dropout or normalization is used. Furthermore, instead of Max-Pooling layers, they use mean pooling layers. In total, their network is ten layers deep. Apart from that, they use a truncated linear unit (TLU) activation function in the first layer, which reflects that most steganographic hidden data is between -1 and 1. The weights in the first convolutional layer are initialized with high-pass filters, which helps the CNN to ignore the image content and instead focuses on the hidden information. For training their network, Ye et al. adopt a curriculum learning scheme, which selects the order of the images that are used for training. This ordering is essential for detecting information that has been hidden with a meager bpp rate. They evaluate their method on multiple steganographic schemes and bpp rates, and their approach consistently outperforms the baseline approaches. Their detection error rate ranged from 0.46 to 0.14. Zeng et al. propose to use the residuals and an ensemble of CNNs to detect steganography in images [176]. As input to their network, they use quantized, truncated residuals, with three different parameters for the quantization and truncation. The three different input sets are used to train three different CNNs. The CNN architecture is based on insights of the steganalysis network of Xu et al. [166], but they use ReLUs instead of TanH. Zeng et al. validate their approach on ImageNet and three steganographic algorithms, and their best performing model achieves a detection accuracy of 74.5%.

Authenticity of Goods and Origin Determination

Sharma et al. applied AlexNet on the problem of separating real from counterfeited products [127]. To detect faked material such as fake leather, they use microscopical images to build a model of the texture of the objects. Then, they use AlexNet to classify different types of material. Depending on the material, the proposed method reaches a test accuracy of up to 98.3%. Similarly, Ferreira proposes a DL-based method to determine the device that a printed document was printed with [44]. Printed documents differ from each other in very subtle ways, depending on the manufacture of that printer. To find these subtle differences, Ferreira et al. use a CNN architecture with two convolutions and two sub-sampling layers. They use three different represent the input in three different ways, raw pixel, median residual filter and average residual filter. These different input representations were either combined with an early or a late fusion strategy. Early fusion combines multiple representations of the same data into one input data vector, and late fusion combines the output of multiple CNNs. Ferreira et al. evaluated their approach on a dataset which consists of 120 documents and ten printers. Using the best overall settings, their method was able to attribute printers with an accuracy of 97.33%.

Forensic Image Analysis

A problem law enforcement faces when searching suspects computer for illegal pornographic material is the vast amount of pictures that can be stored on a computer, many of which are irrelevant. In their work, Mayer et al. evaluate DL-based image classification services on their capabilities to identify and pornographic material from non-pornographic material [88]. They evaluate Yahoo’s service, Illustration2Vec, and Clarifai, NudeDetect and Sightengine. They find, that when such services are used to rank such images on their not safe for work character, that, on average, relevant images are discovered on positions 8 and nine instead of position 1,463.

Watermarking

Watermarks address the problem of data attribution. For example, watermarks are added to copyright protected material to identify the owner this material. Kandi et al. propose to use a convolutional neural AE to add an invisible watermark to image [70]. A CNN AE is an AE that has CNN as encoding and decoding networks.

Detecting Polymorphic Malware

Nguyen et al. propose a method for detecting polymorphic malware from binary files via a CNN [100]. To be able to use CNNs, they first extract the control flow graph from the binary. They transform the graph into an adjacency matrix, where a vertex of the control flow graph is considered as a state, and three values describe each state: register, flag, and memory. The register, flag, and memory are mapped to the color of red, green and blue pixels. This representation allows describing similarity in a program state. Even if two variants of the same malware have different execution codes, the core malicious actions retain in the same are of such an image. The corresponding image will differ, and CNNs are highly suitable for detecting similarities between such images with losing spatial correlation. Nguyen et al. evaluate different architectures on a set of polymorphic malware, and they find that the YOLO-architecture [114] is the best performing. The YOLO-architecture can detect polymorphic malware with an accuracy of up to 97.69%.

Detecting Website De-facement

Website defacements are unauthorized changes to a website that can lead to loss of reputation or financial gain. Borgolte et al. use deep neural networks for detecting web site-defacements from website screenshots [21]. They train their model on screenshots from web-pages, which they automatically collected using a web-crawler. Since the screenshots are comparably large, they extract a window uniformly sampled from the center of the webpage. This extracted window is then compressed with a stacked denoising AE., and finally is classified to normal or defaced using an AlexNet-like network. They validate their approach on their own, large-scale defacement dataset and their method achieves a true positive rate of up to 98.81%.

Data Task Architecture details Approach
Images Breaking hollow CAPTCHAS LeNet-5 variant [135] [47]
Images Breaking Chinese CAPTCHAS ensemble of LeNet-5 variants [33] [3]
Images Breaking reCAPTCHAS adaptive deconvolutional Net [175] [137]
Images Generating CAPTCHAS AlexNet variant [27] [104]
Images Face verification DeepID2 [148] [148]
Images Face verification SDAE, DBM [54]
Images Fingerprint orientation field extraction LeNet-5 variant [79] [25]

Images
Spoofing detection Spoofnet [91], variant of cuda-convnet [77] [91]
Images Spoofing detection DBM [120] [18]
Images Liveness detection VGG19 [136] [101]
Images Person identification multiple AlexNet [78] [183]
Images Gait classification GEINet, a reduced AlexNet [78] [132]
Images Soft biometrics classification VGG19 [136] [99]
Images Soft biometrics classification VGGNet variant [108] [51]
Images Face verfication FaceNet [122]
Images Iris segmentation VGG19 [136] [85]
Images Iris verification AlexNet, custom CNN [180]
Images Iris verification AlexNet-variant [112]
Images Privacy infringing material AlexNet variant [78], CNN [172] [156]
Images Privacy infringing material Custom hierarchical CNN inspired by [78; 38] [173]
Images Steganalysis custom CNN [171] [171]
Images Steganalysis custom CNN based on [166] [176]
Images Alternative Biometrics InceptionV3 [151] [7]
Images Counterfeit detection AlexNet-variant [78] [127]
Images Forensic image analysis DL based services [88]
Images Printer attribution Custom CNN [44] [44]
Images Defacement detection Denoising AutoEncoder, AlexNet [78] [21]
CFG Graphs Malware detection Yolo [114], LeNet-5 [79] [100]
Voice speaker identification AlexNet [78] [158]
Table 2: Classification using Deep Learning on spatial data in information security

5.2.2 Representation Learning on Spatial Data

Forensic face-sketches are pictures of faces of crime suspects that are hand-drawn by an artist with help from instructions of eye-witnesses. Mittal et al. propose a DL-based approach to match forensic face-sketches with images of people [95]. Their approach consists of two phases. First, they learn representations from face images in an unsupervised way. Their model for representation learning is a combination of stacked denoising AE [159] and deep belief networks [59]. They first train this model on real people’s faces and fine-tune it on the face sketches. Secondly, they train an MLP in a directed way to predict a match-score between pairs of images. The input to this MLP is the concatenated learned representation for the real image and the sketch image. They evaluate their approach on multiple databases achieved an accuracy score of 60.2%, which is a significant improvement to the second best approach, which matched only 50.7% face-sketch pairs correctly. Galea and Farrugia tackle the same task of matching with a combination of VGGnet and a triplet network, which is called DEEPS [45]. They validate DEEPS on the UoM-SGFS 3D sketch database and achieved a Rank-1 matching rate of 52.17%.

Wang et al. use the representation-learning capabilities of neural networks to increase the performance of Commercial Off-the-Shelf (COTS) forensic face matchers [163]. They use a pre-trained AlexNet [78]

to derive a vector representation of face images of the last fully connected layer before the classification layer. They use cosine similarity as a distance metric between faces and rank the images using probabilistic hypergraph ranking 

[65]. Then, they select the top images as an input to the COTS PittPatt matcher, where

is a hyperparameter. They validate their approach on a combination of three databases, the LFW, WLFDB and PSCO database and their approach consistently scores a higher mean average precision than the PittPatt matcher alone.

Spoofing Detection

Czajka compares hand-crafted features to learned features for suitability on the task of detecting a spoofing attack with rotated biometrics [34]. The final classification is conducted with an SVM in both cases. They evaluated their approach on multiple datasets from different sensors and found that the hand-crafted features performed better in the cross-sensor test and the learned features performed better in the same-sensor test.

Mobile Iris Verification

Zhang et al. explore the suitability of learned representations for biometric iris verification on a mobile phone [179]. They use a model that is based on the ideas of Zagorykyo et al., who have proposed to compare two image patches by a 2-channel CNN that takes as input a pair of images for each channel [174]. This model has one output and is trained on a hinge-loss to regress on the similarity between the two patches. Zhang et al. combine two types of representations for their iris-verification approach: the one derived by a 2-channel CNN and representations derived by optimized ordinal measures. They demonstrate that the FER rate for iris verification can be significantly decreased by combining these two features.

Face Verification

Gao et al. address the task of face identification by using a supervised deep, denoising stacked AE. This model learns face representations that are robust to differences such as illumination, expression or occlusion [48]. Their architecture consists of a two-layer denoising AE. The input was a canonical face image of a person, i.e., a frontal face image with a neutral expression and normal illumination, and the ”noisy” input to reconstruct were images of the same person. They use their AE to extract face features, and sparse representation-based classification for the face verification task. On the LFW dataset, they achieve a mean classification accuracy of 85.48%. Bharadwaj et al. applied a similar approach on baby faces [17]

. They use stacked denoising AE to learn robust representations of baby-faces, and a one-shot, single-class support vector machine to classify the baby faces. Their system achieves a verification accuracy of 63.4% with a false accept rate of 0.1%. Noroozi et al. propose to learn representations for biometrics with an architecture called SEVEN 

[102]. SEVEN combines a convolutional AE [87] with a Siamese network. The combination of these components allows learning general salient and discriminative features of the data. Also, it enables the network to be trained in an end-to-end fashion.

Data Task Architecture details Approach
Images Forensic sketch matching SDAE [159], DBN [59] [95]
Images Forensic sketch matching FaceNet [122], VGGNet [45]
Images Face filtering AlexNet [78] [163]
Images Face verification SDAE [159] [48]
Images Baby face verification SDAE [159] [17]
Images Spoofing detection CNN [34] [34]
Images Mobile iris verification 2Channel CNN [174] [179]
Images RL for biometrics Siamese Network, CNN AE [102]
Table 3: Representation Learning using Deep Learning on spatial data in information security

5.3 Structured data

5.3.1 Classification on Structured Data

Alsulami et al. propose to identify the authors of source code by features that are extracted by an LSTM from the source code [4]. To achieve this, they first generate the abstract syntax tree (AST) from the source code. Then, this AST is traversed depth-first to create a sequence of nodes. Each node is embedded via an embedding layer, and a model is learned from such sequences of nodes via a bi-directional LSTM. Alsulami at al. evaluate their method via code obtained from public source code repositories, and achieve an author classification accuracy of up to 96%.

DL-based PUF Verification

Physically Unclonable Functions (PUFs) can be used in an authentication system in a challenge-response based protocol. Yashiro et al. propose DAPUF, a strong, arbiter PUF-based system for authentication of chips, that can be used to authenticate them. They show that their system is resistant to DL based attacks [170]. To carry out these attacks, they use a stacked denoising AE followed by a FC layer to differentiate between fake and genuine PUFs in a binary classification task.

Network Intrusion Detection

Network intrusion detection is the task of analyzing network traffic data or data from other components of a network to identify malign actions. Osada et al. propose a network intrusion detection method that uses latent representations of network traffic to identify malign actions [103]

. Osada et al. propose a semi-supervised version of VAEs, the forked variational AE to address this task. The forked VAE learns a representation from the traffic, and subsequently the mean of the latent space defined by the VAE is used as input to a classifier that , and predicts whether an example is benign or not. The error is back-propagated and combined with the VAE reconstruction error. Details on feature extraction and how to overcome problems of training a discrete VAE are omitted. Osada et al. evaluate their approach on the NSL-KDD and Kyoto2006+ datasets. Adding 100 labeled examples increased the absolute recall-rate by 4.4% points the total false-positive rate by 0.019%. Similarly, Aminanto et al. propose to learn features from network data to detect impersonation attacks 

[5]. They extract the features using a sparse deep AE from the existing network data features, i.e., they compress the already complex features. They use this learned frame representation to learn a feed-forward MLP. They evaluate their approach on the Aegean Wireless Network Intrusion Detection Dataset (AWID) and achieve a per-frame classification accuracy of 99.91% and a false positive rate of 0.012.

Drive-by Attack Detection

Drive-by attacks are attacks that infect clients that visit a web page by exploiting client-side vulnerabilities. Shibanara et al. propose a method for detecting such attacks by classifying the sequences of URLs into benign and malicious sequences [130]. They extract 17 features from each of the URLs that are loaded when a client visits a web page, and classify these sequences using a CNN. Shibanara et al. propose an Event Denoising CNN (EDCNN) to suit the data at hand. This EDCNN has an allocation layer, which rearranges the values of the input layer to convolve over two URLs whose order is similar. Additionally, they use a spatial pooling layer to summarize sequences of different length [57]. They evaluate their approach on a data set that they collected using a honey client and which consists of 17,877 malicious and 41,127 benign URL sequences. On this dataset, an EDCNN achieves a false-positive rate of 0.148 and an F1-score of 0.72, which outperforms a regular CNN, which achieves a false-positive rate of 0.276 and an F1-score of 0.59.

5.3.2 Representation Learning on Structured Data

Cross-Platform Binary Code Similarity Detection

Cross-Platform binary code similarity detection aims to identify whether two pieces of binary function code from different platforms describe the same function. To address this problem, Xu et al. propose to learn a metric space for functions [167]. They do so by combining a Siamese network [23] with a Structure2Vec model [36]. That is, they treat the function code as a graph, hence chose Structure2Vec as graph embedding network in a Siamese learning setting. Xu et al. demonstrate on four datasets that their approach outperforms state-of-the-art approaches by large margins. Additionally, they show that the embedding was learned 20x faster and that the embedding maps binary codes of the same function close to each other, and functions from different functions further apart.

Data Task Architecture details Approach

Classification

Abstract Syntax Tree
Author attribution bi-LSTM [123; 63] [4]
Wireless Network traffic Intrusion Detection AE, MLP  [5]
Challenge and response pairs PUF verification SDAE [159], RBM [170]
Network traffic Intrusion Detection VAE-variant [74]  [103]
URL-Sequences Drive-by-attack detection EDCNNs [130]  [130]
Representation Learning
Execution Graph Binary Code Embedding Structure2Vec [36], Siamese Network [23] [167]
Table 4: Deep Learning in information security on structured data

5.4 Text data

5.4.1 Classification on Text Data

Password guessing attacks

Melicher et al. propose DL-based password guessing attacks, which they use to measure the strength of passwords [89]. Their idea is inspired by works of Sutskever et al., who have successfully demonstrated that RNNs could be used to build language models for text, which in turn could also be used for generating text. Melicher et al. use fine-tuned LSTMs [69]

to learn a model for the characters of passwords. This model can be used to predict the likelihood of a given password, and given the likelihood, they calculate the strength. They train their model on a large list of publicly available passwords and show, that a DL-based password model performs better at predicting than Markov chains or Context-Free Grammars.

Creating and Detecting Fake Reviews

E-commerce sites sell products which partly derive their reputation via user reviews. Yao et al. use a DL-based language model to create fake reviews as well as a defensive model [169]. Their fake review language model is a word-based language model based on LSTMs [63]. LSTMs have been shown to model statistical properties of texts well [29; 149; 42]). They use a large text corpus, the Yelp review database for training, and fine-tune the generated reviews by a simple noun replacement strategy. This noun replacement strategy refines the review to the context of the review. Yao et al. experimentally show, that about 96% of the fake reviews pass automated plagiarism detection, and human recall and precision are 0.160 and 0.406. To defend against such automated attacks, Yao et al. propose to use a character-based language model, as they find that generated texts have a statistically detectable difference in character distribution to text written by real people. Their defense achieves a prevision of 0.98 and a recall of 0.97.

5.4.2 Anomaly Detection on Text Data

Log analysis

System logs keep track of activities happening on a computing system. In case of system failures or attacks, system logs can be used to debug such failures or reveal knowledge about an attack. Du et al. propose a neural sequence model-based approach for anomaly detection in system logs called DeepLog [39]. Their neural sequence model consists of a stacked LSTM [63]. To build the sequence model, they manually parse lines to a specific event IDs to construct sequences of events. Such sequences describe workflows on the analyzed system. The normal workflow model is built from a set of workflows. The model continuously tries to predict next event IDs, and if they predicted event ID is above a certain mean squared error for the actual error of the event, the system raises an alarm. Du et al. integrate user feedback to update the model if a false positive has been detected. DeepLog is evaluated on two large system logs, and it achieves an F1-score of up to 96%.

5.4.3 Clustering on Text Data

Log Analysis

Thaler et al. propose an unsupervised method for signature extraction from forensic logs [152]. This approach groups log lines based on the print statement that originated their log lines. Their proposed method combines a Neural language model [42] and with a clustering method. They train the neural language model as an RNN-autoencoder to learn a representation of the loglines, which captures the complex dependency between mutable and immutable parts of a logline. These representations are then clustered. They evaluate their approach on three datasets, one self-created and two public system logs with 11,023, 474,796 and 716,577 log lines. Their method clusters log lines with a V-measure of 0.930 to 0.948.


Data
Task Architecture details Approach

Classification

Characters
Password strength guessing LSTM [69] [89]
Text Detecting fake reviews LSTM [63] [169]

Anomaly Detection
Logs Anomaly detection LSTM [63] [39]
Clustering
Logs Signature Extraction Neural Language Model [42] [152]
Table 5: DL in InfoSec on text data

5.5 Multi-modal data

5.5.1 Classification on Multi-Modal Data

Cyber bullying detection

Zhong et al. propose a DL-based method to detect bullying in cyber space [181]. To do so, they build a model from images and comments of Instagram. For classification of the images, they use a pre-trained version of AlexNet, which they fine-tuned using their dataset. For representing text, they used word2vec [93] and other, shallow representations such as bag-of-words. Zhong et al. experimented with different configurations and combinations and conclude, that the title of the post is one of the strongest predictors of cyber bullying.

ECG Biometrics

With cheap mobile sensors available, ECG can be used as a modality for biometrics. Da Silva Luz et al. propose a method for ECG-based biometrics, which treats the ECG data as both, raw sequential data and spatial data [35]. They derive the 2D spatial data from the raw ECG data by calculating a spectrogram. They build two models from these two data modalities, a 1D, and a 2D CNN. The outputs of these two networks are merged using either a sum, a multiplication or a mean rule. To capture a whole ECG cycle, both CNN models have large first layer convolutions. The method is evaluated on multiple datasets. The fusion of the two model results consistently increases the accuracy of the model.

Data Task Architecture details Approach
Text, Spatial Cyber bullying detection Word2Vec [93],AlexNet [78] [181]
Raw ECG, ECG Spectrum Biometrics 1D CNN, 2D CNN [35]
Table 6: Deep Learning in information security on multi-modal data

5.6 Security properties of DL algorithms

Privacy of DL Algorithms

Although DL algorithms are successful in many domains, in some areas such as healthcare their uptake has been limited due to privacy and confidentiality concerns of the data owners. Shokri and Shmatikov proposed a system that enables multiple parties to jointly learn and use a DL model in a privacy-preserving way [133]. The core of their idea is to devise a novel training procedure: distributed selective stochastic gradient descent (DSSGD). In this procedure, each participant downloads the global model parameters to their local system. On their local system, they compute the gradients using stochastic gradient descent and submit a selection of the gradients to the server. Either the largest gradients select the submitted gradients, or randomly subsample gradients that are above a certain threshold , where and are hyper-parameters. Their approach is not geared towards a particular DL architecture, but they validate their approach on two datasets MNIST and SVHN, with two models, an MLP and CNN. They demonstrate that the models learned by their approach guarantee differential privacy for the data owners without sacrificing predictive performance. Instead of calculating the privacy-loss per parameter, Abadi et al. calculate the privacy-loss per model to select the parameter updates and ensure differential privacy for stochastic gradient descent [1]

. The calculation of privacy-loss per model is beneficial in scenarios where a whole model will be copied to a target device, for example in the context of mobile phones. The main components of their approach are a differentially private version of the SGD algorithm, the moments accountant, and hyperparameter tuning. Abadi et al. experimentally show that differential privacy only has a modest impact on the accuracy of their baseline. Phan et al. propose a privacy-preserving version of a deep AE that enforces

-privacy by modifying the objective function [110]. They achieve -privacy by replacing the objective with a functional mechanism that inserts noise [178]. Additionally, Phan et al. conduct a sensitivity analysis. Hitaj at al. develop an attack on distributed, decentralized DL and show that data of honest participants is leaked [62]. Their attack uses generative adversarial networks (GANs) [52] and assumes an insider at the victim’s system. The insider uses a GAN to learn to create similar objects for a particular of the victim’s classes and injects these images into the learning process under a different class. As a result, the victim has to reveal more gradients about the original class, thereby leaking information about the objects. Phong et al. demonstrate that the scheme proposed by Shokri and Shmaktikov [133] can leak data to an honest-but-curious server [111]. Also, they propose to address this by combining asynchronous DL with additively applied homomorphic encryption. In their evaluation setting, the computational overhead for adding homomorphic encryption to asynchronous DL is around 3.3 times the computational power of not using encryption. However, no information is leaked to the server.

Adversarial Attacks on DL Algorithms

In their work, Szegedy et al. describe that is easy to construct images that appear to be visually similar to a human observer, but will cause a DL model to misclassify [150]. This approach requires knowledge about the parameters and architecture of the model. As an extension to that, Papernot et al. experimentally show that it is also possible to create adversarial attacks with only knowing the labels given an input [105]. Their black-box attack has an adversarial success rate of up to 96.19%. Further, Papernot et al. introduce a class of algorithms that craft adversarial images [106]

, i.e., they propose an algorithm that poses a threat to the integrity of the classification. Their algorithm used the forward derivative and the saliency map to craft adversarial images and evaluated on feedforward deep neural networks and at test time they achieve an adversarial success rate of 97%. In addition to that, Papernot et al. propose a defensive mechanism against adversarial attacks = defensive distillation 

[107]. Defensive distillation is a more robust variant of stochastic gradient descent, where additional knowledge about training points is extracted and fed back into the training algorithm. They experimentally validate that defensive distillation can reduce the effectiveness of adversarial attacks to 0.5%. Shen et al. show that poisoning attacks are possible in a collaborative DL setting with a success rate of 99% [129]. As a defense, Shen et al. propose a collaborative system – AUROR – that detects malicious users and corrects inputs. Malicious updates are detected by gradients that follow an abnormal probability distribution. AUROR reduces the poisoning attacks success rate to 3%. Meng et al. propose MagNet, another defense system against gray-box adversarial attacks that is independent of the target classifier or the adversarial example generation process [90]. MagNet consists of two components, a detector network, and a reformer network. The detector networks is an AE that attempts to distinguish between real images and fake images, and the reformer network, which is also an AE, attempts to reconstruct the test input. Both, the detector and the reformer network are trained with Gaussian noise. Examples with larger reconstruction error of the detector network are rejected, and if they are not rejected, they are passed through the reformer to be moved closer to the original manifold to disturb their adversarial capability. During the evaluation, Magnet shows a minor reduction in classification accuracy due to the loss that is caused by the reformer network, but it is very effective to prevent adversarial attacks.

g

Security Property Task Architecture details Approach
Privacy Protect users privacy custom MLP, CNN [133]
Privacy Protect users privacy custom MLP, CNN [1]
Privacy Protect users privacy SDAE [110]
Privacy Attack privacy preserving systems GAN[52] [62]
Privacy Protect users privacy [111] [111]
Integrity Craft adversarial examples custom MLP [106]
Integrity Defend against adversarial examples custom MLP [107]
Integrity Black box adversarial attacks custom MLP [105]
Integrity Defend against poisoning attacks custom MLP [129]
Integrity Defend adversarial examples MagNet, SDAE [90]
Table 7: DL in InfoSec on spatial data

6 Discussion

In this survey, we have reviewed 77 papers that are about DL in the domain of InfoSec. While machine learning has played a vital role in many aspects of InfoSec for a long time, the application of DL in InfoSec is a more recent development – 64 of the 77 papers were published after 2016.

A broad variety of different model architectures and methods have been applied in the reviewed papers. Generally, CNNs are the most common architecture for spatial data, and RNNs are the most common architecture for sequential-, structured and text data. The observation that multiple deep architectures can successfully be applied to the same task leads to the conjecture that the exact composition of the architecture is not the only crucial aspect for achieving good results. Often, good results are achieved by particular pre-processing or data augmentation methods, or by using more data from other sources. Except for these broad categorizations, it is difficult to draw a definite conclusion which architecture and solutions excel at specific tasks. In ”traditional” machine learning, the main problem to solve was how to design features that model the data well. In DL, the problem shifted to choosing a suitable architecture and finding the right hyper-parameters for solving a task. In the DL community, there are a few empirically motivated guidelines for choosing hyper parameters [22; 11; 60; 97; 14], as well automated attempts for hyper parameter and architecture search architecture [184; 15; 16; 139]. Within InfoSec, finding the right architecture and hyperparameters has mainly been a manual effort, or successful architectures from other domains have been adapted.

Most of the tasks in this survey were about image or audio data. This is not surprising, because DL methods on this two data types have been hugely successful in other domains. More than one-third of the surveyed papers were about biometrics, where DL methods were used to identify individuals from sensor readings. Another prominent task was breaking CAPTCHAs, which usually involves analyzing some form of distorted images or audio clip. Other tasks such as malware analysis or intrusion detection, that involved complex data or data that has not been studied widely outside of InfoSec have much fewer applications.

To solve InfoSec problems, 46 of them were framed as classification tasks. Using DL for classification in a supervised setting has been well researched, and many results have shown that DL methods are capable of finding solutions that generalize well to classification problems [177]. The second large use case of DL methods in InfoSec is to replace manual feature engineering by automatically learned features, i.e., to learn the representations from the data.

Only a few papers explain the results that they achieved. The other surveyed papers focus on the performance rather than on an explanation of the results. Further, lessons learned are mainly about architecture decisions. While performance is indeed important, these observations do not contribute to a better understanding of the problem or the solution.

6.1 Challenges

Applying DL methods to InfoSec problems presents specific challenges. Some of these challenges exist because of additional requirements in the InfoSec domain. Others arise because of peculiarities of the data types that need to be analyzed. There are also general challenges concerning DL methods which are also of interest in InfoSec.

One such challenge for InfoSec problems is that errors are generally costly. Consider an intrusion detection system. A false negative that is an undetected intrusion can be costly because the attacker can carry out their attack, which usually has some cost associated to it. False positives, on the other hand, are costly because a security officer needs to investigate this incident. Too many false positives may lead to many hours of effort at best, and to security officers ignoring incidents in the worst case.

Another challenge that needs to be addressed for InfoSec problems is that in many cases the predictions of a machine learning model need to be transparent and understandable to humans, as experts need to interact with the DL models and the predictions. For example, in information forensics, the result of a model must be presentable in a court and be understandable by people there, and under some legislation, people have the right for an explanation if an automated decision has been made about them [43]. Alternatively, in an intrusion detection setting, a security officer needs to understand why a specific intrusion has happened, not only that an intrusion has happened.

Further, within the context of InfoSec, the data that is analyzed is often highly structured and carries a lot of implicit and explicit semantic meaning. For example, data that needs to be analyzed in a security context are log files, which are often structured to columns. There is no need to use DL to figure out that a character string in the date column represents a date. Outside of InfoSec, the data types where DL algorithms excel are mainly unstructured data such as the pixels of an image, text, genetic sequences or the frequency of a sound file.

One other major challenge is that changing nature of the adversaries in information securities. Attacks and defenses evolve continually. This changing nature has to be reflected in the DL models as well. A DL model that detects malware today may potentially not be useful in the future, as the malware may change its behavior. This changing nature is very different for many other domains where machine learning is applied, where the task to be modeled remains the same.

Then, in many cases, problems in InfoSec are in highly artificial, created contexts where lots of domain knowledge is available. For example, malware is analyzed in the context of an operating system, and the inner workings of an operating system are often available. Currently, such information is rarely used because it is challenging to combine DL models with domain knowledge.

Finally, the DL-model that is learned is only as good that the data that has been used to build the model. Since mistakes in InfoSec are costly, ensuring high-quality data as well as high-quality labels to build the models poses another significant challenge. Labeling the data is often not trivial and labor intensive, and ensuring that a model trained on some data is qualified for a task is challenging as well.

6.2 Research Directions

DL methods offer solutions to many problems that are difficult to solve by other means. However, as outlined in the previous paragraphs many challenges remain, and they are currently not very well addressed. We do believe that high performance alone is not enough and that addressing these challenges is a vital pre-condition for DL methods to be applied in a practical InfoSec setting. In this section, we outline potential research directions and point to developments in the DL community that could lead to potential solutions.

Adding domain knowledge

Currently, combining domain knowledge with DL-models poses a challenge. Domain knowledge is usually incorporated in DL methods by data pre-processing or augmenting methods, or design choices in the model architecture. DL models are built with only very general assumptions about that data, thus adding domain knowledge is likely a hard problem.

Three potential ways to add domain knowledge to DL models are the regularization of the learning process, customization of the objective functions or changes in the learning procedure.

Model Adaptability

One capability that DL models are currently lacking is model adaptability. That is, the capability to tune a model based on the judgment of an administrator. DL models can, of course, be retrained or updated with additional data, but this does not reflect the changing nature of certain threats in InfoSec.

One-shot [161; 76; 157] or zero-shot [140; 67] learning may offer a potential solution to adaptable models. Instead of learning directly to classify, these approaches learn a metric space using DL. The models learn features that distinguish certain objects in that metric space. This metric space allows new instances of entirely different objects to be described in such a space without having to re-train the model.

Another way to achieve model adaptability is to learn DL models that represent specific, static features. Instead of modeling malware behavior using different feature types and training one model, one could train a model for each static feature type where the meaning clearly understood. An example of such a feature is modeling the number of file accesses over time using an RNN. The meaning of such a prediction is well understood - it predicts whether the amount of file accesses is within a reasonable range or unusually high. A combination of different static models could be used to build more complicated, but easy to adapt detection engines.

Model Interpretability

Currently, DL methods in InfoSec focus mainly on achieving high performance for the given task. However, in many cases, the predictions of a DL model need to be understood by a human operator. Model interpretability can be understood in two ways. Either, the model and its inner workings are comprehensible, or its predictions are understandable [83]. For InfoSec, we believe that the latter is more important.

Recently, work on explaining predictions of DL models has begun. Lei et al. and Riberio et al. worked on explaining predictions of classifiers [80; 116] by extending the model with parts that would highlight the inputs that were responsible for causing the result. Ritter et al. were inspired by methods from cognitive psychology to interpret model results and to identify biases in predictions [118]. Montavon et al. summarize currently available tools for interpreting model predictions [98]. Karpathy et al. visualized the activations of RNNs to shed light on their inner workings [71].

Deep metric learning may offer another path to understandable representations of the data. Deep metric learning maps an input feature space to a representation feature space in such a way that a distance metric such as the Euclidean distance obtains a meaning. Deep metric learning has been successfully applied to signature [23] and face verification [164; 122] problems as well as cross-platform code malware detection [167]. Potentially, the learned metric space can be interpreted, though very little research has been directed towards that goal.

Variational inference may offer yet another way to learn understandable representations. Yan et al. trained variational AE in a conditional way [168]. The resulting latent space encoded various attributes of the source images, such as objects, rotation, and shading. Bengio et al. propose a similar idea [153].

Finally, reverse engineering and analyzing of the predicted results provide valuable insights on how to address problems in a particular domain. For example, DeepMind’s AlphaGo, which used deep reinforcement learning to learn the game Go 

[134]. AlphaGo devised a set of previously unknown strategies to play the game, which also increased the capability of humans to play the game. Similarly, analyzing learned features may provide useful feedback on how to construct manual features or how to analyze data for particular problems.

Feedback Loop

Another important aspect which is currently under-researched is how to combine the work of human experts with DL models. In particular, two questions should be addressed: How to change models after an incorrect prediction has been made, and how tune models towards certain thresholds that a human operator sets. In both cases, changes to the model need to be done in such a way that does not disturb the quality of results of other predictions. For both questions, methods of active learning 

[125] and changes in stochastic gradient descent may provide potential answers.

Data Quality

DL models solve tasks by deriving useful representations from data. Consequently, the quality of the model depends largely on the quality of the data. Only if the data is representative of the domain and the problems to be solved, proper models can be learned. hence, the data should be free representative of the problem domain and free from errors and biases. Bolukbasi et al. demonstrated that DL models trained on a Wikipedia text corpus would inherit the biases that are within the data [20]. In InfoSec, this may pose a significant problem. The DL community is lacking tools to decide on the quality of the underlying data, and when DL methods are used in the context of InfoSec, one of the best practices should be to use such data quality assurance tools. Research in ensuring bias-free data may also gain further traction since legislation in some parts of the world requires automated algorithms to be discrimination free, e.g. the European Union [43].

Much data in InfoSec is structured data that carries a meaning. For example, the source IP field of an IP packet intrinsically carries much meaning. However, DL methods generally work best for loosely correlated and unstructured data such as images or audio files. Consequently, two potential research directions are methods for decomposing structured data or research on novel models that are well-suited for structured data. An example for such a model is structure2vec, which uses DL techniques to learn a vector representation for structured data [142].

High-quality labels are an essential ingredient for high-quality datasets, and in consequence for good models. Obtaining such labels and maintaining them is often labor-intensive and error-prone. Four areas of research could be pursued to mitigate the challenge of obtaining labels in InfoSec: unsupervised learning, active learning, transfer learning and metric learning. Unsupervised learning such as AE, restricted Boltzmann machines are already widely used, both within the domain of InfoSec and in other domains. Transfer learning allows for training a model on a large corpus of unrelated data, and then fine tune it to a smaller set of labeled data for the task at hand. Active learning methods allow to efficiently utilize available labeled data by carefully selecting which data is used per training batch. Moreover, finally, deep metric learning, in particular, triplet networks, may offer a solution to hard obtainable labels because they allow being trained by solving a ranking problem instead of a hard classification. Ranking data is much easier to obtain than accurate labeled data.

Offensive DL

Except for breaking CAPTCHAs, DL methods for offensive purposes are currently also rarely researched. DL-methods show significant potential to be useful in side-channel attacks. Another possibility could be to use deep reinforcement learning to train an agent that automatically attacks a system [96]. Finally, one could use DL could be to design chatbots for phishing attacks.

7 Related Work

To the best of our knowledge, this is the first systematic review for DL in InfoSec from a data-centric perspective. On a broader perspective, machine learning has attracted many researchers in many sub-domains of InfoSec. In their workshop manifest, Joseph et al. provide a broad perspective on machine learning in security as well as general directions for future research ([68],  [40]). The use of machine learning was investigated in the other InfoSec sub-domains such as: intrusion detection ([141], [9], [24]); anomaly detection ([26], [19]); malware detection and classification ([126], [46]); and information forensics [6].

8 Conclusion

This paper presents a systematic literature review on the application of DL methods in InfoSec research. We have reviewed 77 papers and presented them from a data-centric perspective, i.e., which tasks were performed on what data type. We have categorize these papers according to five different data types, sequential data, spatial data, structured data, text data as well as combinations thereof. Additionally, we have reviewed papers that have investigated the security properties of DL algorithms, such as privacy and integrity of the learning methods.

For some well-defined issues such as biometric matching or attacking CAPTCHAs, DL methods that are successful in other domains can readily be applied and achieve state-of-the-art results. In particular, DL methods excel at machine learning tasks that are well-defined and where sufficient labeled data is available. However, many machine learning tasks in the domain of InfoSec often face a variety of unsolved challenges. Tasks are frequently hard to define, labeled data is difficult to obtain, and the data is often highly structured. Another challenge is the volatility of the machine learning tasks, e.g., attackers continuously change their behavior which is difficult to model. Besides that, DL models in InfoSec should fulfill special requirements. Domain knowledge needs to be combined with automated models, and the predictions of a model should be humanly understandable so that a security officer can judge an automated analysis and investigate on it accordingly. These requirements open ample opportunities for research, such as adapting and tuning models, combining domain knowledge with models, transparency of the models or learning to understand the predictions of DL methods.

To conclude, we want to re-iterate one of the most significant merits of DL methods, namely: DL methods derive useful representations from data, which leads to two very desirable consequences. First, it saves the manual effort that was previously required to manually craft features, especially in domains where it is difficult to hand-craft features because the domain is hard to understand. Secondly, it allows models and methods that are successful on a specific data type of one domain be also applicable to other problems of other domains on a similar data type. These two merits potentially lead to synergies between domains that have previously been disconnected, for example, health care and InfoSec. Advances in one domain, e.g., on the transparency of the models can readily be deployed in other domains.

Acknowledgment

The work presented in this paper is part of a project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780495.

References

  • Abadi et al. [2016] Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. In Shai Halevi, editor, Proceedings of the ACM Conference on Computer and Communications Security, volume 24-28-Octo, pages 308–318. ACM, 2016. ISBN 9781450341394. doi: 10.1145/2976749.2978318. URL http://arxiv.org/abs/1607.00133%0Ahttp://dx.doi.org/10.1145/2976749.2978318.
  • ACM [2012] ACM. ACM Computing Classification System, 2012. URL https://www.acm.org/publications/class-2012.
  • Algwil et al. [2016] Abdalnaser Algwil, Dan Ciresan, Beibei Liu, and Jeff Yan. A security analysis of automated chinese turing tests. In Proceedings of the 32nd Annual Conference on Computer Security Applications - ACSAC ’16, volume 5-9-Decemb, pages 520–532, 2016. ISBN 9781450347716. doi: 10.1145/2991079.2991083. URL http://dl.acm.org/citation.cfm?doid=2991079.2991083.
  • Alsulami et al. [2017] Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt.

    Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

    , volume 10492 LNCS.
    2017. ISBN 978-3-319-66401-9. doi: 10.1007/978-3-319-66402-6–_˝6. URL http://link.springer.com/10.1007/978-3-319-66402-6_6.
  • Aminanto et al. [2018] Muhamad Erza Aminanto, Rakyong Choi, Harry Chandra Tanuwidjaja, Paul D. Yoo, and Kwangjo Kim.

    Deep Abstraction and Weighted Feature Selection for Wi-Fi Impersonation Detection.

    IEEE Transactions on Information Forensics and Security, 13(3):621–636, 2018. ISSN 1556-6013. doi: 10.1109/TIFS.2017.2762828. URL http://ieeexplore.ieee.org/document/8067440/.
  • Ariu et al. [2011] Davide Ariu, Giorgio Giacinto, and Fabio Roli. Machine learning in computer forensics (and the lessons learned from machine learning in computer security). In Proceedings of the 4th ACM workshop on Security and artificial intelligence AISec 11, pages 99–103. ACM, 2011. ISBN 9781450310031. doi: 10.1145/2046684.2046700. URL http://dl.acm.org/citation.cfm?id=2046700.
  • Azimpourkivi et al. [2017] Mozhgan Azimpourkivi, Umut Topkara, and Bogdan Carbunar. A Secure Mobile Authentication Alternative to Biometrics. In Proceedings of the 33rd Annual Computer Security Applications Conference on - ACSAC 2017, volume Part F1325, pages 28–41, 2017. ISBN 9781450353458. doi: 10.1145/3134600.3134619. URL http://dl.acm.org/citation.cfm?doid=3134600.3134619.
  • Barker [2003] William C Barker. Guidelines for Identifying an Information System as a National Security System. Technical report, 2003.
  • Beghdad [2008] Rachid Beghdad. Critical study of neural networks in detecting intrusions. Computers & Security, 27(5):168–175, 2008. ISSN 01674048. doi: 10.1016/j.cose.2008.06.001. URL http://dx.doi.org/10.1016/j.cose.2008.06.001.
  • Bengio [2009] Yoshua Bengio. Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning, 2(1):1–127, 2009. ISSN 1935-8237. doi: 10.1561/2200000006.
  • Bengio [2012] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.
  • Bengio et al. [2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.50.
  • Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Bergstra et al. [2013] James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013.
  • Bergstra et al. [2011] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
  • Bharadwaj et al. [2016] Samarth Bharadwaj, Himanshu S. Bhatt, Mayank Vatsa, and Richa Singh.

    Domain Specific Learning for Newborn Face Recognition.

    IEEE Transactions on Information Forensics and Security, 11(7):1630–1641, 2016. ISSN 15566013. doi: 10.1109/TIFS.2016.2538744.
  • Bharati et al. [2016] Aparna Bharati, Richa Singh, Mayank Vatsa, and Kevin W. Bowyer. Detecting Facial Retouching Using Supervised Deep Learning. IEEE Transactions on Information Forensics and Security, 11(9):1903–1913, 2016. ISSN 15566013. doi: 10.1109/TIFS.2016.2561898. URL http://dblp.uni-trier.de/db/journals/tifs/tifs11.htmlhttp://dblp.uni-trier.de/db/journals/tifs/tifs11.html#BharatiSVB16.
  • Bhuyan et al. [2014] Monowar H Bhuyan, D K Bhattacharyya, and J K Kalita. Network Anomaly Detection: Methods, Systems and Tools. Communications Surveys & Tutorials, IEEE, 16(1):303–336, 2014. ISSN 1553-877X. doi: 10.1109/SURV.2013.052213.00046. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6524462.
  • Bolukbasi et al. [2016] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, pages 4349–4357, 2016.
  • Borgolte et al. [2015] Kevin Borgolte, Christopher Kruegel, Giovanni Vigna, Kevin Borgolte, Christopher Kruegel, and Giovanni Vigna. Meerkat : Detecting Website Defacements through Image-based Object Recognition. In USENIX Security, pages 595–610, 2015. ISBN 9781931971232.
  • Bottou [2012] Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  • Bromley et al. [1994] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pages 737–744, 1994.
  • Buczak and Guven [2016] Anna L Buczak and Erhan Guven. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2):1153–1176, 2016.
  • Cao and Jain [2015] Kai Cao and Anil K. Jain. Latent orientation field estimation via convolutional neural network. In Proceedings of 2015 International Conference on Biometrics, ICB 2015, pages 349–356, 2015. ISBN 9781479978243. doi: 10.1109/ICB.2015.7139060.
  • Chandola et al. [2009] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A Survey. Techniques, 41(3):15, 2009. ISSN 10984275. doi: 10.1089/lap.2006.05083.
  • Chatfield et al. [2014] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets. arXiv preprint arXiv: …, pages 1–11, 2014. ISSN 1-901725-52-9. doi: 10.5244/C.28.6. URL http://arxiv.org/abs/1405.3531.
  • Chen et al. [2017] Z Chen, B Yu, Y Zhang, J Zhang, and J Xu. Automatic mobile application traffic identification by convolutional neural networks. In

    Proceedings - 15th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 10th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Symposium on Parallel and Distributed Proce

    , pages 301–307, 2017.
    ISBN 9781509032051. doi: 10.1109/TrustCom.2016.0077. URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-85015242258&doi=10.1109%2FTrustCom.2016.0077&partnerID=40&md5=fd7fec4b091b907e595981134509f910.
  • Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734, 2014.
    ISSN 09205691. doi: 10.3115/v1/D14-1179. URL http://arxiv.org/abs/1406.1078.
  • Chua et al. [2017] Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. Neural Nets Can Learn Function Type Signatures From Binaries. In 26th USENIX Security Symposium (USENIX Security 17), volume 17, pages 99–116, 2017. ISBN 978-1-931971-40-9. URL https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/chua.
  • Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • Chung et al. [2015] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015.
  • Ciregan et al. [2012] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. In

    Computer vision and pattern recognition (CVPR), 2012 IEEE conference on

    , pages 3642–3649. IEEE, 2012.
  • Czajka et al. [2017] Adam Czajka, Kevin W. Bowyer, Michael Krumdick, and Rosaura G. Vidalmata. Recognition of Image-Orientation-Based Iris Spoofing. IEEE Transactions on Information Forensics and Security, 12(9):2184–2196, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2701332.
  • Da Silva Luz et al. [2018] Eduardo Jose Da Silva Luz, Gladston J.P. Moreira, Luiz S. Oliveira, William Robson Schwartz, and David Menotti. Learning Deep Off-the-Person Heart Biometrics Representations. IEEE Transactions on Information Forensics and Security, 13(5):1258–1270, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2784362.
  • Dai et al. [2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pages 2702–2711, 2016.
  • Deng et al. [2014] Li Deng, Dong Yu, and others. Deep learning: methods and applications. Foundations and Trends®in Signal Processing, 7(3–4):197–387, 2014.
  • Donahue et al. [2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Icml, volume 32, page 647–655, 2014. ISBN 9781634393973. URL http://arxiv.org/abs/1310.1531.
  • Du et al. [2017] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security - CCS ’17, pages 1285–1298, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3134015. URL http://dl.acm.org/citation.cfm?doid=3133956.3134015.
  • Dua and Du [2013] Sumeet Dua and Xian Du. Data Mining and Machine Learning in Cybersecurity, volume 53. CRC press, 2013. ISBN 9788578110796. doi: 10.1017/CBO9781107415324.004.
  • Dumoulin and Visin [2016] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016.
  • Dzmitry Bahdana et al. [2014] Dzmitry Bahdana, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation By Jointly Learning To Align and Translate. Iclr 2015, pages 1–15, 2014. ISSN 0147-006X. doi: 10.1146/annurev.neuro.26.041002.131047. URL http://arxiv.org/abs/1409.0473v3.
  • EU [2016] EU. GENERAL DATA PROTECTION REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016. Official Journal of the European Union, 2016.
  • Ferreira et al. [2017] Anselmo Ferreira, Luca Bondi, Luca Baroffio, Paolo Bestagini, Jiwu Huang, Jefersson A. Dos Santos, Stefano Tubaro, and Anderson Rocha. Data-Driven Feature Characterization Techniques for Laser Printer Attribution. IEEE Transactions on Information Forensics and Security, 12(8):1860–1873, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2692722.
  • Galea and Farrugia [2018] Christian Galea and Reuben A. Farrugia. Matching Software-Generated Sketches to Face Photographs with a Very Deep CNN, Morphed Faces, and Transfer Learning. IEEE Transactions on Information Forensics and Security, 13(6):1421–1431, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2788002.
  • Gandotra et al. [2014] Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. Malware Analysis and Classification: A Survey. Journal of Information Security, 05(02):56–64, 2014. ISSN 2153-1234. doi: 10.4236/jis.2014.52006.
  • Gao et al. [2013] Haichang Gao, Wei Wang, Jiao Qi, Xuqin Wang, Xiyang Liu, and Jeff Yan. The robustness of hollow CAPTCHAs. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security - CCS ’13, pages 1075–1086, 2013. ISBN 9781450324779. doi: 10.1145/2508859.2516732. URL http://dl.acm.org/citation.cfm?doid=2508859.2516732.
  • Gao et al. [2015] S. Gao, Y. Zhang, K. Jia, J. Lu, and Y. Zhang. Single Sample Face Recognition via Learning Deep Supervised Autoencoders. IEEE Transactions on Information Forensics and Security, 10(10):2108–2118, 2015. ISSN 15566013. doi: 10.1109/TIFS.2015.2446438.
  • Gates and Taylor [2007] Carrie Gates and Carol Taylor. Challenging the Anomaly Detection Paradigm: A Provocative Discussion. In Proceedings of the 2006 workshop on New security paradigms, pages 21–29. ACM, 2007. ISBN 9781595939234. doi: 10.1145/1278940.1278945.
  • Gers et al. [1999] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. 1999.
  • Gonzalez-Sosa et al. [2018] Ester Gonzalez-Sosa, Julian Fierrez, Ruben Vera-Rodriguez, and Fernando Alonso-Fernandez. Facial soft biometrics for recognition in the wild: Recent works, annotation, and COTS evaluation. IEEE Transactions on Information Forensics and Security, 13(8):2001–2014, 2018. ISSN 15566013. doi: 10.1109/TIFS.2018.2807791.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. ISBN 9780262035613. URL http://www.deeplearningbook.org.
  • Goswami et al. [2017] Gaurav Goswami, Mayank Vatsa, and Richa Singh. Face Verification via Learned Representation on Feature-Rich Video Frames. IEEE Transactions on Information Forensics and Security, 12(7):1686–1698, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2668221.
  • Han et al. [2017] Yi Han, Sriharsha Etigowni, Hua Li, Saman Zonouz, and Athina Petropulu. Watch Me, but Don’t Touch Me! Contactless Control Flow Monitoring via Electromagnetic Emanations. In Proceedings of the ACM Conference on Computer and Communications Security, pages 1095–1108, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3134081. URL http://arxiv.org/abs/1708.09099%0Ahttp://dx.doi.org/10.1145/3133956.3134081.
  • Hand and Chellappa [2017] E M Hand and R Chellappa. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In 31st AAAI Conference on Artificial Intelligence, AAAI 2017, pages 4068–4074, 2017. URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-85030228594&partnerID=40&md5=8a379a7b3a9d70c1e565882b78be99ce.
  • He et al. [2014] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In european conference on computer vision, pages 346–361. Springer, 2014.
  • Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and others. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012. ISSN 1053-5888. doi: 10.1109/MSP.2012.2205597. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296526%5Cnhttp://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6296526&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6296526%5Cnhttp://www.isip.piconepress.com/cou.
  • Hinton [2009] Geoffrey E Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009.
  • Hinton [2012] Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
  • Hinton et al. [2006] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  • Hitaj et al. [2017] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning. In Proceedings of the ACM Conference on Computer and Communications Security, pages 603–618, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3134012. URL http://arxiv.org/abs/1702.07464.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997. URL http://www7.informatik.tu-muenchen.de/~hochreithttp://www.idsia.ch/~juergen.
  • Hornik [1991] Kurt Hornik. Approximation Capabilities of Muitilayer Feedforward Networks. Neural Networks, 4(2):251–257, 1991.
  • Huang et al. [2010] Yuchi Huang, Qingshan Liu, Shaoting Zhang, and Dimitris N Metaxas. Image retrieval via probabilistic hypergraph ranking. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3376–3383. IEEE, 2010.
  • IEEE [2017] IEEE. IEEE Taxonomy 2017, 2017. URL https://www.ieee.org/content/dam/ieee-org/ieee/web/org/pubs/taxonomy_v101.pd.
  • Johnson et al. [2016] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and others. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. arXiv preprint arXiv:1611.04558, 2016.
  • Joseph et al. [2013] Anthony D Joseph, Pavel Laskov, Fabio Roli, J Doug Tygar, and Blaine Nelson. Machine Learning Methods for Computer Security (Dagstuhl Perspectives Workshop 12371). Dagstuhl Manifestos, 3(1):1–30, 2013. ISSN 2193-2433. doi: http://dx.doi.org/10.4230/DagMan.3.1.1. URL http://drops.dagstuhl.de/opus/volltexte/2013/4356.
  • Jozefowicz et al. [2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, pages 2342–2350, 2015.
  • Kandi et al. [2017] Haribabu Kandi, Deepak Mishra, and Subrahmanyam R.K.Sai Gorthi. Exploring the learning capabilities of convolutional neural networks for robust image watermarking. Computers and Security, 65:247–268, 2017. ISSN 01674048. doi: 10.1016/j.cose.2016.11.016.
  • Karpathy et al. [2016] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and Understanding Recurrent Networks. Iclr, pages 1–13, 2016. ISSN 978-3-319-10589-5. doi: 10.1007/978-3-319-10590-1–_˝53.
  • Keogh and Mueen [2011] Eamonn Keogh and Abdullah Mueen. Curse of dimensionality. In Encyclopedia of machine learning, pages 257–258. Springer, 2011.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 12 2014. URL http://arxiv.org/abs/1412.6980.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kissel [2013] Richard Kissel. Glossary of Key Information Security Terms. 2013.
  • Koch et al. [2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  • Krizhevsky [2012] Alex Krizhevsky. cuda-convnet: High-performance c++/cuda implementation of convolutional neural networks. Source code available at https://github. com/akrizhevsky/cuda-convnet2 [March, 2017], 2012.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lei et al. [2016] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing Neural Predictions. EMNLP 2016, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, 2016. URL http://arxiv.org/abs/1606.04155.
  • Li et al. [2018] Yanxiong Li, Xue Zhang, Xianku Li, Yuhan Zhang, Jichen Yang, and Qianhua He. Mobile Phone Clustering from Speech Recordings Using Deep Representation and Spectral Clustering. IEEE Transactions on Information Forensics and Security, 13(4):965–977, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2774505.
  • Lin et al. [2018] Z Lin, Y Huang, and J Wang. RNN-SM: Fast Steganalysis of VoIP Streams Using Recurrent Neural Network. IEEE Transactions on Information Forensics and Security, 13(7):1854–1868, 2018. ISSN 15566013. doi: 10.1109/TIFS.2018.2806741. URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-85042114578&doi=10.1109%2FTIFS.2018.2806741&partnerID=40&md5=6461ef0f7e4450925459965dbbf24d10.
  • Lipton [2016] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
  • Liu et al. [2016a] N Liu, H Li, M Zhang, J Liu, Z Sun, and T Tan. Accurate iris segmentation in non-cooperative environments using fully convolutional networks. In 2016 International Conference on Biometrics, ICB 2016, 2016a. doi: 10.1109/ICB.2016.7550055. URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-84988416223&doi=10.1109%2FICB.2016.7550055&partnerID=40&md5=12c3beba29b901337aee3db09a7c265d.
  • Liu et al. [2016b] Nianfeng Liu, Haiqing Li, Man Zhang, Jing Liu, Zhenan Sun, and Tieniu Tan. Accurate iris segmentation in non-cooperative environments using fully convolutional networks. In 2016 International Conference on Biometrics, ICB 2016, 2016b. ISBN 9781509018697. doi: 10.1109/ICB.2016.7550055.
  • Luo et al. [2017] Da Luo, Rui Yang, Bin Li, and Jiwu Huang. Detection of double compressed AMR audio using stacked autoencoder. IEEE Transactions on Information Forensics and Security, 12(2):432–444, 2017. ISSN 15566013. doi: 10.1109/TIFS.2016.2622012.
  • Masci et al. [2011] Jonathan Masci, Ueli Meier, Dan Ciresan, and Jürgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pages 52–59. Springer, 2011.
  • Mayer and Steinebach [2017] Felix Mayer and Martin Steinebach. Forensic Image Inspection Assisted by Deep Learning. In Proceedings of the 12th International Conference on Availability, Reliability and Security - ARES ’17, volume Part F1305, pages 1–9, 2017. ISBN 9781450352574. doi: 10.1145/3098954.3104051. URL http://dl.acm.org/citation.cfm?doid=3098954.3104051.
  • Melicher et al. [2016] William Melicher, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks. In 25th USENIX Security Symposium (USENIX Security 16), pages 175–191, 2016. ISBN 978-1-931971-32-4. doi: 10.1145/2420950.2420987. URL https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/melicher.
  • Meng and Chen [2017] Dongyu Meng and Hao Chen. MagNet: a Two-Pronged Defense against Adversarial Examples. In Proceedings of the ACM Conference on Computer and Communications Security, pages 135–147, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3134057. URL http://arxiv.org/abs/1705.09064.
  • Menotti et al. [2015] David Menotti, Giovani Chiachia, Allan Pinto, William Robson Schwartz, Helio Pedrini, Alexandre Xavier Falcão, and Anderson Rocha. Deep Representations for Iris, Face, and Fingerprint Spoofing Detection. IEEE Transactions on Information Forensics and Security, 10(4):864–879, 2015. ISSN 15566013. doi: 10.1109/TIFS.2015.2398817.
  • Mikolov et al. [2013a] Tomas Mikolov, K Chen, G Corrado, and J Dean. Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints, 2013a.
  • Mikolov et al. [2013b] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger, editors, Nips, pages 1–9. Curran Associates, Inc., 2013b. ISBN 2150-8097. doi: 10.1162/jmlr.2003.3.4-5.951. URL http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
  • Mitchell [1997] Tom M Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870–877, 1997.
  • Mittal et al. [2015] Paritosh Mittal, Mayank Vatsa, and Richa Singh. Composite sketch recognition via deep network - A transfer learning approach. In Proceedings of 2015 International Conference on Biometrics, ICB 2015, pages 251–256, 2015. ISBN 9781479978243. doi: 10.1109/ICB.2015.7139092.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. ISSN 0028-0836. doi: 10.1038/nature14236. URL http://dx.doi.org/10.1038/nature14236.
  • Montavon et al. [2012] Gregoire Montavon, Genevieve B Orr, and Klaus-Robert Müller. Neural Networks: Tricks of the Trade, Reloaded. vol. 7700 of Lecture Notes in Computer Science (LNCS), 2012.
  • Montavon et al. [2017] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. arXiv preprint arXiv:1706.07979, 2017.
  • Narang and Bourlai [2016] Neeru Narang and Thirimachos Bourlai. Gender and ethnicity classification using deep learning in heterogeneous face recognition. In 2016 International Conference on Biometrics (ICB), pages 1–8, 2016. ISBN 978-1-5090-1869-7. doi: 10.1109/ICB.2016.7550082. URL http://ieeexplore.ieee.org/document/7550082/.
  • Nguyen et al. [2018] Minh Hai Nguyen, Dung Le Nguyen, Xuan Mao Nguyen, and Tho Thanh Quan. Auto-detection of sophisticated malware using lazy-binding control flow graph and deep learning. Computers and Security, 76:128–155, 2018. ISSN 01674048. doi: 10.1016/j.cose.2018.02.006.
  • Nogueira et al. [2016] Rodrigo Frassetto Nogueira, Roberto De Alencar Lotufo, and Rubens Campos MacHado. Fingerprint Liveness Detection Using Convolutional Neural Networks. IEEE Transactions on Information Forensics and Security, 11(6):1206–1213, 2016. ISSN 15566013. doi: 10.1109/TIFS.2016.2520880.
  • Noroozi et al. [2017] Vahid Noroozi, Lei Zheng, Sara Bahaadini, Sihong Xie, and Philip S. Yu. SEVEN: Deep SEmi-supervised verification networks. In IJCAI International Joint Conference on Artificial Intelligence, pages 2571–2577, 2017. ISBN 9780999241103.
  • Osada et al. [2017] Genki Osada, Kazumasa Omote, and Takashi Nishide. Network Intrusion Detection Based on Semi-supervised Variational Auto-Encoder, volume 10493 LNCS. 2017. ISBN 9783319663982. doi: 10.1007/978-3-319-66399-9–_˝19. URL http://link.springer.com/10.1007/978-3-319-66399-9_19.
  • Osadchy et al. [2017] Margarita Osadchy, Julio Hernandez-Castro, Stuart Gibson, Orr Dunkelman, and Daniel Perez-Cabo. No Bot Expects the DeepCAPTCHA! Introducing Immutable Adversarial Examples, with Applications to CAPTCHA Generation. IEEE Transactions on Information Forensics and Security, 12(11):2640–2653, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2718479.
  • Papernot et al. [2016a] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical Black-Box Attacks against Machine Learning. In ASIA CCS 2017 - Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, pages 506–519, 2016a. ISBN 9781450349444. doi: 10.1145/3052973.3053009. URL http://arxiv.org/abs/1602.02697.
  • Papernot et al. [2016b] Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Proceedings - 2016 IEEE European Symposium on Security and Privacy, EURO S and P 2016, pages 372–387, 2016b. ISBN 9781509017515. doi: 10.1109/EuroSP.2016.36.
  • Papernot et al. [2016c] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. In Proceedings - 2016 IEEE Symposium on Security and Privacy, SP 2016, pages 582–597, 2016c. ISBN 9781509008247. doi: 10.1109/SP.2016.41.
  • Parkhi et al. [2015] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and others. Deep Face Recognition. In BMVC, volume 1, page 6, 2015.
  • Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global Vectors for Word Representation. In EMNLP, volume 14, pages 1532–1543, 2014.
  • Phan et al. [2016] N. Phan, Y. Wang, X. Wu, and D. Dou. Differential privacy preservation for deep auto-encoders: An application of human behavior prediction. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, number Arfken 1985, pages 1309–1316, 2016. ISBN 9781577357605.
  • Phong et al. [2018] Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang, and Shiho Moriai. Privacy-Preserving Deep Learning via Additively Homomorphic Encryption. IEEE Transactions on Information Forensics and Security, 13(5):1333–1345, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2787987.
  • Proença and Neves [2018] Hugo Proença and João C. Neves. Deep-PRWIS: Periocular Recognition Without the Iris and Sclera Using Deep Learning Frameworks. IEEE Transactions on Information Forensics and Security, 13(4):888–896, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2771230.
  • Qin and El-Yacoubi [2017] Huafeng Qin and Mounim A. El-Yacoubi. Deep Representation-Based Feature Extraction and Recovering for Finger-Vein Verification. IEEE Transactions on Information Forensics and Security, 12(8):1816–1829, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2689724.
  • Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • Reinsel et al. [2017] David Reinsel, John Gantz, and John Rydning. Data Age 2025. Technical report, Seagate, 2017. URL https://www.seagate.com/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf.
  • Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv preprint arXiv:1602.04938, 2016.
  • Riesenhuber and Poggio [1999] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019, 1999.
  • Ritter et al. [2017] Samuel Ritter, David G. T. Barrett, Adam Santoro, and Matt M. Botvinick. Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study. 6 2017. URL http://arxiv.org/abs/1706.08606.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Salakhutdinov and Larochelle [2010] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep Boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 693–700, 2010.
  • Schmidhuber [2015] Jürgen Schmidhuber. Deep Learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. ISSN 18792782. doi: 10.1016/j.neunet.2014.09.003. URL http://arxiv.org/abs/1404.7828.
  • Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
  • Schuster and Paliwal [1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Schuster et al. [2017] Roei Schuster, Vitaly Shmatikov, and Eran Tromer. Beauty and the Burst: Remote Identification of Encrypted Video Streams. In USENIX Security, number 2, pages 1–26, 2017. ISBN 9781931971409. URL https://beautyburst.github.io/beautyburst.pdf.
  • Settles [2012] Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
  • Shabtai et al. [2009] Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report, 14(1):16–29, 2009. ISSN 13634127. doi: 10.1016/j.istr.2009.03.003.
  • Sharma et al. [2017] Ashlesh Sharma, Vidyuth Srinivasan, Vishal Kanchan, and Lakshminarayanan Subramanian. The Fake vs Real Goods Problem. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17, volume Part F1296, pages 2011–2019, 2017. ISBN 9781450348874. doi: 10.1145/3097983.3098186. URL http://dl.acm.org/citation.cfm?doid=3097983.3098186.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Shen et al. [2016] Shiqi Shen, Shruti Tople, and Prateek Saxena. A uror: defending against poisoning attacks in collaborative deep learning systems. In Proceedings of the 32nd Annual Conference on Computer Security Applications - ACSAC ’16, volume 5-9-Decemb, pages 508–519, 2016. ISBN 9781450347716. doi: 10.1145/2991079.2991125. URL http://dl.acm.org/citation.cfm?doid=2991079.2991125.
  • Shibahara et al. [2017] T. Shibahara, K. Yamanishi, Y. Takata, D. Chiba, M. Akiyama, T. Yagi, Y. Ohsita, and M. Murata. Malicious URL sequence detection using event de-noising convolutional neural network. In IEEE International Conference on Communications (ICC), pages 1–7, 2017. ISBN 9781467389990. doi: 10.1109/ICC.2017.7996831.
  • Shin et al. [2015] Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. Recognizing Functions in Binaries with Neural Networks. In 24th USENIX Security Symposium (USENIX Security 15), pages 611–626, 2015. ISBN 978-1-931971-232. doi: 10.1109/PPIC.2011.5982880. URL https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/shin.
  • Shiraga et al. [2016] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi. GEINet: View-invariant gait recognition using a convolutional neural network. In 2016 International Conference on Biometrics, ICB 2016, pages 0–7, 2016. ISBN 9781509018697. doi: 10.1109/ICB.2016.7550060.
  • Shokri and Shmatikov [2015] Reza Shokri and Vitaly Shmatikov. Privacy-Preserving Deep Learning. In Proceedings of the ACM Conference on Computer and Communications Security, volume 2015-Octob, pages 1310–1321, 2015. ISBN 9781450338325. doi: 10.1145/2810103.2813687. URL http://www.cs.cornell.edu/~shmat/shmat_ccs15.pdf.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 0028-0836. doi: 10.1038/nature16961. URL http://dx.doi.org/10.1038/nature16961.
  • Simard et al. [2003] Patrice Y Simard, David Steinkraus, John C Platt, and others. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958–962, 2003.
  • Simonyan and Zisserman [2014] K Simonyan and A Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv e-prints, 2014.
  • Sivakorn et al. [2016] Suphannee Sivakorn, Iasonas Polakis, and Angelos D. Keromytis. I am Robot: (Deep) learning to break semantic image CAPTCHAs. In Proceedings - 2016 IEEE European Symposium on Security and Privacy, EURO S and P 2016, pages 388–403, 2016. ISBN 9781509017515. doi: 10.1109/EuroSP.2016.37.
  • Smolensky [1986] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
  • Snoek et al. [2015] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
  • Socher et al. [2013] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943, 2013.
  • Sommer and Paxson [2010] Robin Sommer and Vern Paxson. Outside the closed world: On using machine learning for network intrusion detection. In IEEE Symposium on Security and Privacy, volume 0, pages 305–316. IEEE, 2010. ISBN 978-1-4244-6894-2. doi: 10.1109/SP.2010.25.
  • Song [2018] Le Song. Structure2Vec: Deep Learning for Security Analytics over Graphs. 2018.
  • Springenberg et al. [2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Stallings et al. [2012] William Stallings, Lawrie Brown, Michael D Bauer, and Arup Kumar Bhattacharjee. Computer security: principles and practice. Pearson Education, 2012.
  • Sugar and James [2003] Catherine A Sugar and Gareth M James. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463):750–763, 2003.
  • Sun et al. [2017] Lichao Sun, Yuqi Wang, Bokai Cao, Philip S. Yu, Witawas Srisa-An, and Alex D. Leow. Sequential Keystroke Behavioral Biometrics for Mobile User Identification via Multi-view Deep Learning, volume 10536 LNAI. 2017. ISBN 9783319712727. doi: 10.1007/978-3-319-71273-4–_˝19.
  • Sun et al. [2014] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Representation by Joint Identification-Verification. In Advances in Neural Information Processing Systems, volume 3, pages 1988–1996, 2014. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.244. URL http://arxiv.org/abs/1406.4773.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Szegedy et al. [2013] Christian Szegedy, W Zaremba, and I Sutskever. Intriguing properties of neural networks. arXiv preprint arXiv: …, pages 1–10, 2013. ISSN 15499618. doi: 10.1021/ct2009208. URL http://arxiv.org/abs/1312.6199.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
  • Thaler et al. [2017] Stefan Thaler, Vlado Menkovski, and Milan Petkovic. Unsupervised Signature Extraction from Forensic Logs, volume 10536 LNAI. 2017. ISBN 9783319712727. doi: 10.1007/978-3-319-71273-4–_˝25.
  • Thomas et al. [2017] Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently Controllable Features. arXiv preprint arXiv:1708.01289, 2017.
  • Tibshirani et al. [2001] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
  • Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 2012.
  • Tran et al. [2016] Lam Tran, Deguang Kong, Hongxia Jin, and Ji Liu. Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network using Hierarchical Features. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1–7, 2016. ISBN 9781577357605.
  • Triantafillou et al. [2017] Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2252–2262, 2017.
  • Uzan and Wolf [2015] Lior Uzan and Lior Wolf. I know that voice: Identifying the voice actor behind the voice. In Proceedings of 2015 International Conference on Biometrics, ICB 2015, pages 46–51, 2015. ISBN 9781479978243. doi: 10.1109/ICB.2015.7139074.
  • Vincent et al. [2008] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning - ICML ’08, pages 1096–1103. ACM, 2008. ISBN 9781605582054. doi: 10.1145/1390156.1390294. URL http://portal.acm.org/citation.cfm?doid=1390156.1390294.
  • Vinyals et al. [2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
  • Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, and others. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • Von Ahn et al. [2003] Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. CAPTCHA: Using hard AI problems for security. In International Conference on the Theory and Applications of Cryptographic Techniques, pages 294–311. Springer, 2003.
  • Wang and Jain [2015] Dayong Wang and Anil K. Jain. Face retriever: Pre-filtering the gallery via deep neural net. In Proceedings of 2015 International Conference on Biometrics, ICB 2015, pages 473–480, 2015. ISBN 9781479978243. doi: 10.1109/ICB.2015.7139112.
  • Wang et al. [2014] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1386–1393, 2014.
  • Xing et al. [2010] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter, 12(1):40–48, 2010.
  • Xu et al. [2016] Guanshuo Xu, Han-zhou Wu, Student Member, and Yun-qing Shi. Structural Design of Convolutional Neural Networks for Steganalysis. 23(5):708–712, 2016. ISSN 1070-9908. doi: 10.1109/LSP.2016.2548421.
  • Xu et al. [2017] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the ACM Conference on Computer and Communications Security, pages 363–376, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3134018. URL http://arxiv.org/abs/1708.06525%0Ahttp://dx.doi.org/10.1145/3133956.3134018.
  • Yan et al. [2016] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
  • Yao et al. [2017] Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, and Ben Y. Zhao. Automated Crowdturfing Attacks and Defenses in Online Review Systems. In Proceedings of the ACM Conference on Computer and Communications Security, pages 1143–1158, 2017. ISBN 9781450349468. doi: 10.1145/3133956.3133990. URL http://arxiv.org/abs/1708.08151.
  • Yashiro et al. [2016] Risa Yashiro, Takanori Machida, Mitsugu Iwamoto, and Kazuo Sakiyama. Deep-learning-based security evaluation on authentication systems using arbiter PUF and its variants, volume 9836 LNCS. 2016. ISBN 9783319445236. doi: 10.1007/978-3-319-44524-3–_˝16.
  • Ye et al. [2017] Jian Ye, Jiangqun Ni, and Yang Yi. Deep Learning Hierarchical Representations for Image Steganalysis. IEEE Transactions on Information Forensics and Security, 12(11):2545–2557, 2017. ISSN 15566013. doi: 10.1109/TIFS.2017.2710946.
  • You et al. [2015] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks. In AAAI, pages 381–388, 2015.
  • Yu et al. [2017] Jun Yu, Baopeng Zhang, Zhengzhong Kuang, Dan Lin, and Jianping Fan. IPrivacy: Image Privacy Protection by Identifying Sensitive Objects via Deep Multi-Task Learning. IEEE Transactions on Information Forensics and Security, 12(5):1005–1016, 2017. ISSN 15566013. doi: 10.1109/TIFS.2016.2636090.
  • Zagoruyko and Komodakis [2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 4353–4361. IEEE, 2015.
  • Zeiler et al. [2011] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2018–2025. IEEE, 2011.
  • Zeng et al. [2018] Jishen Zeng, Shunquan Tan, Bin Li, and Jiwu Huang. Large-Scale JPEG Image Steganalysis Using Hybrid Deep-Learning Framework. IEEE Transactions on Information Forensics and Security, 13(5):1200–1214, 2018. ISSN 15566013. doi: 10.1109/TIFS.2017.2779446.
  • Zhang et al. [2016a] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016a.
  • Zhang et al. [2012] Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett.

    Functional mechanism: regression analysis under differential privacy.

    Proceedings of the VLDB Endowment, 5(11):1364–1375, 2012.
  • Zhang et al. [2016b] Qi Zhang, Haiqing Li, Zhenan Sun, Zhaofeng He, and Tieniu Tan. Exploring complementary features for iris recognition on mobile devices. In 2016 International Conference on Biometrics, ICB 2016, 2016b. ISBN 9781509018697. doi: 10.1109/ICB.2016.7550079.
  • Zhao and Kumar [2017] Zijing Zhao and Ajay Kumar. Accurate Periocular Recognition under Less Constrained Environment Using Semantics-Assisted Convolutional Neural Network. IEEE Transactions on Information Forensics and Security, 12(5):1017–1030, 2017. ISSN 15566013. doi: 10.1109/TIFS.2016.2636093.
  • Zhong et al. [2016a] Haoti Zhong, Hao Li, Anna Squicciarini, Sarah Rajtmajer, Christopher Griffin, David Miller, and Cornelia Caragea. Content-driven detection of cyberbullying on the instagram social network. In IJCAI International Joint Conference on Artificial Intelligence, volume 2016-Janua, pages 3952–3958, 2016a. ISBN 978-1-57735-770-4.
  • Zhong et al. [2016b] Yang Zhong, Josephine Sullivan, and Haibo Li. Face attribute prediction using off-the-shelf CNN features. In 2016 International Conference on Biometrics, ICB 2016, 2016b. ISBN 9781509018697. doi: 10.1109/ICB.2016.7550092.
  • Zhu et al. [2015] Jianqing Zhu, Shengcai Liao, Dong Yi, Zhen Lei, and Stan Z. Li. Multi-label CNN based pedestrian attribute learning for soft biometrics. In Proceedings of 2015 International Conference on Biometrics, ICB 2015, pages 535–540, 2015. ISBN 9781479978243. doi: 10.1109/ICB.2015.7139070.
  • Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural Architecture Search with Reinforcement Learning. CoRR, abs/1611.0, 2016. URL http://arxiv.org/abs/1611.01578.

Appendix A Venues

As indicated in Section 4, here we list all the venues that we included in our review. The venues are listed in alphabetical order.

Security venues

We included literature of the following security conference proceedings and journals: ACNS, ACSAC, ARES, ASIACCS, ASIACRYPT, CCS, Computers and Security, CRYPTO, ESORICS, EUROCRYPT, Fast Software Encryption, FC, IACR, ICB, ICC, IEEE Transactions on Dependable and Secure Computing, IWSEC, Journal of Cryptology, Privacy Enhancing Technologies Symposium, RAID, S&P, SACMAT, SIGCOMM, SOUPS, Theory of Cryptography, Transactions on Information Forensics and Security, TrustCom, WISEC, WPES.

a.1 Machine learning venues

We included literature of following machine learning conference proceedings and journals: ACCV, ACM SIGKDD, ACM Transactions on Intelligent Systems and Technology, ACM Transactions on Knowledge Discovery from Data, Advances in Data Analysis and Classification AISTATS, BioData Mining, BMVC, Computer Vision and Image Understanding, CVPR, Data Mining and Knowledge Discoveries, ECCV, ECML PKDD, ICCV, ICDM, ICIP, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE International Conference on Big Data, IEEE Transactions on Image Processing, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Pattern Analysis and Machine Intelligence, Image and Vision Computing, International Conference on Pattern Recognition, International Journal of Computer Vision, Journal of Visual Communication and Image Representation, Knowledge and Information Systems, Machine Vision and Applications Medical Image Analysis, PAKDD, Pattern Recognition, Pattern Recognition Letters, RecSys, SDM, shops, Social Network Analysis and Mining, Statistical Analysis and Data Mining, WACV, WSDM