A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

05/30/2019 ∙ by Saptarshi Sengupta, et al. ∙ Vanderbilt University 0

Deep learning (DL) has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with accuracy that often surpasses human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of DL increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer ANNs that comprise DL. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where DL has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the DL community as well as to provide a reference to researchers seeking to use it in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introuction

Artificial neural networks (ANNs), now one of the most widely-used approaches to computational intelligence, started as an attempt to mimic adaptive biological nervous systems in software and customized hardware [1]. ANNs have been studied for more than 70 years [2] during which time they have waxed and waned in the attention of researchers. Recently they have made a strong resurgence as pattern recognition tools following pioneering work by a number of researchers [3]

. It has been demonstrated unequivocally that multilayered artificial neural architectures can learn complex, non-linear functional mappings, given sufficient computational resources and training data. Importantly, unlike more traditional approaches, their results scale with training data. Following these remarkable, significant results in robust pattern recognition, the intellectual neighborhood has seen exponential growth, both in terms of academic and industrial research. Moreover, multilayer ANNs reduce much of the manual work that until now has been needed to set up classical pattern recognizers. They are, in effect, black box systems that can deliver, with minimal human attention, excellent performance in applications that require insights from unstructured, high-dimensional data

[4] [5] [6] [7] [8] [9]. These facts motivate this review of the topic.

I-a What is an Artificial Neural Network?

An artificial neural network comprises many interconnected, simple functional units, or neurons

that act in concert as parallel information-processors, to solve classification or regression problems. That is they can separate the input space (the range of all possible values of the inputs) into a discrete number of classes or they can approximate the function (the black box) that maps inputs to outputs. If the network is created by stacking layers of these multiply connected neurons the resulting computational system can:

  1. Interact with the surrounding environment by using one layer of neurons to receive information (these units are known to be part of the input layers of the neural network)

  2. Pass information back-and-forth between layers within the black-box for processing by invoking certain design goals and learning rules (these units are known to be part of the hidden layers of the neural network)

  3. Relay processed information out to the surrounding environment via some of its atomic units (these units are known to be part of the output layers of the neural network).

Within a hidden layer each neuron is connected to the outputs of a subset (or all) of the neurons in the previous layer each multiplied by a number called a weight

. The neuron computes the sum of the products of those outputs (its inputs) and their corresponding weights. This computation is the dot product between an input vector and weight vector which can be thought of as the projection of one vector onto the other or as a measure of similarity between the the two. Assume the input vectors and weights are both

-dimensional and there are neurons in the layer. Each neuron has its own weight vector, so the output of the layer is an -dimensional vector computed as the input vector pre-multiplied by an matrix of weights. That is, the output is an

-dimensional vector that is the linear transformation of an

-dimensional input vector. The output of each neuron is in effect a linear classifier where the weight vector defines a borderline between two classes and where the input vector lies some distance to one side of it or the other. The combined result of all

neurons is an

-dimensional hyperplane that independently classifies the

dimensions of the input into two

-dimensional classes in the output. If the weights are derived via least mean-squared (LMS) estimation from matched pairs of input-output data they form a linear regression,

i.e. the hyperplane that is closest in the LMS sense to all the outputs given the inputs.

The hyperplane maps new input points to output points that are consistent with the original data, in the sense that some error function between the computed outputs and the actual outputs in the training data is minimized. Multiple layers of linear maps, wherein the output of one linear classifier or regression is the input of another, is actually equivalent to a different single linear classifier or regression. This is because the output of different layers reduces to the multiplication of the inputs by a single matrix that is the product of the matrices, one per layer.

To classify inputs non-linearly or to approximate a nonlinear function with a regression, each neuron adds a numerical bias value to the result of its input sum of products (the linear classifier) and passes that through a nonlinear activation function

. The actual form of the activation function is a design parameter. But they all have the characteristic that they map the real line through a monotonic increasing function that has an inflection point at zero. In a single neuron, the bias effectively shifts the inflection point of the activation function to the value of the bias itself. So the sum of products is mapped through an activation function centered on the bias. Any pair of activation functions so defined are capable of producing a pulse between their inflection points if each one is scaled and one is subtracted from the other. In effect each pair of neurons samples the input space and outputs a specific value for all inputs within the limits of the pulse. Given training data consisting of input-output pairs – input vectors each with a corresponding output vector – the ANN learns an approximation to the function that produced each of the outputs from its corresponding input. That approximation is the partition of the input space into samples that minimizes the error function between the output of the ANN given its training inputs and the training outputs. This is stated mathematically by the

universal approximation theorem which implies that any functional mapping between input vectors and output vectors can be approximated to with arbitrary accuracy with an ANN provided that it has a sufficient number of neurons in a sufficient number of layers with a specific activation function [10] [11] [12] [13].

Given the dimensions of the input and output vectors, the number of layers, the number of neurons in each layer, the form of the activation function, and an error function, the weights are computed via optimization over input-output pairs to minimize the error function. That way the resulting network is a best approximation of the known input-output data.

I-B How do these networks learn?

Fig. 1:

The Perceptron Learning Model

Neural networks are capable of learning - by changing the distribution of weights it is possible to approximate a function representative of the patterns in the input. The key idea is to re-stimulate the black-box using new excitation (data) until a sufficiently well-structured representation is achieved. Each stimulation redistributes the neural weights a little bit - hopefully in the right direction, given the learning algorithm involved is appropriate for use, until the error in approximation w.r.t some well-defined metric is below a practitioner-defined lower bound. Learning then, is the aggregation of a variable length of causal chains of neural computations [14] seeking to approximate a certain pattern recognition task through linear/nonlinear modulation of the activation of the neurons across the architecture. The instances in which chains of implicit linear activation fail to learn the underlying structure, non-linearity aids the modulation process. The term ’deep’ in this context is a direct indicator of the space complexity of the aggregation chain across many hidden layers to learn sufficiently detailed representations. Theorists and empiricists alike have contributed to an exponential growth in studies using Deep Neural Networks, although generally speaking, the existing constraints of the field are well-acknowledged [15] [16] [17]

. Deep learning has grown to be one of the principal components of contemporary research in artificial intelligence in light of its ability to scale with input data and its capacity to generalize across problems with similar underlying feature distributions, which are in stark contrast to the hard-coded, problem-specific pattern recognition architectures of yesteryear.

People Involved Contribution
McCulloch & Pitts ANN models with adjustible weights (1943) [2]
Rosenblatt The Perceptron Learning Algorithm (1957) [18]
Widrow and Hoff Adaline (1960), Madaline Rule I (1961) & Madaline Rule II (1988)[19] [20]
Minsky & Papert The XOR Problem (1969) [21]
Werbos (Doctoral Dissertation) Backpropagation (1974) [22]
Hopfield Hopfield Networks (1982) [23]
Rumelhart, Hinton & Williams Renewed interest in backpropagation: multilayer adaptive backpropagation (1986) [24]
Vapnik, Cortes Support Vector Networks (1995) [25]
Hochreiter & Schmidhuber Long Short Term Memory Networks (1997) [26]
LeCunn et. al. Convolutional Neural Networks (1998) [27]
Hinton & Ruslan Hierarchical Feature Learning in Deep Neural Networks (2006) [28]
TABLE I: Some Key Advances in Neural Networks Research

I-C Why are deep neural networks garnering so much attention now?

Multi-layer neural networks have been around through the better part of the latter half of the previous century. A natural question to ask why deep neural networks have gained the undivided attention of academics and industrialists alike in recent years? There are many factors contributing to this meteoric rise in research funding and volume. Some of these are briefed:

  • A surge in the availability of large training data sets with high quality labels

  • Advances in parallel computing capabilities and multi-core, multi-threaded implementations

  • Niche software platforms such as PyTorch

    [29]

    , Tensorflow

    [30]

    , Caffe

    [31] , Chainer [32]

    , Keras

    [33], BigDL [34] etc. that allow seamless integration of architectures into a GPU computing framework without the complexity of addressing low-level details such as derivatives and environment setup. Table II provides a summary of popular Deep Learning Frameworks.

  • Better regularization techniques introduced over the years help avoid overfitting as we scale up: techniques like batch normalization, dropout, data augmentation, early stopping etc are highly effective in avoiding overfitting and can improve model performance with scaling.

  • Robust optimization algorithms that produce near-optimal solutions: Algorithms with adaptive learning rates (AdaGrad, RMSProp, Adam, Adaboost), Stochastic Gradient Descent (with standard momemtum or Nesterov momentum), Particle Swarm Optimization, Differential Evolution, etc.

Software Platform Purpose
Tensorflow [30] Software library with high performance numerical computation and support for Machine Learning and Deep Learning architectures compatible to be deployed in CPU, GPU and TPU. url: https://www.tensorflow.org/
Theano [35] GPU compatible Python library with tight integration to NumPy involves smooth mathematical operations on multidimensional arrays. url: http://deeplearning.net/software/theano/
CNTK [36] Microsoft Cognitive Toolkit (CNTK) is a Deep Learning Framework describing computations through directed graphs. url: https://www.microsoft.com/en-us/cognitive-toolkit/
Keras [33] It runs on top of Tensorflow, CNTK or Theano compatible to be deployed in CPU and GPU. url: https://keras.io/
PyTorch [29] Distributed training and performance evaluation platform integrated with Python supported by major cloud platforms. url: https://pytorch.org/
Caffe [31] Convolutional Architecture for Fast Feature Embedding (Caffe) is a Deep Learning framework with focus on image classsification and segmentation and deployable in both CPU and GPU. url: http://caffe.berkeleyvision.org/
Chainer [32] Supports CUDA computation and multiple GPU implementation. url: https://chainer.org/
BigDL [34] Distributed deep learning library for Apache Spark supporting programming languages Scala and Python. url: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark
TABLE II: A Collection of Popular Deployment Platforms
Fig. 2: Organization of the Review

I-D Review Methodology

The article, in its present form serves to present a collection of notable work carried out by researchers in and related to the deep learning niche. It is by no means exhaustive and limited in its own right to capture the global scheme of proceedings in the ever-evolving complex web of interactions among the deep learning community. While cognizant of the difficulty of achieving the stated goal, we tried to present nonetheless to the reader an overview of pertinent scholarly collections in varied niches in a single article.

The article makes the following contributions from a practitioner’s reading perspective:

  • It walks through foundations of biomimicry involving artificial neural networks from biological ones, commenting on how neural network architectures learn and why deeper layers of neural units are needed for certain of pattern recognition tasks.

  • It talks about how several different deep architectures work, starting from Deep feed-forward networks (DFNNs) and Restricted Boltzmann Machines (RBMs) through Deep Belief Networks (DBNs) and Autoencoders. It also briefly sweeps across Convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs) and some ot the more recent deep architectures. This cluster within the article serves as a baseline for further readings or as a refresher for the sections which build on it and follow.

  • The article surveys two major computational areas of research in the present day deep learning community that we feel have not been adequately surveyed yet - (a) Multi-agent approaches in automatic architecture generation and learning rule optimization of deep neural networks using swarm intelligence and (b) Testing, troubleshooting and robustness analysis of deep neural architectures which are of prime importance in guaranteeing up-time and ensuring fault-tolerance in mission-critical applications.

  • A general survey of developments in certain application modalities is presented. These include:

    1. Anomaly Detection in Financial Services,

    2. Financial Time Series Forecasting,

    3. Prognostics and Health Monitoring,

    4. Medical Imaging and

    5. Power Systems

Figure 2 captures a high-level hierarchical abstraction of the organization of the review with emphasis on current practices, challenges and future research directions. The content organization is as follows: Section II outlines some commonly used deep architectures with a high-level working mechanisms of each, Section III talks about the infusion of swarm intelligence techniques within the context of deep learning and Section IV details diagnostic approaches in assuring fault-tolerant implementations of deep learning systems. Section V makes an exploratory survey of several pertinent applications highlighted in the previous paragraph while Section VI makes a critical dissection of the general successes and pitfalls of the field as of now and Section VII concludes the article.

Ii Deep architectures: Working mechanisms

There are numerous deep architectures available in the literature. The Comparison of architectures is difficult as different architectures have different advantages based on the application and the characteristics of the data involved, for example, In vision, Convolutional Neural Networks [27], for sequences and time series modelling Recurrent neural networks [37] is prefered. However, deep learning is a fast evolving field. In every year various architectures with various learning algorithms are developed to endure the need to develop human-like efficient machines in different domains of application.

Ii-a Deep Feed-forward Networks

Fig. 3:

Deep Feed-forward Neural Network with n Hidden layers, p input units and q output units with weights W.

Deep Feedforward Neural network, the most basic deep architecture with only the connections between the nodes moves forward. Basically, when a multilayer neural network contains multiple numbers of hidden layers, we call it deep neural network [38]. An example of Deep Feed-Forward Network with n hidden layers is provided in Figure 3. Multiple hidden layers help in modelling complex nonlinear relation more efficiently compared to the shallow architecture. A complex function can be modelled with less number of computational units compared to a similarly performing shallow network due to the hierarchical learning possible with the multiple levels of nonlinearity [39]. Due to the simplicity of architecture and the training in this model, It is always a popular architecture among researchers and practitioners in almost all the domains of engineering. Backpropagation using gradient descent [40] is the most common learning algorithm used to train this model. The algorithm first initialises the weights randomly, and then the weights are tuned to minimise the error using gradient descent. The learning procedure involves multiple forward and backwards passes consecutively. In forward pass, we forward the input towards the output through multiple hidden layers of nonlinearity and ultimately compare the computed output with the actual output of the corresponding input. In the backward pass, the error derivatives with respect to the network parameters are back propagated to adjust the weights in order to minimise the error in the output. The process continues multiple times until we obtained a desired improvement in the model prediction. If is the input and is the nonlinear activation function in layer i, the output of the layer i can be represented by,

(1)

, as this becomes input for the next layer. and are the parameters connecting the layer i with the previous layer. In the backward pass, these parameters can be updated with,

(2)
(3)

Where and are the updated parameters for W and b respectively, and E is the cost function and is the learning rate. Depending on the task to be performed like regression or classification, the cost function of the model is decided. Like for regression, root mean square error is common and for classification softmax function.

Many issues like overfitting, trapped in local minima and vanishing gradient issues can arise if a deep neural network is trained naively. This was the reason; neural network was forsaken by the machine learning community in the late 1990s. However, in 2006 [28, 41], with the advent of unsupervised pretraining approach in deep neural network, the neural network is revived again to be used for the complex tasks like vision and speech. Lately, many other techniques like l1, l2 regularisation [42], dropout [43], batch normalisation [44], good set of weight initialisation techniques [45, 46, 47, 48] and good set of activation functions [49] are introduced to combat the issues in training deep neural networks.

Ii-B Restricted Boltzmann Machines

Restricted Boltzmann Machine (RBM) [50]

can be interpreted as a stochastic neural network. It is one of the popular deep learning frameworks due to its ability to learn the input probability distribution in supervised as well as unsupervised manner. It was first introduced by Paul Smolensky in 1986 with the name Harmonium

[51]. However, it gets popularised by Hinton in 2002 [52] with the advent of the improved training algorithm to RBM. After that, it got a wide application in various tasks like representation learning [53], dimensionality reduction [54], prediction problems [55]. However, deep belief network training using the RBM as building block [28] was the most prominent application in the history of RBM that provides the starting of deep learning era. Recently RBM is getting immense popularity in the field of collaborative filtering [56] due to the state of the art performance in Netflix.

Restricted Boltzmann Machine is a variation of Boltzmann machine with the restriction in the intra-layer connection between the units, and hence called restricted. It is an undirected graphical model containing two layers, visible and hidden layer, forms a bipartite graph. Different variations of RBMs have been introduced in literature in terms of improving the learning algorithms, provided the task. Temporal RBM [57] and conditional RBM [58] proposed and applied to model multivariate time series data and to generate motion captures, Gated RBM [59] to learn transformation between two input images, Convolutional RBM [60, 61] to understand the time structure of the input time series, mean-covariance RBM [62, 63, 64] to represent the covariance structure of the data, and many more like Recurrent TRBM [65], factored conditional RBM (fcRBM) [66]. Different types of nodes like Bernoulli, Gaussian [67] are introduced to cope with the characteristics of the data used. However, the basic RBM modelling concept introduced with Bernoulli units. Each node in RBM is a computational unit that processes the input it receives to make stochastic decisions whether to transmit that input or not. An RBM with m visible and n hidden units is provided in Figure 4.

Fig. 4: RBM with m visible units and n hidden units

The joint probability distribution of an standard RBM can be defined with Gibbs distribution , where energy function E(v,h) can be represented with:

(4)

Where, m,n are the number of visible and hidden units, , are the states of the visible unit j and hidden unit i, , are the real-valued biases corresponding to the jth visible unit and ith hidden unit respectively, is real-valued weights connecting visible units with hidden units. Z is the normalisation constant (sum over all the possible combinations for

) to ensure the probability distributions sums to 1. The restriction made in the intralayer connection make the RBM hidden layer variables independent given the states of the visible layer variables and vice versa. This easy down the complexity of modelling the probability distribution and hence the probability distribution of each variable can be represented by conditional probability distribution as given below:

(5)
(6)

RBM is trained to maximise the expected probability of the training samples. Contrastive divergence algorithm proposed by Hinton

[52] is popular for the training of RBM. The training brings the model to a stable state by minimising its energy by updating the parameters of the model. The parameters can be updated using the following equations:

(7)
(8)
(9)

Where, is the learning rate, . data , . model are used to represent the expected values of the data and the model.

Ii-C Deep Belief Networks

Deep belief network (DBN) is a generative graphical model composed of multiple layers of latent variables. The latent variables are typically binary, can represent the hidden features present in the input observations. The connection between the top two layers of DBN is undirected like an RBM model, hence a DBN with 1 hidden layer is just an RBM. The other connections in DBN except last are directed graphs towards the input layer. DBN is a generative model, hence to generate a sample from DBN follows a top-down approach. We first draw samples from the RBM on the top layer, this is usually done by Gibbs sampling, then we can perform sampling from the visible units by a simple pass of ancestral sampling in a top-down fashion. A standard DBN model [68] with three hidden layers is shown in Figure 5.

Fig. 5: DBN with input vector x with 3 hidden layers

Inference in DBN is an intractable problem due to the explaining away effect in the latent variable model. However, in 2006 Hinton [28]

proposed a fast and efficient way of training DBN by stacking Restricted Boltzmann Machine (RBM) one above the other. The lowest level RBM during training learns the distribution of the input data. The next level of RBM block learns high order correlation between the hidden units of the previous hidden layer by sampling the hidden units. This process repeated for each hidden layer till the top. A DBN with L numbers of hidden layer models the joint distribution between its visible layer v and the hidden layers

, where l =1,2, … L as follows:

(10)

The log-probability of the training data can be improved by adding layers to the network, which, in turn, increases the true representational power of the network [69]. The DBN training proposed in 2006 [28] by Hinton led to the deep learning era of today and revived the neural network. This was the first deep architecture in the history able to train efficiently. Before that, it was almost impossible to train deep architectures. Deep architectures build by initialising the weights with DBN, outperformed the kernel machines, that was in the research landscape at that time. DBN, along with its use as generative models, significantly applied as discrimination model by appending a discrimination layer at the end and fine-tuning the model using the target labels provided [3]. In most of the applications, this approach of pretraining a deep architecture led to the state of the performance in discriminative model [70, 28, 41, 71, 54] like in recognising handwritten digits, detecting pedestrians, time series prediction etc. even when the number of labelled data was limited [72]. It has got immense popularity in acoustic modelling [73]

recetly as the model could provide upto 20% improvement over state of the art models, Hidden Markov Model, Gaussian Mixture Model. The approach creates feature detectors hierarchically as “features of features” in pretraining that provide a good set of initialised weights to the discriminative model. The initialised weights are in a region near the optimal weights that can improve both modelling and the convergence in fine-tuning

[70, 74]. DBN has been used as an initialised model in classification in many applications like in phone recognition [62]

, computer vision

[63] where it is used for the training of higher order factorized Boltzmann machine, speech recognition [75, 76, 77] for pretraining DNN, for pretraining of deep convolutional neural network (CNN) [60, 78, 61]. The improved performance is due to the ability to learn some abstract features by the hidden layer of the network. Some of the work on analysis of the features to understand what is lost and what is captured during its training is demonstrated in [64, 79, 80].

Ii-D Autoencoders

Autoencoder is a three-layer neural network, as shown in Figure 6, that tries to reconstruct its input in its output layer. Hence, the output layer of an autoencoder contains the same number of units as the input layer. The hidden layer typically contains less number of neurons compared to the visible layer, tries to encode or represent the input in a more compact form. It shares the same idea as RBM, but it typically uses deterministic distribution instead of stochastic units with particular distribution as in the case of RBM.

Fig. 6: Autoencoder with 3 neurons in hidden layer

Like feedforward neural network, autoencoder is typically trained using backpropagation algorithm. The training consists of two phases: Encoding and Decoding. In the encoding phase, the model tries to encode the input into some hidden representation using the weight metrics of the lower half layer, and in the decoding phase, it tries to reconstruct the same input from the encoding representation using the metrics of the upper half layer. Hence, weights in encoding and decoding are forced to be the transposed of each other. The encoding and decoding operation of an autoencoder can be represented by equations below: In encoding phase,

(11)

Where w, b are the parameters to be tuned, f is the activation function, x is the input vector, and y is the hidden representation. In decoding phase,

(12)

Where is the transpose of is the bias to the output layer, is the reconstructed input at the output layer. The parameters of the autoencoder can be updated using the following equations:

(13)
(14)

Where and are the updated parameters for w and b respectively at the end of the current iteration, and E is the reconstruction error of the input at the output layer.

Autoencoder with multiple hidden layers forms a deep autoencoder. Similar like in deep neural network, autoencoder training may be difficult due to multiple layers. This can be overcome by training each layer of deep autoencoder as a simple autoencoder [28, 41]. The approach has been successfully applied to encode documents for faster subsequent retrieval [81]

, image retrieval, efficient speech features

[82] etc. As like RBM stacking to form DBN [28] for layerwise pretraining of DNN, autoencoder [41]

along with sparse encoding energy-based model

[71] are independently developed at that time. They both were effectively used to pre-train a deep neural network, much like the DBN. The unsupervised pretraining using autoencoder has been successfully applied in many fields like in image recognition and dimensionality reduction in MNIST [54, 82, 83], multimodal learning in speech and video images [84, 85] and many more. Autoencoder has got immense popularity as generative model in recent years [38, 86]. Non Probabilistic and non-generative nature of conventional autoencoder has been generalised to generative modelling [87, 42, 88, 89, 90] that can be used to generate the samples from the network meaningfully.

Several variations of autoencoders are introduced with quite different properties and implementation to learn more efficient representation of data. One of the popular variation of autoencoder that is robust to input variations is denoising autoencoder

[89, 42, 90]. The model can be used for good compact representation of input with the number of hidden layers less than the input layer. It can also be used to perform robust modelling of the input distribution with higher number of neurons in the hidden layer. The robustness in denoising autoencoder is achieved by introducing dropout trick or by introducing some gaussian noise to the input data [91, 92] or to the hidden layers [93]. The approach helps in many many ways to improve performance. It virtually increasing the training set hence reduce overfitting, and make robust representation of the input. Sparse autoencoder [93] is introduced in a consideration to allow larger number of hidden units than the visible units to make it easier and efficient representation of the input distribution in the hidden layer. The larger hidden layer represent the input representation by turning on and off the units in the hidden layer. Variational autoencoder [86, 94] that uses quite the similar concept as RBM, learn stochastic distribution of latent variables instead of deterministic distribution. Transforming autoencoders [95] proposed as a autoencoder with transformation invariant property. The encoded features of the autoencoder can effectively reflect the transformation invariant property. The encoder is applied in image recognition [95, 96] purpose that contains “capsule” as the building block. “Capsule” is an independent sub-network that extracts local features within a limited window of viewing to understand if a feature entity is present with certain probability. Pretraining for CNN using regularised deep autoencoder is very much popularised in recent years in computer vision works. Robust models of CNN is obtained with denoising autoencoder [88] and sparse autoencoder with pooling and local contrast normalization [97] which provides not only translation-invariant features but also scaling and out-of-plane rotation invariant features.

Fig. 7: Convolution and Pooling Layers in a CNN
Fig. 8: Representations of an image of handwritten digit learned by CNN

Ii-E Convolutional Neural Networks

Convolutional Neural Networks are a class of neural networks that are extremely good for processing images. Although its idea was proposed way back in 1998 by LeCun et. al in their paper entitled ”Gradient-based learning applied to document recognition” [98] but the deep learning world actually saw it in action when Krizhevsky et. al were able win the ILSVRC-2012 competition. The architecture that Krizhevsky et. al proposed is popularly known as AlexNet [99]. This remarkable win started the new era of artificial intelligence and the computation community witnessed the real power of CNNs. Soon after this, several architectures have been proposed and still are being proposed. And in many cases, these CNN architectures have been able to beat human recognition power as well. It is worth to note that, The deep learning revolution actually with the usage of Convolutional Neural Networks (CNNs). CNNs are are extremely useful for a set computer vision related tasks such as image detection, image segmentation, image classification and so on and all of these tasks are practically well aligned. On a very high level, deep learning is all about learning data representations and in order to do so deep learning systems typically breaks down complex representations into a set of simpler representations. As mentioned earlier, CNNs are particularly useful when it comes to images as images have a special spatial property in their formations. An image has several characteristics like edges, contours, strokes, textures, gradients, orientation, colour. A CNN breaks down an image in terms of simple properties like these and learn them as representations in different layers [100]. Figure 8 is a good representative of this learning scheme.

The layers involved in any CNN model are the convolution layers and the subsampling/pooling layers which allow the network learn filters that are specific to specific parts in an image. The convolution layers help the network retain the spatial arrangement of pixels that is present in any image whereas the pooling layers allow the network to summarize the pixel information [101]. There are several CNN architectures ZFNet, AlexNet, VGG, YOLO, SqueezeNet, ResNet and so on and some these have been discussed in section II-H.

Ii-F Recurrent Neural Networks

Fig. 9: A Recurrent Neural Network Architecture

Although Hidden Markov Models (HMM) can express time dependencies, they become computationally unfeasible in the process of modelling long term dependencies which RNNs are capable of. A detailed derivation of Recurrent Neural Network from differential equations can be found in [102]. RNNs are form of feed-forward networks spanning adjacent time steps such that at any time instant a node of the network takes the current data input as well as the hidden node values capturing information of previous time steps. During the backpropagation of errors across multiple timesteps the problem of vanishing and exploding gradients take place which can be avoided by Long Short Term Memory (LSTM) Networks introduced by Hochreiter and Schmidhuber [103]. The amount of information to be retained from previous time steps is controlled by a sigmoid layer known as ‘forget’ gate whereas the sigmoid activated ‘input gate’ decides upon the new information to be stored in the cell followed by a hyperbolic tangent activated layer to produce new candidate values which is updated taking forget gate coefficient weighted old state’s candidate value. Finally the output is produced controlled by output gate and hyperbolic tangent activated candidate value of the state.

LSTM networks with peephole connections [104]

updates the three gates using the cell state information. A single update gate instead of forget and input gate is introduced in Gated Recurrent Unit (GRU)

[105] merging the hidden and the cell state. In [106] Sak et al., came up with training LSTM RNNs in a distributed way on multicore CPU using asynchronus SGD (Stochastic Gradient Descent) optimization for the purpose of acoustic modelling. They presented a two-layer deep LSTM architecture with each layer having a linear recurrent projection layer with more efficient use of the model parameters. Doetch et al., [107] proposed a LSTM based training framework composed of sequence chunks forming mini batches for training for the purpose of handwriting recognition. With reduction of runtime by a factor of 3 the architecture uses modified gating units with layer specific weights for each gate. Palangi et al., [108] implemented sentence embedding model using LSTM-RNN that sequentially extracts information from each word and embeds in a semantic vector till the end of the sentence to obtain overall semantic representation of the entire sentence. The model with capability of attenuating unimportant words and identifying salient keywords is specifically useful in web document retrieval applications.

Ii-G Generative Adversarial Networks

Fig. 10: A Generative Adversarial Network Architecture

Goodfellow et al., [109]

introduced a novel framework for Generative Adversarial Nets with simultaneous training of a generative and a discriminative model. The proposed new Generative model bypasses the difficulty of approximation of unmanageable probabilistic measures in Maximum Likelihood Estimation faced previously. The generative model tries to capture the data distribution whereas the discriminative model learns to estimate the probability of a sample either coming from training data or the distribution captured by generative model. If the two above models described by multilayer perceptrons, only backpropagation and dropout algorithms are required to train them.

The goal in this process is to train the Generative network in a way to maximize the probability of the discriminative network to make a mistake. A unique solution can be obtained in the function space where the generative model recovers the distribution of training data and the discriminative model results into 50% probalilty for each sample. This can be viewed as a minmax two player game between these two models as the generative models produce adversarial examples while discriminative model trying to identify them correctly and both try to improve their efficiency until the adversarial examples are indistinguishable from the original ones.

In [110]

, the authors presented training procedures to be applied to GANs focusing on producing visually sensible images. The proposed model was successful in producing MNIST samples visually indistinguishable from the original data and also in learning recognizable features from Imagenet dataset in a semi-supervised way. This work provides insight about appropriate evaluation metric for generative models in GANs and stable semi-supervised training approach. In

[111], the authors identified distinct features of GANs from a Turing perspective. The discriminators were allowed to behave as interrogators such as in Turing Test by interacting with data sample generating processes and affirmed the increase in accuracy of the models by verification with two case studies. The first one was about inferring an agent’s behavior based on a hidden stochastic process while managing its environment. The second examples talks about active self-discovery exercised by a robot to conclude about its own sensors by controlled movements.

Wu et al., [112] proposed a 3D Generative Adversarial Network (3DGAN) for three dimensional object generation using volumetric convolutional networks with a mapping from probabilistic space of lower dimension to three dimensional object space so that the 3D object can be sampled or explored without any reference image. As a result high quality 3D objects can be generated employing efficient shape descriptor learnt in an unsupervised manner by the adversarial discriminator. Vondrick et al., [113] came up with video recognition/classification and video generation/prediction model using Generative Adversarial Network (GAN) with separation of foreground from background employing spatio-temporal convolutional architecture. The proposed model is efficient in predicting futuristic versions of static images extracting meaningful features and recognizing actions embedded in the video in a minimally supervised way. Thus, learning scene dynamics from unlabeled videos using adversarial learning is the main objective of the proposed framework.

Another interesting application is generating images from detailed visual descriptions [114]

. The authors trained a deep convolutional generative adversarial network (DC-GAN) based on encoded text features through hybrid character-level convolutional recurrent neural network and used manifold interpolation regularizer. The generalizability of the approach was tested by generating images from various objects and changing backgrounds.

Ii-H Recent Deep Architectures

When it comes to deep learning and computer vision, datasets like Cats and Dogs, ImageNet, CIFAR-10, MNIST are used for benchmarking purposes. Throughout this section, the ImageNet dataset is used for the purpose of benchmarking results as it is more generalized than the other datasets just mentioned. Every year a competition named ILSVRC (ImageNet Large Scale Visual Recognition Competition) is organized (which is an image classification competition) which based on the ImageNet dataset and it is widely accepted by the deep learning community [115].
Several deep neural network architectures have been proposed in the literature and still are being proposed with an objective of achieving general artificial intelligence. LeNet architecture, for example was proposed by Lecun et. al in 1998s and it was originally proposed as a digit classification model. Later, LeNet has been incorporated to identify handwritten numbers on cheques [98]. Several architectures have been proposed after LeNet among which AlexNet certainly deserves to be the most notable mentions. It was proposed by Krizhevsky et. al in 2012 and AlexNet was able to beat all the competitors of the ILSVRC challenge. The discovery of AlexNet marks a significant turn in the history of deep learning for several reasons such as AlexNet incorporated the dropout regularization which was just developed by that time, AlexNet made use of efficient GPU computing for reducing the training time which was first of its kind back in 2012 [99]

. Soon after AlexNet , ZFNet was proposed by Zeiler et. al in the year of 2013 and showed state-of-the-art results on the ILSVRC challenge. It was an enhancement of the AlexNet architecture. It uses expanded mid convolution layers and incorporates smaller strides and filters in the first convolution layer for capturing the pixel information in a great detail

[116]. In 2014, Google researchers came with a better model which is known as GoogleNet or the Inception Network and won the ILSVRC 2014 challenge. The main catch of this architecture is the inception layer which allows convolving in parallel with different kernel sizes. This is turn allows to learn the smaller pixel information of an image in a better way [117]. It’s worth to mention the VGGNet (also called VGG) architecture here. It was the runners’ up in the ILSVRC 2014 challenge and was proposed by Simonyan et. al. VGG uses a 3X3 kernel throughout its entire architecture and ahieves tremendous generalization with this fixation [118]. The inner of the ILSVRC 2015 challenge was the ResNet architecture and was proposed by He et. al. This architecture is more formally known as Residual Networks and is deeper than the VGG architecture while still being less complex in the VGG architecture. ResNet was able to beat human performance on the ImageNet dataset and it is still being quite actively used in production [119] [120].

Iii Swarm Intelligence in Deep Learning

The introduction of heuristic and meta-heuristic algorithms in designing complex neural network architectures aimed towards tuning the network parameters to optimize the learning process has brought improvements in the performance of several Deep Learning Frameworks. In order to design the Artificial Neural Networks (ANN) automatically with evolutionary computation a Deep Evolutionary Network Structured Representation (DENSER) was proposed in

[121], where the optimal design for the network is achieved by a bi-leveled representation. The outer level deals with the number of layers and their sequence whereas the inner layer optimizes the parameters and hyper parameters associated with each layer defined by a context-free human perceivable grammar. Through automatic design of CNNs the proposed approach performed well on CIFER-10, CIFER-100, MNIST and Fashion MNIST dataset. On the other hand, Garro et al., [122] proposed a methodology to automatically design ANN using basic Particle Swarm Optimization (PSO), Second Generation of Particle Swarm Optimization (SGPSO), and a New Model of PSO (NMPSO) to evolve and optimize the synaptic weights, transfer function for each neuron and the architecture itself simultaneously. The ANNs designed in this way, were evaluated over eight fitness functions. It aimed towards dimensionality reduction of the input pattern, and was compared to the traditional design architectures using well known Back-Propagation and Levenberg-Marquardt algorithms. Das et al. [123], used PSO to optimize the number of layers, neurons, the kind of transfer functions to be involved and the topology of ANN aimed at building channel equalizers that perform better in presence of all noise scenarios.

Wang et al. [124], used Variable-length Particle Swarm Optimization for automatic evolution of deep Convolutional Neural Network Architectures for image classification purposes. They proposed novel encoding strategy to encode CNN layers in particle vectors and introduced a Disabled layer hiding certain dimensions of the particle vector to have variable-length particles. In addition to this, to speed up the process the authors randomly picked up partial datasets for evaluation. Thus several variants of PSO along with its hybridised versions [125] as well as a host of recent swarm intelligence algorithms such as Quantum Double Delta Swarm Algorithm (QDDS) [126] and its chaotic implementation [127] proposed by Sengupta et al. can be used, among others for automatic generation of architectures used in Deep Learning applications.

The problem of changing dimensionality of perceived information by each agent in the domain of Deep reinforcement learning (RL) for swarm systems has been solved in

[128] using an end–to–end learned mean feature embedding as state information. The research concluded that an end–to–end embedding using neural network features helps to scale up the RL architecture with increasing numbers of agents towards better performing policies as well as ensures fast convergence.

Iv Testing neural networks

Software employed in safety critical systems need to be rigorously tested through white-box or black-box testing. In white box testing, the internal structure of the software/program is known and utilized in generating test cases as per the test criteria/requirement. Whereas in black box testing the inputs and outputs of the program are compared as the internal code of the software cannot be accessed. Some of the previous works dealing with generating test cases revealing faulty cases can be found in [129] and in [130] using Principle component analysis. In [131] the authors implemented a black-box testing methodology by feeding randomly generated input test cases to an original version of a real-world test program producing the corresponding outputs, so as the input-output pairs are generated to train a neural network. Then each test case is applied to mutated and faulty version of the test program and compared against the output of the trained ANN to calculate the distance between two outputs indicating whether the faulty program has produced valid or invalid result. Thus ANN is treated as an automated ‘oracle’ which produces satisfactory results when the training set is comprised of data ensuring good coverage on the whole range of input.

Y. Sun et al, [132]

proposed a set of four test coverage criteria drawing inspiration from traditional Modified Condition/Decision Coverage (MC/DC) criteria. They also proposed algorithms for generating test cases for each criterion built upon linear programming. A new test case (an input to Deep Neural Network) is produced by perturbing a given one, where the stated algorithms should encode the test requirement and a fragment of the DNN by fixing the activation pattern obtained from the given input example, and then minimize the difference between the new and the current inputs. The utility of this method lies in bug finding, determining DNN safety statistics, measuring testing accuracy and analysis of DNN internal structure. The paper discusses about sign change, value change and distance change of a neuron pair with two neurons in adjacent layers in the context of their change in activation values in two given test cases. Four covering methods: sign sign cover, distance sign cover, sign value cover and distance value cover are explained along with test requirement and test criteria which computes the percentage of the neuron pairs that are covered by test cases with respect to the covering method.

For each test requirement an automatic test case generation algorithm is implemented based on Linear Programming (LP). The objective is to find a test input variable, whose value is to be synthesized with LP, with identical activation pattern as a given input. Hence a pair of inputs that satisfy the closeness definition are called adversarial examples if only one of them is correctly labeled by the DNN. The testing criteria necessitates that (sign or distance) changes of the condition neurons should support the (sign or value) change of every decision neuron. For a pair of neurons with a specified testing criterion, two activation patterns need to be found such that the two patterns together shall exhibit the changes required by the corresponding testing criterion. In the final test suite the inputs matching these patterns will be added. The authors put forward results on 10 DNNs with the Sign-Sign, Distance-Sign, Sign-value and Distance-Value covering methods that show that the test generation algorithms are effective, as they reach high coverage for all covering criteria. Also, the covering methods designed are useful. This is supported by the fact that a significant portion of adversarial examples have been identified. To evaluate the quality of obtained adversarial examples, a distance curve to see how close the adversarial example is to the correct input has been plotted. It is observed that when going deeper into the DNN, it can become harder for the cover of neuron pairs. Under such circumstances, to improve the coverage performance, the use of larger data set when generating test pairs is needed. Interestingly, it seems that most adversarial examples can be found around the middle layers of all DNNs tested. In future the authors propose to find more efficient test case generation algorithms that do not require linear programming.

Katz et al. [133], provided methods for verifying adversarial robustness of neural networks with Reluplex algorithm, to prove, that a small perturbation to a rightly classified input should not result into misclassification. Huang et al, [134], proposed an automated verification framework based on Satisfiability Modulo Theory (SMT) to test the safety of neural network by searching adversarial manipulations through exploration in the space around a given data point. The adversarial examples discovered were used to fine-tune the network.

Iv-a Different Methods of Adversarial Test Generation

Despite the success of deep learning in various domains, the robustness of the architectures need to be studied before applying neural network architectures in safety critical systems. In this subsection we discuss the kind of malicious attack that can fool or mislead NN to output wrong decisions and ways to overcome them. The work presented by Tuncali et al., [135] deals with generating scenarios leading to unexpected behaviors by introducing perturbations in the testing conditions. For identifying fasification and critical systems behavior for autonomous driving systems, the authors focused on finding glancing counterexamples which refer to the borderline behavior of the system where it is in the verge of failing. They introduced Signal Temporal Logic (STL) formula for the problem in hand which in this case is a combination of predicates over the speed of the target car and distances of all other objects (including cars and pedestrians) and relative positions of them. Then a list of test cases is created and evaluated against STL specification. A covering array spanning all possible combinations of the values the variables can take is generated. To find a glancing behavior, the discrete parameters from the covering array that correspond to the trace that minimize STL conditions for a trace, are used to create test cases either uniformly randomly or by a cost function to guide a search over the continuous variables. Thus, a glancing test case for a trace is obtained. The proposed closed loop architecture behaves in an integrated way along with the controller and Deep Neural Network (DNN) based perception system to search for critical behavior of the vehicle.

In [136] Yuan et al discuss adversarial falsification problem explaining false positive and false negative attacks, white box attacks where there is complete knowledge about the trained NN model and black box attack where no information of the model can be accessed. With respect to adversarial specificity there are targeted and non-targeted attacks where the class output of the adversarial input is predefined in the first case and arbitrary in the second case. They also discuss about perturbation scope where individual attacks are geared towards generating unique perturbations per input whereas universal attacks generate similar attack for the whole dataset. The perturbation measurement is computed as p-norm distance between actual and adversarial input. The paper discusses various attack methods including L-BFGS attack, Fast Gradient Sign Method (FGSM) by performing update of one step gradient along the direction of the sign of the gradient of every pixel expressed as [137]:

(15)

where is the magnitude of perturbation which when added to an input data generates an adversarial data.

FGSM has been extended by Basic Iterative Method (BIM) and Iterative Least-Likely Class Method (ILLC). Moosavi-Dezfooli et al. [138] proposed Deepfool where iterative attack was performed with linear approximation to surpass the nonlinearity in multidimensional cases.

Iv-B Countermeasures for Adversarial Examples

The paper [136] deals with reactive countermeasures such as Adversarial Detecting, Input Reconstruction, and Network Verification and proactive countermeasures such as Network Distillation, Adversarial (Re)training, and Classifier Robustifying. In Network Distillation high temperature softmax activation reduces the sensitivity of the model towards small perturbations. In Adversarial (Re)training adversarial examples are used during training. Adversarial detecting deals with finding the probability of a given input being adversarial or not. In input reconstruction technique a denoising autoencoder is used to transform the adversarial examples to actual data before passing them as input to the prediction module by deep NN. Also, Gaussian Process Hybrid Deep Neural Networks (GPDNNs) are proven to be more robust towards adversarial inputs.

There are also ensembling defense strategies to counter multifaceted adversarial examples. But the defense strategies discussed here are mostly applicable to computer vision tasks, whereas the need of the day is to generate real time adversarial input detection and take measures for safety critical systems.

In [139] Rouhani et al., proposed an online defense framework DeepFense against adversarial deep learning. They formulated it as an unsupervised optimization problem by minimizing the less observed spaces in the latent feature hyperspace spanned by a Deep Learning network and was able to decrease the risk of integrated attacks. With integrated design of algorithms for software and hardware the proposed framework aims to maximize model reliability.

It is necessary to build robust countermeasures to be used for different types of adversarial scenarios to provide a reliable infrastructure as none of the countermeasures can be universally applicable to all sorts of adversaries. A detailed list of specific attack generation and corresponding countermeasures can be found in [140].

Application Area Authors
Fraud Detection in Financial Services Pumsirirat et al. [141], Schreyer et al. [142], Wang et al. [143], Zheng et al. [144], Dong et al. [145], Gomez et al. [146], Rymantubb et al. [147], Fiore et al. [148]
Financial Time Series Forecasting Cavalcante et al. [149], Li et al. [150], Fama et al. [151], Lu et al. [152], Tk & Verner [153], Pandey et al. [154], Lasfer et al. [155], Gudelek et al. [156], Fischer & Krauss [157], Santos Pinheiro & Dras [158], Bao et al. [159], Hossain et al. [160], Calvez and Cliff [161]
Prognostics and Health Monitoring Basak et al. [162], Tamilselvan & Wang [163], Kuremoto et al. [164], Qiu et al. [165], Gugulothu et al. [166], Filonov et al. [167], Botezatu et al. [168]
Medical Image Processing Suk, Lee & Shen [169], van Tulder & de Bruijne [170], Brosch & Tam [171], Esteva et al. [172], Rajaraman et. al. [173], Kang et al. [174], Hwang & Kim [175], Andermatt et al. [176], Cheng et al. [177], Miao et al. [178], Oktay et al. [179], Golkov et al. [180]
Power Systems Vankayala & Rao [181], Chow et al. [182], Guo et al. [183], Bourguet & Antsaklis [184], Bunn & Farmer [185], Hippert et al. [186], Kuster et al. [187], Aggarwal & Song [188], Zhai [189], Park et al. [190], Mocanu et al. [191], Chen et al. [192], Bouktif et al. [193], Dedinec et al. [194], Rahman et al. [195], Kong et al. [196], Dong et al. [197], Kalogirou et al. [198], Wang et al. [199], Das et al. [200], Dabra et al. [201], Liu et al. [202], Jang et al. [203], Gensler et al. [204], Abdel-Nasser et al. [205], Manwell et al. [206], Marugán et al. [207], Wu et al. [208], Wang et al. [209], Wang et al. [210], Feng et al. [211], Qureshi et al. [212]
TABLE III: Distribution of Articles by Application Areas

V Applications

V-a Fraud Detection in Financial Services

Fraud detection is an interesting problem in that it can be formulated in an unsupervised, a supervised and a one-class classification setting. In unsupervised learning category, class labels either unknown or are assumed to be unknown and clustering techniques are employed to figure out (i) distinct clusters containing fraudulent samples or (ii) far off fraudulent samples that do not belong to any cluster, where all clusters contained genuine samples, in which case, it is treated as an outlier detection problem. In supervised learning category, class labels are known and a binary classifier is built in order to classify fraudulent samples. Examples of these techniques include logistic regression, Naive Bayes, supervised neural networks, decision tree, support vector machine, fuzzy rule based classifier, rough set based classifier etc. Finally, in the one-class classification category, only samples of genuine class available or fraud samples are not considered for training even if available. These are called one-class classifiers. Examples include one-class support vector machine (aka Support vector data description or SVDD), auto association neural networks (aka auto encoders). In this category, models are trained on the genuine class data and are tested on the fraud class. Literature abounds with many studies involving traditional neural networks with various architectures to deal with the above mentioned three categories. Having said that fraud (including cyber fraud) detection is increasingly becoming menacing and fraudsters always appear to be few notches ahead of organizations in terms of finding new loopholes in the system and circumventing them effortlessly. On the other hand, organizations make huge investments in money, time and resources to predict fraud in near real-time, if not real time and try to mitigate the consequences of fraud. Financial fraud manifests itself in various areas such as banking, insurance and investments (stock markets). It can be both offline as well as online. Online fraud includes credit/debit card fraud, transaction fraud, cyber fraud involving security, while offline fraud includes accounting fraud, forgeries etc.

Deep learning algorithms proliferated during the last five years having found immense applications in many fields, where the traditional neural networks were applied with great success. Fraud detection one of them. In what follows, we review the works that employed deep learning for fraud detection and appeared in refereed international journals and one article is from arXive repository. papers published in International conferences are excluded.

Pumsirirat (2018)[141] proposed an unsupervised deep auto encoder (AE) based on restricted Boltzmann machine (RBM) in order to detect novel frauds because fraudsters always try to be innovative in their modus operandi so that they are not caught while perpetrating the fraud. He employed backpropagation trained deep Auto-encoder based on RBM that can reconstruct normal transactions to find anomalies from normal patterns. He used the Tensorflow library from Google to implement AE, RBM, and H2O by using deep learning. The results show the mean squared error, root mean squared error, and area under curve.

Schreyer (2017) [142] observed the disadvantage of business and experiential-knowledge driven rules in failing to generalize well beyond the known scenarios in large scale accounting frauds. Therefore, he proposed a deep auto encoder for this purpose and tested it effectiveness on two real world datasets. Chartered accountants appreciated the power of the deep auto encoder in predicting the anomalous accounting entries.

Automobile insurance fraud has traditionally been predicted by considering only structured data and textual date present in the claims was never analyzed. But, Wang and Xu (2018) [143]

proposed a novel method, wherein Latent Dirichlet Allocation (LDA) was first used to extract the text features hidden in the text descriptions of the accidents appearing in the claims, and then along with the traditional numeric features as input data deep neural networks are trained. Based on the real-world insurance fraud dataset, they concluded their hybrid approach outperformed random forests and support vector machine.

Telecom fraud has assumed large proportions and its impact can be seen in services involving mobile banking. Zheng et al. (2018)[144] proposed a novel generative adversarial network (GAN) based model to compute probability of fraud for each large transfer so that the bank can prevent potential frauds if the probability exceeds a threshold. The model uses a deep denoising autoencoder to learn the complex probabilistic relationship among the input features, and employs adversarial training to accurately discriminate between positive samples and negative samples in a data. They concluded that the model outperformed traditional classifiers and using it two commercial banks have reduced losses of about 10 million RMB in twelve weeks thereby significantly improving their reputation.

In today’s word-of-mouth marketing, online reviews posted by customers critically influence buyers’ purchase decisions more than before. However, fraud can be perpetrated in these reviews too by posting fake and meaningless reviews, which cannot reflect customers’/users’ genuine purchase experience and opinions. They pose great challenges for users to make right choices. Therefore, it is desirable to build a fraud detection model to identify and weed out fake reviews. Dong et al. (2018)[145] present an autoencoder and random forest, where a stochastic decision tree model fine tunes the parameters. Extensive experiments were conducted on a large Amazon review dataset.

Gomez et al. (2018)[146] presented a neural network based system for fraud detection in banking. They analyzed a real world dataset, and proposed an end-to-end solution from the practitioner’s perspective, especially focusing on issues such as data imbalances, data processing and cost metric evaluation. They reported their proposed solution performed comparably with state-of-the-art solutions.

Ryman-Tubb et al. (2018) [147] observed that payment card fraud has dented economies to the tune of USD 416bn in 2017. This fraud is perpetrated primarily to finance terrorism, arms and drug crime. Until recently the patterns of fraud and the criminals modus operandi has remained unsophisticated. However, smart phones, mobile payments, cloud computing and contactless payments have emerged almost simultaneously with large-scale data breaches. This made the extant methods less effective. They surveyed extant methods using transactional volumes in 2017. This benchmark will show that only eight traditional methods have a practical performance to be deployed in industry. Further, they suggested that a cognitive computing approach and deep learning are promising research directions.

Fiore et al (2019) [148] observed that data imbalance is a crucial issue in payment card fraud detection and that oversampling has some drawbacks. They proposed Generative Adversarial Networks (GAN) for oversampling, where they trained a GAN to output mimicked minority class examples, which were then merged with training data into an augmented training set so that the effectiveness of a classifier can be improved. They concluded that a classifier trained on the augmented set outperformed the same classifier trained on the original data, especially as far the sensitivity is concerned, resulting in an effective fraud detection mechanism.

In summary, as far as fraud detection is concerned, some progress is made in the application of a few deep learning architectures. However, there is immense potential to contribute to this field especially, the application of Resnet, gated recurrent unit, capsule network etc to detect frauds including cyber frauds. .

V-B Financial Time Series Forecasting

Advances in technology and break through in deep learning models have seen an increase in intelligent automated trading and decision support systems in Financial markets, especially in the stock and foreign exchange (FOREX) markets. However, time series problems are difficult to predict especially financial time series [149]. On the other hand, NN and deep learning models have shown great success in forecasting financial time series [150] despite the contradictory report by efficient market hypothesis (EMH) [151], that the FOREX and stock market follows a random walk and any profit made is by chance. This can be attributed to the ability of NN to self-adapt to any nonlinear data set without any statically assumption and prior knowledge of the data set [152].

Deep leaning algorithms have used both fundamental and technical analysis data, which is the two most commonly used techniques for financial time series forecasting, to trained and build deep leaning models [149]. Fundamental analysis is the use or mining of textual information like financial news, company financial reports and other economic factors like government policies, to predict price movement. Technical analysis on the other hand, is the analysis of historical data of the stock and FOREX market.

Deep Learning NN (DLNN) or Multilayer Feed forward NN (MFF) is the most used algorithms for financial markets [153]. According to the experimental analysis done by Pandey el at [154], showed that MFF with Bayesian learning performed better than MFF learning with back propagation for the FOREX market.

Deep neural networks or machine learning models are considered as a black box, because the internal workings is not fully understood. The performance of DNN is highly influence by the its parameters for a particular domain. Lasfer el at [155] performed an analysis on the influence of parameter (like the number of neurons, learning rate, activation function etc) on stock price forecasting. The authors work showed that a larger NN produces a better result than a smaller NN. However, the effect of the activation function on a large NN is lesser.

Although CNN is well known for its stripes in image recognition and less application in the Financial markets, CNN have also shown good performance in forecasting the stock market. As indicated by [155], the deeper the network the more NN can generalize to produce good results. However, the more the layers of NN, it is more likely to overfit a given data set. CNN on the other hand, with its techniques of convolution, pooling and drop out mechanism reduces the tendency of overfitting [156].

In order to apply CNN for the Financial market, the input data need to be transformed or adapted for CNN. With the help of a sliding window, Gudelek el at [156] used images generated by taking snapshots of the stock time series data and then fed it into 2D-CNN to perform daily predictions and classification of trends (whether downwards or upwards). The model was able to get 72 percent accuracy on 17 exchange traded fund data set. The model was not compared against other NN architecture. Fisher and Krauss [157] adapted LSTM for stock prediction and compared its performance with memory-free based algorithms like random forest, logistic regression classifier and deep neural network. LSTM performed better than other algorithms, random forest however, outperformed LSTM during the financial crisis in 2008.

EMH [151]

holds the view that financial news which affects the price movement are in cooperated into the price immediately or gradual. Therefore, any investor that can first analyze the news and make a good trading strategy can profit. Based on this view and the rise of big data, there has been an upward trend in sentiment analysis and text mining research which utilizes blogs, financial news social media to forecast the stock or FOREX market

[149]. Santos et al [158] explored the impact of news articles on company stock prices by implementing a LSTM neural network pre-trained by a character level language model to predict the changes in prices of a company for both inter day and intraday trading. The results showed that, CNN with word wise based model outperformed other models. However, LSTM character level-based model performed better than RNN base models and also has less architectural complexity than other algorithms.

Moreover, there has been hybrid architectures to combine the strengths or more than one deep leaning models to forecast financial time series. Bao et al [159] combined wavelet transform, stacked autoencoders and LSTM for stock price prediction. The output of one network or model was fed into the next model as input. The hybrid model perfumed better than LSTM and RNN (which were standalone). Hossain et al [160], also created a hybrid model by combining LSTM and Gated recurrent unit (GRU) to predict S&P 500 stock price. The model was compared against standalone models like LSTM and GRU with different architectural layers. The hybrid model outperformed all other algorithms.

Calvez and Cliff [161] did introduce a new approach on how to trade on the stock market with DLNN model. DLNN model learn or observe the trading behaviors of traders. The author used a limit-order-book (LOB) and quotes made by successful traders (both automated and humans) as input data. DLNN was able to learn and outperformed both human traders and automated traders. This approach of learning might be the breakthrough for intelligent automated trading for Financial markets.

V-C Prognostics and Health Management

The service reliability of the ever-encompassing cyber-physical systems around us has started to garner the undivided attention of the prognostics community in recent years. Factors such as revenue loss, system downtime, failure in mission-critical deployments and market competitive index are emergent motivations behind making accurate predictions about the State-of-Health (SoH) and Remaining Useful Life (RUL) of components and systems. Industry niches such as manufacturing, electronics, automotive, defense and aerospace are increasingly becoming reliant on expert diagnosis of system health and smart recommender systems for maximizing system uptime and adaptive scheduling of maintenance. Given the surge in sensor influx, if there exists sufficient structured information in historical or transient data, accurate models describing the system evolution may be proposed. The general idea is that in such approaches, there is a point in the operational cycle of a component beyond which it no longer delivers optimum performance. In this regard, the most widely used metric for determining the critical operational cycle is termed as the Remaining Useful Life (RUL), which is a measure of the time from measurement to the critical cycle beyond which sub-optimal performance is anticipated. Prognostic approaches may be divided into three categorizations: (a) Model-driven (b) Data-driven (c) Hybrid i.e. any combination of (a) and (b). The last three decades have seen extensive usage of model-driven approaches with Gaussian Processes and Sequential Monte-Carlo (SMC) methods which continue to be popular in capturing patterns in relatively simpler sensor data streams. However, one shortcoming of model driven approaches used till date happens to be their dependence on physical evolution equations recommended by an expert with problem-specific domain knowledge. For model-driven approaches to continue to perform as well when the problem complexity scales, the prior distribution (physical equations) needs to continue to capture the embedded causalities in the data accurately. However, it has been the observation that as sensor data scales, the ability of model-driven approaches to learn the inherent structures in the data has lagged. This is of course due to the use of simplistic priors and updates which are unable to capture the complex functional relationships from the high dimensional input data. With the introduction of self-regulated learning paradigms such as Deep Learning, this problem of learning the structure in sensor data was mitigated to a large extent because it was no longer necessary for an expert to hand-design the physical evolution scheme of the system. With the recent advancements in parallel computational capabilities, techniques leveraging the volume of available data have begun to shine. One key issue to keep in mind is that the performance of data-driven approaches are only as good as the labeled data available for training. While the surplus of sensor data may act as a motivation for choosing such approaches, it is critical that the precursor to the supervised part of learning, i.e. data labeling is accurate. This often requires laborious and time-consuming efforts and is not guaranteed to result in the generation of near-accurate ground truth. However, when adequate precaution is in place and strategic implementation facilitating optimal learning is achieved, it is possible to deliver customized solutions to complex prediction problems with an accuracy unmatched by simpler, model-driven approaches. Therein lies the holy grail of deep learning: the ability to scale learning with training data.

The recent works on device health forecasting are as follows: Basak et al. [162] carried on Remaining Useful Life (RUL) prediction of hard disks along with discussions on effective feature normalization strategies on Backblaze hard disk data. Deep Belief Networks (DBN) based multisensor health diagnosis methodology has been proposed in [163] and employed in aircraft engine and electric power transformer health diagnosis to show the effectiveness of the approach.

Kuremoto et al., [164] applied DBN composed of two Restricted Botzmann Machines (RBM) to capture the input feature distribution and then optimized the size of the network and learning rate through Particle Swarm Optimization for forecasting purposes with time series data. Qiu et al., [165]

proposed an early warning model where feature extraction through DNN with hidden state analysis of Hidden Markov Model (HMM) is carried out for health maintenance of equipment chain in gas pipeline. Gugulothu et al.

[166] proposed a forecasting scheme using a Recurrent Neural Network (RNN) model to generate embeddings which capture the trend of multivariate time series data which are supposed to be disparate for healthy and unhealthy devices. The idea of using RNNs to capture intricate dependencies among various time cycles of sensor observations is emphasized in [167] for prognostic applications. Botezatu et al., came up with some rules for directly identifying the healthy or unhealthy state of a device in [168], employing a disk replacement prediction algorithm with changepoint detection applied to time series Backblaze data.

So, a typical flow of prognostics and health management of any system under test using data-driven approaches start with data collection including instances of device performances or features under both normal and degraded operating conditions. The data preprocessing and feature selection play a crucial role before applying deep learning approaches to learn a model capturing degradation of device under test which should be employed to diagnose a fault, prognosticate future states of a device ensuring proper device maintenance. In this case maintaining a balance between false positives and negatives becomes crucial under real world industrial constraints and accuracy measures such as precision and recall must be validated before deployment.

V-D Medical Image Processing

Deep learning techniques have pervaded the entire discipline of medical image processing and the number of studies highlighting its application in canonical tasks such as image classification, detection, enhancement, image generation, registration and segmentation have been on a sharp rise. A recent survey by Litjens et al. [213] presents a collective picture of the prevalence and applications of deep learning models within the community as does a fairly rigorous treatise of the same by Shen et al. [214]. A concise overview of recent work in some of these canonical tasks follows.

The purpose of image/exam classification jobs is to identify the presence of a disease based on the images of medical examinations. Over the last few years, various neural network architectures have been used in this field including stacked auto-encoders applied to diagnosis of Alzheimer’s disease and mild cognitive impairment, exploiting the latent non-linear complicated relations among various features [169], Restricted Boltzmann Machines applied to Lung CT analysis combining generative as well as discriminative learning techniques [170], Deep Belief Networks trained on three dimensional medical images [171] etc. Recently, the the trend of using Convolutional Neural Networks in the field of image processing has been observed. In 2017, Esteva et al. [172] used and fine-tuned the Inception v3 [215] model to classify clinical images pertaining to skin cancer examinations into benign and malignant variants. Validated of experiments was carried out by testing model performance against a good number of dermatologists. In 2018, Rajaraman et. al [173] used specialized CNN architectures like ResNet for detecting malarial parasites in thin blood smear images. Kang et al. [174] improved the performance of 2D CNN by using a 3D multi-view CNN for lung nodule classification using spatial contextual information with the help of 3D Inception-ResNet architecture.

Object/lesion detection aims to identify different parts/lesions in an image. Although object classification and object detection are quite similar to each other but the challenges are specific to each of the categories. When it comes to object detection, the problem of class-imbalance can pose a major hurdle in terms of the performance of object detection models. Object detection also involves identification of localized information (that is specific to different parts of an image) from the full image space. Therefore, the task of object detection is a combination of identification of localized information and classification [216]

. In 2016, Hwang and Kim proposed a self-transfer learning (STL) framework which optimizes both the aspects of medical object detection task. They tested the STL framework for the detection of nodules in chest radiographs and lesions in mammography

[175].

Fig. 11: MRI Brain Slice and its different segmentation [217]

Segmentation happens to be one of the most common subjects of interest when it comes to application of Deep Learning in the domain of medical image processing. Organ and substructure segmentation allows for advanced fine-grained analysis of a medical image and it is widely practiced in the analyses of cardiac and brain images. A demonstration is shown in Figure 11, where different segmented parts of an MRI Brain Slice along with the original slice are considered. Segmentation includes both the local and global context of pixels with respect to a given image and the performance of a segmentation model can suffer from inconsistencies due to class imbalances. This makes the task of segmentation a difficult one. The most widely-used CNN architecture for medical image segmentation is U-Net which was proposed by Ronneberger et al. [218] in 2015. U-Net takes care of sampling that is required to check the class-imbalance factors and it is capable of scanning an entire image in just one forward pass which enables it to consider the full context of the image. RNN-based architectures have also been proposed for segmentation tasks. In 2016, Andermatt et al. [176] presented a method to automatically segment 3D volumes of biomedical images. They used multi-dimensional gated recurrent units (GRU) as the main layers of their neural network model. The proposed method also involves on-the-fly data augmentation which enables the model to be trained with less amount of training data.

Other applications of deep learning in Medical Image processing include image registration which implies coordinate transformation from a reference image space to target image space. Cheng et al. [177] used multi-modal stacked denoising autoencoder to compute effective similarity measure among images using normalized mutual information and local cross correlation. On the other hand, Miao et al. [178] developed CNN regressors to directly evaluate the registration transformation parameters. In addition to these, image generation and enhancement techniques have been discussed in [179], [180].

So far, applications of deep learning in medical image processing has produced satisfactory results in most of the cases, However, in a sensitive field like medical image processing prior knowledge should be incorporated in cases of image detection and recognition, reconstruction so that the data driven approaches do not produce implausible results [219].

V-E Power Systems

Artificial Neural Networks (ANN) have rapidly gained popularity among power system researchers [181]. Since their introduction to the power systems area in 1988 [182], numerous applications of ANN to problems of electric power systems have been proposed. However, the recent developments of Deep Learning (DL) methods have resulted into powerful tools that can handle large data-sets and often outperform traditional machine learning methods in problems related to the power sector [183]. For this reason, currently deep architectures are receiving the attention of researchers in power industry applications. Here, we will focus on describing some approaches of deep ANN architectures applied on three basic problems of the power industry, i.e. load forecasting and prediction of the power output of wind and solar energy systems.

Load forecasting is one of the most important tasks for the efficient power system’s operation. It allows the system operator to schedule spinning reserve allocation, decide for possible interchanges with other utilities and assess system’s security [184]. A small decrease in load forecasting error may result in significant reduction of the total operation cost of the power system [185]. Among the Artificial Intelligence techniques applied for load forecasting, methods based on ANN have undoubtedly received the largest share of attention [186]. A basic reason for their popularity lies on the fact that ANN techniques are well-suited for energy forecast [187]; they may obtain adequate estimations in cases where data is incomplete [188] and can consistently deal with complex non-linear problems [189]. Park et al. [190], was one of the first approaches for applying ANN in load forecasting. The efficiency of the proposed Multi-layer Perceptron (MLP) was demonstrated by benchmarking it against a numerical forecasting method frequently used by utilities. As an evolution of ANN forecasting techniques, DL methods are expected to increase the prediction accuracy by allowing higher levels of abstraction [191]. Thus, DL methods are gradually gain increased popularity due to their ability to capture data behaviour when considering complex non-linear patterns and large amounts of data. In [192], an end-to-end model based on deep residual neural networks is proposed for hourly load forecasting of a single day. Only raw data of past load and temperature were used as inputs of the model. Initially, the inputs of the model are processed by several fully connected layers to produce preliminary forecast. These forecasts are then passed through a deep neural network structure constructed by residual blocks. The efficiency of the proposed model was demonstrated on data-sets from the North-American Utility and ISO-NE. In [193]

, a Long Short Term Memory (LSTM)-based neural network has been proposed for short and medium term load forecasting. In order to optimize the effectiveness of the proposed approach, Genetic Algorithm is used to find the optimal values for the time lags and the number of layers of the LSTM model. The efficient performance of the proposed structure was verified using electricity consumption data of the France Metropolitan. Mocanu et al.

[191]

utilized two deep learning approaches based on Restricted Boltzman Machines (RBM), i.e. conditional RBM and factored conditional RBM, for single-meter residential load forecasting. The method was benchmarked against several shallow ANN architectures and a Support Vector Machine approach, demonstrating increased efficiency compared to the competing methods. Dedinec et al.

[194] employed a Deep Belief Network (DBN) for short term load forecasting of the Former Yugoslavian Republic of Macedonia. The proposed network comprised several stacks of RBM, which were pre-trained layer-wise. Rahman et al. [195] proposed two models based on the architecture of Recurrent Neural Networks (RNN) aiming to predict the medium and long term electricity consumption in residential and commercial buildings with one-hour resolution. The approach has utilized a MLP in combination with a LSTM based model using an encoder-decoder architecture. A model based on LSTM-RNN framework with appliance consumption sequences for short term residential load forecasting has been proposed in [196]. The researchers have showed that their method outperforms other state-of-the-art methods for load forecasting. In [197]

a Convolutional Neural Network (CNN) with k-means clustering has been proposed. K-means is used to partition the large amount of data into clusters, which are then used to train the networks. The method has shown improved performance compared to the case where the k-means has not been engaged.

The utilization of DL techniques for modelling and forecasting in systems of renewable energy is progressively increasing. Since the data in such systems are inherently noisy, they may be adequately handled with ANN architectures [198]. Moreover, because renewable energy data is complicated in nature, shallow learning models may be insufficient to identify and learn the corresponding deep non-linear and non-stationary features and traits [199]. Among the various renewable energy sources, wind and solar energy have gained more popularity due to their potential and high availability [200]. As a result, in recent years the research endeavours have been focused on developing DL techniques for the problems related to the deployment of the aforementioned renewable energy sources.

Photovolatic (PV) energy has received much attention, due to its many advantages; it is abundant, inexhaustible and clean [201]. However, due to the chaotic and erratic nature of the weather systems, the power output of PV energy systems is intermittent, volatile and random [202]. These uncertainties may potentially degrade the real-time control performance, reduce system economics, and thus pose a great challenge for the management and operation of electric power and energy systems [203]. For these reasons, the accuracy of forecasting of PV power output plays a major role in ensuring optimum planning and modelling of PV plants. In [199]

a deep neural network architecture is proposed for deterministic and probabilistic PV power forecasting. The deep architecture for deterministic forecasting comprises a Wavelet Transform and a deep CNN. Moreover, the probabilistic PV power forecasting model combines the deterministic model and a spine Quantile Regression (QR) technique. The method has been evaluated on historical PV power data-sets obtained from two PV farms in Belgium, exhibiting high forecasting stability and robustness. In Gensler et al.

[204], several deep network architectures, i.e. MLP, LSTM networks, DBN and Autoencoders, have been examined with respect to their forecasting accuracy of the PV power output. The performance of the methods is validated on actual data from PV facilities in Germany. The architecture that has exhibited the best performance is the Auto-LSTM network, which combines the feature extraction ability of the Autoencoder with the forecasting ability of the LSTM. In [205] an LSTM-RNN is proposed for forecasting the output power of solar PV systems. In particular, the authors examine five different LSTM network architectures in order to obtain the one with the highest forecasting accuracy at the examined data-sets, which are retrieved from two cities of Egypt. The network, which provided the highest accuracy is the LSTM with memory between batches.

With the advantages of non-pollution, low costs and remarkable benefits of scale, wind power is considered as one of the most important sources of energy [206]. ANN have been widely employed for processing large amounts of data obtained from data acquisition systems of wind turbines [207]. In recent years, many approaches based on DL architectures have been proposed for the prediction of the power output of wind power systems. In [208], a deep neural network architecture is proposed for deterministic wind power forecasting, which combines CNN and LSTM networks. The results of the model are further analyzed and evaluated based on the wind power forecasting error in order to perform probabilistic forecasting. The method has been validated on data obtained from a wind farm in China; it has managed to perform better compared to other statistical methods, i.e. ARIMA and persistence method, as well as artificial intelligence based techniques in deterministic and probabilistic wind power forecasting. Wang et al. [209] proposed a wind power forecasting method based on Wavelet Transform, CNN and ensemble technique. Their method was compared with the persistence method and two shallow ANN architectures, i.e. Back-Propagation ANN (BPANN) and Support Vector Machine, on data sets collected from wind farms in China. The results validate that their method outperforms the benchmark approaches in terms of reliability, sharpness and overall skill. In [210]

a DBN model in conjunction with the k-means clustering algorithm is proposed for wind power forecasting. The proposed approach demonstrated significantly increased forecasting accuracy compared to a BPANN and a Morlet wavelet neural network on data-sets obtained from a wind farm in Spain. A data-driven multi-model wind forecasting methodology with deep feature selection is proposed in

[211]. In particular, a two layer ensemble technique is developed; the first layer comprises multiple machine learning models, which generate individual forecasts. In the second layer a blended algorithm is utilized to merge the forecasts derived during the first stage. Numerical results validate the efficiency of the proposed methodology compared to models employing a single algorithm. Finally, in [212] an approach is proposed for wind power forecasting, which combines deep Autoencoders, DBN and the concept of transfer learning. The method is tested on data-sets containing power measurement and meteorological forecast related to components of wind, obtained from wind farms in Europe. Moreover, it is compared to commonly used baseline regression models, i.e. ARIMA and Support Vector Regressor, and derives better results in terms of MAE, RMSE and SDE compared to the benchmark algorithms.

Vi Discussions

In this paper we presented several Deep Learning architectures starting from the foundational architectures up to the recent developments covering the aspect of their modifications and evolution over time as well as applications to specific domains. We provided a list of category wise publicly available data repositories for Deep Learning practitioners in Table LABEL:tab:availabledata. We discussed the blend of swarm intelligence in Deep Learning approaches and how the influence of one enriches other when applied to real world problems. The vastly growing use of deep learning architectures specially in safety critical systems brings us to the question, how reliable the architectures are in providing decisions even in presence of adversarial scenarios. To address this, we started by giving an overview of testing neural network architectures, various methods for adversarial test generation as well as countermeasures to be adopted against adversarial examples. Next we moved on to specific applications of deep learning including Medical Imaging, Prognostics and Health Management, Applications in Financial Services, Financial Time Series Forecasting and lastly the applications in Power Systems. For each application the current research trends as well as future research directions are discussed. Table IV

lists a collection of recent reviews in different fields of Deep Learning including computer vision, forecasting, image processing, adversarial cases, autonomous vehicles, natural language processing, recommender systems and big data analytics.

Vii Conclusions and Future Work

In conclusion, we highlight a few open areas of research and elaborate on some of the existing lines of thoughts and studies in addressing challenges that lie within.

  • Challenges with scarcity of data: With growing availability of data as well as powerful and distributed processing units Deep Learning architectures can be successfully applied to major industrial problems. However, deep learning is traditionally big data driven and lacks efficiency to learn abstractions through clear verbal definitions [220] if not trained with billions of training samples. Also the large reliance on Convolutional Neural Networks(CNNs) especially for video recognition purposes could face exponential ineffeciency leading to their demise [221] which can be avoided by capsules [222] capturing critical spatial hierarchical relationships more efficiently than CNNs with lesser data requirements. To make DL work with smaller available data sets, some of the approaches in use are data augmentation, transfer learning, recursive classification techniques as well as synthetic data generation. One shot learning [223] is also bringing new avenues to learn from very few training examples which has already started showing progress in language processing and image classification tasks. More generalized techniques are being developed in this domain to make DL models learn from sparse or fewer data representations is a current research thrust.

  • Adopting unsupervised approaches: A major thrust is towards combining deep learning with unsupervised learning methods. Systems developed to set their own goals [220] and develop problem-solving approaches in its way towards exploring the environment are the future research directions surpassing supervised approaches requiring lots of data apriori. So, the thrust of AI research including Deep Learning is towards Meta Learning, i.e., learning to learn which involves automated model designing and decision making capabilities of the algorithms. It optimizes the ability to learn various tasks from fewer training data[224].

  • Influence of cognitive meuroscience: Inspiration drawn from cognitive neuroscience, developmental psychology to decipher human behavioral pattern are able to bring major breakthrough in applications such as enabling artificial agents learn about spatial navigation on their own which comes naturally to most living beings [225].

  • Neural networks and reinforcement learning: Meta-modeling approaches using Reinforcement Learning(RL) are being used for designing problem specific Neural Network architectures. In [226] the authors introduced MetaQNN, a RL based meta-modeling algorithm to automatically generate CNN architectures for image classification by using Q-learning [227] with greedy exploration. AlphaGo, the computer program built combining reinforcement learning and CNN for playing the game ‘Go’ achieved a great success by beating human professional ’Go’ players. Also deep convolutional neural networks can work as function approximators to predict ‘Q’ values in a reinforcement learning problem. So, a major thrust of current research is on superposition

    Topic Review Papers
    Computer Vision Deep learning for visual understanding: A review [228] Deep Learning for Computer Vision: A Brief Review [229] A Survey on Deep Learning Methods for Robot Vision [230] Deep learning for visual understanding: A review [228] Deep Learning Advances in Computer Vision with 3D Data: A Survey [231] Visualizations of Deep Neural Networks in Computer Vision: A Survey [232]
    Forecasting Machine Learning in Financial Crisis Prediction: A Survey [233] Deep Learning for Time-Series Analysis [234] A Survey on Machine Learning and Statistical Techniques in Bankruptcy Prediction [235] Time series forecasting using artificial neural networks methodologies: A systematic review [236] A Review of Deep Learning Methods Applied on Load Forecasting [237] Trends in Machine Learning Applied to Demand & Sales Forecasting: A Review [238] A survey on retail sales forecasting and prediction in fashion markets [239] Electric load forecasting: Literature survey and classification of methods [240] A review of unsupervised feature learning and deep learning for time-series modeling [241]
    Image Processing A Survey on Deep Learning in Medical Image Analysis [213] A Comprehensive Survey of Deep Learning for Image Captioning [242] Biological image analysis using deep learning-based methods: Literature review [243] Deep learning for remote sensing image classification: A survey [244] Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review [245] Deep Learning for Medical Image Processing: Overview, Challenges and the Future [246] An overview of deep learning in medical imaging focusing on MRI [247] Deep Learning in Medical Ultrasound Analysis: A Review [248]
    Adversarial Cases Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey [249]

    Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks

    [250] Adversarial Attacks and Defenses Against Deep Neural Networks: A Survey [251]Adversarial Machine Learning: A Literature Review [252] Review of Artificial Intelligence Adversarial Attack and Defense Technologies [253] A Survey of Adversarial Machine Learning in Cyber Warfare [254]
    Autonomous Vehicles Planning and Decision-Making for Autonomous Vehicles [255] A Review of Deep Learning Methods and Applications for Unmanned Aerial Vehicles [256] MIT Autonomous Vehicle Technology Study: Large-Scale Deep Learning Based Analysis of Driver Behavior and Interaction with Automation [257] Perception, Planning, Control, and Coordination for Autonomous Vehicles [258] Survey of neural networks in autonomous driving [259] Self-Driving Cars: A Survey [260]
    Natural Language Processing Recent Trends in Deep Learning Based Natural Language Processing [261] A Survey of the Usages of Deep Learning in Natural Language Processing [262] A survey on the state-of-the-art machine learning models in the context of NLP [263] Inflectional Review of Deep Learning on Natural Language Processing [264] Deep learning for natural language processing: advantages and challenges [265] Deep Learning for Natural Language Processing [266]
    Recommender Systems Deep Learning based Recommender System: A Survey and New Perspectives [267] A Survey of Recommender Systems Based on Deep Learning [268] A review on deep learning for recommender systems: challenges and remedies [269] Deep Learning Methods on Recommender System: A Survey of State-of-the-art [270] Deep Learning-Based Recommendation: Current Issues and Challenges [271] A Survey and Critique of Deep Learning on Recommender Systems [272]
    Big Data Analytics Efficient Machine Learning for Big Data: A Review [273] A survey on deep learning for big data [274] A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective [275] A survey of machine learning for big data processing [276] Deep learning in big data Analytics: A comparative study [277] Deep learning applications and challenges in big data analytics [278]
    TABLE IV: A Collection of Recent Reviews on Deep Learning
    Category Dataset Link
    Image Datasets MNIST CIFAR-100 Caltech 101 Caltech 256 Imagenet COIL100 STL-10 Google Open images           Labelme http://yann.lecun.com/exdb/mnist/ http://www.cs.utoronto.ca/~kriz/cifar.html http://www.vision.caltech.edu/Image_Datasets/Caltech101/ http://www.vision.caltech.edu/Image_Datasets/Caltech256/ http://www.image-net.org/ http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php http://www.stanford.edu/~acoates//stl10/ https://ai.googleblog.com/2016/09/introducing-open-images-dataset.html http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php
    Speech Datasets Google Audioset TIMIT           VoxForge CHIME 2000 HUB5 English LibriSpeech VoxCeleb Open SLR CALLHOME American English Speech https://research.google.com/audioset/dataset/index.html http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 http://www.voxforge.org/ http://spandh.dcs.shef.ac.uk/chime_challenge/data.html https://catalog.ldc.upenn.edu/LDC2002T43 http://www.openslr.org/12/ http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ https://www.openslr.org/51 https://catalog.ldc.upenn.edu/LDC97S42
    Text Datasets English Broadcast News SQuAD Billion Word Dataset 20 Newsgroups Google Books Ngrams UCI Spambase Common Crawl Yelp Open Dataset https://catalog.ldc.upenn.edu/LDC97S44 https://rajpurkar.github.io/SQuAD-explorer/ http://www.statmt.org/lm-benchmark/ http://qwone.com/~jason/20Newsgroups/ https://aws.amazon.com/datasets/google-books-ngrams/ https://archive.ics.uci.edu/ml/datasets/Spambase http://commoncrawl.org/the-data/ https://www.yelp.com/dataset
    Natural Language Datasets Web 1T 5-gram Blizzard Challenge 2018 Flickr personal taxonomies Multi-Domain Sentiment Dataset Enron Email Dataset Blogger Corpus Wikipedia Links Data Gutenberg eBooks List SMS Spam Collection UCI’s Spambase data https://catalog.ldc.upenn.edu/LDC2006T13 https://www.synsig.org/index.php/Blizzard_Challenge_2018 https://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html           http://www.cs.jhu.edu/~mdredze/datasets/sentiment/           https://www.cs.cmu.edu/~./enron/ http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm https://code.google.com/archive/p/wiki-links/downloads http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ https://archive.ics.uci.edu/ml/datasets/Spambase
    Geospatial Datasets OpenStreetMap Landsat8 NEXRAD ESRI Open data USGS EarthExplorer OpenTopography NASA SEDAC NASA Earth Observations Terra Populus https://www.openstreetmap.org https://landsat.gsfc.nasa.gov/landsat-8/ https://www.ncdc.noaa.gov/data-access/radar-data/nexrad https://hub.arcgis.com/pages/open-data https://earthexplorer.usgs.gov/ https://opentopography.org/ https://sedac.ciesin.columbia.edu/ https://neo.sci.gsfc.nasa.gov/           https://terra.ipums.org/
    Recom- mender Systems Datasets Movielens Million Song Dataset Last.fm Book-crossing Dataset Jester Netflix Prize Pinterest Fashion Compatibility Amazon Question and Answer Data Social Circles Data https://grouplens.org/datasets/movielens/ https://www.kaggle.com/c/msdchallenge https://grouplens.org/datasets/hetrec-2011/ http://www2.informatik.uni-freiburg.de/~cziegler/BX/ https://goldberg.berkeley.edu/jester-data/ https://www.netflixprize.com/ http://cseweb.ucsd.edu/~jmcauley/datasets.html#pinterest           http://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_qa           http://cseweb.ucsd.edu/~jmcauley/datasets.html#socialcircles
    Economics and Finance Datasets Quandl World Bank Open Data IMF Data Financial Times Market Data Google Trends           American Economic Association US stock Data World Factbook Dow Jones Index Data Set https://www.quandl.com/ https://data.worldbank.org/ https://www.imf.org/en/Data https://markets.ft.com/data/           https://trends.google.com/trends/?q=google&ctab=0&geo=all&date=all&sort=0 https://www.aeaweb.org/resources/data/us-macro-regional           https://github.com/eliangcs/pystock-data https://www.cia.gov/library/publications/download/ http://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index
    Auto- nomous Vehicles Datasets BDD100k Baidu Apolloscapes Comma.ai Oxford’s Robotic Car Cityscape Dataset CSSAD Dataset KUL Belgium Traffic Sign Dataset LISA Bosch Small Traffic Light LaRa Traffic Light Recognition WPI Datasets https://bdd-data.berkeley.edu/ http://apolloscape.auto/ https://archive.org/details/comma-dataset https://robotcar-dataset.robots.ox.ac.uk/ https://www.cityscapes-dataset.com/ http://aplicaciones.cimat.mx/Personal/jbhayet/ccsad-dataset http://www.vision.ee.ethz.ch/~timofter/traffic_signs/           http://cvrr.ucsd.edu/LISA/datasets.html https://hci.iwr.uni-heidelberg.de/node/6132           http://www.lara.prd.fr/benchmarks/trafficlightsrecognition           http://computing.wpi.edu/dataset.html
    TABLE V: A Collection of Data Repositories for Deep Learning Practitioners

    of neural networks and reinforcement learning geared towards problem specific requirements.

This review has aimed at aiding the beginner as well as the practitioner in the field make informed choices and has made an in-depth analysis of some recent deep learning architectures as well as an exploratory dissection of some pertinent application areas. It is the authors’ hope that readers find the material engaging and informative and openly encourage feedback to make the organization and content of this article more aligned along the lines of a formal extension of the literature within the deep learning community.

References

  • [1] M. van Gerven and S. Bohte, “Editorial: Artificial neural networks as models of neural information processing,” Frontiers in Computational Neuroscience, vol. 11, p. 114, 2017. [Online]. Available: https://www.frontiersin.org/article/10.3389/fncom.2017.00114
  • [2] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, Dec 1943. [Online]. Available: https://doi.org/10.1007/BF02478259
  • [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  • [4]

    S. Lawrence, C. L. Giles, , and A. D. Back, “Face recognition: a convolutional neural-network approach,”

    IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98–113, Jan 1997.
  • [5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3431–3440.
  • [6] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 677–691, Apr. 2017. [Online]. Available: https://doi.org/10.1109/TPAMI.2016.2599174
  • [7] X. Wu, R. He, and Z. Sun, “A lightened CNN for deep face representation,” CoRR, vol. abs/1511.02683, 2015. [Online]. Available: http://arxiv.org/abs/1511.02683
  • [8] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. V. Gool, “Weakly supervised cascaded convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 5131–5139.
  • [9] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, H. Li, K. Wang, J. Yan, C. Loy, and X. Tang, “Deepid-net: Object detection with deformable part based convolutional neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1320–1334, July 2017.
  • [10]

    G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

    Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec 1989. [Online]. Available: https://doi.org/10.1007/BF02551274
  • [11] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T
  • [12] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” CoRR, vol. abs/1709.02540, 2017. [Online]. Available: http://arxiv.org/abs/1709.02540
  • [13] B. Hanin, “Universal function approximation by deep neural nets with bounded width and relu activations,” 08 2017. [Online]. Available: https://arxiv.org/abs/1708.02691
  • [14] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85 – 117, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608014002135
  • [15] G. Marcus, “Deep Learning: A Critical Appraisal,” arXiv e-prints, p. arXiv:1801.00631, Jan. 2018.
  • [16] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in 2016 IEEE European Symposium on Security and Privacy (EuroS P), March 2016, pp. 372–387.
  • [17] E. Abbe and C. Sandon, “Provable limitations of deep learning,” arXiv e-prints, p. arXiv:1812.06369, Dec. 2018.
  • [18] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, pp. 65–386, 1958.
  • [19] and, “Madaline rule ii: a training algorithm for neural networks,” in IEEE 1988 International Conference on Neural Networks, July 1988, pp. 401–408 vol.1.
  • [20] B. Widrow and M. A. Lehr, “30 years of adaptive neural networks: perceptron, madaline, and backpropagation,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1415–1442, Sep. 1990.
  • [21] M. Minsky and S. Papert, “Perceptrons - an introduction to computational geometry,” 1969.
  • [22] P. J. Werbos, The roots of backpropagation: from ordered derivatives to neural networks and political forecasting.   John Wiley & Sons, 1994, vol. 1.
  • [23] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982. [Online]. Available: https://www.pnas.org/content/79/8/2554
  • [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1,” D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, Eds.   Cambridge, MA, USA: MIT Press, 1986, ch. Learning Internal Representations by Error Propagation, pp. 318–362. [Online]. Available: http://dl.acm.org/citation.cfm?id=104279.104293
  • [25] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, Sep 1995. [Online]. Available: https://doi.org/10.1023/A:1022627411411
  • [26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [28] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.
  • [30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
  • [31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22Nd ACM International Conference on Multimedia, ser. MM ’14.   New York, NY, USA: ACM, 2014, pp. 675–678. [Online]. Available: http://doi.acm.org/10.1145/2647868.2654889
  • [32] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015. [Online]. Available: http://learningsys.org/papers/LearningSys˙2015˙paper˙33.pdf
  • [33] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [34] J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C. Zhang, Y. Wan, Z. Li, J. Wang, S. Huang, Z. Wu, Y. Wang, Y. Yang, B. She, D. Shi, Q. Lu, K. Huang, and G. Song, “Bigdl: A distributed deep learning framework for big data,” CoRR, vol. abs/1804.05839, 2018.
  • [35] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688
  • [36] F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16.   New York, NY, USA: ACM, 2016, pp. 2135–2135. [Online]. Available: http://doi.acm.org/10.1145/2939672.2945397
  • [37] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget, “Recurrent neural network based language modeling in meeting recognition,” in Twelfth annual conference of the international speech communication association, 2011.
  • [38] L. Deng, D. Yu, et al., “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.
  • [39] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  • [40] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.
  • [41] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural information processing systems, 2007, pp. 153–160.
  • [42] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in optimizing recurrent networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2013, pp. 8624–8628.
  • [43]

    G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for lvcsr using rectified linear units and dropout,” in

    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 8609–8613.
  • [44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [45] D. Sussillo and L. Abbott, “Random walk initialization for training very deep feedforward networks,” arXiv preprint arXiv:1412.6558, 2014.
  • [46] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
  • [47] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
  • [48] S. K. Kumar, “On weight initialization in deep neural networks,” arXiv preprint arXiv:1704.08863, 2017.
  • [49] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
  • [50] A. Fischer and C. Igel, “An introduction to restricted boltzmann machines,” in Iberoamerican Congress on Pattern Recognition.   Springer, 2012, pp. 14–36.
  • [51] P. Smolensky, “Information processing in dynamical systems: Foundations of harmony theory,” COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, Tech. Rep., 1986.
  • [52] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
  • [53] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223.
  • [54] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [55] H. Larochelle and Y. Bengio, “Classification using discriminative restricted boltzmann machines,” in Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 536–543.
  • [56] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machines for collaborative filtering,” in Proceedings of the 24th international conference on Machine learning.   ACM, 2007, pp. 791–798.
  • [57]

    I. Sutskever and G. Hinton, “Learning multilevel distributed representations for high-dimensional sequences,” in

    Artificial Intelligence and Statistics, 2007, pp. 548–555.
  • [58] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling human motion using binary latent variables,” in Advances in neural information processing systems, 2007, pp. 1345–1352.
  • [59] R. Memisevic and G. Hinton, “Unsupervised learning of image transformations,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on.   IEEE, 2007, pp. 1–8.
  • [60] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th annual international conference on machine learning.   ACM, 2009, pp. 609–616.
  • [61] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104.
  • [62] G. Dahl, A.-r. Mohamed, G. E. Hinton, et al., “Phone recognition with the mean-covariance restricted boltzmann machine,” in Advances in neural information processing systems, 2010, pp. 469–477.
  • [63] G. E. Hinton et al., “Modeling pixel means and covariances using factorized third-order boltzmann machines,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 2551–2558.
  • [64] A.-r. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on.   IEEE, 2012, pp. 4273–4276.
  • [65] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine,” in Advances in neural information processing systems, 2009, pp. 1601–1608.
  • [66] G. W. Taylor and G. E. Hinton, “Factored conditional restricted boltzmann machines for modeling motion style,” in Proceedings of the 26th annual international conference on machine learning.   ACM, 2009, pp. 1025–1032.
  • [67] G. E. Hinton, “A practical guide to training restricted boltzmann machines,” in Neural networks: Tricks of the trade.   Springer, 2012, pp. 599–619.
  • [68] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.   MIT press Cambridge, 2016, vol. 1.
  • [69] N. Le Roux and Y. Bengio, “Representational power of restricted boltzmann machines and deep belief networks,” Neural computation, vol. 20, no. 6, pp. 1631–1649, 2008.
  • [70] G. E. Hinton et al., “What kind of graphical model is the brain?” in IJCAI, vol. 5, 2005, pp. 1765–1775.
  • [71] C. Poultney, S. Chopra, Y. L. Cun, et al., “Efficient learning of sparse representations with an energy-based model,” in Advances in neural information processing systems, 2007, pp. 1137–1144.
  • [72] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3626–3633.
  • [73] A.-r. Mohamed, G. E. Dahl, G. Hinton, et al., “Acoustic modeling using deep belief networks,” IEEE Trans. Audio, Speech & Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
  • [74] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010.
  • [75] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial for speaker adaptation of connectionist speech recognition systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2152–2161, 2013.
  • [76] S. M. Siniscalchi, D. Yu, L. Deng, and C.-H. Lee, “Exploiting deep neural networks for detection-based speech recognition,” Neurocomputing, vol. 106, pp. 148–157, 2013.
  • [77] D. Yu, S. M. Siniscalchi, L. Deng, and C.-H. Lee, “Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on.   IEEE, 2012, pp. 4169–4172.
  • [78] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised learning of hierarchical representations with convolutional deep belief networks,” Communications of the ACM, vol. 54, no. 10, pp. 95–103, 2011.
  • [79] J. Susskind, V. Mnih, G. Hinton, et al., “On deep generative models with applications to recognition,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 2857–2864.
  • [80] V. Stoyanov, A. Ropson, and J. Eisner, “Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 725–733.
  • [81] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.
  • [82] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A.-r. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • [83]

    L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],”

    IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  • [84] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energy models,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1105–1112.
  • [85] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
  • [86] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [87] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
  • [88] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [89] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski, “Deep generative stochastic networks trainable by backprop,” in International Conference on Machine Learning, 2014, pp. 226–234.
  • [90] Y. Bengio, “Deep learning of representations: Looking forward,” in International Conference on Statistical Language and Speech Processing.   Springer, 2013, pp. 1–37.
  • [91] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
  • [92] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010.
  • [93] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
  • [94] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.
  • [95] G. E. Hinton, “A better way to learn features: technical perspective,” Communications of the ACM, vol. 54, no. 10, pp. 94–94, 2011.
  • [96] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in International Conference on Artificial Neural Networks.   Springer, 2011, pp. 44–51.
  • [97] Q. V. Le, “Building high-level features using large scale unsupervised learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 8595–8598.
  • [98] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
  • [99] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.   USA: Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257
  • [100] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [Online]. Available: https://doi.org/10.1038/nature14539
  • [101] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
  • [102] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” CoRR, vol. abs/1808.03314, 2018.
  • [103] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
  • [104] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3, July 2000, pp. 189–194 vol.3.
  • [105] J. Chung, Çaglar Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
  • [106] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in INTERSPEECH, 2014.
  • [107] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training of recurrent neural networks for offline handwriting recognition,” 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 279–284, 2014.
  • [108] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward, “Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 694–707, 2016.
  • [109] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
  • [110] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   USA: Curran Associates Inc., 2016, pp. 2234–2242. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157096.3157346
  • [111] R. Groß, Y. Gu, W. Li, and M. Gauci, “Generalizing gans: A turing perspective,” in NIPS, 2017.
  • [112] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” in NIPS, 2016.
  • [113] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in NIPS, 2016.
  • [114] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016.
  • [115] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [116] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” CoRR, vol. abs/1311.2901, 2013. [Online]. Available: http://arxiv.org/abs/1311.2901
  • [117] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842
  • [118] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
  • [119] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
  • [120] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay,” CoRR, vol. abs/1803.09820, 2018. [Online]. Available: http://arxiv.org/abs/1803.09820
  • [121] F. Assunção, N. Lourenço, P. Machado, and B. Ribeiro, “Denser: deep evolutionary network structured representation,” Genetic Programming and Evolvable Machines, pp. 1–31, 2018.
  • [122] B. A. Garro and R. A. Vázquez, “Designing artificial neural networks using particle swarm optimization algorithms,” in Comp. Int. and Neurosc., 2015.
  • [123] G. Das, P. K. Pattnaik, and S. K. Padhy, “Artificial neural network trained by particle swarm optimization for non-linear channel equalization,” Expert Systems with Applications, vol. 41, no. 7, pp. 3491 – 3496, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417413008701
  • [124] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional neural networks by variable-length particle swarm optimization for image classification,” 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8, 2018.
  • [125] S. Sengupta, S. Basak, and R. A. Peters, “Particle swarm optimization: A survey of historical and recent developments with hybridization perspectives,” Machine Learning and Knowledge Extraction, vol. 1, no. 1, pp. 157–191, 2018. [Online]. Available: http://www.mdpi.com/2504-4990/1/1/10
  • [126] S. Sengupta, S. Basak, and R. A. Peters, “Qdds: A novel quantum swarm algorithm inspired by a double dirac delta potential,” in 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Nov 2018, pp. 704–711.
  • [127] S. Sengupta, S. Basak, and R. A. Peters, “Chaotic quantum double delta swarm algorithm using chebyshev maps: Theoretical foundations, performance analyses and convergence issues,” Journal of Sensor and Actuator Networks, vol. 8, no. 1, 2019. [Online]. Available: http://www.mdpi.com/2224-2708/8/1/9
  • [128] M. Hüttenrauch, A. Sosic, and G. Neumann, “Deep reinforcement learning for swarm systems,” CoRR, vol. abs/1807.06613, 2018.
  • [129] C. Anderson, A. V. Mayrhauser, and R. Mraz, “On the use of neural networks to guide software testing activities,” in Proceedings of 1995 IEEE International Test Conference (ITC), Oct 1995, pp. 720–729.
  • [130] T. M. Khoshgoftaar and R. M. Szabo, “Using neural networks to predict software faults during testing,” IEEE Transactions on Reliability, vol. 45, no. 3, pp. 456–462, Sep. 1996.
  • [131] M. Vanmali, M. Last, and A. Kandel, “Using a neural network in the software testing process,” Int. J. Intell. Syst., vol. 17, pp. 45–62, 2002.
  • [132] Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,” CoRR, vol. abs/1803.04792, 2018.
  • [133] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Towards proving the adversarial robustness of deep neural networks.” in FVAV@iFM, 2017.
  • [134] X. Huang, M. Z. Kwiatkowska, S. Wang, and M. Wu, “Safety verification of deep neural networks,” in CAV, 2017.
  • [135] C. E. Tuncali, G. Fainekos, H. Ito, and J. Kapinski, “Simulation-based adversarial test generation for autonomous vehicles with machine learning components,” 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1555–1562, 2018.
  • [136] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” CoRR, vol. abs/1712.07107, 2017.
  • [137] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” CoRR, vol. abs/1412.6572, 2014.
  • [138] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and accurate method to fool deep neural networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582, 2016.
  • [139] B. D. Rouhani, M. Samragh, M. Javaheripi, T. Javidi, and F. Koushanfar, “Deepfense: Online accelerated defense against adversarial deep learning,” in Proceedings of the International Conference on Computer-Aided Design, ser. ICCAD ’18.   New York, NY, USA: ACM, 2018, pp. 134:1–134:8. [Online]. Available: http://doi.acm.org/10.1145/3240765.3240791
  • [140] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, “Adversarial attacks and defences: A survey,” CoRR, vol. abs/1810.00069, 2018.
  • [141] A. Pumsirirat and L. Yan, “Credit card fraud detection using deep learning based on auto-encoder and restricted boltzmann machine,” INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, vol. 9, no. 1, pp. 18–25, 2018.
  • [142] M. Schreyer, T. Sattarov, D. Borth, A. Dengel, and B. Reimer, “Detection of anomalies in large scale accounting data using deep autoencoder networks,” CoRR, vol. abs/1709.05254, 2017. [Online]. Available: http://arxiv.org/abs/1709.05254
  • [143] Y. Wang and W. Xu, “Leveraging deep learning with lda-based text analytics to detect automobile insurance fraud,” Decision Support Systems, vol. 105, pp. 87–95, 2018.
  • [144] Y.-J. Zheng, X.-H. Zhou, W.-G. Sheng, Y. Xue, and S.-Y. Chen, “Generative adversarial network based telecom fraud detection at the receiving bank,” Neural Networks, vol. 102, pp. 78 – 86, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608018300698
  • [145] M. Dong, L. Yao, X. Wang, B. Benatallah, C. Huang, and X. Ning, “Opinion fraud detection via neural autoencoder decision forest,” Pattern Recognition Letters, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865518303039
  • [146] J. A. Gómez, J. Arévalo, R. Paredes, and J. Nin, “End-to-end neural network architecture for fraud scoring in card payments,” Pattern Recognition Letters, vol. 105, pp. 175 – 181, 2018, machine Learning and Applications in Artificial Intelligence. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S016786551730291X
  • [147] N. F. Ryman-Tubb, P. Krause, and W. Garn, “How artificial intelligence and machine learning research impacts payment card fraud detection: A survey and industry benchmark,” Engineering Applications of Artificial Intelligence, vol. 76, pp. 130 – 157, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0952197618301520
  • [148] U. Fiore, A. D. Santis, F. Perla, P. Zanetti, and F. Palmieri, “Using generative adversarial networks for improving classification effectiveness in credit card fraud detection,” Information Sciences, vol. 479, pp. 448 – 455, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025517311519
  • [149] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. Nobrega, and A. L. Oliveira, “Computational intelligence and financial markets: A survey and future directions,” Expert Systems with Applications, vol. 55, pp. 194–211, 2016.
  • [150] X. Li, Z. Deng, and J. Luo, “Trading strategy design in financial investment through a turning points prediction scheme,” Expert Systems with Applications, vol. 36, no. 4, pp. 7818 – 7826, 2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417408008622
  • [151] E. F. Fama, “Random walks in stock market prices,” Financial analysts journal, vol. 51, no. 1, pp. 75–80, 1995.
  • [152]

    C.-J. Lu, T.-S. Lee, and C.-C. Chiu, “Financial time series forecasting using independent component analysis and support vector regression,”

    Decision Support Systems, vol. 47, no. 2, pp. 115 – 125, 2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167923609000323
  • [153] M. Tkáč and R. Verner, “Artificial neural networks in business: Two decades of research,” Applied Soft Computing, vol. 38, pp. 788 – 804, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1568494615006122
  • [154] T. N. Pandey, A. K. Jagadev, S. Dehuri, and S.-B. Cho, “A novel committee machine and reviews of neural network and statistical models for currency exchange rate prediction: An experimental analysis,” Journal of King Saud University - Computer and Information Sciences, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1319157817303816
  • [155] A. Lasfer, H. El-Baz, and I. Zualkernan, “Neural network design parameters for forecasting financial time series,” in Modeling, Simulation and Applied Optimization (ICMSAO), 2013 5th International Conference on.   IEEE, 2013, pp. 1–4.
  • [156] M. U. Gudelek, S. A. Boluk, and A. M. Ozbayoglu, “A deep learning based stock trading model with 2-d cnn trend detection,” in Computational Intelligence (SSCI), 2017 IEEE Symposium Series on.   IEEE, 2017, pp. 1–8.
  • [157] T. Fischer and C. Krauss, “Deep learning with long short-term memory networks for financial market predictions,” European Journal of Operational Research, vol. 270, no. 2, pp. 654–669, 2018.
  • [158] L. dos Santos Pinheiro and M. Dras, “Stock market prediction with deep learning: A character-based neural language model for event-based trading,” in Proceedings of the Australasian Language Technology Association Workshop 2017, 2017, pp. 6–15.
  • [159] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for financial time series using stacked autoencoders and long-short term memory,” PloS one, vol. 12, no. 7, p. e0180944, 2017.
  • [160] A. H. Mohammad, K. Rezaul, T. Ruppa, D. B. B. Neil, and W. Yang, “Hybrid deep learning model for stock price prediction,” in IEEE Symposium Symposium Series on Computational Intelligence SSCI, 2018.   IEEE, 2018, pp. 1837–1844.
  • [161] A. le Calvez and D. Cliff, “Deep learning can replicate adaptive traders in a limit-order-book financial market,” 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1876–1883, 2018.
  • [162] S. Basak, S. Sengupta, and A. Dubey, “Mechanisms for Integrated Feature Normalization and Remaining Useful Life Estimation Using LSTMs Applied to Hard-Disks,” arXiv e-prints, p. arXiv:1810.08985, Oct 2018.
  • [163] P. Tamilselvan and P. Wang, “Failure diagnosis using deep belief learning based health state classification,” Reliability Engineering & System Safety, vol. 115, pp. 124 – 135, 2013. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0951832013000574
  • [164] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, “Time series forecasting using a deep belief network with restricted boltzmann machines,” Neurocomputing, vol. 137, pp. 47 – 56, 2014, advanced Intelligent Computing Theories and Methodologies. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231213007388
  • [165] J. Qiu, W. Liang, L. Zhang, X. Yu, and M. Zhang, “The early-warning model of equipment chain in gas pipeline based on dnn-hmm,” Journal of Natural Gas Science and Engineering, vol. 27, pp. 1710 – 1722, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1875510015302407
  • [166] N. Gugulothu, T. Vishnu, P. Malhotra, L. Vig, P. Agarwal, and G. Shroff, “Predicting remaining useful life using time series embeddings based on recurrent neural networks,” CoRR, vol. abs/1709.01073, 2017.
  • [167] P. Filonov, A. Lavrentyev, and A. Vorontsov, “Multivariate industrial time series with cyber-attack simulation: Fault detection using an lstm-based predictive data model,” CoRR, vol. abs/1612.06676, 2016.
  • [168] M. M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann, “Predicting disk replacement towards reliable data centers,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16.   New York, NY, USA: ACM, 2016, pp. 39–48. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939699
  • [169] H.-I. Suk, S.-W. Lee, and D. Shen, “Latent feature representation with stacked auto-encoder for ad/mci diagnosis,” Brain Structure and Function, vol. 220, pp. 841–859, 2013.
  • [170] G. van Tulder and M. de Bruijne, “Combining generative and discriminative representation learning for lung ct analysis with convolutional restricted boltzmann machines,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1262–1272, May 2016.
  • [171] T. Brosch and R. Tam, “Manifold learning of brain mris by deep learning,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 633–640.
  • [172] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115–, Jan. 2017. [Online]. Available: http://dx.doi.org/10.1038/nature21056
  • [173] S. Rajaraman, S. K. Antani, M. Poostchi, K. Silamut, M. A. Hossain, R. J. Maude, S. Jaeger, and G. R. Thoma, “Pre-trained convolutional neural networks as feature extractors toward improved malaria parasite detection in thin blood smear images,” PeerJ, vol. 6, pp. e4568–e4568, Apr 2018, 29682411[pmid]. [Online]. Available: https://www.ncbi.nlm.nih.gov/pubmed/29682411
  • [174] G. Kang, K. Liu, B. Hou, and N. Zhang, “3d multi-view convolutional neural networks for lung nodule classification,” in PloS one, 2017.
  • [175] S. Hwang and H. Kim, “Self-transfer learning for fully weakly supervised object localization,” CoRR, vol. abs/1602.01625, 2016. [Online]. Available: http://arxiv.org/abs/1602.01625
  • [176] S. Andermatt, S. Pezold, and P. Cattin, “Multi-dimensional gated recurrent units for the segmentation of biomedical 3d-data,” in Deep Learning and Data Labeling for Medical Applications, G. Carneiro, D. Mateus, L. Peter, A. Bradley, J. M. R. S. Tavares, V. Belagiannis, J. P. Papa, J. C. Nascimento, M. Loog, Z. Lu, J. S. Cardoso, and J. Cornebise, Eds.   Cham: Springer International Publishing, 2016, pp. 142–151.
  • [177] X. Cheng, X. Lin, and Y. Zheng, “Deep similarity learning for multimodal medical images,” CMBBE: Imaging & Visualization, vol. 6, pp. 248–252, 2018.
  • [178] S. Miao, Z. J. Wang, and R. Liao, “A cnn regression approach for real-time 2d/3d registration,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1352–1363, May 2016.
  • [179]

    O. Oktay, W. Bai, M. C. H. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. A. Cook, D. P. O’Regan, and D. Rueckert, “Multi-input cardiac image super-resolution using convolutional neural networks,” in

    MICCAI, 2016.
  • [180] V. Golkov, A. Dosovitskiy, J. I. Sperl, M. I. Menzel, M. Czisch, P. Sämann, T. Brox, and D. Cremers, “q-space deep learning: Twelve-fold shorter and model-free diffusion mri scans,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1344–1351, May 2016.
  • [181] V. S. S. Vankayala and N. D. Rao, “Artificial neural networks and their applications to power systems—a bibliographical survey,” Electric power systems research, vol. 28, no. 1, pp. 67–79, 1993.
  • [182] M.-y. Chow, P. Mangum, and R. Thomas, “Incipient fault detection in dc machines using a neural network,” in Signals, Systems and Computers, 1988. Twenty-Second Asilomar Conference on, vol. 2.   IEEE, 1988, pp. 706–709.
  • [183] Z. Guo, K. Zhou, X. Zhang, and S. Yang, “A deep learning model for short-term power load and probability density forecasting,” Energy, vol. 160, pp. 1186–1200, 2018.
  • [184] R. E. Bourguet and P. J. Antsaklis, “Artificial neural networks in electric power industry,” ISIS, vol. 94, p. 007, 1994.
  • [185] J. Sharp, “Comparative models for electrical load forecasting: D.h. bunn and e.d. farmer, eds.(wiley, new york, 1985) [uk pound]24.95, pp. 232,” International Journal of Forecasting, vol. 2, no. 2, pp. 241–242, 1986. [Online]. Available: https://EconPapers.repec.org/RePEc:eee:intfor:v:2:y:1986:i:2:p:241-242
  • [186] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for short-term load forecasting: A review and evaluation,” IEEE Transactions on power systems, vol. 16, no. 1, pp. 44–55, 2001.
  • [187] C. Kuster, Y. Rezgui, and M. Mourshed, “Electrical load forecasting models: A critical systematic review,” Sustainable cities and society, vol. 35, pp. 257–270, 2017.
  • [188] R. Aggarwal and Y. Song, “Artificial neural networks in power systems. i. general introduction to neural computing,” Power Engineering Journal, vol. 11, no. 3, pp. 129–134, 1997.
  • [189] Y. Zhai, “Time series forecasting competition among three sophisticated paradigms,” Ph.D. dissertation, University of North Carolina at Wilmington, 2005.
  • [190] D. C. Park, M. El-Sharkawi, R. Marks, L. Atlas, and M. Damborg, “Electric load forecasting using an artificial neural network,” IEEE transactions on Power Systems, vol. 6, no. 2, pp. 442–449, 1991.
  • [191] E. Mocanu, P. H. Nguyen, M. Gibescu, and W. L. Kling, “Deep learning for estimating building energy consumption,” Sustainable Energy, Grids and Networks, vol. 6, pp. 91–99, 2016.
  • [192] K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He, “Short-term load forecasting with deep residual networks,” IEEE Transactions on Smart Grid, 2018.
  • [193] S. Bouktif, A. Fiaz, A. Ouni, and M. Serhani, “Optimal deep learning lstm model for electric load forecasting using feature selection and genetic algorithm: Comparison with machine learning approaches,” Energies, vol. 11, no. 7, p. 1636, 2018.
  • [194] A. Dedinec, S. Filiposka, A. Dedinec, and L. Kocarev, “Deep belief network based electricity load forecasting: An analysis of macedonian case,” Energy, vol. 115, pp. 1688–1700, 2016.
  • [195] A. Rahman, V. Srikumar, and A. D. Smith, “Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks,” Applied Energy, vol. 212, pp. 372–385, 2018.
  • [196] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, “Short-term residential load forecasting based on lstm recurrent neural network,” IEEE Transactions on Smart Grid, 2017.
  • [197] X. Dong, L. Qian, and L. Huang, “Short-term load forecasting in smart grid: A combined cnn and k-means clustering approach,” in Big Data and Smart Computing (BigComp), 2017 IEEE International Conference on.   IEEE, 2017, pp. 119–125.
  • [198] S. A. Kalogirou, “Artificial neural networks in renewable energy systems applications: a review,” Renewable and sustainable energy reviews, vol. 5, no. 4, pp. 373–401, 2001.
  • [199] H. Wang, H. Yi, J. Peng, G. Wang, Y. Liu, H. Jiang, and W. Liu, “Deterministic and probabilistic forecasting of photovoltaic power based on deep convolutional neural network,” Energy Conversion and Management, vol. 153, pp. 409–422, 2017.
  • [200] U. K. Das, K. S. Tey, M. Seyedmahmoudian, S. Mekhilef, M. Y. I. Idris, W. Van Deventer, B. Horan, and A. Stojcevski, “Forecasting of photovoltaic power generation and model optimization: A review,” Renewable and Sustainable Energy Reviews, vol. 81, pp. 912–928, 2018.
  • [201] V. Dabra, K. K. Paliwal, P. Sharma, and N. Kumar, “Optimization of photovoltaic power system: a comparative study,” Protection and Control of Modern Power Systems, vol. 2, no. 1, p. 3, 2017.
  • [202] J. Liu, W. Fang, X. Zhang, and C. Yang, “An improved photovoltaic power forecasting model with the assistance of aerosol index data,” IEEE Transactions on Sustainable Energy, vol. 6, no. 2, pp. 434–442, 2015.
  • [203] H. S. Jang, K. Y. Bae, H.-S. Park, and D. K. Sung, “Solar power prediction based on satellite images and support vector machine,” IEEE Trans. Sustain. Energy, vol. 7, no. 3, pp. 1255–1263, 2016.
  • [204] A. Gensler, J. Henze, B. Sick, and N. Raabe, “Deep learning for solar power forecasting—an approach using autoencoder and lstm neural networks,” in Systems, Man, and Cybernetics (SMC), 2016 IEEE International Conference on.   IEEE, 2016, pp. 002 858–002 865.
  • [205] M. Abdel-Nasser and K. Mahmoud, “Accurate photovoltaic power forecasting models using deep lstm-rnn,” Neural Computing and Applications, pp. 1–14, 2017.
  • [206] J. F. Manwell, J. G. McGowan, and A. L. Rogers, Wind energy explained: theory, design and application.   John Wiley & Sons, 2010.
  • [207] A. P. Marugán, F. P. G. Márquez, J. M. P. Perez, and D. Ruiz-Hernández, “A survey of artificial neural network in wind energy systems,” Applied energy, vol. 228, pp. 1822–1836, 2018.
  • [208] W. Wu, K. Chen, Y. Qiao, and Z. Lu, “Probabilistic short-term wind power forecasting based on deep neural networks,” in Probabilistic Methods Applied to Power Systems (PMAPS), 2016 International Conference on.   IEEE, 2016, pp. 1–8.
  • [209] H.-z. Wang, G.-q. Li, G.-b. Wang, J.-c. Peng, H. Jiang, and Y.-t. Liu, “Deep learning based ensemble approach for probabilistic wind power forecasting,” Applied energy, vol. 188, pp. 56–70, 2017.
  • [210] K. Wang, X. Qi, H. Liu, and J. Song, “Deep belief network based k-means cluster approach for short-term wind power forecasting,” Energy, vol. 165, pp. 840–852, 2018.
  • [211] C. Feng, M. Cui, B.-M. Hodge, and J. Zhang, “A data-driven multi-model methodology with deep feature selection for short-term wind forecasting,” Applied Energy, vol. 190, pp. 1245–1257, 2017.
  • [212] A. S. Qureshi, A. Khan, A. Zameer, and A. Usman, “Wind power prediction using deep neural network based meta regression and transfer learning,” Applied Soft Computing, vol. 58, pp. 742–755, 2017.
  • [213] G. J. S. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” CoRR, vol. abs/1702.05747, 2017. [Online]. Available: http://arxiv.org/abs/1702.05747
  • [214] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual Review of Biomedical Engineering, vol. 19, no. 1, pp. 221–248, 2017, pMID: 28301734. [Online]. Available: https://doi.org/10.1146/annurev-bioeng-071516-044442
  • [215] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826, 2016.
  • [216] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp. 2042–2049.
  • [217] “Brain mri image segmentation using stacked denoising autoencoders,” https://goo.gl/tpnDx3.
  • [218] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015. [Online]. Available: http://arxiv.org/abs/1505.04597
  • [219] A. Maier, C. Syben, T. Lasser, and C. Riess, “A gentle introduction to deep learning in medical image processing,” Zeitschrift für Medizinische Physik, vol. 29, no. 2, pp. 86 – 101, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S093938891830120X
  • [220] G. Marcus, “Deep learning: A critical appraisal,” CoRR, vol. abs/1801.00631, 2018.
  • [221] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in NIPS 2017, 2017.
  • [222] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in ICANN, 2011.
  • [223] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   USA: Curran Associates Inc., 2016, pp. 3637–3645. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157382.3157504
  • [224] K. Hsu, S. Levine, and C. Finn, “Unsupervised learning via meta-learning,” CoRR, vol. abs/1810.02334, 2018.
  • [225] A. Banino, C. Barry, B. Uria, C. Blundell, T. P. Lillicrap, P. Mirowski, A. Pritzel, M. J. Chadwick, T. Degris, J. Modayil, G. Wayne, H. Soyer, F. Viola, B. Zhang, R. Goroshin, N. C. Rabinowitz, R. Pascanu, C. Beattie, S. Petersen, A. Sadik, S. Gaffney, H. King, K. Kavukcuoglu, D. Hassabis, R. Hadsell, and D. Kumaran, “Vector-based navigation using grid-like representations in artificial agents,” Nature, vol. 557, pp. 429–433, 2018.
  • [226] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” CoRR, vol. abs/1611.02167, 2016.
  • [227] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992698
  • [228] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual understanding: A review,” Neurocomputing, vol. 187, pp. 27–48, 2016.
  • [229] A. Voulodimos, N. D. Doulamis, A. D. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” in Comp. Int. and Neurosc., 2018.
  • [230] J. R. del Solar, P. Loncomilla, and N. Soto, “A survey on deep learning methods for robot vision,” CoRR, vol. abs/1803.10862, 2018.
  • [231] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, “Deep learning advances in computer vision with 3d data: A survey,” ACM Computing Surveys, vol. 50, 06 2017.
  • [232] C. Seifert, A. Aamir, A. Balagopalan, D. Jain, A. Sharma, S. Grottel, and S. Gumhold, Visualizations of Deep Neural Networks in Computer Vision: A Survey.   Cham: Springer International Publishing, 2017, pp. 123–144. [Online]. Available: https://doi.org/10.1007/978-3-319-54024-5˙6
  • [233] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai, “Machine learning in financial crisis prediction: A survey,” IEEE Transactions on Systems, Man, and Cybernetics - TSMC, vol. 42, pp. 421–436, 07 2012.
  • [234] J. C. B. Gamboa, “Deep learning for time-series analysis,” CoRR, vol. abs/1701.01887, 2017.
  • [235] S. Sarojini Devi and Y. Radhika, “A survey on machine learning and statistical techniques in bankruptcy prediction,” International Journal of Machine Learning and Computing, vol. 8, pp. 133–139, 04 2018.
  • [236] A. Tealab, “Time series forecasting using artificial neural networks methodologies: A systematic review,” Future Computing and Informatics Journal, vol. 3, no. 2, pp. 334 – 340, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2314728817300715
  • [237] A. Almalaq and G. Edwards, “A review of deep learning methods applied on load forecasting,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec 2017, pp. 511–516.
  • [238] J. P. Usuga Cadavid, S. Lamouri, and B. Grabot, “Trends in Machine Learning Applied to Demand & Sales Forecasting: A Review,” in International Conference on Information Systems, Logistics and Supply Chain, Lyon, France, July 2018. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01881362
  • [239] S. Beheshti-Kashi, H. Reza Karimi, K.-D. Thoben, M. Lütjen, and M. Teucke, “A survey on retail sales forecasting and prediction in fashion markets,” Systems Science & Control Engineering: An Open Access Journal, vol. 3, pp. 154–161, 01 2015.
  • [240] H. K. Alfares and M. Nazeeruddin, “Electric load forecasting: Literature survey and classification of methods,” Int. J. Systems Science, vol. 33, pp. 23–34, 2002.
  • [241] M. Längkvist, L. Karlsson, and A. Loutfi, “A review of unsupervised feature learning and deep learning for time-series modeling,” Pattern Recognition Letters, vol. 42, pp. 11 – 24, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865514000221
  • [242] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, pp. 118:1–118:36, Feb. 2019. [Online]. Available: http://doi.acm.org/10.1145/3295748
  • [243] H. Wang, S. Shang, L. Long, R. Hu, Y. Wu, N. Chen, S. Zhang, F. Cong, and S. Lin, “Biological image analysis using deep learning-based methods: Literature review,” Digital Medicine, vol. 4, no. 4, pp. 157–165, 2018. [Online]. Available: http://www.digitmedicine.com/article.asp?issn=2226-8561;year=2018;volume=4;issue=4;spage=157;epage=165;aulast=Wang;t=6
  • [244] Y. Li, H. Zhang, X. Xue, Y. Jiang, and Q. Shen, “Deep learning for remote sensing image classification: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 6, p. e1264, 2018. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1264
  • [245] W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Computation, vol. 29, no. 9, pp. 2352–2449, 2017, pMID: 28599112. [Online]. Available: https://doi.org/10.1162/neco˙a˙00990
  • [246] M. I. Razzak, S. Naz, and A. Zaib, Deep Learning for Medical Image Processing: Overview, Challenges and the Future.   Cham: Springer International Publishing, 2018, pp. 323–350. [Online]. Available: https://doi.org/10.1007/978-3-319-65981-7˙12
  • [247] A. S. Lundervold and A. Lundervold, “An overview of deep learning in medical imaging focusing on mri,” Zeitschrift für Medizinische Physik, vol. 29, no. 2, pp. 102 – 127, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0939388918301181
  • [248] S. Liu, Y. Wang, X. Yang, B. Lei, L. Liu, S. X. Li, D. Ni, and T. Wang, “Deep learning in medical ultrasound analysis: A review,” Engineering, vol. 5, no. 2, pp. 261 – 275, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2095809918301887
  • [249] N. Akhtar and A. S. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” IEEE Access, vol. 6, pp. 14 410–14 430, 2018.
  • [250] D. J. Miller, Z. Xiang, and G. Kesidis, “Adversarial learning in statistical classification: A comprehensive review of defenses against attacks,” CoRR, vol. abs/1904.06292, 2019. [Online]. Available: http://arxiv.org/abs/1904.06292
  • [251] M. Ozdag, “Adversarial attacks and defenses against deep neural networks: A survey,” Procedia Computer Science, vol. 140, pp. 152 – 161, 2018, cyber Physical Systems and Deep Learning Chicago, Illinois November 5-7, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050918319884
  • [252] S. Thomas and N. Tabrizi, Adversarial Machine Learning: A Literature Review, 07 2018, pp. 324–334.
  • [253] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligence adversarial attack and defense technologies,” Applied Sciences, vol. 9, no. 5, 2019. [Online]. Available: http://www.mdpi.com/2076-3417/9/5/909
  • [254] V. Duddu, “A survey of adversarial machine learning in cyber warfare,” 2018.
  • [255] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, 05 2018.
  • [256] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. C. Cervera, “A review of deep learning methods and applications for unmanned aerial vehicles,” J. Sensors, vol. 2017, pp. 3 296 874:1–3 296 874:13, 2017.
  • [257] A. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik, J. Terwilliger, J. Kindelsberger, L. Ding, S. Seaman, H. Abraham, A. Mehler, A. Sipperley, A. Pettinato, B. Seppelt, L. Angell, B. Mehler, and B. Reimer, “Mit autonomous vehicle technology study: Large-scale deep learning based analysis of driver behavior and interaction with automation,” CoRR, vol. abs/1711.06976, 2017.
  • [258] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H. Eng, D. Rus, and M. H. Ang, “Perception, planning, control, and coordination for autonomous vehicles,” Machines, vol. 5, no. 1, 2017. [Online]. Available: http://www.mdpi.com/2075-1702/5/1/6
  • [259] G. von Zitzewitz, “Survey of neural networks in autonomous driving,” 07 2017.
  • [260] C. Badue, R. Guidolini, R. V. Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. F. R. Jesus, R. F. Berriel, T. M. Paixão, F. W. Mutz, T. Oliveira-Santos, and A. F. de Souza, “Self-driving cars: A survey,” CoRR, vol. abs/1901.04407, 2019.
  • [261] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing [review article],” IEEE Computational Intelligence Magazine, vol. 13, pp. 55–75, 2018.
  • [262] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning in natural language processing,” CoRR, vol. abs/1807.10854, 2018.
  • [263] W. Khan, A. Daud, J. Nasir, and T. Amjad, “A survey on the state-of-the-art machine learning models in the context of nlp,” vol. 43, pp. 95–113, 10 2016.
  • [264] S. Fahad and A. Yahya, “Inflectional review of deep learning on natural language processing,” 07 2018.
  • [265] H. Li, “Deep learning for natural language processing: advantages and challenges,” National Science Review, vol. 5, no. 1, pp. 24–26, 09 2017. [Online]. Available: https://doi.org/10.1093/nsr/nwx110
  • [266] Y. Xie, L. Le, Y. Zhou, and V. V. Raghavan, “Deep learning for natural language processing,” 2018.
  • [267] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommender system: A survey and new perspectives,” ACM Comput. Surv., vol. 52, pp. 5:1–5:38, 2019.
  • [268] R. Mu, “A survey of recommender systems based on deep learning,” IEEE Access, vol. 6, pp. 69 009–69 022, 2018.
  • [269] Z. Batmaz, A. Yurekli, A. Bilge, and C. Kaleli, “A review on deep learning for recommender systems: challenges and remedies,” Artificial Intelligence Review, Aug 2018. [Online]. Available: https://doi.org/10.1007/s10462-018-9654-y
  • [270] B. T. Betru, C. A. Onana, and B. Batchakui, “Deep learning methods on recommender system: A survey of state-of-the-art,” 2017.
  • [271] R. Fakhfakh, B. a. Anis, and C. Ben Amar, “Deep learning-based recommendation: Current issues and challenges,” International Journal of Advanced Computer Science and Applications, vol. 8, 01 2017.
  • [272] L. Zheng, “A survey and critique of deep learning on recommender systems,” 2016.
  • [273] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Efficient machine learning for big data: A review,” Big Data Research, vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2214579615000271
  • [274] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Information Fusion, vol. 42, pp. 146 – 157, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253517305328
  • [275] Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for machine learning: a big data - AI integration perspective,” CoRR, vol. abs/1811.03402, 2018. [Online]. Available: http://arxiv.org/abs/1811.03402
  • [276] J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng, “A survey of machine learning for big data processing,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 67, May 2016. [Online]. Available: https://doi.org/10.1186/s13634-016-0355-x
  • [277] B. Jan, H. Farman, M. Khan, M. Imran, I. U. Islam, A. Ahmad, S. Ali, and G. Jeon, “Deep learning in big data analytics: A comparative study,” Computers & Electrical Engineering, vol. 75, pp. 275 – 287, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0045790617315835
  • [278] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, “Deep learning applications and challenges in big data analytics,” Journal of Big Data, vol. 2, no. 1, p. 1, Feb 2015. [Online]. Available: https://doi.org/10.1186/s40537-014-0007-7