I How Deep Learning Became Deep Learning
In the last few years with the emerging of affordable parallel processing hardware and free and open source frameworks, a new type of machine learning approach named ”Deep Learning” drew an extensive amount of attention. The wave originally started with a Deep Neural Network named AlexNet
in 2012 which won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). This new method achieved an incredible 8.8% less top five test error compared to the second best method. After 2012, the DNN models were the usual winners of every year’s ILSVRC. Several amazing architectures were introduced for object detection since 2012 including but not limited to ZF Net, VGG Net , GoogleNet , and Microsoft’s ResNet  that won the ILSVRC in 2015 with an error rate of 3.6% which is almost as twice as better than the human accuracy.
While the deep learning science started with object classification, the wave did not stop just there. Object detection was another big application wherein DNNs took a large step forward. Works such as Region Based CNNs (RCNN) , Fast RCNN , Faster RCNN , You Only Look Once (YOLO) , YOLO9000 and YOLOv2  and Mask Region-based Convolutional Network (Mask R-CNN) 
are methods which were designed to provide a bounding box around an object in their input image and also classify the object.
While these works mix a regression problem (bounding box location) with a classification solution (object recognition), there are applications on multidimensional classification use cases such as semantic segmentation applications. For example, the celebrated SegNet model is designed to map its input image to an output of the same size in which the output pixels correspond to one of 11 classes. This is while the networks presented in [12, 13] are trained to perform binary classification for segmenting low-quality iris images. Considering medical applications, the U-net model 
is trained to perform binary segmentation of neuronal structures which is also utilized to perform segmentation in transmitted light microscopy images (phase contrast and DIC).
Classification and Regression are not the only problems for which DL provides a solution. In 2014, DNNs has been utilized in estimation theory by the introduction of Generative Adversarial Networks (GAN). GANs are successful implementations of deep generative models which learn the data distribution and draw random samples from the learned distribution. There are multiple variations such as WGAN , EBGAN , BEGAN , VAR+GAN  and VAC+GAN 
, which have evolved from the original GAN by adjusting the loss function and/or the network architecture.
Alongside all different applications of DNNs, their impact on the medical imaging sciences is substantial. This article is mainly around the impact of DNNs on the CT imaging in general and low dose CT reconstruction in particular since recently DNNs have been widely used in low dose CT use cases. The technologies are getting so progressive that recently the first market stage deep learning based low dose CT reconstruction  have been introduced. This indicates the importance of data-driven methods such as Deep Learning even in sensitive markets such as medical imaging. The success of these methods in providing valuable input to radiologists changes the path where the imaging technology heads. In the Next section a brief explanation of CT imaging is described followed by a discussion of the influence of DL in the CT imaging in section III. In sectionIV a more detailed explanation of the future of CT imaging is described and the last section includes the conclusion and the discussion.
Ii CT Imaging
Computed tomography (CT) is well-known imaging technique that allows for non-invasive visualization of the interior of an object. It is widely used in many applications such as medical imaging [22, 23], non-destructive testing , industrial metrology , food industry [26, 27], and security . In CT, X-ray radiation is used to acquire a number of two dimensional (2D) images of an object from many different view points. From these images, using reconstruction software, a three dimensional (3D) model of the object’s internal structure is computed and subsequently analyzed. In this section, we will shortly describe the principle of X-ray CT imaging and image reconstruction.
Ii-a X-rays: matter interaction and detection
When an X-ray beam passes through an object, its intensity decreases due to physical mechanisms such as the photo-electric effect or elastic or inelastic scattering. Let denote the intensity of a monochromatic X-ray beam that leaves the X-ray source. Then the intensity of the X-ray beam at position on the detector after passing through the object along a line oriented at angle is given by:
Eq. (1) describes the relationship between the observed intensity at the detector side and the unknown attenuation coefficients the X-ray beam passed through. Log-normalization of this detected intensity yields the projection value
which linearly relates to the (unknown) attenuation coefficients of the object.
The main purpose of CT reconstruction methods is to recover the object’s attenuation coefficients from given projection data . In what follows, we describe commonly used reconstruction methods to recover the object’s attenuation values from projection data measured by directing the X-ray beam at different angles through the object.
Ii-B Analytical reconstruction methods
In the analytical approach, the object’s attenuation distribution is described as a function that maps the spatial coordinate to its corresponding local attenuation coefficient .
Ii-B1 Filtered Back Projection (FBP)
The Filtered Back Projection (FBP) reconstruction method is based on the following analytical formula:
denotes the Fourier transform of the projection data. As can be observed from 3, the FBP formula gives rise to a simple two step approach for calculating a reconstruction of the scanned object based on the measured projection data :
Filter the projection data by multiplying its Fourier transform with and calculating the inverse Fourier transform. This step corresponds to the inner integral in 3.
For a particular location in the image domain , sum up all the filtered projection data that corresponds to the lines with . This step corresponds to the outer integral in 3.
The FBP representation assumes that projection data is available from all angles. If indeed many X-ray projections from all angles are available, FBP generally leads to high quality reconstructions. If these two conditions are not satisfied (e.g. in case of limited angle scanning or related missing data problems), severe streaking artefacts appear in the reconstructed image.
Ii-C Algebraic reconstruction methods
A class of reconstruction methods that are better suited to cope with deviations from the above conditions are algebraic reconstruction methods (ARMs). ARMs methods rely on a discrete model of the object , as shown in Fig.1. Their basic scheme is to iteratively minimize the difference between the computed forward projection of the discrete image with the observed projection data , where the object is updated based on the backprojected difference. Thereby, , with denoting the contribution of object pixel to the detector pixel .
A commonly used ARM is the SIRT algorithm, in which, in each iteration , the current estimate of the image, , is updated as follows:
An important advantage of ARMs is that, if prior knowledge is available such as shape, object support, material density, or sparseness in some transform domain, this information can be easily integrated in such an iterative reconstruction scheme. As a result the quality of the reconstructed image can substantially be improved compared to plain FBP. One of the most important drawbacks, however is their computational load and slow convergence, which is the main reason why they are not commonly used in industrial applications.
Iii Deep Learning in CT
The most studied Deep Learning application in CT imaging is in the reconstruction pipeline for low dose CT use cases. The main reason is the difficulty in finding the exact artefact model caused by the limited angle projections. These artefacts although seem similar in different reconstructions are nontrivial to predict without knowing the exact geometry and material properties of the object to be scanned. This is a dead loop since the goal of the imaging is to measure such properties. It is worthwhile to mention the data-driven methods provide a powerful tool to learn any kind of pattern which occur in similar shapes and intensities repetitively throughout any data stream. DNNs are able to learn a set of solutions from the training set and generalize it to a set of data which they never observed before . In the case of the low dose CT imaging, they give promising results in removing the streaking artefacts from the reconstructed images.
In general, DNNs provide solutions in CT reconstruction in two main different ways. First, they produce a post-processing tool to remove the artefacts caused by low dose imaging. Second, they present an end to end model which translates the sinogram space into the image space. Both of these approaches are explained in the following sections.
Iii-a DNNs as helping hands
If you already came across a deep learning article in the low dose CT reconstruction, it is most probably a DNN, utilized as a post-processing step to remove the noise and artefact from other reconstruction techniques. In fact, considering the literature, the biggest misleading lyes in the title of the articles which give the impression that the DNN is performing the reconstruction task. This is while in almost all of these papers the deep learning technique is used as an auxiliary step after the actual reconstruction from classical methods such as FBP or SIRT. For example in
, the authors exploit a Mixed-Scale Dense (MSD) Convolutional Neural Network to remove the artefacts from the FBP reconstruction. In [33, 34] other fully convolutional models have been used to perform the exact same task. This is while other approaches such as [35, 36] perform this job in the wavelet and contourlet space using fully convolutional DNNs.
There is another approach known as NNFBP presented in  where a neural network is exploited to learn a filter bank for the FBP method and a weighted sum of several FBP reconstructions are calculated as the final result for a given sinogram.
There are also some sophisticated approaches like the work presented in  wherein a GAN objective (on Wasserstein distance) is imposed on the network that forces it to generate an output which follows high-quality CT images distribution. This is while the network is also obligated to reduce a perceptual loss between its output and the ground truth which this loss is produced by another pre-trained network known as VGG . Perceptual losses are designed to produce visually friendly results and do not impose any structural and/or pixel level correctness to the reconstructed image. This is an important issue which should be considered in a detailed level if these methods are getting any market attention specially in medical imaging where sensitive decisions are made based on the acquisitions.
Iii-B DNNs as an end to end solution
It is very tempting to be able to provide an end to end solution in CT reconstruction using data driven methods. This means that the model learns the mapping between the sinogram signal and image space entirely based on training data. The AUTOMAP method originally presented in 
provides such an end to end solution. Since there is no obvious one to one, pixel level correspondence from sinogram space to the image space signals, especially in the low dose scenarios, the AUTOMAP technique exploits two fully connected layers at the very beginning of the model. The fully connected layers give the opportunity to each sinogram sample to contribute to all the pixels in the image space. The first layer maps the input sinogram into a signal with the size of the output, and the second layer maps this tensor to another signal with the same size. These two layers contribute to almost all learnable parameters in the network. Let’s continue the discussion with an example.
Consider a 2-dimensional case where the X-ray sensor consists of 512 pixels and 128 signals in different projection angles are taken from the objects. In this case, the sinogram space signal will be an image with the dimensions pixels. If the reconstructed image considered to be pixels then the first two layer of AUTOMAP model consist of which is around learnable parameters. Most of the current deep learning platforms and also GPU architectures correspond 32 bit to each variable. This means that in such a simple scenario the AUTOMAP model needs more than 340 Gigabytes of memory to store its variables. The other problem is that training of such a huge model needs an extremely large database with enough variations to avoid overfitting. It is worthwhile to note we are only considering 2-dimensional use cases. For the real life 3-dimensional scenarios the needed memory and computing power rise exponentially. The AUTOMAP model also consists of two convolutional layers and an output layer.
In order to investigate this technique in the low dose scenarios, we trained an AUTOMAP model for a low dose CT problem with image size of and 20 projections with parallel beam geometry. The model was trained on more than 45000 images from National Cancer Institutes Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma (CPTAC-PDA)111https://wiki.cancerimagingarchive.net/display/Public/CPTAC-PDA database. We also trained an MSD network to reproduce the results of the method presented in  which is explained in section III-A. The models are tested on samples from Visible Human Project CT Datasets222https://mri.radiology.uiowa.edu/visible human datasets.html. The MSD network tends to remove the artefacts induced by the FBP algorithm in the reconstruction process. This observation is conducted to compare the end to end method AUTOMAP to a method where DNNs are used as an auxiliary step in the reconstruction pipeline for low dose CT. Figure 2 shows the reconstructions on a test sample for the original FBP method, the FBP output repaired using the trained MSD network, and AUTOMAP output alongside with the ground truth.
The AUTOMAP method is able to reconstruct the general shape of the object while the details are tangled into each other. The main reason is the fact that this approach totally ignores the geometry of the scanning process which plays a crucial role in FBP and consequently FBP+MSD methods. In ill-posed problems such as low dose CT reconstructions, any auxiliary information can make a difference in the final results. In the current case, the knowledge about scanning geometry induces auxiliary information which helps the model to retain more detailed structures in the final reconstruction.
Iv How the Future Looks Like?
Based on the observations and the discussions in the previous section, the marketing in the near future will not involve any end to end solution in the low dose CT reconstruction. In fact, the deep learning approach already found its way into the medical imaging industries. Regarding the article , the first industry level deep learning based low dose CT reconstruction method follows the first approach described in section III-A. In this method, a Convolutional Deep Neural Network is utilized to remove the artefacts from the conventional reconstructions.
shows a potential schematic of how the training is accomplished. The networks proposed in the literature often exploit fully convolutional architecture. A fully convolutional network is a neural network which just contains convolution, deconvolution, pooling, and unpooling layers. They usually take advantage of techniques such as Batch Normalization, drop out , skipped connections and residual blocks 
. These methods are employed to improve training convergence, avoid overfitting and keep high-frequency information from the input signal. After each convolutional operation the signal passes through an activation function also knows as nonlinearity. The most favorite nonlinearity is the ReLU and its variations such as leaky ReLU , and ELU .
The training is accomplished by two main steps. Forward propagation and backpropagation. In the forward propagation step, the low-quality image is passed through the network and the output is aquired by calculating a set of convolutions. The next step is to compare the output of the network with its corresponding ground truth. Loss value is calculated using one or more distance functions, designed based on the problem characteristics. In the method shown in figure3 the most popular distance measure used in the literature is the mean squared error given by:
Wherein , , and are width, height and the batch size of the input signal, and are the output signal and the target values respectively.
The next step is known as backpropagation. In this step, the derivative of the loss function with respect to every network parameter is calculated and these values are updated using the calculated gradient to decrease the loss. There are several different methods in performing the update which are mostly based on gradient descent technique such as Stochastic Gradient Descent (SGD), SGD with Nestrov Momentum, AdaGrad 
, RMSprop, and ADAM . Although the most popular optimization method is ADAM since it assigns a learning rate and a momentum for each learnable parameter. The update step of ADAM is as follows:
where and is the set of parameters in iteration , is the model’s mapping function and, is the learning rate. The default values for , and, are 0.9, 0.999 and , respectively.
This optimization method presents a solution to problems such as vanishing learning rate, slow convergence, and fluctuations in the cost value.
A large community of programmers and machine learning engineers are dedicated to research and develop faster and more versatile software platforms for Deep Learning use cases. There are many examples such as Tensorflow
, Pytorch, and MXNET  which provide an appealing experience and helpful interfaces to train and test DNNs with fast and memory efficient implementations. Almost every framework include Convolution, Deconvolution, Pooling , Fully Connected, Batch Normalization and Drop out techniques with almost all popular optimization methods implemented. In the near future the medical imaging industry in general and CT reconstruction in particular moves to a new level with the approval of Machine Learning techniques and their amazing outcomes.
In this article, the low dose CT reconstruction problem has been investigated from a machine learning point of view. The CT imaging, reconstruction methods, low dose problem and advantage and disadvantage of model-based approaches were covered as well.
The deep neural networks play an important role in our modern world technology development and like every other signal processing branches they vastly influence medical imaging science. Currently, the academic and the industrial worlds point at the DNNs as a solution for the low dose CT imaging. The main reason is the lack of a single representative model for such a problem. Deep Learning provides powerful tools in learning the artefacts caused by low SNR or incomplete set of observations in the sinogram signal. Two main approaches have been proposed in the literature in using data-driven methods to perform the reconstruction. The first one is to exploit these techniques to remove the noise and artefacts from the reconstructed image and the second one is to train an end to end model to translate sensor signal data straight into the image space representation. Both methods were investigated in the current article and concluded that the first approach gives more satisfactory results in the low dose CT use cases for two reasons:
At the end to end approach, the size of the model grows exponentially with respect to the size of the input signal which increases the chance of overfitting and also the implementation becomes nontrivial in real life scenarios with the requirements for high-resolution acquisitions.
The end to end model totally ignores the model geometry. This raises into larger problems in low dose cases where the underlying problems are highly ill-posed. Based on the observations concluded in section III, this information plays a crucial role in retaining details in the reconstructed image.
Industrial marketing is in favor of products from the first approach with using DNNs as helping hands for model-based methods. And also from the visual observations of the results presented in , these products compensate for lowering the X-ray tube current rather than limited or incomplete projection scenarios. The future for these technologies is bright with affordable parallel processing hardware and open source and free software and they play an important role in developing a more accurate and higher quality medical acquisitions.
This work is financially supported by VLAIO (Flemish Agency for Innovation and Entrepreneurship), through the ANNTOM project HBC.2017.0595.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
European conference on computer vision. Springer, 2014, pp. 818–833.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” inAAAI, vol. 4, 2017, p. 12.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
-  S. Bazrafkan, S. Thavalengal, and P. Corcoran, “An end to end deep neural network for iris segmentation in unconstrained scenarios,” Neural Networks, vol. 106, pp. 79 – 95, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S089360801830193X
-  V. Varkarakis, S. Bazrafkan, and P. Corcoran, “Deep neural network and data augmentation methodology for off-axis iris segmentation in wearable headsets,” arXiv preprint arXiv:1903.00389, 2019.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
-  J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
-  D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
-  S. Bazrafkan and P. Corcoran, “Versatile auxiliary regressor with generative adversarial network (var+gan),” arXiv preprint arXiv:1805.10864, 2018.
-  S. Bazrafkan, H. Javidnia, and P. Corcoran, “Versatile auxiliary classifier with generative adversarial network (vac+gan),” arXiv preprint arXiv:1805.00316, 2018.
-  K. Boedeker, “Phd,” 2019. [Online]. Available: https://mfl.ssl.cdn.sdlmedia.com/636837173033229994OU.pdf
-  P. B. Bach, J. N. Mirkin, T. K. Oliver, C. G. Azzoli, D. A. Berry, O. W. Brawley, T. Byers, G. A. Colditz, M. K. Gould, J. R. Jett, A. L. Sabichi, R. Smith-Bindman, D. E. Wood, A. Qaseem, and F. C. Detterbeck, “Benefits and Harms of CT Screening for Lung Cancer A Systematic Review,” JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, vol. 307, no. 22, pp. 2418–2429, JUN 13 2012.
-  T. Kubo, P.-J. P. Lin, W. Stiller, M. Takahashi, H.-U. Kauczor, Y. Ohno, and H. Hatabu, “Radiation dose reduction in chest CT: A review,” AMERICAN JOURNAL OF ROENTGENOLOGY, vol. 190, no. 2, pp. 335–343, FEB 2008.
-  T. De Schryver, J. Dhaene, M. Dierick, M. N. Boone, E. Janssens, J. Sijbers, M. van Dael, P. Verboven, B. Nicolai, and L. Van Hoorebeke, “In-line NDT with X-Ray CT combining sample rotation and translation,” NDT & E INTERNATIONAL, vol. 84, pp. 89–98, DEC 2016.
-  J. Hiller and P. Hornberger, “Measurement accuracy in x-ray computed tomography metrology: Toward a systematic analysis of interference effects in tomographic imaging,” Precision Engineering, vol. 45, pp. 18 – 32, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141635915002238
-  L. Schoeman, P. Williams, A. du Plessis, and M. Manley, “X-ray micro-computed tomography (mu CT) for non-destructive characterisation of food microstructure,” TRENDS IN FOOD SCIENCE & TECHNOLOGY, vol. 47, pp. 10–24, JAN 2016.
-  E. Janssens, J. D. Beenhouwer, M. V. Dael, P. Verboven, B. Nicolai, and J. Sijbers, “Neural network based x-ray tomography for fast inspection of apples on a conveyor belt,” in IEEE International Conference on Image Processing, Sept 21-27 2015, pp. 917–921.
-  P. M. Shikhaliev, “Large-scale mv ct for cargo imaging: A feasibility study,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 904, pp. 35 – 43, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0168900218308490
-  G. Van Eyndhoven and J. Sijbers, Iterative reconstruction methods in X-ray CT. CRC Press, 2018, ch. 34, pp. 693–712.
J. Lemley, S. Bazrafkan, and P. Corcoran, “Deep learning for consumer devices and services: Pushing the limits for machine learning, artificial intelligence, and computer vision.”IEEE Consumer Electronics Magazine, vol. 6, no. 2, pp. 48–56, 2017.
-  D. Pelt, K. Batenburg, and J. Sethian, “Improving tomographic reconstruction from limited data using mixed-scale dense convolutional neural networks,” Journal of Imaging, vol. 4, no. 11, p. 128, 2018.
-  D. M. Pelt and J. A. Sethian, “A mixed-scale dense convolutional neural network for image analysis,” Proceedings of the National Academy of Sciences, vol. 115, no. 2, pp. 254–259, 2018.
-  H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang, “Low-dose ct with a residual encoder-decoder convolutional neural network,” IEEE transactions on medical imaging, vol. 36, no. 12, pp. 2524–2535, 2017.
-  H. Chen, Y. Zhang, W. Zhang, P. Liao, K. Li, J. Zhou, and G. Wang, “Low-dose ct via convolutional neural network,” Biomedical optics express, vol. 8, no. 2, pp. 679–694, 2017.
-  E. Kang, J. C. Ye et al., “Wavelet domain residual network (wavresnet) for low-dose x-ray ct reconstruction,” arXiv preprint arXiv:1703.01383, 2017.
-  E. Kang, W. Chang, J. Yoo, and J. C. Ye, “Deep convolutional framelet denosing for low-dose ct via wavelet residual network,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1358–1369, 2018.
-  D. M. Pelt and K. J. Batenburg, “Fast tomographic reconstruction from limited data using artificial neural networks,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 5238–5251, 2013.
-  Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang, “Low dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE transactions on medical imaging, 2018.
-  B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature, vol. 555, no. 7697, p. 487, 2018.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, p. 947, 2000.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
-  D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
-  Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2),” in Doklady AN USSR, vol. 269, 1983, pp. 543–547.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
-  T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  J. Bergstra, F. Bastien, O. Breuleux, P. Lamblin, R. Pascanu, O. Delalleau, G. Desjardins, D. Warde-Farley, I. Goodfellow, A. Bergeron et al., “Theano: Deep learning on gpus with python,” in NIPS 2011, BigLearning Workshop, Granada, Spain, vol. 3. Citeseer, 2011.
-  A. Paszke, S. Chintala, R. Collobert, K. Kavukcuoglu, C. Farabet, S. Bengio, I. Melvin, J. Weston, and J. Mariethoz, “Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, may 2017.”
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.