Approximating Activation Functions

01/17/2020 ∙ by Nicholas Gerard Timmons, et al. ∙ 13

ReLU is widely seen as the default choice for activation functions in neural networks. However, there are cases where more complicated functions are required. In particular, recurrent neural networks (such as LSTMs) make extensive use of both hyperbolic tangent and sigmoid functions. These functions are expensive to compute. We used function approximation techniques to develop replacements for these functions and evaluated them empirically on three popular network configurations. We find safe approximations that yield a 10 37 suitable for all cases we considered and we believe are appropriate replacements for all networks using these activation functions. We also develop ranged approximations which only apply in some cases due to restrictions on their input domain. Our ranged approximations yield a performance improvement of 20 considerably out perform the ad-hoc approximations used in Theano and the implementation of Word2Vec.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Function approximation is a software engineering technique in which we seek to replace complex functions with faster, but less accurate, alternatives. For example, consider . This is a challenging and complex function to implement correctly and much research has been published on algorithms for achieving minimal error with correct rounding de Dinechin et al. (2005); De Dinechin et al. (2004); Defour et al. (2004). However, if the caller were able to tolerate some error we could replace the implementation of with some terms from the Taylor Series for . This yields faster execution at the cost of some loss of accuracy.

Much of the research on function approximation incorporates a significant amount of mathematical reasoning to argue that the approximation falls within a worst-case error bound in all cases: the Taylor Series example above would not be used in practice because its worst-case bound is too high.

However, there are many applications where such strict requirements are unnecessary and in this case one can make use of simpler (and considerably faster) approximations that are ‘good enough’ Sampson et al. (2013). In this paper we consider one such application: that of activation functions in neural networks. Not only is a neural network inherently tolerant to error but practitioners will admit that there is a certain art to the selection of activation functions (and other hyper-parameters) and thus a certain leeway in their accuracy.

Function approximation has been applied to activation functions before. The implementation of Word2Vec makes use of a lookup table for approximating the exponential function and the popular machine learning library Theano includes an approximation of the

function called ultra_fast_sigmoid

. Google also released some research very recently where they showed a piecewise approximations of their ‘Swish’ activation function for use in their edge computing TensorFlow interface which shows the viability of some approximation for limited hardware 

Howard et al. (2019). However, this paper is the first detailed study on the subject in this area, and shows that approximations can be used as drop-in replacements for current models. The approximations we present outperform all of the mentioned alternative approximation approaches.

In this paper we consider a range of approximations with a trade-off between complexity and accuracy. We study the overall impact on training and prediction time for popular neural network architectures. We identify two approximation options for each: a safe approximation that works in all cases and a ranged approximation that requires some restrictions to it’s input domain. For all networks tested we find that our safe approximations outperform the standard implementations with an to improvement in training times. Where appropriate our ranged implementations provide an improvement of to . Our training and inference benchmarks are initially run on CPU as a proof of concept.

We believe that our safe approximations are of particular relevance to library developers since they are better-performing drop-in replacements for existing functions. There is also a growing range of specialised hardware designed for accelerating machine learning tasks and in future work we would like to consider how well our approximations might translate into hardware implementations.

We provide an open-source implementation of our approach in Julia using the Flux

111 machine learning library. We argue that these approximations are applicable to other libraries too and we include a brief study using TensorFlow Abadi et al. (2016) as evidence.

2 Activation functions

Activation functions are used in neural networks to introduce non-linearity. For activation functions with bounded results, the popular choices are logistic, trigonometric and clamping functions with the sigmoid () and hyperbolic tangent () functions being used most frequently. For unbounded or partially bounded activations, the function and it’s variants are most popular.

is the dominant choice for an activation function but and still have significant use as gating functions in feed-forward and recurrent networks.

We focus on and in particular as major components of LSTMs gating behaviour which in it’s original form uses three calls to and two calls to  Graves (2013).

For cases (such as LSTMs) where is not appropriate the performance impact can be notable: in micro-benchmarks measuring function execution time we found and to be 3.6x and 7.9x more expensive than . In this paper we focus on these two functions given their popularity and relatively high cost.

A significant proportion of CPU time is spent in the computation of values in activation functions. For example, for a densely connected network with 51,000 trainable parameters using a activation function with two equal-sized hidden layers we measure that approximately 29% of time is spent computing the function. For context, in the paper which introduced sequence-to-sequence learning for neural networks the authors use a 5-layer LSTM network which has 380 million trainable parameters Sutskever et al. (2014).

3 Function Approximation

A conventional implementation of a mathematical function will aim to compute results which arise from rounding actual values to the precision of the data-type used. For example, implementations of the C Math library often come with documentation stating to how many units-in-last-place the implementation of a specific function will return in the worst case. An application might be able to tolerate considerably more error than this but it can be difficult to prove conclusively. In this paper we seek to validate our approximations empirically. We believe this is appropriate since the ‘correct’ choice of activation function is most often demonstrated empirically too.

In addition to tolerating error in the function’s outputs an application might also restrict the function’s input domain. This provides further opportunity for approximation since it is only necessary to mimic the original function within the domain of interest. We therefore develop two classes of approximation. Our ranged approximation produces higher performance under the assumption that the input domain is approximately -5.5 to 5.5. Our safe approximation makes no assumptions about input domain and so is a drop-in replacement for all networks.

We consider two different approximation approaches: 1) taking a mathematically derived approximation of the function and then using the parameters to produce a range of alternative versions at varying precision; 2) performing a numerical regression to fit the given function. The mathematically defined approach produces more stable results but can be more complex. The regression approach produces often faster and simpler functions but the output result is heavily determined by the input parameters and function that is being replaced.


We first consider an optimisation of which transforms a max operator (which may use a branch) to a sum and a division.

This may result in changing the return value from to a value near to

under the rules of IEEE floating-point arithmetic 

11, but it also avoids the branch in the if-statement if the max function is not properly implemented. In our tests this resulted in an insignificant change and so we use the standard implementation of as our baseline.


Figure 1: Approximation functions for and their associated error calculated as the absolute difference between the original function per input .

We consider four forms of approximation of which we show in Figure 1 and describe below.

Firstly we implemented as Lambert’s continued fraction. We limit ourselves to a finite number () of iterations depending on the precision we want and then simplify using a symbolic algebra package. We call this approximation tanh_cont_n.

We also consider two forms of polynomial approximation of : 1) a Taylor Series of the form

and; 2) Padé approximants of the form

. We select a desired number of terms (), choose a uniform sample of 5000 values in the range to and then determine the values of the coefficients () using a least-squares fitting procedure. We choose this range based on the shape of the (and ) which are very close to and by this point. Using these two techniques yields the approximations tanh_taylor_n and tanh_pade_n_m. Minor variations in the number of points sampled and the range covered did not have a big impact on our results.

As a final option for we consider the Serpentine function (serp) defined as

As the serp function does not fit beyond the main gradient, we also introduced a variant serp_clamp which is clamped to the values or for inputs outside of the range to .


Figure 2: Approximation functions for and their associated error calculated as the absolute difference between the original function per input .

Figure 2 shows three approximation forms for the function. These arise from the approximation of the function included within the implementation. Here we show a well-known expansion of the implementation for an infinite limit. We call approximations of this form sigmoid_fastexp_n.

We also generate fits for Talyor and Padé polynomials in the same manner as for . This yields approximations sigmoid_taylor_n and sigmoid_pade_n_m.

We note that there is an alternative approach to the fast computation of the exponential function based on exploiting the definition of IEEE Floating-Point numbers Schraudolph (1999). This is no longer as effective as it once was due to it’s reliance on a union structure to treat the floating-point number as both an integer for some parts and a floating-point number for others. This technique is not easily transferable to SIMD Nassimi and Sahni (1981) where arrays of numbers are worked on in parallel and it is not always trivial to cast between different type representations in memory.

4 Performance results

In practice it is the performance of the whole network which is important rather than a particular activation function. For example, when training a neural network the ideal choice of activation function will result in the lowest loss for the least training time. This means that when selecting an approximation we are looking for a trade-off between the computation cost and the resulting error. Fast approximations with high error might cause a network to take longer to converge than slower approximations with a lower error.

It is currently not possible to analytically determine the best choice of activation function for some network architecture. Similarly we are unable to prove that a particular approximation is a better choice in all cases. Instead we seek to empirically justify our choices through end-to-end measurements on three popular representative machine learning tasks covering three popular network architectures.

MNIST Classifier

We consider the task of classifying images in the MNIST dataset and use a network comprising convolutional and dense layers inspired by the design of successful neural network structures 

Simard et al. (2003) based on the implementation from a selection of provided models for Flux Innes (2018). While ReLU is sometimes used for convolutional image classification tasks for it’s speed and simplicity, and have been used more commonly in the past Kalchbrenner et al. (2014); Lawrence et al. (1997); Krizhevsky et al. (2012) and have some desirable properties which can reduce training times in some scenarios Ciresan et al. (2011).

MNIST AutoencoderAutoencoders are common in the area of generative machine learning and are often evaluated using the MNIST dataset LeCun et al. (1998). We implemented a simple autoencoder to work with this dataset. Autoencoders are compatible with many different activations when structured with different configurations Szegedy et al. (2013); Burda et al. (2015); Ng and others (2011); Vincent et al. (2008); Toderici et al. (2015); Chen et al. (2012). Our model is the simplest example of an autoencoder and as such is compatible with , and activations.


Sequence-to-sequence problems are another common task. We selected a common LSTM network layout with 2 hidden LSTM layers for text generation that takes a text dataset and learns to generate similar text. LSTM cells make use of both

and activation functions and so we considered approximations to both.

Loss Time (s)
Activation Abs. Rel. Abs. Rel. Choice



ReLU 25.99 1.000 486.9 1.000


sigm 20.23 1.000 889.5 1.000
3-9 sigm_fastexp_2 17.21 0.851 643.5 0.723 Ranged
3-9 sigm_fastexp_512 20.39 1.008 798.5 0.898 Safe
3-9 sigm_taylor_9 - -
3-9 sigm_pade_4_4 21.00 1.038 702.9 0.790
3-9 ultra_fast_sigmoid 20.63 1.020 743.2 0.836
3-9 word2vec 456.1 22.55 833.9 0.937


tanh 16.64 1.000 1126 1.000
3-9[6pt/3pt] tanh_cont_4 16.54 0.994 654.0 0.581
3-9[6pt/3pt] tanh_taylor_9 - -
3-9[6pt/3pt] tanh_pade_4_4 13.98 0.840 712.8 0.633 Safe
3-9[6pt/3pt] serp 14.93 0.897 523.7 0.465 Ranged
3-9[6pt/3pt] serp_clamp 18.89 1.135 604.8 0.537



ReLU 1.441 1.000 25.87 1.000 -


sigm 4.166 1.000 35.54 1.000
3-9[6pt/3pt] sigm_fastexp_2 3.924 1.006 28.33 0.797 Ranged
3-9[6pt/3pt] sigm_fastexp_512 4.167 1.000 31.47 0.886 Safe
3-9[6pt/3pt] sigm_taylor_9 6.788 1.630 32.74 0.921
3-9[6pt/3pt] sigm_pade_4_4 4.161 0.999 30.56 0.860
3-9[6pt/3pt] ultra_fast_sigmoid 4.210 1.011 32.48 0.914
3-9[6pt/3pt] word2vec 13.94 3.347 32.58 0.917


tanh 2.234 1.000 37.14 1.000
3-9[6pt/3pt] tanh_cont_4 2.256 1.010 30.39 0.818
3-9[6pt/3pt] tanh_taylor_9 2.237 1.001 35.73 0.962
3-9[6pt/3pt] tanh_pade_4_4 2.242 1.004 32.48 0.875 Safe
3-9[6pt/3pt] serp 2.147 0.961 28.36 0.770 Ranged
3-9[6pt/3pt] serp_clamp 2.151 0.963 31.36 0.845




ReLU ReLU NaN - - -
3-9[6pt/3pt] sigm tanh 79.16 1.000 1502.93 1.000
3-9[6pt/3pt] sigm_fastexp_2 tanh 82.30 1.040 1406.603 0.936
3-9[6pt/3pt] sigm_fastexp_512 tanh 77.65 0.981 1401.893 0.933
3-9[6pt/3pt] sigm_taylor_9 tanh - -
3-9[6pt/3pt] sigm_pade_4_4 tanh 78.01 0.985 1462.484 0.973
3-9[6pt/3pt] sigm tanh_cont_4 78.68 0.994 1361.133 0.906
3-9[6pt/3pt] sigm tanh_taylor_9 - -
3-9[6pt/3pt] sigm tanh_pade_4_4 77.99 0.985 1407.648 0.937
3-9[6pt/3pt] sigm serp 78.29 0.989 1303.92 0.868
3-9[6pt/3pt] sigm serp_clamp 78.46 0.991 1367.542 0.910
3-9[6pt/3pt] sigm_fastexp_2 serp 79.82 1.008 1332.127 0.886 Ranged
3-9[6pt/3pt] sigm_fastexp_2 serp_clamp 82.81 1.046 1184.179 0.788
3-9[6pt/3pt] sigm_fastexp_512 tanh_cont_4 - -
3-9[6pt/3pt] sigm_fastexp_512 tanh_pade_4_4 79.34 1.002 1446.656 0.963
3-9[6pt/3pt] sigm_fastexp_512 serp 77.76 0.982 1150.14 0.765
3-9[6pt/3pt] sigm_fastexp_512 serp_clamp 79.24 1.001 1155.745 0.769 Safe
3-9[6pt/3pt] ultra_fast_sigmoid tanh 78.88 0.996 1450.332 0.965
3-9[6pt/3pt] word2vec tanh 100.7 1.271 1561.443 1.039


Table 1: Performance results for the range of approximations considered. The Rel. columns indicate performance relative to the replaced function (smaller values are better). () We discuss the performance of ultra_fast_sigmoid and word2vec later.

We used each approximation in turn to train our three networks for a fixed number of epochs recording the loss and the total time taken. We then compared these values to the non-approximated activation functions to compute the relative loss and relative time taken (Table

1). Relative values are with respect to the function being replaced (rather than ) and smaller relative values indicate better performance. We used an Intel Xeon E5-2673 v3 @ 2.40GHz (14GB RAM) for the MNIST workloads and a dual-core Intel Xeon E5-2673 v4 @ 2.30GHz (8GB RAM) for the RNN workload. We ran our benchmarks on Azure cloud machines and we provide virtual machine images222 for Azure which replicate our results.

In most cases our replacement activation functions either converged with similar loss to the original function or failed to converge entirely (marked in the table). We found a few instances (such as sigm_fastexp_2 in Convnet) for which the loss was drastically lower (44%). We highlight this case as another example of the difficulty of making definitive statements about the correct choice of activation function. In all cases where the training loss converged our approximations resulted in a reduced overall training time.

We found that two functions (sigm_fastexp_512 and tanh_pade_4_4) produced the best reduction in training time () whilst working in all cases. We mark these as our chosen safe approximations. We also found functions (such as sigm_fastexp_2 and serp) which produced even better reduction in training times () but have such significant divergence from the target functions that we cannot argue that they are a suitable replacement in all cases. We mark these as ranged approximations suitable for networks (such as these) where the activation input values are roughly within the range of to .

We also found performance improvements when using approximations for inference rather than training. For the Character-based RNN (LSTM) model our safe approximations offered ~8% savings where as our ranged approximations allowed for savings of up to 20% when performing 1000 sequential inferences. This is potentially more signficant than the reduction in training times since inference is often performed on restricted hardware such as mobile devices.

5 Comparison with ultra_fast_sigmoid

ultra_fast_sigmoid is a fast implementation in popular machine learning framework library Theano Bergstra et al. (2010). It is implemented as a piece-wise approximation. It is one of only a few approximations available openly as standard in popular machine learning libraries.

Figure 3: Shape and relative error of Theano’s ultra_fast_sigmoid compared to sigm_fastexp_512.

We compared ultra_fast_sigmoid to sigm_fastexp_512. Figure 3 shows the much greater approximation error for ultra_fast_sigmoid. For the Autoencoder and CharRNN workloads sigm_fastexp_512 results in lower loss in less time. For the Convnet workload sigm_fastexp_512 is slightly slower but with drastically lower loss.

6 Comparison with Word2Vec lookup table

Figure 4: Loss over time for the word2vec table based sigmoid function.

Word2Vec makes use of shallow 2-layer networks to create word embeddings. In these models the authors used an approximated sigmoid implementation which makes use of a 1000 element lookup table. We have compared this lookup table to our approximations in our test models.

Our results (Table 1) show that applying this approximation results in more loss for a similar or slightly reduced training time. When looking at the loss over time (Figure 4

) we see that in some cases training fails to make progress. We believe this occurs due to the lack of interpolation in the table lookup resulting in quantisation of outputs. As a result small incremental changes to the weight values may not result in a change of output. This could potentially be migitated with use of a different optimiser or by adding interpolation (at the cost of a performance reduction).

Again, sigmoid_fastexp_512 results in less loss in less training time.

7 Approximations in TensorFlow

TensorFlow is one of the most commonly used machine learning libraries. Despite the fact that it is a Python library it achieves high performance through native implementation of the core functions. As such this makes experimentation with novel activation functions difficult: one must inject a low-level implementation and then provide a mechanism to reference it from the high-level code. Flux does not suffer from this issue because the entire system is implemented in Julia (a relatively high performance language) and alternative activation functions can be straightforwardly applied on a level playing field with the default options.

Despite being unable to directly evaluate our new activation functions in TensorFlow we were able to identify the optimisation space by measuring the performance of the simplest possible activation function, the identity function.

Figure 5: Training time per epoch for the MNIST classifier in TensorFlow and Flux using and activation functions.

Figure 5 shows the time taken to train a simple MNIST classifier (one convolution layer and two dense layers) when using and in both TensorFlow (left) and Flux (right). The approximations we have discussed in this paper fall within the region between these two lines. Even with this simple model this demonstrates that there is scope for improving model performance in TensorFlow by optimising activation functions.

8 Threats to validity

Although we have tried to evaluate our activation functions on three representative workloads it is not possible to say for sure how well they will work in the general case. However, the relative errors in our safe approximations are so small that we would expect them to be drop-in replacements.

Neural networks are commonly trained offline on large compute clusters whereas inference happens with interactive latencies and increasingly on limited hardware (such as mobile phone handsets). The majority of our results focus on training times because training loss provides a convenient measurement to check that the activation function is performing well. However, we also found that our approximations improve inference and mention this in Section 4.

Our results only consider the performance on CPUs whereas much training (and inference) happens on GPUs or specialist hardware such as TPUs. We therefore cannot say how our approximations perform in these circumstances. We argue that our safe approximations are useful even if only applied to CPUs since they generally reduce training times with no impact on loss. It would be particularly interesting to consider hardware implementations of these approximations for specialist hardware. We leave this for future work.

9 Conclusion

We have shown that approximation of activation functions in neural networks can improve the training and inference time without a negative effect on the accuracy of the network.

We investigated a range of functions and propose two safe approximations sigm_fastexp_512 and tanh_pade_4_4 for and respectively. These approximations produce faster training and inference times without damaging prediction accuracy. Our ranged approximations produced even larger speedups but will not work for all networks.

As such we think these functions are candidates for inclusion as standard options in machine learning libraries. We also showed that these functions outperform existing approximations in these libraries.

In the future we hope to expand on this work by integrating approximation as a standard option into many machine learning libraries so that it can be used to improve training time on a larger scale. Additionally we wish to analyse the effect of approximations on large and complex networks and hardware to understand if there is a structure which may cause approximations to not be beneficial.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
  • [2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio (2010) Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), Vol. 4. Cited by: §5.
  • [3] Y. Burda, R. Grosse, and R. Salakhutdinov (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §4.
  • [4] M. Chen, Z. Xu, K. Weinberger, and F. Sha (2012)

    Marginalized denoising autoencoders for domain adaptation

    arXiv preprint arXiv:1206.4683. Cited by: §4.
  • [5] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber (2011)

    Flexible, high performance convolutional neural networks for image classification


    Twenty-Second International Joint Conference on Artificial Intelligence

    Cited by: §4.
  • [6] F. De Dinechin, D. Defour, and C. Lauter (2004) Fast correct rounding of elementary functions in double precision using double-extended arithmetic. Ph.D. Thesis, INRIA. Cited by: §1.
  • [7] F. de Dinechin, A. V. Ershov, and N. Cast (2005) Towards the post-ultimate libm. In 17th IEEE Symposium on Computer Arithmetic (ARITH’05), pp. 288–295. Cited by: §1.
  • [8] D. Defour, G. Hanrot, V. Lefevre, J. Muller, N. Revol, and P. Zimmermann (2004) Proposal for a standardization of mathematical function implementation in floating-point arithmetic. Numerical algorithms 37 (1-4), pp. 367–375. Cited by: §1.
  • [9] A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §2.
  • [10] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §1.
  • [11] (1985) IEEE standard for binary floating-point arithmetic. Institute of Electrical and Electronics Engineers, New York. Note: Note: Standard 754–1985 Cited by: §3.1.
  • [12] M. Innes (2018) Flux: elegant machine learning with julia. Journal of Open Source Software 3 (25), pp. 602. Cited by: §4.
  • [13] N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. Cited by: §4.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
  • [15] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back (1997) Face recognition: a convolutional neural-network approach. IEEE transactions on neural networks 8 (1), pp. 98–113. Cited by: §4.
  • [16] Y. LeCun, C. Cortes, and C. Burges (1998) MNIST dataset. URL http://yann. lecun. com/exdb/mnist. Cited by: §4.
  • [17] D. Nassimi and S. Sahni (1981) Data broadcasting in SIMD computers. IEEE Transactions on Computers 100 (2), pp. 101–107. Cited by: §3.3.
  • [18] A. Ng et al. (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §4.
  • [19] A. Sampson, L. Ceze, and D. Grossman (2013) Good-enough computing. IEEE Spectrum 50 (10), pp. 54–59. Cited by: §1.
  • [20] N. N. Schraudolph (1999) A fast, compact approximation of the exponential function. Neural Computation 11 (4), pp. 853–862. Cited by: §3.3.
  • [21] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §4.
  • [22] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.
  • [23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §4.
  • [24] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2015) Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085. Cited by: §4.
  • [25] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §4.