RNN benchmarks of pytorch, tensorflow and theano
This study provides benchmarks for different implementations of LSTM units between the deep learning frameworks PyTorch, TensorFlow, Lasagne and Keras. The comparison includes cuDNN LSTMs, fused LSTM variants and less optimized, but more flexible LSTM implementations. The benchmarks reflect two typical scenarios for automatic speech recognition, notably continuous speech recognition and isolated digit recognition. These scenarios cover input sequences of fixed and variable length as well as the loss functions CTC and cross entropy. Additionally, a comparison between four different PyTorch versions is included. The code is available online https://github.com/stefbraun/rnn_benchmarks.READ FULL TEXT VIEW PDF
Language models (LMs) based on Long Short Term Memory (LSTM) have shown ...
Front-end techniques for robust automatic speech recognition (ASR) have ...
Automatic speech recognition (ASR) via call is essential for various
With recent advances in deep learning, considerable attention has been g...
In this paper, we propose and investigate a variety of distributed deep
When deploying a deep neural network on constrained hardware, it is poss...
There are many research tools which are also used for teaching the acous...
RNN benchmarks of pytorch, tensorflow and theano
, Theano-based Lasagnetheano ; lasagne , Keras chollet2015keras , Chainer chainer_learningsys2015 and others wikipedia have been introduced and developed at a rapid pace. These frameworks provide neural network units, cost functions and optimizers to assemble and train neural network models. In typical research applications, network units may be combined, modified or new network units may be introduced. In all cases the training time is a crucial factor during the experimental evaluation: faster training allows for more experiments, larger datasets or reduced time-to-results. Therefore, it is advantageous to identify deep learning frameworks that allow for fast experimentation.
This study focuses on the benchmarking of the widely-used LSTM cell hochreiter1997long that is available in the mentioned frameworks. The LSTM architecture allows for various optimization steps such as increased parallelism, fusion of point-wise operations and others appleyard2016optimizing . While an optimized LSTM implementation trains faster, it is typically more difficult to implement (e.g. writing of custom CUDA kernels) or modify (e.g. exploring new cell variants, adding normalization etc.). Resultingly, some frameworks provide multiple LSTM implementations that differ in training speed and flexibility towards modification. For example, TensorFlow offers 5 LSTM variants: (1) BasicLSTMCell, (2) LSTMCell, (3) LSTMBlockCell, (4) LSTMBlockFusedCell and (5) cuDNNLSTM.
To summarize, neural network researchers are confronted with a two-fold choice of framework and implementation, and identifying the fastest option helps to streamline the experimental research phase. This study aims to assist the identification process with the following goals:
Provide benchmarks for different LSTM implementations across deep learning frameworks
Share the benchmark scripts for transparency and to help people coding up neural networks in different frameworks
The deep learning community put considerable effort into benchmarking and comparing neural-network related hardware, libraries and frameworks. A non-exhaustive list is presented followingly:
DeepBench deepbench . This project aims to benchmark basic neural network operations such as matrix multiplies and convolutions for different hardware platforms and neural network libraries such as cuDNN or MKL. In comparison, our study focuses solely on the LSTM unit and compares on the higher abstraction level of deep learning frameworks.
DAWNBench coleman2017dawnbench . This is a benchmark suite for complete end-to-end models that measures computation time and cost to train deep models to reach a certain accuracy. In contrast, our study is limited to measuring the training time per batch.
The experiments cover (1) a comparison between the PyTorch, TensorFlow, Lasagne and Keras frameworks and (2) a comparison between four versions of PyTorch. The experimental details are explained in the following sections.
A summary of the frameworks, backends and CUDA/cuDNN versions is given in Table 1
. In total, four deep learning frameworks are involved in this comparison: (1) PyTorch, (2) TensorFlow, (3) Lasagne and (4) Keras. While PyTorch and TensorFlow can operate as standalone frameworks, the Lasagne and Keras frameworks rely on backends that handle the tensor manipulation. Keras allows for backend choice and was evaluated with the TensorFlow and Theanotheano backends. Lasagne is limited to the Theano backend. Care was taken to use the same CUDA version 9.0 and a cuDNN 7 variant when possible. The frameworks were not compiled from source, but installed with the default conda or pip packages. Note that the development of Theano has been stopped theano_stop .
|Lasagne||0.2.1dev||April 2018||Theano 1.0.1||9.0||7005|
|Keras||2.1.6||April 2018||Theano 1.0.1, TensorFlow 1.8.0||9.0||7005|
The benchmarks cover 10 LSTM implementations, including the cuDNN appleyard2016optimizing and various optimized and basic variants. The complete list of the covered LSTM implementations is given in Table 3. The main differences between the implementations occur in the computation of a single time step and the realization of the loop over time, and possible optimization steps are described in appleyard2016optimizing . While the cuDNN and optimized variants are generally faster, the basic variants are of high interest as they are easy to modify, therefore simplifying the exploration of new recurrent network cells for researchers.
Four network configurations were evaluated as reported in Table 3.
The input data covers a short sequence length (100 time steps) and a long
sequence length scenario (up to 1000 time steps). The data is randomly sampled from a normal distribution.
Short & fixed length. The short input size is 64x100x123 (batch size x time steps x features) and the sequence length is fixed to 100 time steps. The target labels consist of 10 classes and 1 label is provided per sample. This setup is similar to an isolated digit recognition task on the TIDIGITS leonard1993tidigits data-set. 111ASR-task on TIDIGITS/isolated digit recognition, default training set (0.7 hours of speech): 123-dimensional filterbank features with 100fps, average sequence length of 98, alphabet size of 10 digits and 1 label per sample
Long & fixed length. The long input size is 32x1000x123 and covers 1000 time steps. The target labels consist of 10 classes and 1 label is provided per sample.
Long & variable length.
The sequence length is varied between 500 to 1000 time steps, resulting in an average length of 750 time steps. All 32 samples are zero-padded to the maximum length of 1000 time steps, resulting in a 32x1000x123 input size. The target labels consist of 59 output classes, and each sample is provided with a label sequence of 100 labels. The variable length setup is similar to a typical continuous speech recognition task on theWall Street Journal (WSJ) garofalo2007csr data-set.222ASR-task on WSJ/continuous speech recognition, pre-processing with EESEN miao2015eesen on training subset si-284 (81h of speech): 123-dimensional filterbank features with 100fps, average sequence length 783, alphabet size of 59 characters and average number of characters per sample 102. The variable length information is presented in the library specific format, i.e. as PackedSequence in PyTorch, as sequence_length parameter of dynamic_rnn in TensorFlow and as a mask in Lasagne.
The fixed length data is classified with the cross-entropy loss function, which is integrated in all libraries. The variable length data is classified with theCTC graves2006connectionist loss. For TensorFlow, the integrated tf.nn.ctc_loss is used. PyTorch and Lasagne do not include CTC loss functions, and so the respective bindings to Baidu’s warp-ctc baidu_ctc are used pytorch-warp-ctc ; theano_ctc .
All networks consist of LSTMs followed by an output projection. The LSTM part uses either a single layer of 320 unidirectional LSTM units, or four layers of bidirectional LSTMs with 320 units per direction. When using the cross entropy loss, the output layer consists of 10 dense units that operate on the final time step output of the last LSTM layer. In contrast, the output layer of the CTC loss variant uses 59 dense units that operate on the complete sequence output of the last LSTM layer.
|PyTorch||LSTMCell-basic||✓||✓||✗1||✗1||Custom code, pure PyTorch implementation, easy to modify. Loop over time with Python for loop|
|PyTorch||LSTMCell-fused2||✓||✓||✗1||✗1||LSTM with optimized kernel for single time steps. Loop over time with Python for loop|
|PyTorch||cuDNNLSTM3||✓||✓||✓||✓||Wrapper to cuDNN LSTM implementation appleyard2016optimizing|
|TensorFlow||LSTMCell||✓||✓||✓||✓||Pure TensorFlow implementation, easy to modify. Loop over time with tf.while_loop. Uses dynamic_rnn|
|TensorFlow||LSTMBlockCell||✓||✓||✓||✓||Optimized LSTM with single operation per time-step. Loop over time with tf.while_loop. Uses dynamic_rnn|
|TensorFlow||LSTMBlockFusedCell||✓||✓||✗1||✗1||Optimized LSTM with single operation over all time steps. Loop over time is part of the operation.|
|TensorFlow||cuDNNLSTM||✓||✓||✓||✗4||Wrapper to cuDNN LSTM implementation appleyard2016optimizing|
|Lasagne||LSTMLayer||✓||✓||✓||✓||Pure Theano implementation, easy to modify. Loop over time with theano.scan|
|Keras||LSTM||✓||✓||✗1||✗1||Pure Theano/TensorFlow implementation, easy to modify. Loop over time with theano.scan or tf.while_loop|
|Keras||cuDNNLSTM||✓||✓||✗1||✗1||Wrapper to cuDNN LSTM implementation appleyard2016optimizing 5|
no helper function to create multi-layer networks
renamed from original name LSTMCell for easier disambiguation
renamed from original name LSTM for easier disambiguation
no support for variable sequence lengths
only available with TensorFlow backend
|Name||Layers x LSTM units||Output||Loss||Input [NxTxC]||Sequence length||Labels per sample|
|1x320/CE-short||1x320 unidirectional||10 Dense||cross entropy||64x100x123||fixed||1|
|1x320/CE-long||1x320 unidirectional||10 Dense||cross entropy||32x1000x123||fixed||1|
|4x320/CE-long||4x320 bidirectional||10 Dense||cross entropy||32x1000x123||fixed||1|
|4x320/CTC-long||4x320 bidirectional||59 Dense||CTC||32x1000x123||variable||100|
All benchmarks use the framework-specific default implementation of the widely used ADAM optimizer kingma2014adam .
The reported timings reflect the mean and standard deviation of the time needed to fully process one batch, including the forward and backward pass. The benchmarks were carried out on a machine with a Xeon W-2195 CPU (Skylake architecture333This CPU scales its clock frequency between 2.3GHz to 4.3Ghz. In order to reduce the fluctuation of the CPU frequency, the number of available CPU cores was restricted to 4 such that all 4 cores maintained 4.0GHz to 4.3GHz clock frequency), a NVIDIA GTX 1080 Founders Edition graphics card (fan speed @ 100% to avoid thermal throttling) and Ubuntu 16.04 operating system (CPU frequency governor in performance mode444By default, Ubuntu 16.04 uses the mode powersave which significantly decreased performance during the benchmarks). The measurements were conducted over 500 iterations and the first 100 iterations were considered as warm-up and therefore discarded.
The limitations reflect the current state of the benchmark code, and may be fixed in future versions. Further issues can be reported in the github repository.
The benchmark scripts are carefully written, but not optimized to squeeze that last bit of performance out of them. They should reflect typical day-to-day research applications.
Due to time constraints, only the 1x320 LSTM benchmark covers all considered frameworks. For the multi-layer 4x320 networks, only implementations that provided helper functions to create stacked bidirectional networks were evaluated. An exemption of this rule was made for Lasagne, in order to include a Theano-based contender for this scenario.
The TensorFlow benchmarks use the feed_dict input method that is simple to implement, but slower than the tf.data API feed_dict . Implementing a high performance input pipeline in TensorFlow is not trivial, and only the feed_dict approach allowed for a similar implementation complexity as in the PyTorch and Lasagne cases.
The TensorFlow cuDNNLSTM was not tested with variable length data as it does not support such input 6633 .
The TensorFlow benchmark uses the integrated tf.nn.ctc_loss instead of the warp-ctc library, even though there is a TensorFlow binding available baidu_ctc . The performance difference has not been measured.
PyTorch 0.4.0 merged the Tensor and Variable classes and does not need the Variable wrapper anymore. The Variable wrapper has a negligible performance impact on version 0.4.0, but is required for older PyTorch releases in the PyTorch version comparison.
The complete set of results is given in Table 4 and the most important findings are summarized below.
Fastest LSTM implementation. The cuDNNLSTM is the overall fastest LSTM implementation, for any input size and network configuration. It is up to 7.2x faster than the slowest implementation (Keras/TensorFlow LSTM , 1x320/CE-long). PyTorch, TensorFlow and Keras provide wrappers to the cuDNN LSTM implementation and the speed difference between frameworks is small (after all, they are wrapping the same implementation).
Optimized LSTM implementations. When considering only optimized LSTM implementations other than cuDNNLSTM, then the TensorFlow LSTMBlockFusedCell is the fastest variant: it is 1.3x faster than PyTorch LSTMCell-fused and 3.4x faster than TensorFlow LSTMBlockCell (1x320/CE-long). The fused variants are slower than the cuDNNLSTM but faster than the more flexible alternatives.
Basic/Flexible LSTM implementations. When considering only flexible implementations that are easy to modify, then Lasagne LSTMLayer, Keras/Theano LSTM and PyTorch LSTMCell-basic are the fastest variants with negligible speed difference. All three train 1.6x faster than TensorFlow LSTMCell, and 1.8x faster than Keras/TensorFlow LSTM (1x320/CE-long). For Keras, the Theano backend is faster than the TensorFlow backend.
100 vs 1000 time steps. The results for the short and and long input sequences allow for similar conclusions, with the main exception being that the spread of training time increases and the fastest implementation is 7.2x (1x320/CE-long) vs. 5.1x (1x320/CE-short) faster than the slowest.
Quad-layer networks. The cuDNNLSTM wrappers provided by TensorFlow and PyTorch are the fastest implementation and deliver the same training speed, and they are between 4.7x to 7.0x faster than the networks built with Lasagne LSTMLayer and TensorFlow LSTMBlockCell / LSTMCell (4x320/CE-long and 4x320/CTC-long).
Fixed vs. variable sequence length. Going from 4x320/CE-long to 4x320/CTC-long (fixed vs. variable sequence length, cross entropy vs. CTC loss function) slows down training by a factor of 1.1x (PyTorch cuDNNLSTM) to 1.2x (Lasagne LSTMLayer). The TensorFlow implementations stabilize at roughly the same level.
PyTorch versions. The two most recent versions 0.4.0 and 0.3.1post2 are significantly faster than the older versions, especially for the flexible LSTMCell-basic where the speedup is up to 2.2x (1x320/CE-long). The 4x320/CTC-long benchmark is very slow for version 0.2.0_4 which suffered from a slow implementation for sequences of variable length that is fixed in newer versions packed_sequence . The older PyTorch versions tend to produce larger standard deviations in processing time, which is especially visible on the 1x320/CE-short test.
This study evaluated the training time of various LSTM implementations in the PyTorch, TensorFlow, Lasagne and Keras deep learning frameworks. The following conlusions are drawn:
The overall fastest LSTM implementation is the well-optimized cuDNN variant provided from NVIDIA which is easily accessible in PyTorch, TensorFlow and Keras, and the training speed is similar across frameworks.
When comparing deep learning frameworks and less optimized, but more customizable LSTM implementations, then PyTorch trains 1.5x to 1.6x faster than TensorFlow and Lasagne trains between 1.3x to 1.6x faster than TensorFlow. Keras is between 1.5x to 1.7 faster than TensorFlow when using the Theano backend, but 1.1x slower than TensorFlow when using the TensorFlow backend.
The comparison of PyTorch versions showed that for PyTorch users it is a good idea to update to the most recent version.
On a final note, the interested reader is invited to visit the github page and run his own benchmarks with custom network and input sizes.
I thank my PhD supervisor Shih-Chii Liu for support and encouragement in the benchmarking endeavours. I also thank the Sensors group from the Institute of Neuroinformatics, University of Zurich / ETH Zurich for feedback and discussions. This work was partially supported by Samsung Advanced Institute of Technology.
center, max width=0.85
|Framework comparison||PyTorch comparison|
|(a) 1x320/CE-short ::: 1x320 unidirectional LSTM ::: cross entropy loss ::: fixed sequence length ::: input 64x100x123 (NxTxC)1|
|(b) 1x320/CE-long ::: 1x320 unidirectional LSTM ::: cross entropy loss ::: fixed sequence length ::: input 32x1000x123 (NxTxC)1|
|(c) 4x320/CE-long ::: 4x320 bidirectional LSTM ::: cross entropy loss ::: fixed sequence length ::: input 32x1000x123 (NxTxC)1|
|(d) 4x320/CTC-long ::: 4x320 bidirectional LSTM ::: CTC loss ::: variable sequence length ::: input 32x1000x123 (NxTxC)1,2|
N = batch size, T = time steps, C =feature channels
CTC experiment for pytorch 0.1.12_2 omitted due to warp-ctc compilation issues
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.Software available from tensorflow.org.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inAdvances in Neural Information Processing Systems 25, pp. 1097–1105, 2012.