I Introduction
In recent years, deep learning has been successfully applied to many problems. The successful use of transfer learning for computer vision problems has enabled many applications: large CNNs such as VGG
[6] and ResNets [7]are pretrained on a large image dataset such as ImageNet
[8, 9] and then utilized as the backbone for other computer vision tasks. These models are able to extract meaningful features for new tasks without needing to be trained from scratch for each task [10, 11, 12, 2].Recent work has shown promising results from unsupervised language modeling, followed by transfer learning to natural language tasks [3], [13]. However, neural language models have not benefited from scale and transfer learning in the same way as convolutional image models. Historically, natural language leverages large scale transfer learning through the use of word embedding pretraining on large corpora [14, 15, 16]. Transferring only the embeddings limits the scope of the transfer, since word embeddings do not capture sequential information in a section of text. We would like to transfer whole NLP models capable of processing a text sequence.
However, transfer learning in this context is difficult because of the time it takes to train large language models on large datasets. Several recent publications seek to address long training times by leveraging distributed data parallelism and increasing the effective batch size during training [17, 1, 18, 19, 20]
, taking advantage of advances in distributed deep learning and improvements in the memory size and compute capability of available high performance computing (HPC) resources. This work often focuses on computer vision and rarely on natural language tasks, let alone RNNbased language models, which are numerically difficult to train and suffer from poor parallelization due to their sequential nature. We do have evidence that RNNs for language modeling, speech recognition, and neural machine translation continue to provide accuracy improvements as they are trained on larger datasets
[21]. Accordingly, techniques for efficiently training large RNN models will lead to improved accuracy on many natural language tasks.We focus on training a singlelayer 4096 neuron multiplicative LSTMbased character language model
[4] on the Amazon Reviews dataset, one of the largest publiclyavailable NLP datasets, and transfer the model to the downstream tasks of sentiment classification on the Binary Stanford Sentiment Treebank (SST) and IMDB movie review datasets. We train our recurrent models with mixed precision FP16/FP32 arithmetic, which speeds up training on a single V100 by 4.2X over training in FP32.We then train the mixed precision model using a 32k batch size via distributed data parallelism across 128 GPUs. This achieves a 109x increase in training data throughput relative to the single GPU case. However, with such a large batch size, we require additional epochs to train the model to a similar level of accuracy, bringing the total training time to 4 hours.
In addition, we train a 8192 neuron mLSTM capable of beating state of the art performance in Amazon review language modeling with a bits per character (BPC) of 1.038 and SST classification accuracy of 93.8%.
We analyze how distributed data parallelism scales with larger models. While utilizing distributed data parallelism for training RNNs, we observe some problems common to training with large batches. We investigate the relationship between dataset size, batch size, and learning rate schedule to investigate how to effectively use large batch training to train models on commonly available large NLP datasets.
Ii Language Model Pretraining and Transfer
Separately trained word embeddings[14, 15, 16] are commonly used to transfer learning from large datasets to specific tasks. However, word embeddings function only as a lookup table for invocabulary words. They do not transfer well to multiword sequences and contexts.
Works such as Semisupervised Sequence Learning [22], context2vec [23]
, Contextualized Word Vectors (CoVe)
[24], and Deep Contextualized Word Representations (ELMo) [25] seek to remedy this by computing embeddings of words in a sequence using a pretrained recurrent neural language model. In these approaches, the surrounding words provide context which is used to produce an embedding that represents the meaning of a given word. These works approach the transfer learning problem with a whole neural language model capable of modeling the compositional nature of language rather than a lookup table that considers all words independently.This pretraining and transfer work has motivated follow on works trying to increase the scope of neural language model pretraining and transfer [3, 26, 27, 28, 29, 13], in which the authors explore new types of language models, multiple types of language model pretraining, and the effect these two have on a wide variety of down stream language tasks. A common theme between these different research efforts, however, is that downstream transfer success is predicated on the pretraining corpus size. Larger text corpora produce more powerful language models, which then improve transfer learning.
Iia Pretraining Tasks and Datasets
As part of pretraining there are three components that determine pretraining success: the task used for pretraining, pretraining dataset quality, and pretraining dataset size.
The former requires careful consideration as it affects the other two. A number of language pretraining tasks can be considered generative pretraining tasks, where the language models are trained to generate some language as output. Some of these include sequence to sequence (Seq2Seq) tasks such as SkipThought pretraining [30, 27] and Neural Machine Translation [31, 27]. However, we instead choose to focus on unsupervised text reconstruction as our pretraining task: predict the next character of text, given the previous characters. Text reconstruction captures the fundamental components of sequence modeling required by other language modeling tasks.
With text reconstruction, the data provides its own labels, and given the data has undergone reasonable cleaning, we can focus on dataset size rather than dataset type or quality. Several corpora successfully utilized for unsupervised text pretraining in prior work are the BooksCorpus [32], GigaWord [33], 1Billion Word [34], and Amazon Reviews [5] datasets. Similar to [3, 35], we focus our pretraining efforts on the largest of the four datasets (see Fig. 1), by training a mLSTM on an aggressively deduplicated copy of the Amazon Reviews dataset totaling 82 million reviews (40GB). The generality of our task and the size of our dataset allow the insights developed in this work to be applied to other large scale language tasks.
Dataset  corpus size (GBs) 

1Billion Word  3 
BooksCorpus  5 
GigaWord  26 
Amazon Reviews Dataset  41 
Iii Large Batch Training
Given the size of the Amazon corpus, pretraining a large state of the art neural language model is a time consuming process. Running such a workload on a single GPU is not practical, as state of the art models tend to be large and can only fit a modest training batch size per GPU. In order to enable effective pretraining and transfer of large language models, we employ multiGPU parallelism. We focus on scaling to multiple GPUs with data parallelism, meaning that we partition the batch during training across multiple GPUs. We don’t use model parallelism, which partitions the neural network itself across multiple processors, because it’s less flexible and places more constraints on software, although it remains an interesting avenue for further parallelism.
We use synchronous data parallelism, where a large batch is distributed evenly amongst all participating worker processes, at which point the worker processes run forward and backward propagation, communicate the resulting gradients with each other, and update the model before receiving a new data batch. Depending on model size and communication latency, data parallelism allows for near linear speed up by scaling batch size linearly with respect to the number of available GPUs. Taking advantage of such scaling, the Computer Vision community has been able to reduce the training time of AlexNet and ResNet50 models on the ImageNet benchmark from hours to the order of minutes [17, 1, 18, 19].
However, these projects have focused on convolutional networks for image classification, and comparatively less work has been published on large batch training of language models. Ott et. al [20] employ data parallelism to speed up Seq2Seq neural machine translation. However, similar to prior work, Ott et. al train convolutional models with large batches.
In order to enable large batch pretraining of an arbitrary language model it is important to explicitly analyze the effects of large batch training with RNNbased language models. The sequential nature of recurrent neural networks makes the training landscape difficult to optimize, due to saddle points, local minima, and numerical instabilities in the RNN computation itself [36, 37, 38]. These complexities necessitate analysis of large batch training with RNNs.
Large batch training is itself not without difficulties. Identical hyperparameters at different batch sizes regularly produce models which generalize differently. Recent work analyzes the relationship between large batch size, learning rates, and generalization, showing how to achieve similar evaluation results when training across different batch sizes
[39, 40, 17].By analyzing the noise scale of gradientdescent optimization, these methods modify learning rate proportionally to batch size , with a linear scaling rule provided that , where is the dataset size. The authors find that learning rate scaling leads to models that generalize well across various batch sizes. Additionally, Smith et. al [39, 40] proposed scaling momentum as a function of batch size; however, we do not investigate such scaling in this work.
In order to enable large batch training of RNN language models, we explore the effects of this linear scaling rule as well as a softer square root scaling rule proposed by Hoffer et. al [41].
Additionally, we investigate the scalability of data parallelism with different interconnects and model sizes, so as to assess the effectiveness of data parallelism for an arbitrary neural language model.
Iv Distributed Deep Learning Setup
We use NVIDIA DGX1V systems built from 16 GB Tesla V100 GPUs. For intranode and internode communication we leverage the NCCL2 (NVIDIA Collective Communications) library which uses the DGX1V’s underlying NVLink and InfiniBand connections for GPU to GPU communication.
We do not use a central parameter server for managing gradient reduction and updating the model. In order to efficiently perform updates to the model, the group of worker processes perform a ring reduce of the gradients, and each worker independently updates the model parameters. Crucial to reducing the necessary communication bandwidth, the library also supports communication of FP16 values natively with no FP16 emulation overhead when reducing FP16 parameter gradients across GPUs.
V Mixed Precision Training
FP16 is not only useful for reducing communication overhead, it also plays a key role in directly accelerating training on processors like the V100 that support higher throughput mixedprecision arithmetic. The V100 provides 15.6 TFlops in single precision, but 125 TFlops with mixedprecision arithmetic (FP16 storage and multiplication, FP32 accumulation). Using FP16 reduces the dynamic range and precision of the computations being performed. This presents a unique set of training difficulties, which, if not addressed, lead to convergence issues while training.
Drawing from [42, 43], we use automatic loss scaling to effectively increase the dynamic range of the training process. Automatic loss scaling seeks to ameliorate numerical underflow by multiplying the training loss by a scalar ”loss scale” factor
, performing backpropagation with all intermediary gradients multiplied by
, and dividing the final weight gradients by . This multiplication shifts small gradient values into the range permitted by FP16, thereby ensuring that they do not vanish during back propagation.We choose dynamically by starting at a large value, performing backpropagation, and checking for an overflow in the weight gradients. If an overflow is detected, then the weight update for the batch is skipped, and is halved. After the algorithm finds a suitable , it tries to increase after a sufficient number of iterations have passed without overflow, and again backs off if overflow occurs. The algorithm repeats this process throughout training, iteratively updating the loss scale, hence the name automatic loss scaling.
Without automatic loss scaling, we found that our models did not train to convergence. Although the computationally intensive parts of training were performed in mixed precision, a minority of the work still remained in FP32 in order to converge properly:

Gradients are accumulated into a “master” FP32 copy of the parameters. The division by occurs on the gradients of these master copies.

Reductions are performed in FP32; it only takes a few large values to cause an overflow in FP16.

Accumulation of the summation in the 2 norm computation required by weight normalization should be done in FP32 to avoid overflow. The final norm value is output in FP16.

Softmax loss is computed in FP32, operating on FP32 logits in order to avoid numerical issues when exponentiating FP16 values.
These techniques working in conjunction allowed for successful training of the mLSTM language model in mixed precision.
Vi Experiments
All experiments are set up following [3]
and run with Pytorch’s v0.4 release
[44]. The Amazon Reviews dataset is shuffled and split into training, validation, and test sets. The model is trained using truncated backpropagation through time (TBTT) [45] on sequences of 256 characters. We persist hidden state across each minibatch during training and evaluation.Via Data Sharding
In order to create the training, validation, and test sets, the dataset is split proportionally by a ratio of 1000, 1, and 1 allocated for train, validation, and test sets respectively. Within these sets we create batch size number of shards for evaluation, and shards for training. A shard is defined as a subset of strings sampled without replacement from one of the dataset splits; this subset is concatenated together into one large string to form a shard. These shards are used for all training epochs with no further shuffling. Hidden state is initialized to zero at the beginning of a shard and persisted throughout the shard.
When constructing a minibatch , data is sampled such that between two consecutive minibatches and , minibatch index contains contiguous subsequences from within a shard. This contiguity across minibatches enables hidden state persistence across truncated subsequences in TBTT.
ViB Weight Normalization
ViC Optimization and Learning Rate (LR) schedule
As in [3], Adam [47] is utilized for optimization, along with a learning rate schedule that decays linearly to zero over the course of training. For a global batch size of 128 a learning rate of 5e4 is used, and is scaled up according to the batch size using either the linear or square root scaling rule.
ViD Evaluation
Two metrics for evaluating training are considered:

A bits per character (BPC) metric calculated on the immediate task of predicting the next character given the current character on the Amazon Reviews test set. We calculate the average BPC across 16 random shards of the test set by using an evaluation batch size of 16. Since our model operates directly on characterlevel tokens, calculation of BPC is simply where is the softmax cross entropy loss averaged over the entire sequence.

Accuracy from the downstream tasks of binary sentiment classification on the Binary SST, and IMDB Movie Review datasets. To perform transfer the model weights are taken at the end of Amazon training, frozen, and used to featurize text samples from the classification dataset. A simple binary logistic regression classifier from scikitlearn
[48] is trained to classify these text features as having positive or negative sentiment. The transfer process is negligible computationally because of the simple model we use on the downstream task.
Vii Analysis of Mixed Precision vs FP32 training

Mixed precision training allows for faster computation as well as a 2x increase in effective batch size during training, because FP16 storage is 2x smaller. In this section we analyze performance gains and convergence for training networks with mixed precision arithmetic, comparing it to single precision training. This allows us to validate the correctness of the remaining experiments, which are all trained in mixed precision.
Using the techniques described in section IV & V, we train a model on the Amazon Reviews dataset using a full DGX1V node with 8 GPUs. We initially begin with a batch size of 128 per GPU, for a global batch size of 1024, and compare the relative speedup granted by mixed precision arithmetic. Next, we quantify the benefits of the reduced memory footprint by doubling the batch size to 256 per GPU (2048 global) in order to better saturate the GPU. Additionally, we utilize the softer square root scaling rule [41] to modify the learning rate as a function of batch size.
Figure 2 shows that training in mixed precision and single precision both produce similar training curves and converge to similar numbers for both language modeling and transfer evaluation. We find that moving to mixed precision not only achieves similar training results, but it also provides a 3x speedup in training. By taking advantage of the reduced memory footprint of FP16 storage, we increase the batch size twofold to 256 per GPU, better saturating the GPU, and achieve an additional speedup of 40% on top of our original speedup. This provides approximately a 4.2x speedup when switching from single precision arithmetic to mixed precision.
Overall, this yields a speed up from one month of training as in [3] to 18 hours. We have accomplished this using 8 Tesla V100 GPUs, larger batch size, and mixed precision arithmetic.
Viii Distributed Data Parallel Scaling
To train a language model in hours, not in days, we further parallelize the training process by using multiple nodes and additional data parallelism. We first analyze the effect of communication overhead on the scalability of multiGPU training at various batch sizes and processor counts.
The model is trained in mixed precision on 1, 8, 16, 32, 64, and 128 GPUs with a local batch size of 256 batches/GPU and 8 GPUs/DGX1V node. In Fig. 3 we observe that NCCL2 provides near linear scaling with minimal overhead when scaling from 1 to 8 GPUs within a node. Infiniband efficiently handles internode communication for the 4096 neuron mLSTM with effectively constant overhead with respect to the number of participating nodes. This allows for a total speedup of 109x when scaling to 128 GPUs across 16 DGX1V Nodes. More concretely, we complete one epoch of training on the Amazon reviews dataset in only 1.2 hours.

Viiia Scaling Large Model Training
Not every problem calls for training a 4096d mLSTM. Smaller models will train faster and may converge to a good enough BPC, while larger models may be necessary for state of the art performance. To illustrate this, we train an mLSTM with hidden state sizes of 256, 1024, 4096, and 8192 and a global training batch size of 2048 split across 1 DGX1V node and learning rate of 2e3. In the case of the 8192d hidden state mLSTM we use a per GPU batch size of 96 (768 total) due to memory constraints. In this experiment, we use a learning rate of 7.8e4 that observes the square root scaling rule. In Fig. 4 we can see the benefit of training larger models, with the 8192d mLSTM achieving state of the art language modeling comparable to [35], albeit at the cost of additional compute and memory.
We investigate the scalability of a larger 8192d mLSTM model compared to the baseline 4096d mLSTM model in Fig. 3. The 8192d model has 0.72 GB of parameters in FP16, while the 4096d model has 0.18 GB of parameters. While training the 8192d model on 128 GPUs, we see a speedup factor of 120.8x across 128 GPUs. Even though the larger model has correspondingly larger gradients, it is also more computationally intensive, leading to better scaling than the baseline model on the same hardware.
Ix Analysis of Large Batch training
Distributed data parallel training allows for near linear scalability with respect to available GPUs by increasing the batch size. However, as seen already in this work (see section VII), training with large batches may run faster than training with small batches, but it may not converge to the same validation accuracy. Using the same setup as section VIII and 8, 16, 32, 64, and 128 GPUs we take a look at how the learning rate schedule affects convergence.
Ixa Learning Rate Scaling
Former work in this space trained a 4096d mLSTM with an initial learning rate of 5e4, batches of 128, and a linear learning rate that decays to zero over one epoch of training [3]. As expected and shown in Fig. 5, keeping this same learning rate schedule as we increase batch size leads to worse accuracy, or a higher BPC^{2}^{2}2We train our models to convergence, not for one epoch, but results over one epoch are representative, and easier to compare..
Recent work scaling image CNN models with SGD suggest that learning rates could be scaled linearly as batch size increases without a noticeable loss in model accuracy[39]. However, we found that our mLSTM model, optimized with Adam, diverged for large batches when we scaled up the 5e4 initial learning rate with either a linear or a square root rule, as we increased the batch size.
We observed that for batch sizes of 2k8k, the model converged reasonably well with an initial learning rate of 3e3 decayed to zero over one epoch. Thus we kept 3e3 as the initial learning rate for all other experiments.
Batch  Iters  Rule  LR  BPC  SST  IMDB 

2048  72.6k  linear  8e3  1.280  79.4  77.6 
sqrt  2e3  1.117  90.2  91.9  
  5e4  1.130  89.1  90.8  
  3e3  1.110  89.0  92.1  
4096  37.3k  linear  1.6e2  1.275  78.3  77.6 
sqrt  2.8e3  1.122  89.6  91.0  
  5e4  1.146  89.3  90.9  
  3e3  1.119  89.2  91.8  
8192  18.6k  linear  3.2e2  1.476  65.4  67.3 
sqrt  4e3  1.133  89.7  90.8  
  5e4  1.175  87.3  89.6  
  3e3  1.132  89.5  91.4  
16384  9.3k  linear  6.4e2  Div     
sqrt  5.8e3  Div      
  5e4  1.254  85.1  86.4  
  3e3  1.162  89.0  90.1  
32768  4.6k  linear  1.3e1  Div     
sqrt  8e3  Div      
  5e4  1.380  75.2  74.8  
  3e3  1.218  87.1  87.9 
IxB Learning Rate Schedule
When training to convergence, we used the same learning rate schedule for all batch sizes:

Set an initial learning rate of 3e3.

Linearly decay learning rate to zero over 100,000 iterations.

Stop training at 3 epochs over the dataset, if fewer than 100,000 iterations.
This schedule, constant across all batch sizes, avoided the divergence observed in Fig. 5, from scaling learning rate too much, but it also performed better than if we had instead kept the initial 5e4 learning rate constant across batch sizes.
Using this learning rate schedule for the model with different batch sizes, Fig. 6 shows that large batch training for this problem can converge to a similar evaluation BPC as smaller batch training given a good training schedule. However, adjusting the learning rate schedule is not as simple as modifying the learning rate according to batch size. In our experiments we found that controlling the steepness of decay was also required.

X Discussion
We were able to converge our model in mixed precision, to a similar value as the FP32 baseline. This speeds up training, and substantially reduces our memory footprint, without a measurable change in accuracy, as shown in Fig. 2. We further speed up training by saturating up to 128 GPUs with distributed data parallelism, which we can do with a nearlinear scaling factor Fig. 3.
However, as batch size increases from 128 in [3] to 32k, the model needs more training steps to converge, and it does not converge to to quite as good a validation BPC as the lowbatch model Fig. 6.
With longer training: 3 epochs of the Amazon Reviews dataset rather than 1, we do converge the 32k batch model close to the small batch model, doing so in a few hours instead of days or weeks. We also show that downstream task transfer to sentiment extraction is comparable when using converged large batch models (Fig. 5(b)).
Smith et. al [39] suggest that to scale the learning rate without a loss in generalization quality, given a batch size and total amount of data , must be sufficiently large so that . It is possible that a batch size 32k and the amount of available Amazon data do not satisfy this requirement since the Amazon Reviews dataset is reduced to fewer than iterations when we scale up to a 32k batch size. This observation opens up new research questions for future work.
Xi Future Work
We have shown that distributed data parallelism scales for large RNN text models. However, we start to see diminishing returns on wall time convergence at very large batches, possibly because each epoch is reduced to a small number of training iterations. Now that we can train a language model on the 40 GB Amazon reviews dataset in hours, a next step could be to train on larger text datasets. Orders of magnitude larger text datasets could be constructed by collecting web pages, news articles, Reddit comments, and tweets, for example.
In addition to larger text datasets, we could further improve Amazon Reviews BPC (and presumably accuracy on transfer tasks) with some of the following:

Training for more than 3 epochs.

Data shuffling between epochs.

Larger RNN models, with more layers and larger hidden states.

Alternative language models, such as the Transformer network
[49]. 
Hyperparameter search for an ideal large batch learning rate schedule.
As shown in Fig. 5(b), our best large batch training runs did not decay the learning rate to zero by the end of 3 epochs. As long as the initial learning rate is low enough not to cause training divergence, it may be possible to keep the learning rate high through several epochs of training. Recent language modeling work with the Transformer network has shown that triangular learning rate warmup and nonlinear learning rate decay (cosine annealing) can lead to a better learning rate schedule with the Adam optimizer[13]. We showed that a simple learning rate schedule can work for large batch training, but further work on learning rate schedules will likely improve convergence.
Increasing the mLSTM size from 4096 to 8192d reduces the perGPU batch size by a factor of four. Using gradient checkpointing [50] would allow training larger models with larger batches without being constrained by memory capacity.
In order to get maximal text understanding from these larger models, we could modify the unsupervised task to include additional objectives, along with language modeling. Auxiliary tasks may include predicting a review’s star rating, the title or topic of a piece of text, or any other freely available structural text label. Since the purpose of unsupervised training is to build a model with deep conceptual understanding of the text, auxiliary tasks that leverage metadata available with the text could provide additional understanding.
Xii Conclusion
We set out to investigate large scale training for recurrent models in the Natural Language domain. With mixed precision training we can successfully converge a model 4.2x faster with double the batch size compared to FP32 training. By leveraging distributed deep learning with NCCL2, NVLINK, and Infiniband interconnect, we achieve near linear scaling of 109x with 128 GPUs, as we grow the batch size proportionately to the number of available machines.
In addition to pushing wall time scalability by decreasing the time needed to converge a language model on the Amazon Reviews dataset, we analyze the convergence of models trained with large batches. We find that training with very large batches leads to somewhat worse generalization, requiring more data to converge to a similar validation BPC and transfer accuracy as small batch training. Learning rate schedule modifications are necessary to help with convergence. Without such techniques evaluation quality begins to decline as batch size increases, or the model fails to converge if the learning rate is scaled too high.
With further modification to the learning rate schedule and additional training it is possible to train models with large batches comparable to models trained with smaller batches. Our experiments lead to two insights:

The relationship between batch size and learning regime is complex and learning rate scaling alone is not always enough to converge a model.

Even with the largest public text corpus available, it may not be feasible to satisfy the batch size requirement needed to effectively train with the largest batches that modern hardware allows.
We look forward to more work investigating large scale language model training and using it in transferred tasks to solve difficult natural language problems.
References
 [1] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch SGD: training resnet50 on imagenet in 15 minutes,” CoRR, vol. abs/1711.04325, 2017.
 [2] S. Kornblith, J. Shlens, and Q. V. Le, “Do better imagenet models transfer better?” 2018.
 [3] A. Radford, R. Józefowicz, and I. Sutskever, “Learning to generate reviews and discovering sentiment,” CoRR, vol. abs/1704.01444, 2017.
 [4] B. Krause, L. Lu, I. Murray, and S. Renals, “Multiplicative LSTM for sequence modelling,” CoRR, vol. abs/1609.07959, 2016.
 [5] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Imagebased recommendations on styles and substitutes,” SIGIR, 2015.
 [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.
 [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” 2009.
 [9] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” CoRR, vol. abs/1409.0575, 2014.

[10]
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell,
“Decaf: A deep convolutional activation feature for generic visual
recognition,”
International Conference on Machine Learning
, 2013. 
[11]
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features
offtheshelf: an astounding baseline for recognition,”
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
, 2014.  [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask RCNN,” IEEE International Conference on Computer Vision (ICCV), 2017.
 [13] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pretraining,” 2018. [Online]. Available: https://blog.openai.com/languageunsupervised/

[14]
J. Turian, L. Ratinov, and Y. Bengio, “Word representations: A simple and general method for semisupervised learning,” in
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ser. ACL ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 384–394.  [15] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” CoRR, vol. abs/1310.4546, 2013.
 [16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” 2014. [Online]. Available: https://www.aclweb.org/anthology/D141162
 [17] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017.
 [18] Y. You, Z. Zhang, C. Hsieh, and J. Demmel, “100epoch imagenet training with alexnet in 24 minutes,” CoRR, vol. abs/1709.05011, 2017.
 [19] Y. You, I. Gitman, and B. Ginsburg, “Scaling SGD batch size to 32k for imagenet training,” CoRR, vol. abs/1708.03888, 2017.
 [20] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” 2018.
 [21] J. Hestness, S. Narang, N. Ardalani, G. F. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep learning scaling is predictable, empirically,” CoRR, vol. abs/1712.00409, 2017.
 [22] A. M. Dai and Q. V. Le, “Semisupervised sequence learning,” CoRR, vol. abs/1511.01432, 2015.
 [23] O. Melamud, J. Goldberger, and I. Dagan, “context2vec: Learning generic context embedding with bidirectional lstm,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 01 2016, pp. 51–61.
 [24] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: Contextualized word vectors,” CoRR, vol. abs/1708.00107, 2017.
 [25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, vol. abs/1802.05365, 2018.
 [26] J. Howard and S. Ruder, “Finetuned language models for text classification,” CoRR, vol. abs/1801.06146, 2018.
 [27] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multitask learning,” CoRR, vol. abs/1804.00079, 2018.
 [28] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. GuajardoCespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil, “Universal sentence encoder,” CoRR, vol. abs/1803.11175, 2018.
 [29] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, “Generating wikipedia by summarizing long sequences,” ICLR, 2018.
 [30] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler, “Skipthought vectors,” CoRR, vol. abs/1506.06726, 2015.
 [31] P. Ramachandran, P. J. Liu, and Q. V. Le, “Unsupervised pretraining for sequence to sequence learning,” CoRR, vol. abs/1611.02683, 2016.
 [32] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards storylike visual explanations by watching movies and reading books,” CoRR, vol. abs/1506.06724, 2015.
 [33] R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword fifth edition,” 2011. [Online]. Available: https://catalog.ldc.upenn.edu/ldc2011t07
 [34] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, and P. Koehn, “One billion word benchmark for measuring progress in statistical language modeling,” CoRR, vol. abs/1312.3005, 2013.
 [35] S. Gray, A. Radford, and D. P. Kingma, “Gpu kernels for blocksparse weights,” 2017. [Online]. Available: https://blog.openai.com/blocksparsegpukernels/
 [36] Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Müller, Efficient BackProp. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 9–48. [Online]. Available: https://doi.org/10.1007/9783642352898_3
 [37] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” CoRR, vol. abs/1211.5063, 2012.
 [38] A. Karpathy, J. Johnson, and F. Li, “Visualizing and understanding recurrent networks,” CoRR, vol. abs/1506.02078, 2015.
 [39] S. L. Smith and Q. V. Le, “A bayesian perspective on generalization and stochastic gradient descent,” CoRR, vol. abs/1710.06451, 2017.
 [40] S. L. Smith, P. Kindermans, and Q. V. Le, “Don’t decay the learning rate, increase the batch size,” CoRR, vol. abs/1711.00489, 2017.
 [41] E. Hoffer, I. Hubara, and D. Soudry, “Train longer, generalize better: closing the generalization gap in large batch training of neural networks,” 2017.
 [42] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” CoRR, vol. abs/1710.03740, 2017.
 [43] NVIDIA. (2018) Mixed precision training: Choosing a scaling factor. [Online]. Available: https://docs.nvidia.com/deeplearning/sdk/mixedprecisiontraining/index.html#scalefactor
 [44] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
 [45] I. Sutskever, “Training recurrent neural networks,” 2013. [Online]. Available: https://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
 [46] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016.
 [47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
 [50] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” CoRR, vol. abs/1604.06174, 2016.
Comments
There are no comments yet.