, GPT-2(Radford et al., 2019), MT-DNN (Liu et al., 2019a), RoBERTA (Liu et al., 2019b) reached state-of-the-art performance on tasks like machine translation (Arivazhagan et al., 2019), language modelling (Radford et al., 2019), text classification benchmarks like GLUE (Wang et al., 2018). However, these models require huge amount of memory and need high computational requirements making it hard to deploy to small memory constraint devices such as mobile phones, watches and IoT. Recently, there have been interests in making BERT lighter and faster (Sanh et al., 2019; McCarley, 2019). In parallel, recent on-device works like SGNN Ravi and Kozareva (2018) and SGNN++ Ravi and Kozareva (2019) produce lightweight models with extremely low memory footprint. They employ a modified form of LSH projection to dynamically generate a fixed binary projection representation, for the input text
using word or character n-grams and skip-grams features, and a 2-layer MLPsoftmax layer for classification. As shown in Ravi and Kozareva (2018) these models are suitable for short sentence lengths as they compute
bit LSH projection vector to represent the entire sentence. However,Kozareva and Ravi (2019) showed that such models cannot handle long text due to significant information loss in the projection operation.
On another side, recurrent architectures represent long sentences well, but the sequential nature of the computations increases latency requirements and makes it difficult to launch on-device. Recently, self-attention based architectures like BERT (Devlin et al., 2018) have demonstrated remarkable success in capturing long term dependencies in the input text via purely attention mechanisms. BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation in (Vaswani et al., 2017). The self-attention scores can be computed in parallel as they do not have recurrent mechanisms. But usually these architectures are very deep and the amount of computation is quadratic in the order of , where is the number of layers (Transformer blocks) and is the input sentence length. Straightforward solutions like reducing the number of layers is insufficient to launch transformers on-device due to the large memory and quadratic computation requirements.
In this paper, we introduce a projection-based neural architecture ProFormer that is designed to (a) be efficient and learn compact neural representations (b) handle vocabulary words and misspellings (c) drastically reduce embedding memory footprint from hundreds of megabytes to few kilobytes and (d) reduce the computation overhead quadratically by introducing a local attention layer which reduces the intermediate sequence length by a constant factor, . We achieve this by bringing the best of both worlds by combining LSH projection based representations (for low memory footprint) and self-attention based architectures (to model dependencies in long sentences). To tackle computation overheard in the transformer based models, we reduce the number of self-attention layers and additionally introduce an intermediate local projection attention (LPA) to quadratically reduce the number of self-attention operations. The main contributions of our paper are:
We propose novel on-device neural network called ProFormer which combines LSH projection based text representations, with transformer architecture and locally projected self-attention mechanism that captures long range sentence dependencies while yielding low memory footprint and low computation overhead.
ProFormer reduces the computation overhead and latency in multiple ways: by reducing the number of layers from twelve to two and introducing new local projection attention layer that decreases number of self-attention operations by a quadratic factor.
ProFormer is light weigh compact on-device model, while BERT on-device still needs huge embedding table ( MB for , ) with number of computation flops in the order of , where is the number of layers, is the number of words in the input sentence.
We conduct empirical evaluations and comparisons against state-of-the-art on-device and prior deep learning approaches for short and long text classification. Our model ProFormer reached state-of-art performance for short text and comparable performance for long texts, while maintaining small memory footprint and computation requirements.
2 ProFormer: LSH Projection based Transformers
In this section, we show the overall architecture of ProFormer in Figure 1. ProFormer consists of multiple parts: (1) word-level Locality Sensitive Hashing (LSH) projection layer, (2) local projection attention (LPA) layer, (3) transformer layer (Devlin et al., 2018)
and (4) a max-poolingclassifier layer. Next, we describe each layer in detail.
2.1 LSH Projection Layer
It is a common practice to represent each word in the input sentence, as an embedding vector based on its one-hot representation. Instead, we adopt LSH projection layer from (Ravi, 2017, 2019) which dynamically generates a bit representation, for the input word, based on its morphological features like n-grams, skip-grams from the current and context words, parts-of-speech tags, etc.
Since the LSH projection based approach does not rely on embedding lookup tables to compute word representation, we obtain significant memory savings of the order, , where is the vocabulary size and is the embedding dimension. For instance, the embedding look-up table occupies 92.16 MB (, (Devlin et al., 2018)), while the LSH projection layer requires only 1.7 KB () as shown in Table 1.
|ProFormer (our model)|
2.2 Local Projection Attention (LPA) Layer
The LPA layer shown in Figure 2 consists of a single layer multi-headed self-attention layer similar to the Transformer architecture in (Vaswani et al., 2017) followed by a max-pooling layer yielding a compressed representation of input words, .
The LPA layer transforms the word-level projections, to a sequence of representations as in Equation 1.
where consists of the self-attention and max-pooling operation, is a Group factor111We choose such that is divisible by .. We equally divide the word-level LSH projection representations into groups of size . The LPA layer compresses each group of word representations into yielding representations in total. The LPA layer reduces the self-attention computation overhead in the subsequent transformer layer (Vaswani et al., 2017) by .
2.3 Transformer Layer
This layer consists of 2-layer bidirectional Transformer encoder based on the original implementation described in (Vaswani et al., 2017). This layer transforms the input representations from the LPA layer described in the previous sub-section into output representations. In this layer, we reduce both the computation overhead and memory footprint by reducing the number of layers from to reducing the computation overhead by ( times in the case of 12-layer BERT-base model (Devlin et al., 2018)).
2.4 Max-Pooling and Classification Layer
We summarize the representations from the transformer layer to get a single dimensional vector by max-pooling across the time-steps, followed by a softmax layer to predict the output class .
3 Datasets & Experimental Setup
In this section, we describe our datasets and experimental setup. We use text classification datasets from state-of-the-art on-device evaluations such as: MRDA (Shriberg et al., 2004) and ATIS (Tür et al., 2010), AG News Zhang et al. (2015b) and Yahoo! Answers Zhang et al. (2015b). Table 2 shows the characteristics of each dataset.
|MRDA (Dialog act)||78k||15k|
|ATIS (Intent prediction)|
|AG (News Categorization)||4||38||120k||7.6k|
|Y!A (Yahoo! Answers Categorization)||10||108||1400k||60k|
We train ProFormer on multiple classification tasks individually and report Accuracy on corresponding test sets. We fix the projection size, , n-gram size=5, skip-gram size=1 for the LSH projection operation, . For the LPA layer, We experiment with two values for , where corresponds to the null operation in the LPA layer which just passes the word LSH projection representation to the Transformer layer. For the transformer layer, we fix the number of layers, and set all layer sizes, (including the intermediate size for the dense layer).222The rest of the parameters are same as the one used in in BERT-base model (Devlin et al., 2018)
We compare our model with previous state of the art neural architectures, including on-device approaches. We also fine-tune the pretrained 12-layer BERT-base model (Devlin et al., 2018) on all classification tasks and compare to our model. BERT-base consists 12-layers of transformer blocks (Vaswani et al., 2017) and is pretrained in an unsupervised manner on a large corpus (BooksCorpus (Zhu et al., 2015) and English WikiPedia) using masked-language model objective. We fine-tune the pretrained BERT-base (Devlin et al., 2018) to each of the classification tasks. For training, we use Adam with learning rate of e-, =, =, weight decay of , learning rate warmup over the first
steps, and linear decay of the learning rate. We use dropout probability ofon all layers and training batch size of .
Tables 3 and 4 show the results on the ATIS & MRDA short text classification and AG & Y!A long text classification tasks. We compare our approach, ProFormer against prior state-of-the-art on-device works, BERT-base and other non-on-device neural approaches.
Overall, our model ProFormer improved upon non-on-device neural models while keeping very small memory footprint and high accuracy. This is very impressive since ProFormer can be directly deployed to memory constraint devices like phones, watches and IoT while still maintaining high accuracy. ProFormer also improved upon prior on-device state-of-the-art neural approaches like SGNN Ravi and Kozareva (2018) and SGNN++ Ravi and Kozareva (2019) reaching over 35% improvement on long text classification. Similarly it improved over on-device ProSeqo Kozareva and Ravi (2019) models for all datasets and reached comparable performance on MRDA. In addition to the quality improvements, ProFormer also keeps smaller memory footprint than ProSeqo, SGNN and SGNN++.
In addition to the non-on-device and on-device neural comparisons, we also compare against BERT-base. Our experiments show that although the 12-layer fine-tuned BERT-base (Devlin et al., 2018) model converged to the state-of-the-art in almost all of the tasks, ProFormer converges to % BERT-base’s performance on an average while occupying only % of BERT-base’s memory. ProFormer has million parameters, while BERT-base has million. For fair comparison, we also test ProFormer with , which only occupies % the memory footprint of -layer BERT-base model and reduces the computation overhead by times. The embedding look up table occupies nearly million parameters out of million parameters in the -layer BERT model. We notice that = model performs slightly worse than = indicating information loss in the LPA layer. Overall, our experiments demonstrate that ProFormer reaches better performances that prior non-on-device and on-device neural approaches, and comparable performance to BERT-base models while preserving smaller memory footprint.
|ProFormer (=1) (our model)||89.3||98.2|
|ProFormer (=4) (our model)||86.7||97.0|
|BERT-base(Devlin et al., 2018)||90.1||98.3|
|ProSeqo Kozareva and Ravi (2019)(on-device)||90.1||97.8|
|SGNN++ Ravi and Kozareva (2019)(on-device)||87.3||93.7|
|SGNN Ravi and Kozareva (2018)(on-device)||86.7||88.9|
|RNNKhanpour et al. (2016)||86.8||-|
|RNN+AttentionOrtega and Vu (2017)||84.3||-|
|CNNLee and Dernoncourt (2016)||84.6||-|
|GatedIntentAtten.Goo et al. (2018)||-||94.1|
|GatedFullAtten.Goo et al. (2018)||-||93.6|
|JointBiLSTMHakkani-Tur et al. (2016)||-||92.6|
|Atten.RNNLiu and Lane (2016)||-||91.1|
|ProFormer (=1) (our model)||92.0||72.8|
|ProFormer (=4) (our model)||91.5||71.1|
|BERT-base(Devlin et al., 2018)||94.5||73.8|
|ProSeqoKozareva and Ravi (2019)(on-device)||91.5||72.4|
|SGNNRavi and Kozareva (2018)(on-device)||57.6||36.5|
|FastText-fullJoulin et al. (2016)||92.5||72.3|
|CharCNNLargeWithThesau.Zhang et al. (2015a)||90.6||71.2|
|CNN+NGMBui et al. (2018)||86.9||-|
|LSTM-fullZhang et al. (2015a)||86.1||70.8|
We proposed a novel on-device neural network ProFormer, which combines LSH projection based text representations, with trans-former architecture and locally projected self-attention mechanism that captures long range sentence dependencies. Overall, ProFormer yields low memory footprint and reduces computations quadratically. In series of experimental evaluations on short and long text classifications we show that ProFormer improved upon prior neural models and on-device work like SGNN Ravi and Kozareva (2018), SGNN++ Ravi and Kozareva (2019) and ProSeqo Kozareva and Ravi (2019). ProFormer reached comparable performance to our BERT-base implementation, however it produced magnitudes more compact models than BERT-base. This is very impressive showing both effectiveness and compactness of our neural model.
Massively multilingual neural machine translation in the wild: findings and challenges. CoRR abs/1907.05019. External Links: Cited by: §1.
- Neural graph learning: training neural networks using graphs. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pp. 64–71. External Links: Cited by: Table 4.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1, §1, §2.1, §2.3, Table 1, §2, §3, Table 3, Table 4, §4, footnote 2.
- Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 753–757. Cited by: Table 3.
- Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Proceedings of The 17th Annual Meeting of the International Speech Communication Association (INTERSPEECH 2016), Cited by: Table 3.
- FastText.zip: compressing text classification models. CoRR abs/1612.03651. External Links: Cited by: Table 4.
Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2012–2021. Cited by: Table 3.
ProSeqo: projection sequence networks for on-device text classification.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3892–3901. External Links: Cited by: §1, Table 3, Table 4, §4, §5.
Sequential short-text classification with recurrent and convolutional neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 515–520. Cited by: Table 3.
- Attention-based recurrent neural network models for joint intent detection and slot filling. Proceedings of The 17th Annual Meeting of the International Speech Communication Association (INTERSPEECH 2016). Cited by: Table 3.
- Multi-task deep neural networks for natural language understanding. CoRR abs/1901.11504. External Links: Cited by: §1.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §1.
- Pruning a bert-based question answering model. ArXiv abs/1910.06360. Cited by: §1.
- Neural-based context representation learning for dialog act classification. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 247–252. Cited by: Table 3.
- Language models are unsupervised multitask learners. Cited by: §1.
- Self-governing neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 804–810. Cited by: §1, Table 3, Table 4, §4, §5.
- On-device structured and context partitioned projection networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3784–3793. External Links: Cited by: §1, Table 3, §4, §5.
- ProjectionNet: learning efficient on-device deep networks using neural projections. CoRR abs/1708.00630. Cited by: §2.1.
Efficient on-device models using neural projections.
Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 5370–5379. Cited by: §2.1.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108. Cited by: §1.
- The ICSI meeting recorder dialog act (MRDA) corpus. In Proceedings of the SIGDIAL 2004 Workshop, The 5th Annual Meeting of the Special Interest Group on Discourse and Dialogue, April 30 - May 1, 2004, Cambridge, Massachusetts, USA, pp. 97–100. Cited by: §3.
- What is left to be understood in atis?. In Proceedings of 2010 IEEE Spoken Language Technology Workshop (SLT), pp. 19–24. Cited by: §3.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §1, §2.2, §2.2, §2.3, §3.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461. External Links: Cited by: §1.
- XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Cited by: §1.
- Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649–657. Cited by: Table 4.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pp. 649–657. Cited by: §3.
Aligning books and movies: towards story-like visual explanations by watching movies and reading books.
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 19–27. External Links: Cited by: §3.