Tiny Neural Models for Seq2Seq

08/07/2021 ∙ by Arun Kandoor, et al. ∙ Google 0

Semantic parsing models with applications in task oriented dialog systems require efficient sequence to sequence (seq2seq) architectures to be run on-device. To this end, we propose a projection based encoder-decoder model referred to as pQRNN-MAtt. Studies based on projection methods were restricted to encoder-only models, and we believe this is the first study extending it to seq2seq architectures. The resulting quantized models are less than 3.5MB in size and are well suited for on-device latency critical applications. We show that on MTOP, a challenging multilingual semantic parsing dataset, the average model performance surpasses LSTM based seq2seq model that uses pre-trained embeddings despite being 85x smaller. Furthermore, the model can be an effective student for distilling large pre-trained models such as T5/BERT.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Privacy concerns and connectivity issues have spurred interest in on-device neural applications. Neural semantic parsing is one such problem that converts natural language into machine executable logical forms usable in applications such as voice assistant. Though there is much research on advancing the state of the art in neural semantic parsing, there is little research on achieving the same high quality results within the compute and memory capabilities of edge devices.

Neural seq2seq models employ an encoder-decoder model, in which the encoder converts word tokens to a latent representation. This is then fed to a decoder to generate output tokens based on a target vocabulary. Initial seq2seq models employed an architecture where all the input sequence information is encoded into one single state and is provided to the decoder for generating target sequence (Sutskever et al., 2014). More recent approaches such as Bahdanau et al. (2016) employed attention mechanism, which makes use of the encoder outputs and thereby improving the model performance.

Current state of the art models for semantic parsing are attention based pointer networks (Vinyals et al., 2015; Rongali et al., 2020; Li et al., 2020) and achieve impressive performance on server with minimal to no restrictions on the model size and inference times. The same models however, would not be suitable for inference on edge devices.

Another aspect of these models is the use of embedding tables for word representations. Embedding tables increase in size as vocabulary increases and would not scale well for on-device applicaitons. As previous studies have demonstrated (Kaliamoorthi et al., 2019, 2021; Ravi, 2017; Ravi and Kozareva, 2018)

, text projection is an effective alternative to embedding tables for on-device problems in natural language processing. In this work, we extend text projections to seq2seq problems such as neural semantic parsing by combining them with efficient decoder architectures.

A main motivation in this work is to identify effective neural encoding and decoding architectures that operate on textual input and are suitable for on-device applications. In our experiments and based on previous work (Kaliamoorthi et al., 2021), projections used with QRNN (Bradbury et al., 2016) encoder proves to be a effective combination for text classification and labeling tasks. We extend this model to include a Merged Attention (MAtt) decoder (Zhang et al., 2019) and demonstrate that the resulting model architecture we refer to as pQRNN-MAtt is a promising candidate for on-device neural semantic parsing and code generation tasks. Experiments on multilingual MTOP dataset show that average exact match accuracy for pQRNN-MAtt model is higher than LSTM models with pre-trained XLU embeddings, despite the former being 85x smaller than latter.

2 Related Work

Recent work on neural semantic parsing is largely based on encoder-decoder models and has shown promising results on tasks such as machine translation (Sutskever et al., 2014)

and image captioning

(Vinyals et al., 2014; Bahdanau et al., 2016). Luong et al. (2015) improved these architectures by implementing an attention mechanism in the decoder.

One major drawback with these models was its inability to learn good enough parameters for long tail entities. This was addressed with the advent of Pointer Networks Vinyals et al. (2015), in which the decoder decides to either copy a token from the input query or generate a token from the output vocabulary. Rongali et al. (2020); Li et al. (2020) employed this model to achieve impressive results on public datasets.

All these studies incorporate architectures based on recurrent neural networks or Transformers

(Vaswani et al., 2017) and use some form of pre-trained token representations Mikolov et al. (2017), which are not well suited for on-device applications as the model size is dominated by the embedding table. Large embedding table is necessary to reach high quality with these model architectures.

Projection based methods (Kaliamoorthi et al., 2019, 2021; Ravi, 2017; Ravi and Kozareva, 2018) have been studied extensively for on-device applications essentially replacing embedding tables with hashing based techniques. While promising results have been shown on problems that can be solved with a neural encoder, no one has studied the applicability of these methods for seq2seq tasks like semantic parsing.

To address this shortcoming, we complement the projection based methods with efficient decoder architectures like Merged Attention (MAtt) (Zhang et al., 2019) and study the overall performance of the model for semantic parsing task.

3 Model Architecture

Figure 1: Pointer generator network with pQRNN encoder and MAtt decoder.

Our model architecture combines pQRNN encoder Kaliamoorthi et al. (2021) with MAtt decoder Zhang et al. (2019) and design it as a pointer-generator network as proposed in Rongali et al. (2020).

As illustrated in Figure 1

, the encoder block consists of projection stage that converts source tokens to a sequence of ternary vectors

Kaliamoorthi et al. (2021). The ternary representation is then fed to a dense layer (bottleneck) with activation. Since the projection features are not trainable, the bottleneck layer allows the network to learn semantic similarity needed for the task. A stack of bidirectional QRNN (Bradbury et al., 2016) is then used to learn a contextual representation for the input.

An important modification to the decoder is, since projections lack semantic and contextual information, the encoder hidden states are used as decoder input embeddings for copy tokens. So while decoding at time step , if the previous decoded token is a copy token at encoder step , the corresponding encoder hidden state is chosen to be the input to the decoder.

On the decoder output, at step and with hidden states , as proposed in Li et al. (2020), the final output distribution is a mixture of generation and copy distributions:


, and are computed as:

Figure 1 illustrates the end-to-end flow for decoding at step . The embedding for decoder output token , which is the input for next step, is decided based on whether is copy or generate token.

Exact Match Accuracy
pQRNN+MAtt #Params en es fr de hi th avg
Top1 3.3M (8bit) 77.8 69.9 66.6 63.7 64.5 64.3 67.8
Top2 3.3M (8bit) 81.2 73.7 70.9 68.7 68.3 69.2 72.0
Top3 3.3M (8bit) 81.9 74.8 72.0 70.1 69.6 70.5 73.2
Top4 3.3M (8bit) 82.2 75.3 72.3 70.6 70.3 70.9 73.6
XLU 70M (float) 77.8 66.5 65.6 61.5 61.5 62.8 66.0
XLM-R 550M (float) 83.9 76.9 74.7 71.2 70.2 71.2 74.7
Table 1: Exact Match Accuracy on compositional decoupled representation for 6 languages. Reference numbers have been taken from Table 3 Li et al. (2020).
Intent Accuracy / Slot F1
pQRNN+MAtt en es fr de hi th
Top1 94.7/88.0 93.1/80.1 91.1/78.4 91.5/76.2 91.0/76.5 91.5/79.5
Top2 95.6/89.2 94.0/81.6 92.5/80.0 92.7/78.6 92.0/78.3 92.7/81.3
Top3 95.9/89.4 94.3/82.0 92.9/80.6 93.2/79.2 92.5/78.7 93.1/81.9
Top3 96.0/89.5 94.4/82.2 93.0/80.7 93.5/79.4 92.7/79.1 93.2/82.1
Table 2: Intent Accuracy / Slot F1 for models in Table 1.

3.1 Quantization

Effective quantization techniques allow end-to-end models to run inference using integer-only arithmetic and reduce model footprints. We adapted the quantization scheme proposed in Jacob et al. (2017), which allows us to simulate quantization during training and learn the ranges for weights and activations in the model.

In this setup, training happens in floating point arithmetic, but the forward pass also simulates 8-bit integer quantization using fake-quantized tensorflow ops

(Abadi et al., 2015) which allows us to collect weights and activation stats. Later, Tensorflow Lite converter tool uses these stats to construct an 8-bit Tensorflow Lite model, which is used for running inference.

4 Experiments

We evaluate the model performance using MTOP dataset from Li et al. (2020) on all 6 languages using only target language training data. Exact match accuracy, intent accuracy and slot F1 metrics are reported for all the models. As we could not verify whether the metrics presented in Li et al. (2020) were Top1 result from the decoder output, we chose to present TopK (K=4) results for comparison. We conduct the experiments using the compositional decoupled representaion as labels.

4.1 Model configuration

The model uses open source projection operator

111https://github.com/tensorflow/models/tree/master/research/seq_flow_lite with feature dimension . The projection output is then fed to a dense layer (bottleneck) with output width . The dense layer output is then fed to the QRNN stack of layers, each with state size and convolution kernel width set to .

The decoder input embedding size is set to , which is followed by MAtt decoder stack of size . Each decoder has model dimension set to and the hidden dimension in the feed forward network is set to

. We averaged across 4 heads when computing copy probabilities using


5 Results

Table 1 shows the TopK metrics on exact match accuracy for all 6 languages on the compositional decoupled representation. On average, Top1 results outperform the LSTM baseline model with XLU embeddings from Li et al. (2020). For K>1, the exact match accuracy approaches close to the large pre-trained model XLM-R.

The model effectiveness is indicated by the Params column, which roughly can be mapped to the model footprint and inference times.

Table 2 shows the top level intent accuracy and slot F1 metrics for TopK results for compositional decoupled representation. Li et al. (2020) doesn’t consider this combination, but overall the model proves to be effective in generating the intent and arguments pertaining to the given query.

6 Conclusion

We extend Projection based representations to on-device seq2seq models using QRNN encoder and MAtt decoder. Despite being 85x smaller, evaluations on MTOP dataset proved the model to be highly effective when compared to LSTM models trained with pre-trained embeddings.

Future directions include employing distillation techniques (Kaliamoorthi et al., 2021) to improve the model further and exploring different tokenization schemes for multilingual projections.


We would like to thank our colleagues Prabhu Kaliamoorthi, Erik Vee, Edgar Gonzàlez i Pellicer, Evgeny Livshits, Ashwini Venkatesh, Derik Clive, Edward Li, Milan Lee and the Learn2Compress team for helpful discussions related to this work. We would also like to thank Amarnag Subramanya, Andrew Tomkins and Rushin Shah for their leadership and support.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from tensorflow.org External Links: Link Cited by: §3.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2016) Neural machine translation by jointly learning to align and translate. External Links: 1409.0473 Cited by: §1, §2.
  • J. Bradbury, S. Merity, C. Xiong, and R. Socher (2016)

    Quasi-recurrent neural networks

    CoRR abs/1611.01576. External Links: Link, 1611.01576 Cited by: §1, §3.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko (2017) Quantization and training of neural networks for efficient integer-arithmetic-only inference. CoRR abs/1712.05877. External Links: Link, 1712.05877 Cited by: §3.1.
  • P. Kaliamoorthi, S. Ravi, and Z. Kozareva (2019) PRADO: projection attention networks for document classification on-device. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5012–5021. External Links: Link, Document Cited by: §1, §2.
  • P. Kaliamoorthi, A. Siddhant, E. Li, and M. Johnson (2021)

    Distilling large language models into tiny and effective students using pqrnn

    CoRR abs/2101.08890. External Links: Link, 2101.08890 Cited by: Tiny Neural Models for Seq2Seq, §1, §1, §2, §3, §3, §6.
  • H. Li, A. Arora, S. Chen, A. Gupta, S. Gupta, and Y. Mehdad (2020) MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. CoRR abs/2008.09335. External Links: Link, 2008.09335 Cited by: Tiny Neural Models for Seq2Seq, §1, §2, Table 1, §3, §4, §5, §5.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025. External Links: Link, 1508.04025 Cited by: §2.
  • T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2017) Advances in pre-training distributed word representations. CoRR abs/1712.09405. External Links: Link, 1712.09405 Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: Link, 1910.10683 Cited by: Tiny Neural Models for Seq2Seq.
  • S. Ravi and Z. Kozareva (2018) Self-governing neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 887–893. External Links: Link, Document Cited by: §1, §2.
  • S. Ravi (2017) ProjectionNet: learning efficient on-device deep networks using neural projections. CoRR abs/1708.00630. External Links: Link, 1708.00630 Cited by: §1, §2.
  • S. Rongali, L. Soldaini, E. Monti, and W. Hamza (2020) Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. CoRR abs/2001.11458. External Links: Link, 2001.11458 Cited by: §1, §2, §3.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Link, 1409.3215 Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §2.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In NIPS, pp. 2692–2700. External Links: Link Cited by: §1, §2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2014) Show and tell: A neural image caption generator. CoRR abs/1411.4555. External Links: Link, 1411.4555 Cited by: §2.
  • B. Zhang, I. Titov, and R. Sennrich (2019) Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 898–909. External Links: Link, Document Cited by: §1, §2, §3.