Privacy concerns and connectivity issues have spurred interest in on-device neural applications. Neural semantic parsing is one such problem that converts natural language into machine executable logical forms usable in applications such as voice assistant. Though there is much research on advancing the state of the art in neural semantic parsing, there is little research on achieving the same high quality results within the compute and memory capabilities of edge devices.
Neural seq2seq models employ an encoder-decoder model, in which the encoder converts word tokens to a latent representation. This is then fed to a decoder to generate output tokens based on a target vocabulary. Initial seq2seq models employed an architecture where all the input sequence information is encoded into one single state and is provided to the decoder for generating target sequence (Sutskever et al., 2014). More recent approaches such as Bahdanau et al. (2016) employed attention mechanism, which makes use of the encoder outputs and thereby improving the model performance.
Current state of the art models for semantic parsing are attention based pointer networks (Vinyals et al., 2015; Rongali et al., 2020; Li et al., 2020) and achieve impressive performance on server with minimal to no restrictions on the model size and inference times. The same models however, would not be suitable for inference on edge devices.
Another aspect of these models is the use of embedding tables for word representations. Embedding tables increase in size as vocabulary increases and would not scale well for on-device applicaitons. As previous studies have demonstrated (Kaliamoorthi et al., 2019, 2021; Ravi, 2017; Ravi and Kozareva, 2018)
, text projection is an effective alternative to embedding tables for on-device problems in natural language processing. In this work, we extend text projections to seq2seq problems such as neural semantic parsing by combining them with efficient decoder architectures.
A main motivation in this work is to identify effective neural encoding and decoding architectures that operate on textual input and are suitable for on-device applications. In our experiments and based on previous work (Kaliamoorthi et al., 2021), projections used with QRNN (Bradbury et al., 2016) encoder proves to be a effective combination for text classification and labeling tasks. We extend this model to include a Merged Attention (MAtt) decoder (Zhang et al., 2019) and demonstrate that the resulting model architecture we refer to as pQRNN-MAtt is a promising candidate for on-device neural semantic parsing and code generation tasks. Experiments on multilingual MTOP dataset show that average exact match accuracy for pQRNN-MAtt model is higher than LSTM models with pre-trained XLU embeddings, despite the former being 85x smaller than latter.
2 Related Work
Recent work on neural semantic parsing is largely based on encoder-decoder models and has shown promising results on tasks such as machine translation (Sutskever et al., 2014)
and image captioning(Vinyals et al., 2014; Bahdanau et al., 2016). Luong et al. (2015) improved these architectures by implementing an attention mechanism in the decoder.
One major drawback with these models was its inability to learn good enough parameters for long tail entities. This was addressed with the advent of Pointer Networks Vinyals et al. (2015), in which the decoder decides to either copy a token from the input query or generate a token from the output vocabulary. Rongali et al. (2020); Li et al. (2020) employed this model to achieve impressive results on public datasets.
All these studies incorporate architectures based on recurrent neural networks or Transformers(Vaswani et al., 2017) and use some form of pre-trained token representations Mikolov et al. (2017), which are not well suited for on-device applications as the model size is dominated by the embedding table. Large embedding table is necessary to reach high quality with these model architectures.
Projection based methods (Kaliamoorthi et al., 2019, 2021; Ravi, 2017; Ravi and Kozareva, 2018) have been studied extensively for on-device applications essentially replacing embedding tables with hashing based techniques. While promising results have been shown on problems that can be solved with a neural encoder, no one has studied the applicability of these methods for seq2seq tasks like semantic parsing.
To address this shortcoming, we complement the projection based methods with efficient decoder architectures like Merged Attention (MAtt) (Zhang et al., 2019) and study the overall performance of the model for semantic parsing task.
3 Model Architecture
As illustrated in Figure 1
, the encoder block consists of projection stage that converts source tokens to a sequence of ternary vectorsKaliamoorthi et al. (2021). The ternary representation is then fed to a dense layer (bottleneck) with activation. Since the projection features are not trainable, the bottleneck layer allows the network to learn semantic similarity needed for the task. A stack of bidirectional QRNN (Bradbury et al., 2016) is then used to learn a contextual representation for the input.
An important modification to the decoder is, since projections lack semantic and contextual information, the encoder hidden states are used as decoder input embeddings for copy tokens. So while decoding at time step , if the previous decoded token is a copy token at encoder step , the corresponding encoder hidden state is chosen to be the input to the decoder.
On the decoder output, at step and with hidden states , as proposed in Li et al. (2020), the final output distribution is a mixture of generation and copy distributions:
, and are computed as:
Figure 1 illustrates the end-to-end flow for decoding at step . The embedding for decoder output token , which is the input for next step, is decided based on whether is copy or generate token.
|Exact Match Accuracy|
|Intent Accuracy / Slot F1|
Effective quantization techniques allow end-to-end models to run inference using integer-only arithmetic and reduce model footprints. We adapted the quantization scheme proposed in Jacob et al. (2017), which allows us to simulate quantization during training and learn the ranges for weights and activations in the model.
We evaluate the model performance using MTOP dataset from Li et al. (2020) on all 6 languages using only target language training data. Exact match accuracy, intent accuracy and slot F1 metrics are reported for all the models. As we could not verify whether the metrics presented in Li et al. (2020) were Top1 result from the decoder output, we chose to present TopK (K=4) results for comparison. We conduct the experiments using the compositional decoupled representaion as labels.
4.1 Model configuration
The model uses open source projection operator111https://github.com/tensorflow/models/tree/master/research/seq_flow_lite with feature dimension . The projection output is then fed to a dense layer (bottleneck) with output width . The dense layer output is then fed to the QRNN stack of layers, each with state size and convolution kernel width set to .
The decoder input embedding size is set to , which is followed by MAtt decoder stack of size . Each decoder has model dimension set to and the hidden dimension in the feed forward network is set to
. We averaged across 4 heads when computing copy probabilities using.
Table 1 shows the TopK metrics on exact match accuracy for all 6 languages on the compositional decoupled representation. On average, Top1 results outperform the LSTM baseline model with XLU embeddings from Li et al. (2020). For K>1, the exact match accuracy approaches close to the large pre-trained model XLM-R.
The model effectiveness is indicated by the Params column, which roughly can be mapped to the model footprint and inference times.
We extend Projection based representations to on-device seq2seq models using QRNN encoder and MAtt decoder. Despite being 85x smaller, evaluations on MTOP dataset proved the model to be highly effective when compared to LSTM models trained with pre-trained embeddings.
Future directions include employing distillation techniques (Kaliamoorthi et al., 2021) to improve the model further and exploring different tokenization schemes for multilingual projections.
We would like to thank our colleagues Prabhu Kaliamoorthi, Erik Vee, Edgar Gonzàlez i Pellicer, Evgeny Livshits, Ashwini Venkatesh, Derik Clive, Edward Li, Milan Lee and the Learn2Compress team for helpful discussions related to this work. We would also like to thank Amarnag Subramanya, Andrew Tomkins and Rushin Shah for their leadership and support.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3.1.
- Neural machine translation by jointly learning to align and translate. External Links: Cited by: §1, §2.
Quasi-recurrent neural networks. CoRR abs/1611.01576. External Links: Cited by: §1, §3.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. CoRR abs/1712.05877. External Links: Cited by: §3.1.
- PRADO: projection attention networks for document classification on-device. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5012–5021. External Links: Cited by: §1, §2.
Distilling large language models into tiny and effective students using pqrnn. CoRR abs/2101.08890. External Links: Cited by: Tiny Neural Models for Seq2Seq, §1, §1, §2, §3, §3, §6.
- MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. CoRR abs/2008.09335. External Links: Cited by: Tiny Neural Models for Seq2Seq, §1, §2, Table 1, §3, §4, §5, §5.
- Effective approaches to attention-based neural machine translation. CoRR abs/1508.04025. External Links: Cited by: §2.
- Advances in pre-training distributed word representations. CoRR abs/1712.09405. External Links: Cited by: §2.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: Cited by: Tiny Neural Models for Seq2Seq.
- Self-governing neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 887–893. External Links: Cited by: §1, §2.
- ProjectionNet: learning efficient on-device deep networks using neural projections. CoRR abs/1708.00630. External Links: Cited by: §1, §2.
- Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. CoRR abs/2001.11458. External Links: Cited by: §1, §2, §3.
- Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Cited by: §1, §2.
- Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §2.
- Pointer networks. In NIPS, pp. 2692–2700. External Links: Cited by: §1, §2.
- Show and tell: A neural image caption generator. CoRR abs/1411.4555. External Links: Cited by: §2.
- Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 898–909. External Links: Cited by: §1, §2, §3.