Modularity is a universal requirement for large scale software and system design, where “a module is a unit whose structural elements are powerfully connected among themselves and relatively weakly connected to elements in other units.” Baldwin and Clark (1999). In addition to independence, good software architecture emphasises interchangability of modules, a clear understanding of the function of each module, and a unambiguous interface of how each module interacts with the larger system. In this paper, we demonstrate that widely adopted seq2seq models lack modularity, and introduce new ways of training these models with independent and interchangeable encoder and decoder modules that do not sacrifice overall system performance.
Fully differentiable seq2seq models Chan et al. (2016); Bahdanau et al. (2016, 2015); Vaswani et al. (2017) play a critical role in a wide range of NLP and speech tasks, but fail to satisfy even very basic measures of modularity between the encoder and decoder components. The decoder cross-attention averages over the continuous output representations of the encoder and the parameters of both modules are jointly optimized through back propagation. This cause a tight coupling, and prevents a clear understanding of the function of each part. As we will show empirically, current seq2seq models lack modular interchangability, i.e. retraining a single model with different random seeds will cause the encoder and decoder modules to learn very different functions, so much so that interchanging them radically degrades overall model performance. Such tight coupling makes it difficult to measure the contributions of the individual modules or transfer components across different domains and tasks.
In this paper, we introduce a new method that guarantees encoder-decoder modularity while also ensuring the model is fully differentiable. We constrain the encoder outputs into a predefined discrete vocabulary space using the connectionist temporal classification (CTC) loss Graves et al. (2006)
that is jointly optimized with the decoder output token-level cross entropy loss. This novel use of the CTC loss ensures discretizing the encoder output units while respecting their sequential nature. By grounding the discrete encoder output into a real-world vocabulary space, we are able to measure and analyze the encoder performance. We present two proposals for extending the decoder cross-attention to ingest probability distributions, either using probability scores of different hypotheses or using their rank within a fixed beam. Combining these techniques enables us to train seq2seq models that pass the three measures of modularity; clarity of interface, independence, and interchangeability.
The proposed approach combines the best of the end-to-end and the classic sequence transduction approaches by splitting models into grounded encoder modules performing translation or acoustic modeling, depending on the task, followed by language generation decoders, while preserving full-differentiability of the overall system. We present extensive experiments on the standard Switchboard speech recognition task. Our best model, while having modular encoder and decoder components, achieves a competitive WER and on the standard 300h Switchboard and CallHome benchmarks respectively.
2 Baseline Seq2seq models
2.1 Attention-Based encoder-decoder models
Conditioned on previously generated output tokens and the full input sequence, encoder-decoder models Sutskever et al. (2014) factorize the joint target sequence probability into a product of individual time steps. They are trained by minimizing the token-level cross-entropy (CE) loss between the true and the decoder predicted distributions. Input sequence information is encoded into the decoder output through an attention mechanism Bahdanau et al. (2015) which is conditioned on current decoder states, and run over the encoder output representations.
2.2 Models optimized with the CTC loss
Rather than producing a soft alignment between the input and target sequences, the Connectionist Temporal Classification (CTC) loss Graves et al. (2006) maximizes the log conditional likelihood by integrating over all possible monotonic alignments between both sequences.
Where111We decided to omit the discussion of the extra CTC blank symbol in the above equation for clarity of presentation, Graves et al. (2006) provides the full technical treatment of the subject projects the encoder representations into the output vocabulary space, is the output label sequence, is the space of all possible monotonic alignments of into time steps, and the probability of an alignment is the product of locally-normalized per time step output probabilities . The forward-backward algorithm is used for efficient computation of the marginalization sum. Only one inference step is required to generate the full target sequence in a non-autoregressive fashion through the encoder-only model.
2.3 Joint CTC and Attention-Based models
In Kim et al. (2017); Karita et al. (2019), the encoder-decoder cross-entropy loss is augmented with an auxiliary CTC loss, through an extra linear projection of the encoder output representation into the output target space, to guide learning in early optimization phases when gradients aren’t flowing smoothly from the decoder output to the encoder parameters due to misaligned cross-attention. The decoder cross-attention still acts over the encoder output representation maintaining the tight coupling between the encoder and decoder modules.
3 Enforcing modularity in Seq2Seq models
Establishing an interpretable interface between the encoder and decoder components is the first step towards relaxing their tight coupling in seq2seq models. To achieve this goal, we force the encoder to output distributions over a pre-defined discrete vocabulary rather than communicating continuous vector representations to the decoder. This creates an information bottleneckTishby et al. (1999) in the model where the decoder can communicate with the encoder’s learned representations only through probability distributions over this discrete vocabulary. In addition to being interpretable, grounding the encoder outputs offers an opportunity to measure their quality independent of the decoder, if the encoder vocabulary can be mapped to the ground-truth decoder targets.
We choose an encoder output vocabulary that are sub-word units driven from the target label sequences which may deviate from the decoder output vocabulary. To force the encoder to output probabilities in the needed vocabulary space, we use the Connectionist Temporal Classification (CTC) loss. This is a novel usage of the CTC loss, not as the main loss driving the model learning process, but as a supervised function to discretize the encoder output space into a pre-defined discrete vocabulary. Even if the input-output relationship doesn’t adhere to the monotonicity assumption of the CTC loss, as a module in the system, the encoder component is not expected to solve the full problem, however, the decoder module should correct any mismatch in alignment assumption through its auto-regressive generation process.
The decoder design needs to change to cope with cross-attention over probability distributions rather than continuous hidden representations. We introduce theAttPrep component inside the decoder module to prepare the needed decoder internal representation for attention over the input sequence. The AttPrep step enables us to contain the cross-attention operation inside the decoder module.
The encoder module has a softmax normalization layer at the end so that has each row as a probability distribution over . For discretizing the encoder output, the CTC loss jointly optimized with the decoder cross entropy loss. Having distributions over a discrete vocabulary at the input of the encoder opens the space for many interesting ideas on how to harness the temporal correlations between encoder output units and common confusion patterns. We present two variants for the AttPrep component; the weighted embedding and beam convolution.
3.1 Weighted Embedding AttPrep
Given the encoder output distribution , the weighted embedding AttPrep (WEmb) computes an expected embedding per encoder step and combines it with sinusoidal positional encodings (PE) Vaswani et al. (2017), then it applies a multi-head self-attention operation (MHA) to aggregate information over all time steps.
where and is the decoder input dimension. The first operation to compute the expected embedding is actually a 1-D time convolution operation with a receptive field of 1. It can be extended to larger receptive fields offering the opportunity to learn local confusion patterns from the encoder output.
One variant that we experimented with relaxes the softmax operation of the encoder output, which harshly suppresses most of the encoder output units, by applying the 1-D convolution operation above over log probabilities (WlogEmb) to allow for more information flow between encoder and decoder.
3.2 Beam Convolution AttPrep
Rather than using the encoder output probability values, the beam convolution AttPrep (BeamConv) uses the rank of of the top-k hypotheses per time step. It forces a fixed bandwidth on the communication channel between the encoder and decoder, relaxing the dependence on the shape of the encoder output probability distribution. Since the top-k list doesn’t preserve the unit ordering from the encoder output vector, each vocabulary unit is represented by a dimensional embedding vector. Similar to the weighted embedding AttPrep, a 1-D convolution operation is applied over time steps to aggregate local information followed by a multi-head self attention operation.
Where with beam size and unit embedding dimension .
For our speech recognition experiments, we follow the standard Switchboard setup, with the LDC97S62 300h training set, and the Switchboard (SWB) and CallHome (CH) subsets of HUB5 Eval2000 set (LDC2002S09, LDC2000T43) for testing. Following the data preparation setup of ESPNET Watanabe et al. (2018)
, we use mean and variance normalized 83 log-mel filterbank and pitch features from 16kHz upsampled audio. As model targets, we experiment with 100 and 2000 sub-word unitsKudo and Richardson (2018).
We use FairSeq Ott et al. (2019) for all our experiments. We use the Adam optimizer Kingma and Ba (2014) with and an average batch-size of utterances. We warm-up the learning rate from to a peak in steps, keep it fixed for steps, then linearly decrease it to in steps. We follow the strong Switchboard data augmentation policy from Park et al. (2019), but without time-warping. For inference, we don’t use an external LM or joint decoding over the encoder and decoder outputs Watanabe et al. (2018).
where input speech features are processed using two 2-D convolution blocks with 3x3 kernels, 64 and 128 feature maps respectively, 2x2 maxpooling, and ReLU non-linearity. Both encoder and decoder transformer blocks have 1024 dimensions, 16 heads, and 4096 dimensional feed-forward network. Sinusoidal positional embeddings are added to the output of the encoder 2-D convolutional context layers.
|BPE Units||Beam||Loss Criterion||Eval 2000|
|Our baseline implementation|
|LAS + SpecAugment Park et al. (2019)|
|ESPNET Karita et al. (2019)|
|Kaldi Hybrid system 222Results from Kaldi’s recent best recipe on GitHubPovey et al. (2016)|
4.1 Baseline models performance
Table 1 shows the word error rates (WER) of our ASR baseline implementations employing the three approaches for seq2seq modeling, along with the current SOTA systems in the literature. In line with Irie et al. (2019), the auto-regressive encoder-decoder models benefits from larger modeling units as opposed to the CTC-optimized one that works best with shorter linguistic units. Encoder-decoder models trained by joint optimization of the CTC and cross-entropy losses benefits from a hybrid setup with two different vocabulary sets.
The problem of tight coupling of the encoder and decoder components in the seq2seq model is highlighted in table 2. The decoder cross-attention over the encoder hidden representation makes it not only conditioned on the encoder outputs but also dependent on the encoder architectural decisions and internal hidden representations. The whole ASR system fall apart under the interchangability test, i.e. switching an encoder with another similar one that is only different in its initial random seed, which brings our point about the lack of modularity in encoder-decoder models.
|Enc. Swap 1||✗||✓||569.0||892.0|
|Enc. Swap 2||✗||✓||597.2||759.3|
|Enc. Swap 1||✓||✓||850.5||942.8|
|Enc. Swap 2||✓||✓||747.4||1094.5|
4.2 Performance of the proposed modular Seq2Seq models
|Model 1||Model 2||Enc2 Dec1||Enc1 Dec2|
Tables 3 and 4 show that the proposed modular seq2seq models are competitive with SOTA performance levels 1, and that the models are highly modular. Performance does not degrade when exchanging encoders and decoders trained from different initial seeds or choices of architectures.
The information bottleneck at the encoder outputs is critical for this result, as shown from the WLogEmb architecture. Relaxing the bandwidth constraint on the encoder-decoder connection by utilizing the log distribution lets the decoder rely on specific patterns of errors at the tail of the encoder output distribution for its final decisions. This improves the overall performance but breaks modularity when a different encoder is used, as shown in table 3.
4.3 Advantages of Modularity
Building modular seq2seq models brings about many advantages including interpretable modular interface, functional independence of each module, and interchangeability. In addition to opening up the space for designing modules implementing specific functions, grounding each module’s output into some interpretable discrete vocabulary allows for debugging and measure the quality of each modular component in the system. In our speech recognition experiments, the encoder acts like an acoustic model by mapping input acoustic evidences into low level linguistic units, while the decoder, acting like a language model, aggregates distributions of such units to generate the most likely full sentence. For example, table 4 shows how the overall performance is improving going from the encoder to the decoder module in the last two columns.
Another benefit of modular independence, which we enjoy in software design but not in building fully-differentiable seq2seq models, is the ability to carefully build one critical module to higher levels of performance then switch it into the full system without the need for any model fine-tuning. The new module may reflect a new architecture design in that module, e.g. from LSTMs to Transformers, or simply more training data that become available for that module. Such “modular upgrade” capability is demonstrated in table 5 where the upgraded encoder performance is reflected into the overall system WER. In our case, we just used an encoder model that is trained independently using the CTC loss for a larger number of updates.
|BeamConv||1||10||11.0||21.7||8.3 8.7||17.6 16.6|
|BeamConv||3||50||11.5||21.7||9.9 9.3||17.0 16.4|
|WEmb||1||-||12.2||23.1||8.7 8.9||18.0 17.8|
|WEmb||5||-||12.0||23.0||8.4 8.8||17.9 17.4|
Table 6 presents experiments for a slightly different scenario where one module is trained from scratch conditioned on the output of its parent module with frozen parameters, dubbed PostEdit in our experiments. The beam convolution attention preparation architecture, which uses only the rank of encoder hypotheses rather than probability values, shows much more resilience and ability to fix frozen parent module errors compared to the weighed embedding architecture. There is still a slight degradation of the final decoder performance when trained conditionally on the encoder output – without joint fine-tuning. The reason for that is the lack of data augmentation effect when training the decoder module, as a side effect of modular components, because the encoder is trained to be invariant to augmentation when producing its final probability distribution. This can be treated by designing data augmentation techniques suitable to be applied at the input of each module, which we refer to future work.
Modularity provides us with the ability to create an ensemble of exponential number of models, e.g by training 3 different modular seq2seq systems, we end up with an ensemble of 9. In table 7 we show that a modular ensemble of 4 provides further improvement over the WER of an ensemble of the original 2 models.
|Ensemble of 2||1/3||10/-||7.7||16.1|
|+ 2 using modular swap||1/3||10/-||7.7||15.8|
5 Related work
This work is applying the component modularity notion from the design and analysis of complex systems Baldwin and Clark (1999) to fully-differentiable seq2seq models which achieved impressive levels of performance across many tasks Chan et al. (2016); Bahdanau et al. (2016, 2015); Vaswani et al. (2017). The Connectionist Temporal Classification (CTC) loss Graves et al. (2006) was applied as a sequence level loss for training encoder-only speech recognition models Graves and Jaitly (2014); Hannun et al. (2014), and as a joint loss in attention-based systems for encouraging monotonic alignment between input and output sequences Kim et al. (2017). The CTC loss serves the purpose of introducing an information bottleneck Tishby et al. (1999) through discretizing the encoder output into an interpretable vocabulary space.
By enforcing modularity between the encoder and decoder components in seq2seq models, the decoder module can be viewed as a post-edit module to the recognition output of the encoder. Also, the decoder can be viewed as an instance of a differentiable beam-search decoder Collobert et al. .
There is a long history of research in learning disentangled, distributed hidden representations Rumelhart et al. (1986); Hinton et al. (1986), unsupervised discovery of abstract factors of variations within the training data Bengio (2013); Mathieu et al. (2016); Chen et al. (2016); Higgins et al. (2017). This line of research is complementary to our work which enforces modularity only at the link connecting two big components in a seq2seq system. In this work, a component is defined as a deep and complex network with multiple layers of representations which serves a specific function within the bigger system, and outputs distributions over interpretable vocabulary units.
Another line of research that is related to ours centers around inducing a modular structure on the space of learned concepts through hierarchically gating information flow or via high-level concept blueprints Andreas et al. (2016); Devin et al. (2016); Purushwalkam et al. (2019)
to enable zero and few-shot transfer learningAndreas et al. (2017); Socher et al. (2013), multi-lingual and cross-lingual learning Adams et al. (2019); Dalmia et al. (2018); Swietojanski et al. (2012).
Motivated by modular software and system design literature, we presented a method for inducing modularity in attention-based seq2seq models through discretizing the encoder output into a real-world vocabulary units. The Connectionist Temporal Classification (CTC) loss is applied to the encoder outputs to ground them into the predefined vocabulary while respecting their sequential nature. The learned model adhere to the three properties of modular systems – independence, interchangeability, and clearness of interface – while achieving a competitive WER performance in the standard 300h Switchboard task of and on the SWB and CH subsets respectively. Our future work focuses on extending this work to other sequence-to-sequence machine translation and language processing tasks, as well as exploring the benefits of modular transfer in multi-task and multi-modal settings.
The authors would like to thank Paul Michel, Dmytro Okhonko, Matthew Weisner for their helpful discussions and comments.
- Adams et al. (2019) Oliver Adams, Matthew Wiesner, Shinji Watanabe, and David Yarowsky. 2019. Massively Multilingual Adversarial Speech Recognition. In Proc. NAACL-HLT.
Andreas et al. (2017)
Jacob Andreas, Dan Klein, and Sergey Levine. 2017.
Modular multitask Reinforcement Learning with Policy Sketches.In Proc. ICML.
- Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. In Proc. CVPR.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. Proc. ICLR.
- Bahdanau et al. (2016) Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End-to-End Attention-based Large Vocabulary Speech Recognition. In Proc. ICASSP.
- Baldwin and Clark (1999) Carliss Y. Baldwin and Kim B. Clark. 1999. Design Rules: The Power of Modularity Volume 1. MIT Press.
- Bengio (2013) Yoshua Bengio. 2013. Deep learning of representations: Looking forward. CoRR.
Chan et al. (2016)
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016.
Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition.In Proc. ICASSP.
- Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proc. NeurIPS.
- (10) Ronan Collobert, Awni Hannun, and Gabriel Synnaeve. A fully differentiable beam search decoder. In ICML 2019.
- Dalmia et al. (2018) Siddharth Dalmia, Ramon Sanabria, Florian Metze, and Alan W Black. 2018. Sequence-Based Multi-Lingual Low Resource Speech Recognition. In Proc. ICASSP.
- Devin et al. (2016) Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. 2016. Learning modular neural network policies for multi-task and multi-robot transfer. CoRR.
Mark Gales and Steve Young.
The application of hidden markov models in speech recognition.Found. Trends Signal Process., 1(3).
Graves et al. (2006)
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.In Proc. ICML.
Graves and Jaitly (2014)
Alex Graves and Navdeep Jaitly. 2014.
Towards end-to-end speech recognition with recurrent neural networks.
Proceedings of the 31st International Conference on International Conference on Machine Learning.
- Hannun et al. (2014) Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep speech: Scaling up end-to-end speech recognition. ArXiv.
- Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proc. ICLR.
- Hinton et al. (2012) G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6).
Hinton et al. (1986)
Geoffrey E. Hinton et al. 1986.
Learning distributed representations of concepts.In Proceedings of the eighth annual conference of the cognitive science society.
- Irie et al. (2019) Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, and Patrick Nguyen. 2019. On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition. In Proc. InterSpeech.
- Karita et al. (2019) Shigeki Karita, Nanxin Chen, Tomoki Hayashi, and other. 2019. A Comparative Study on Transformer vs RNN in Speech Applications. Proc. ASRU.
- Kim et al. (2017) Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning. In Proc. ICASSP.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: A Method for Stochastic Optimization. In Proc. ICLR.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. EMNLP: System Demonstrations.
- Mathieu et al. (2016) Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. 2016. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems 29.
- Mohamed et al. (2019) Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. 2019. Transformers with convolutional context for ASR. In arXiv preprint arXiv:1904.11660.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proc. NAACL-HLT: Demonstrations.
Park et al. (2019)
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D
Cubuk, and Quoc V Le. 2019.
SpecAugment: A simple data augmentation method for automatic speech recognition.In Proc. Interspeech.
- Povey et al. (2016) Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. 2016. Purely sequence-trained neural networks for asr based on lattice-free mmi.
- Purushwalkam et al. (2019) Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc’Aurelio Ranzato. 2019. Task-driven modular networks for zero-shot compositional learning. CoRR.
- Rumelhart et al. (1986) David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group, editors. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations.
- Socher et al. (2013) Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013. Zero-Shot Learning Through Cross-Modal Transfer. In Proc. NeurIPS.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Swietojanski et al. (2012) Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2012. Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In Proc. SLT.
- Tishby et al. (1999) Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS.
- Watanabe et al. (2018) Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, et al. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proc. InterSpeech.