Neural models for NLP tasks, such as language modeling and machine translation, require large vocabularies for generality (chelba2013one; bahdanau2014neural; luong2015effective; merity2016pointer)
. These models often employ a similar architecture: words, represented as one-hot vectors, are mapped to a dense continuous space; they are then processed by a context model; finally, the contextualized representations are mapped back to a vocabulary-sized vector for computing next-token probabilities. A language modeling example is shown in Figure0(a). The mapping in the first and last steps often uses a shared learned look-up table, referred to as an embedding layer, which takes every word in the vocabulary to a fixed -dimensional vector. One drawback of this approach is that the number of parameters in the embedding layer increases as the vocabulary size grows, limiting us to small values of over large vocabularies. Researchers have sought to improve the efficiency of the embedding layer by assigning lower frequency words smaller dimensional vectors, however, significant parameter reductions come at the cost of performance (morin2005hierarchical; grave2017efficientSoft; baevski2018adaptive). In all these approaches, word embedding is approximated with a linear function from words to vectors.
In this work, we introduce DEep Factorized INput word Embeddings (DeFINE) for neural sequence modeling. DeFINE approximates the complicated word embedding function with far fewer parameters compared to standard methods. DeFINE allows for lower-dimensional input and output mappings in sequence models, reducing their computational burden without reducing performance. The representations produced by DeFINE are more powerful than those of other factorization techniques and even standard embedding layers. To accomplish this, DeFINE leverages a hierarchical group transformation (HGT) that learns deep representations efficiently and effectively. HGT connects different subsets of the input using sparse and dense connections. To improve the flow of information, DeFINE introduces a new skip-connection that establishes a direct link with the input layer at every level of its hierarchy, allowing gradient to flow back directly to the input via multiple paths. DeFINE replaces standard word embedding layers, leaving the rest of the model untouched, and so it can be used with a wide variety of sequence modeling architectures. Figure 1 shows how we incorporate DeFINE with Transformer-XL (dai2019transformer), a state-of-the-art Transformer-based language model and the resulting reduction in total parameters.
Our experiments show that both LSTM- and Transformer-based sequence models benefit from the use of DeFINE. On the Wikitext-103 dataset, an LSTM-based language model with DeFINE provides a 9 point improvement over a full capacity model while using half as many parameters. When combined with adaptive input (baevski2018adaptive) and output (grave2017efficientSoft) representations, DeFINE improves the performance by about 3 points across LSTM-based (see Table 0(a)) and Transformer-XL-based (see Table 2) language models with a minimal increase in training parameters. Computation time at inference is unaffected.111Embeddings learned using DeFINE can be cached, so DeFINE does not increase the computational cost at inference. Incorporating DeFINE into the popular AWD-LSTM language model (merity2018regularizing) without finetuning results in a test perplexity of 54.2 on the Penn Treebank dataset, outperforming both the original and fine-tuned AWD-LSTM models as well as Transformer-XL and MoS (yang2018breaking). For machine translation, DeFINE improves the efficiency of a Transformer model (vaswani2017attention) by 26% while maintaining translation quality. We provide substantive experiments which detail the impact of our architecture decisions and demonstrate the effectiveness of DeFINE across models of varying capacities.
2 Related Work
Many sequence modeling tasks – including language modeling and machine translation – have a large vocabulary. As a consequence, the majority of a model’s parameters are located in the input (or embedding) and the output (or classification) layers. To reduce the computational load presented by these layers, press2017using and inan2016tying introduce an effective mechanism called weight-tying that enables learning input and output representations jointly while significantly reducing the number of network parameters. To further reduce the computational load from these layers, factorization-based methods, such as projective embeddings (dai2019transformer), grouped embeddings (chen2018groupreduce; grave2017efficientSoft; goodmanclasses; mnih2009scalable; morin2005hierarchical), and slim embeddings (li2018slim), have been proposed. Projective embeddings approximate a large embedding matrix with two smaller matrices while grouped embeddings cluster input tokens by frequency and assign different capacities to different clusters using projective embedding methods. We note that projective embeddings is a special case of grouped embeddings when the number of clusters is one. The adaptive input method of baevski2018adaptive generalizes projective and grouped embedding methods and proposes a factorization method that allows for faster, memory-efficient end-to-end training while providing similar or better benefits compared to existing post-training methods which require a pretrained embedding matrix (chen2018groupreduce). Unlike projective and grouped embeddings, li2018slim extends group transformation (kuchaiev2017factorization; mehta2018pyramidal) with the shuffling algorithm of fisher1943statistical to factorize these layers. Other techniques such as codebook learning (shu2017compressing; chen2016compressing; acharya2019online) and quantization (rastegari2016xnor; hubara2017quantized) can be used to further improve efficiency, especially in terms of storage requirements. DeFINE is orthogonal to these methods; our empirical results in Section 4 show improved performance compared to these methods alone.
Recent advances in sequence modeling, such as Transformers and multi-layer RNNs, demonstrate the power of deep architectures in NLP (jozefowicz2016exploring; vaswani2017attention; merity2018analysis). But while significant attention has been given to modeling the interactions between words with deep architectures (e.g. ELMo (Peters2018deep) and BERT (devlin2019bert)), context-free word representations are typically modeled with only corpus statistics (pennington2014glove)
or a single linear transformation(mikolov2013distributed; mccann2017learned). Character-level models (kim2016character) also effect deep representations of words as a convolution over characters, however these models often require more capacity to deliver performance comparable to word-level models (baevski2018adaptive). Still, DeFINE can be used to learn deep representations of a variety of token types, including words, characters, or byte-pair encodings (sennrich2015neural).
Word embedding is often treated as simple function of a one-hot vector to a dense continuous space. The embedding layer can thus be thought of as a wide, shallow network consisting of a single linear transformation. At its heart, the function that this network approximates (call it ) takes a word from its orthographic form to a representation of those of its syntactic and semantic properties which are relevant for modeling an arbitrary number of contexts in which the word can occur. Most NLP research assumes a simple embedding layer can sufficiently approximate the intractable function . We hypothesize that, due to the complexity of , a shallow network would require exceptional capacity to learn a good approximation. Time and data constraints prohibit learning such a high capacity shallow network. We propose, based on recent theoretical results of liang2017deep,222liang2017deep prove that, for a large class of functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network. We make the assumption that
prove that, for a large class of functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network. We make the assumption thatis in this class of functions. that a deeper network can approximate with significantly fewer parameters than a shallow network. The validity of this assumption is evidenced by our experimental results in Section 4.
In this work, we introduce DeFINE, an effective way of learning deep word-level representations in high-dimensional space with a minimum of additional parameters. Our method is based on a Map-Expand-Reduce (MER) principle, described in Section 3.1, that first maps an input word to a low dimensional embedding vector, then transforms it to a high-dimensional space using a computationally efficient hierarchical group transformation (HGT, Section 3.2), which is sketched in Figure 1(c). The resultant vector is then transformed to a low-dimensional space. Over the course of these transformations, we make use of a new connectivity pattern that establishes a direct link between the input and output layers (Figure 3), promoting feature reuse, and improving gradient flow (Section 3.3). The output layer of DeFINE can then be used in place of a traditional embedding as an input to sequence modeling tasks. We detail the various aspects of the architecture below.
3.1 The Map-Expand-Reduce Principle (Mer)
The first step in MER, Map, is similar to standard sequence models. Every input word in the vocabulary is mapped to a fixed dimensional vector . However, in our case, the value of is small (say 64 or 128, compared to typical dimensions of 400 or more). The next step, Expand, takes as an input and applies a hierarchical group transformation (HGT) to produce a very high-dimensional vector , where . Unlike a stack of fully connected layers, HGT learns deep representations efficiently from different subsets of the input using sparse and dense connections. The last step, Reduce, projects the vector to a lower dimensional space to produce the final embedding vector for a given input word. The dimensions of can be matched to contextual representation models, such as LSTMs or Transformers, allowing DeFINE to serve as an input layer for these models.
3.2 Hierarchical group transformation (Hgt)
We introduce a hierarchical group transformation (HGT), sketched in Figure 1(c), to learn deep word-level representations efficiently. HGT comprises of a stack of layers. At each layer, HGT uses a different number of groups that allow it learn representations from different subsets of input. HGT starts with groups at the first layer and then subsequently decreases the number of groups by a factor of 2 at each level. This hierarchical grouping mechanism sparsifies the connections in fully connected (or linear) layers and allows us to learn representations efficiently with fewer parameters. Similar to a stack of fully connected layers, the -th layer in HGT has access to every input element of the first layer through multiple paths, thereby, allowing it to learn effective representations. Group linear transformations (GLT), originally introduced to improve the efficiency of the LSTM, also sparsify the connections in fully connected layers and significantly reduce computational costs (kuchaiev2017factorization; mehta2018pyramidal). However, if we stack multiple GLT layers, the outputs of a certain group are only derived from a small fraction of the input, thus learning weak representations. The hierarchical grouping mechanism in HGT allows the -th layer to obtain input data from multiple paths, enabling HGT to learn stronger representations. A comparison of different transformations is given in Figure 2. We can see that HGT is both efficient and has better access to the input. Note that linear and group linear transforms are special cases of HGT when and (fixed), respectively.
To transform to , HGT first samples the space between and linearly to construct intermediate layers of increasing dimensionality. Therefore, the output vector produced by -th layer will have higher dimensionality than the -th layer. Assume that the linearly spaced vector dimensions are divisible by , we transform to as follows:
where , are the weights learned at -th layer, and is a group transformation function defined in mehta2018pyramidal. Group transformation splits the input into groups, each of which is processed independently using a linear transformation. The output of these groups are then concatenated to produce final output. See Section A.1 for details.
3.3 DeFINE unit
The DeFINE unit is composed of HGT transformations that are designed using the MER principle. Though HGT layers are an efficient approximation to computationally expensive fully connected layers, they might impede training as the depth of the DeFINE
unit grows. Residual connections(he2016deep) have proved to be very effective at mitigating this issue, however, such connections are difficult to implement in HGT because the input and output dimensions of each layer are different.
To maximize the flow of information and facilitate training with deeper DeFINE units, we introduce a simple new skip-connection that establishes a direct link between any layer in HGT with the input . Figure 3 visualizes the DeFINE unit with a depth of two (=). To enable the sparse connections in HGT to have access to the input and the output of the previous layer (), we chunk the input and the output into groups using a split layer. The chunked input and output vectors are then mixed such that the first chunk of the input and the first chunk of the -th layer’s output are put together as the input for the first group transformation in the -th layer, and so on until inputs have been constructed. The resultant vector is then fed to -th layer. This mechanism promotes input feature reuse efficiently. Additionally, it establishes a direct link with the input , allowing gradient to flow back to the input via multiple paths and resulting in improved performance.
3.4 DeFINE for Sequence Modeling
The DeFINE unit can be easily integrated with any new or existing sequence models. Sequence models typically consist of a stack of an input layer (embedding or adaptive input layer), a contextual model (e.g. LSTM or Transformer), and a classification layer (a fully-connected or adaptive softmax). Since DeFINE learns deep word-level representations, we can easily stack it immediately after the input. An example is shown in Figure 1, where DeFINE is integrated with Transformer-XL, a state-of-the-art language model. DeFINE enables the use of relatively lower dimensions in the input layer, thus reducing network parameters.
The input word-level representations, , , and , that a neural model learns for each word are independent of other words. This allows us to create another independent look-up table (after training a model) that caches the mapping between the input word and the output of the DeFINE unit (), resulting in a mechanism that allows to skip the computations of the DeFINE unit at inference time.
4 Experimental Results
We demonstrate the performance of DeFINE on two sequence modeling tasks: language modeling (Section 4.1) and machine translation (Section 4.2). We compare the performance of DeFINE with existing factorization and compression-based methods in Section 4.3. We also provide ablations in Section 4.4 to show the effectiveness of our design decisions. Throughout this section, we use the following notation: , , and are dimensions of , , and respectively, and represents depth of DeFINE.
4.1 Language Modeling
In this section, we study the performance of our models with LSTM- and Transformer-based language models on two datasets: WikiText-103 (merity2016pointer) and the Penn Treebank (marcus1994ptb). On both datasets, we show that DeFINE is parameter efficient and improves the performance of existing language models.
4.1.1 WikiText-103 (WT-103)
Data and models:
The WikiText-103 dataset (merity2016pointer) consists of 103M/217K/245K tokens for training, validation, and test respectively and has a vocabulary size of about 260K. This dataset is composed of Wikipedia articles and retains punctuation, numbers, and case. To evaluate the effectiveness of DeFINE, we study two different kinds of contextual models: LSTM, and Transformer (Transformer-XL (dai2019transformer)). We measure the performance of these models in terms of perplexity, a standard metric for language modeling. Lower values of perplexity indicate better performance. Following recent works, including merity2018analysis, baevski2018adaptive, and dai2019transformer, we use adaptive inputs as a mapping function in DeFINE and adaptive softmax for classification with tied weights. See A.3 for more details.
Results of LSTM-based language models:
Table 1 summarizes the results of LSTM-based language models. Though the adaptive input (baevski2018adaptive) and output (grave2017efficientSoft) methods are effective and reduce the number of parameters significantly, our method further improves performance by about 3 points while learning only 1.25% (or 0.4 million) more parameters. It is important to note that the computational complexity of models in R2 and R3 is the same because our method allows caching outputs of DeFINE for use at inference (see Section 3.4).
When we scale the depth of DeFINE from 3 to 11 layers (Table 0(b))333We scale the input and the output dimensions to uniformly increase the network complexity., the performance improves by a further 6 points, delivering competitive performance to existing RNN-based methods with fewer parameters (e.g. as many parameters as merity2018analysis). The performance of our model is better than existing methods such as dauphin2017language and BaiTCN2018.
Results of Transformer-based model:
Table 2 compares the performance of Transformer-XL, a state-of-the-art Transformer-based model, with and without DeFINE. Table 1(a) shows our method is able to attain similar performance to dai2019transformer while learning 10M fewer parameters. It is interesting to note that DeFINE enables us to reduce the computational burden from the input and output layers by a large amount with minimal impact on performance. With DeFINE, the performance of Transformer-XL drops only by about 2 points while the number of parameters are reduced by 50%. For similar reduction in the number of parameters, the performance of original Transformer-XL drops by 5 points, suggesting the proposed method for learning word-level representations is effective. Table 1(b) highlights the fact that Transformer-XL with DeFINE is able to achieve comparable perplexity to a standard Transformer-XL with projective embeddings while using significantly fewer parameters.
4.1.2 Penn Treebank (PTB)
Data and models:
The Penn Treebank dataset (marcus1994ptb) contains about 929K/74K/82K tokens in its train, validation, and test sets respectively. It has a vocabulary size of about 10K. Following recent works, we use the processed version provided by mikolov2010recurrent. To evaluate the effectiveness of our model, we compare to AWD-LSTM (merity2018regularizing). Our model replaces the embedding layer in AWD-LSTM with DeFINE unit with the following settings: , , , and
. We use the same hyper-parameters and PyTorch version as the original AWD-LSTM.
Results are summarized in Table 0(c). The proposed method improves the performance of AWD-LSTM by 4 points while simultaneously reducing the number of parameters by 4 million. Without any finetuning, AWD-LSTM + DeFINE achieves comparable performance to state-of-the-art methods, including Transformer-XL, with fewer parameters.
4.2 Machine Translation
Data and models:
We use the WMT 2014 English-German (EN-DE) dataset (luong2015effective) for training. Following vaswani2017attention, we encode the sentences using byte-pair encoding (britz2017massive) and use newstest2014 and newstest2017 as validation and test sets, respectively. We integrate DeFINE with the state-of-the-art Transformer model (vaswani2017attention) with following parameters: , , , and . We use the implementation in OpenNMT-py (klein2017opennmt) for training and evaluation with the recommended hyper-parameters.
Table 3 summarizes the results. DeFINE improves the performance of the Transformer model without checkpoint averaging by 2% while simultaneously reducing the total number of parameters by 26%, suggesting that DeFINE is effective.
|Transformer + SRU (lei2018sru)||✓||90 M||27.1||28.30|
|Transformer (OpenNMT impl.) (klein2017opennmt)||✓||92 M||26.89||28.09|
|Transformer + DeFINE||✗||68 M||27.01||28.25|
) on the task of neural machine translation.DeFINE attains similar performance to checkpoint averaging, but with fewer parameters.
4.3 Comparison with different methods
Table 4 compares the performance of different factorization methods for different sequence models. With DeFINE, the performance and efficiency of sequence models improves across different tasks. This is likely because the output of DeFINE more closely approximates the correlation pattern of a standard embedding layer compared to other embeddings (see Figure 4 and Appendix B). Furthermore, we see that strong correlations between dimensions in the mapping layer of DeFINE are reduced over the course of the expansion layers (see Figures 8, 9, and 10 in Appendix). Figure 11 in Appendix shows that groups within an expansion layer of DeFINE are not correlated, suggesting these matrices are learning different representations of their input.
Impact of compression-based methods:
Compression-based methods allow for efficiently discretizing the continuous 32-bit full-precision embedding vectors, thus reducing the memory footprint of the input layer. With DeFINE, we also learn a continuous full precision 32-bit floating-point embedding vector (similar to baevski2018adaptive and dai2019transformer). Therefore, compression-based methods, such as (shu2017compressing), can be applied to sequence models with DeFINE and other factorization methods. Table 5 shows that DeFINE embeddings can be compressed similarly to standard embeddings without loss of performance.
|LSTM||Language Modeling||Standard||92 M||44.12|
|(Table 0(a))||Adaptive||33 M||44.87|
|AWD-LSTM||Language Modeling||Standard||24 M||58.8|
|(Table 0(c))||DeFINE||20 M||54.2|
|Transformer-XL||Language Modeling||Standard||139 M||27.06|
|(Table 2)||Projective||71 M||29.16|
|Transformer||Machine Translation||Standard||92 M||25.81|
|(Table 3)||DeFINE||68 M||28.25|
|Dimension of||Input-Output||Compression||Look-up Table||Perplexity||Inference Time|
|()||Layers||Used?||Size (in MB)||(in ms/batch)|
4.4 Ablation studies on WikiText-103 dataset
In this section, we provide an analysis of our design choices using an LSTM-based language model. In our ablations, we choose LSTM- over Transformer-based language models because they are less sensitive to hyper-parameters and can be trained on a single GPU. We use the same hyper-parameters for training as described in Section 4.1.1, specifically , , , and .
Impact of different transformations:
Table 6 summarizes our results. HGT is as effective as linear transformation while learning two million fewer parameters. Compared to group linear transform (GLT), HGT improves perplexity by about 5 points while learning a similar number of parameters. Furthermore, when we establish a direct connection with the input (see Section 3.2 for details), the performance further improves by 2.9 points with a minimal impact on number of parameters, suggesting that DeFINE learns good representations.
Impact of scaling depth () and width ():
Table 7 summarizes the results of our scaling experiments. For the same value of , the performance of the language model improves with the increase in the depth . However, when we scale the width for a fixed value of depth , the performance does not improve. This is likely because, as we increase the size of , more neurons are receiving their input from the same subset of dimensions and thus learning many redundant parameters.
DeFINE with different connections:
Table 7(a) demonstrates the impact of residual connections in DeFINE. In order to facilitate residual connections inside DeFINE, we fix the dimension of each layer in DeFINE to be instead of linearly spanning from to . We can clearly see that the proposed skip-connections are more effective.
Impact of reduce operation in Mer:
In the MER strategy (Section 3.1), we project the high-dimensional vector to a low-dimensional space before feeding it to a contextual model, such as an LSTM. We empirically found that the performance with and without this reduction step is similar, however, a model without the reduction step learns more parameters (Table 7(b)).
DeFINE uses a deep, hierarchical, sparse network with new skip connections to learn better word embeddings efficiently. Sequence models with DeFINE (e.g. Transformer and LSTM) perform comparably or better with state-of-the-art methods with fewer parameters. Our experiments show that the proposed architectural decisions each contribute to the effectiveness of the DeFINE unit. We believe neural sequence models with DeFINE can be further improved with extended hyper-parameter search, similar to melis2018on. In future work, we will apply DeFINE to other sequence modeling tasks. For instance, we believe that pretrained language model architectures such as ELMo and BERT can benefit from incorporating DeFINE to improve efficiency and performance. Another direction is to use the components of DeFINE – specifically MER, HGT, and mixing layers – in neural architecture search processes. We have shown the promise of these components here, but a thorough architecture search may discover more optimal configurations in the large search space defined by the depth, grouping, and connectivity parameters.
This research was supported by ONR N00014-18-1-2826, DARPA N66001-19-2-403, NSF (IIS-1616112, IIS1252835), ARO (W911NF-16-10121), an Allen Distinguished Investigator Award, Samsung GRO and gifts from Allen Institute for AI, Google, and Amazon. Authors would also like to thank members of the H2Lab at the University of Washington, Seattle for their valuable feedback and comments.
Appendix A Appendix
a.1 Group Linear Transformation Function
To produce an output from an input and weight matrix , first chunks the input into groups and then concatenates the chunked parts to produce . is then multiplied with weight matrix to produce . The resultant vector is then flattened to produce . When , we obtain the linear transform.
a.2 Block level diagrams of different skip-connections in DeFINE
Block level diagrams of different variants of DeFINE are given in Figure 5. Figure 4(a) stacks transformation layer (Eq. 1) and is the same as HGT in Figure 1(c). Figure 4(b) adds a residual connection to Figure 4(a). Figure 4(c) is the same as Figure 3 while Figure 4(d) is the same as Figure 4(c), but without split and mixer functionality.
a.3 Hyper-parameters for training language models
For training LSTM-based language models, we use a single NVIDIA GTX 1080 Ti GPU with 11 GB GPU memory while for training Transformer-XL, we used four GeForce RTX 2080 Ti GPUs, each with 11 GB of GPU memory (as recommended by authors). Following recent works, including merity2018analysis and baevski2018adaptive, we use adaptive inputs as a mapping function in DeFINE and adaptive softmax for classification for our experiments with RNN-based sequence models. We also tie weights between the adaptive inputs and outputs. For Transformer-XL dai2019transformer, we use projective embeddings (as done by authors). We train our models using PyTorch (v1.2). For LSTM-based language models, we use similar hyper-parameters as merity2018analysis which are summarized in Section 9.
|# of GPUs||1|
|LR reduction (factor, steps)||10, |
|LSTM Hidden Dimension||1024|
|# of LSTM Layers||4|
|Max. dimension of ()||1024|
|Dropout||Same as merity2018analysis|
a.4 Performance of Transformer-XL on WikiText-103
Figure 6 plots the validation perplexity of Transformer-XL on the WikiText-103 as a function of training steps. We can see that DeFINE enables Transformer-XL to deliver similar performance with fewer parameters.
Appendix B Correlation map visualization for Transformer-XL on WikiText-103
Computing correlation map:
Let us say that we have an arbitrary look-up table that maps every word in vocabulary to a -dimensional vector space. We compute the correlation map as: .444Correlation maps are normalized between 0 and 1. If the correlation map is identity, then it suggests that the -dimensions in are independent. To encode better contextual representations among words using context models such as LSTMs and Transformers, embedding dimensions should be independent.
Can DeFINE approximate the standard embedding layer?
Figure 7 visualizes the correlation maps of embeddings learned using a standard embedding layer (top row), projective embeddings (acharya2019online; dai2019transformer) (middle row), and DeFINE embeddings (bottom row) at different values of , where is the dimension of mapping layer in DeFINE. Compared to projective embeddings, DeFINE is able to approximate the standard embedding layer efficiently and effectively (see Table 2 for efficiency and performance comparison).
Furthermore, we provide layer-wise comparison for DeFINE at different values of in Figures 8, 9, and 10. The mapping layer in DeFINE is in low-dimensional space and has correlations. As we learn deeper representations using DeFINE, these correlations are reduced and we obtain a correlation matrix similar to a standard embedding layer. This suggests that DeFINE is effective in approximating the standard embedding layer. Importantly, the groups at different expansion layers in DeFINE are independent (see Figure 11), suggesting these matrices are learning different representations of their input.
|Map layer in DeFINE|
|Expansion layer 1 in DeFINE|
|Expansion layer 2 in DeFINE|
|Reduction layer in DeFINE|
|Map layer in DeFINE|
|Expansion layer 1 in DeFINE|
|Expansion layer 2 in DeFINE|
|Reduction layer in DeFINE|
|Map layer in DeFINE|
|Expansion layer 1 in DeFINE|
|Expansion layer 2 in DeFINE|
|Reduction layer in DeFINE|
|Expansion layer 1 in DeFINE with 4 groups|
|Expansion layer 2 in DeFINE with 2 groups|