1 Introduction
Transformer based models [20, 4, 22, 14, 7, 15]
have become one of the most effective approaches to model sequence data of variable lengths. Transformers have shown wide applicability to many natural language processing (NLP) tasks such as language modeling
[14], neural machine translation (NMT) [20], and language understanding [4]. Unlike traditional recurrentbased models (e.g., RNN or LSTM), Transformer utilizes a nonrecurrent but selfattentive neural architecture to model the dependency among elements at different positions in the sequence, which leads to better parallelization using modern hardware and alleviates the vanishing/exploding gradient problem in traditional recurrent models.
[23] prove that the design of selfattentive architecture leads to a family of permutation equivalence functions. Thus, for applications where the ordering of the elements matters, how to properly encode position information is crucial for Transformer based models. There have been many attempts to encode position information for the Transformer. In the original Transformer paper [20], a family of predefined sinusoidal functions was adapted to construct a set of embeddings for each position. These fixed position embeddings are then added to the word embeddings of the input sequence accordingly. To further construct these position embeddings in a more datadriven way, many recent Transformer variants such as [4, 9] include these embeddings as learnable model parameters in the training stage. This datadriven approach comes at the cost of the limitation of a fixed maximum length of input sequence and the computational/memory overhead of additional parameters, where is usually set to 512 in many applications, and is the dimension of the embeddings. [18] propose a relative position representation to reduce the number of parameters to by dropping the interactions between tokens with a distance greater than . In addition to just the input layer, [3] and [7] suggest that the injection of position information to every layer leads to even better performance for the Transformer.
An ideal position encoding approach should satisfy the following three properties:

[nosep,leftmargin=1em,labelwidth=*,align=left]

Inductive: the ability to handle sequences longer than any sequence seen in the training time.

DataDriven: the position encoding should be learnable from the data.

Parameter Efficient: number of trainable parameters introduced by the encoding should be limited to avoid increased model size, which could hurt generalization.
In Table 1, we summarize some of the existing position encoding approaches in terms of these three properties.
In this paper, we propose a new method to encode position with minimum cost. The main idea is to model position encoding as a continuous dynamical system, so we only need to learn the system dynamics instead of learning the embeddings for each position independently. By doing so, our method enjoys the best of both worlds – we bring back the inductive bias, and the encoding method is freely trainable while being parameter efficient. To enable training of this dynamical system with backpropagation, we adopt the recent progress in continuous neural network
[1], officially called Neural ODE. In some generative modeling literature, it is also called the freeform flow model [5], so we call our model FLOwbAsed TransformER (FLOATER). We highlight our contributions as follows:
[nosep,leftmargin=1em,labelwidth=*,align=left]

We propose FLOATER, a new position encoder for Transformer, which models the position information via a continuous dynamical model in a datadriven and parameterefficient manner.

Due to the use of a continuous dynamic model, FLOATER can handle sequences of any length. This property makes inference more flexible.

With careful design, our position encoder is compatible with the original Transformer; i.e., the original Transformer can be regarded as a special case of our proposed position encoding approach. As a result, we are not only able to train a Transformer model with FLOATER from scratch but also plug FLOATER into most existing pretrained Transformer models such as BERT, RoBERTa, etc.

We demonstrate that FLOATER consistent improvements over baseline models across a variety of NLP tasks ranging from machine translations, language understanding, and question answering.
2 Background and Related Work
2.1 Importance of Position Encoding for Transformer
We use a simplified selfattentive sequence encoder to illustrate the importance of position encoding in the Transformer. Without position encoding, the Transformer architecture can be viewed as a stack of blocks containing a selfattentive and a feedforward layer
. By dropping the residual connections and layer normalization, the architecture of a simplified Transformer encoder can be represented as follows.
(1)  
(2) 
where , is the length of the sequence and is the dimension of the word embedding. and are the selfattentive and feedforward layer in the th block , respectively.
Each row of can be regarded as a weighted sum of the value matrix , with the weights determined by similarity scores between the key matrix and query matrix as follows:
(3)  
and are the weight and bias parameters introduced in the selfattentive function . The output of the feedforward function used in the Transformer is also a matrix with rows. In particular, the th row is obtained as follows.
(4) 
where and
are the weights and biases of linear transforms, and
is the activation function. It is not hard to see from (
3) and (4) that both and are permutation equivalent. Thus, we can conclude that the entire function defined in (1) is also permutation equivalent, i.e., for any permutation matrix . This permutation equivalence property restricts the Transformer without position information from modeling sequences where the ordering of elements matters.2.2 Position Encoding in Transformer
As mentioned in Section 1, there are many attempts to inject position information in selfattentive components. Most of them can be described in the following form:
(5) 
where is a position encoding function.
[20] propose to keep and inject position information only at the input block with a family of predefined sinusoidal functions: , where ] is a position embedding matrix with the th row corresponding to the th position in the input sequence. In particular, the th dimension of the th row is defined as follows.
(6) 
where . [3] and [7] observe better performance by further injecting the position information at each block, i.e., as follows:
(7) 
Note that for the above two approaches, position encoding functions are fixed for all the applications. Although no additional parameters are introduced in the model, both approaches are inductive and can handle input sequences of variable length.
Many successful variants of pretrained Transformer models, such as BERT [4] and RoBERTa [9], include the entire embedding matrix in as training parameters. As the number of training parameters needs to be fixed, the maximum length of a sequence, , is required to be determined before the training. Although it lacks the inductive property, this datadriven approach is found to be effective for many NLP tasks. Note that, unlike the fixed sinusoidal position encoding, there is no attempt to inject a learnable position embedding matrix at each block for Transformer due to a large number of additional parameters ().
3 FLOATER: Our Proposed Position Encoder
We introduce our method in three steps. In the first step, we only look at one Transformer block, and describe how to learn the position representation driven by a dynamical system; in the second step, we show how to save parameters if we add position signals to every layer; lastly, we slightly change the architecture to save trainable parameters further and make FLOATER “compatible” with the original Transformer [20]. The compatibility means our model is a strict superset of the vanilla Transformer so that it can be initialized from the Transformer.
3.1 Position Encoding with Dynamical Systems
Position representations in Transformer models are a sequence of vectors to be added to the sequence of the input representations {. Existing position encoding approaches either apply a fixed sinusoidal function to obtain , or include them as uncorrelated learnable parameters. Both of them fail to capture the dependency or dynamics among these position representations . In this paper, we propose to use a dynamical system to model these position representations; that is, there is a “latent force” denoted by that drives the changes from to . To encourage smoothness, we consider as the continuous version of the discrete sequence . In particular, our proposed continuous dynamical system is characterized as follows:
(8) 
together with an initial vector , where is a neural network parameterized by and takes the previous state . Notice that the domain of is . The position sequence can be obtained by taking on a series of points : . One simple strategy is to set so that the points are equidistant, where
is a hyperparameter (e.g.,
). With this strategy, we are implicitly assuming the position signals evolve steadily as we go through each token in a sentence. In general, can be any monotonically increasing series, which allows us to extend our work to more applications where the elements in the sequence are not always observed with the same interval. More discussions about the applicability for this general setting is included in the Supplementary material. For the NLP applications discussed in this paper, we choose .Eq. (8) is equivalent to an ODE problem , which is guaranteed to have a unique solution under mild conditions [19]. We follow the efficient approach by [1] to calculate the gradients of with respect to the overall training loss, which allows us to include this parameterized dynamical position encoder into the endtoend training of Transformer models. More details can be found in the Supplementary material.
3.2 Parameter Sharing among Blocks
As mentioned in Section 2, injecting position information to each block for Transformer leads to better performance [3, 7] in some language understanding tasks. Our proposed position encoder FLOATER (8) can also be injected into each block. The idea is illustrated in Figure 1. Typically there are blocks in sequencetosequence Transformer and or blocks in BERT. We add a superscript to denote dynamics at th block:
As we can imagine, having different dynamical models for each block can introduce too many parameters and cause significant training overhead. Instead, we address this issue by sharing parameters across all the blocks, namely
(10) 
Note that (10) does not imply that all the are the same, as we will assign different initial values for each block, that is for .
TransformerBase  TransformerLarge  
EnDe  EnFr  EnDe  EnFr  
Position encoders at all blocks  
FLOATER  28.6  41.6  29.2  42.7 
Predefined Sinusoidal Position Encoder  28.2  40.6  28.4  42.0 
Fixedlength Position Embedding  26.9  40.9  28.3  42.0 
Position encoder only at input block  
FLOATER  28.3  41.1  29.1  42.4 
Predefined Sinusoidal Position Encoder  27.9  40.4  28.4  41.8 
Fixedlength Position Embedding  27.8  40.9  28.5  42.4 
3.3 Compatibility and Warmstart Training
In this section, we change the way to add position encoding so that our FLOATER can be directly initialized from Transformer. As an example, we use the standard Transformer model, which has a fixed sinusoidal encoding at the input block and no position encoding at deeper levels. Note that this technique can be extended to other variants of Transformers with different position encoding methods, such as embedding matrix. We first examine the standard Transformer model, the query matrix at block is
(11) 
where and are parameters in (3); is the sinusoidal encoding; is the th row of . Here we add a tilde sign to indicate the sinusoidal vectors. Formulas for and have a very similar form and are omitted for brevity.
Now we consider the case of FLOATER, where new position encodings are added
(12)  
It is easy to see that the changing the position embedding from to
is equivalent to adding a positionaware bias vector
into each selfattentive layers . As a result, we can instead apply (8) to model the dynamics of . In particular, we have the following dynamical system:(13) 
After that, we set . We can see that if and , then . This implies (12) degenerates to (11). Note that (13) has the same form as (8), except that we are now modeling the bias terms in (3). We will apply the same technique to and .
To summarize, our model has a tight connection to the original Transformer: if we set all dynamical models to zero, which means , then our FLOATER model will be equivalent to the original Transformer with the sinusoidal encoding. The same trick also works for Transformer with position embedding such as BERT [4].
We strive to make our model compatible with the original Transformer due to the following reasons. First of all, the original Transformer is faster to train as it does not contain any recurrent computation; this is in contrast to our dynamical model (8), where the next position depends on the previous one
. By leveraging the compatibility of model architecture, we can directly initialize FLOATER model from a pretrained Transformer model checkpoint and then finetune for the downstream task for a few more epochs. By doing so, we enjoy all the benefits of our FLOATER model but still maintain an acceptable training budget. Likewise, for models such as BERT or TransformerXL, we already have wellorganized checkpoints out of the box for downstream tasks. These models are costly to train from scratch, and since our goal is to examine whether our proposed position representation method can improve over the original one, we decided to copy the weights layer by layer for attention as well as FFN layers, and randomly initialize the dynamical model
.4 Experimental Results
In this section, we perform experiments to see if FLOATER can improve over the existing position encoding approaches for a given Transformer model on various NLP tasks. Thus, all the metrics reported in this paper are computed from a single (not ensemble) Transformer model over each evaluation NLP task. Albeit lower than top scores on the leaderboard, these metrics are able to reveal more clear signal to judge the effectiveness of the proposed position encoder.
All our codes to perform experiments in this paper are based on the Transformer implementations in the fairseq [11] package. Implementation details can be found in the Supplementary material. Our experimental codes will be made publicly available.
Model  Single Sentence  Similarity and Paraphrase  Natural Language Inference  

CoLA  SST2  MRPC  QQP  STSB  MNLI  QNLI  RTE  
Base model  
RoBERTa  63.6  94.8  88.2  91.9  91.2  87.6  92.8  78.7 
FLOATER  63.4  95.1  89.0  91.7  91.5  87.7  93.1  80.5 
Large model  
RoBERTa  68.0  96.4  90.9  92.2  92.4  90.2  94.7  86.6 
FLOATER  69.0  96.7  91.4  92.2  92.5  90.4  94.8  87.0 
Model  Accuracy  Middle  High 

Single model on test, large model  
RoBERTa  82.8  86.5  81.3 
FLOATER  83.3  87.1  81.7 
4.1 Neural Machine Translation
Neural Machine Translation (NMT) is the first application that demonstrates the superiority of a sequencetosequence Transformer model over conventional recurrent sequence models. We include the following three additive position encoders: .

[nosep,leftmargin=1em,labelwidth=*,align=left]

Datadriven FLOATER: is generated by our proposed continuous dynamical models with datadriven parameters described in (8).
To better demonstrate the parameter efficiency brought by FLOATER, for each above encoder, we also include two experimental settings: position encoder at all blocks or only at the input block (i.e., ).
In Table 2, we present the BLEU scores on WMT14 EeDe and EnFr datasets with both Transformerbase and Transformerlarge models described in [20]. Among all the data/model combinations, our proposed FLOATER at all blocks outperforms two other position encoders.
On the other hand, we also observe that adding position encoders at all blocks yields better performance than only at the input block. While there is an exception in the fixedlength position embedding approach. We suspect that this phenomenon is due to overfitting cased by learnable parameters introduced by this approach. In contrast, our proposed FLOATER is parameter efficient (more discussions in Section 4.3), so the performance can be improved by injecting the position encoder at all the blocks of Transformer without much additional overhead.
4.2 Language Understanding and Question Answering
Model  SQuAD 1.1  SQuAD 2.0  

EM  F1  EM  F1  
Single models on dev, w/o data augmentation  
RoBERTa  88.9  94.6  86.5  89.4 
FLOATER  88.9  94.6  86.6  89.5 
Pretrained Transformer models such as BERT and RoBERTa have become the key to achieving the stateoftheart performance for various language understanding and question answering tasks. In this section, we want to evaluate the effectiveness of the proposed FLOATER on these tasks. In particular, we focus on three language understanding benchmark sets, GLUE [21], RACE [6] and SQuAD [17]. As mentioned in Section 3.3, FLOATER is carefully designed to be compatible with the existing Transformer models. Thus, we can utilize pretrained Transformer models to warmstart a FLOATER model easily to be used to finetune on these NLP tasks. In this paper, we download the same pretrained RoBERTa model from the official repository as our pretrained Transformer model for all NLP tasks discussed in this section. GLUE Benchmark. This benchmark is commonly used to evaluate the language understanding skills of NLP models. Experimental results in Table 3 show that our FLOATER model outperforms RoBERTa in most datasets, even though the only difference is the choice of positional encoding. RACE benchmark Similar to the GLUE benchmark, the RACE benchmark is another widely used test suit for language understanding. Compared with GLUE, each item in RACE contains a significantly longer context, which we believe requires more important to grasp the accurate position information. Like in GLUE benchmark, we finetune the model from the same pretrained RoBERTa checkpoint. We keep the hyperparameters, such as batch size and learning rate, to also be the same. Table 4 shows the experimental results. We again see consistent improvement of FLOATER across all subtasks.
BLEU ()  #Parameters ()  

FLOATER  28.57  526.3K 
1layer RNN + scalar  27.99  263.2K 
2layer RNN + scalar  28.16  526.3K 
1layer RNN + vector  27.99  1,050.0K 
SQuAD benchmark SQuAD benchmark [17, 16] is another challenging task to evaluate the question answering skills of NLP models. In this dataset, each item contains a lengthy paragraph containing facts and several questions related to the paragraph. The model needs to predict the range of characters that answer the questions. In SQuADv2, the problem becomes more challenging that the questions might be unanswerable by the context. We follow the same data processing script as BERT/RoBERTa for fair comparison; more details about the training process are described in the Supplementary material. The experiment results are presented in Table 5. As we can see, the FLOATER model beats the baseline RoBERTa model consistently across most datasets. The improvement is significant, considering that both models are finetuned from the same pretrained checkpoint.
4.3 More Discussions and Analysis
How inductive is FLOATER?
FLOATER is designed to be inductive by a datadriven dynamical model (8). To see how inductive FLOATER is when comparing to existing approaches, we design the following experiment. We first notice that in WMT14 EnDe dataset, of the training sentences are shorter than tokens. Based on that, we make a new dataset called EnDe short to long (or S2L for brevity): this dataset takes all the short sentences ( tokens) as the training split and all the long sentences ( tokens) as the testing split. We further divide the testing split to four bins according to the source length fallen in . BLEU scores are calculated in each bin, and the results are presented in Figure 2.
Our FLOATER model performs particularly well on long sentences, even though only short sentences are seen by the model during training. This empirical observation supports our conjecture that FLOATER model is inductive: the dynamics learned from shorter sequences can be appropriately generalized to longer sequences.
Is RNN a good alternative to model the dynamics?
Recurrent neural network (RNN) is commonly used to perform sequential modeling. RNN and our continuous dynamical model (8) indeed share some commonality. Computing the value at the th step relies on the results at the st step. Further, they all contain trainable parameters, allowing them to adapt to each particular task. Lastly, they can be extrapolated to any length as needed. To see if RNN works equally well, we model the sequence with RNN models:
(14) 
where is the input to the RNN model at index . Recall in RNN language models, is the word embedding or hidden feature of the th token. In our case, since we apply RNN to learn the encodings as opposed to hidden features, sensible inputs can be scalar value or vectorized value by sinusoidal encoding. We tried both choices on WMT14 EnDe data and found that vectorized value generally works better, though not as good as our FLOATER model. Detailed results can be found in Table 6.
What does each position encoding look like?
To better understand how different position encodings affect the sequence modeling, in Figure 3, we visualize the position embedding matrix obtained from four different position encoding approaches for the Transformerbase backbone on WMT14 EnDe dataset. We can see that sinusoidal encoding (2(a)) is the most structural, while position embedding (2(b)) is quite chaotic. Our FLOATER model learns position representation completely from data, but still exhibits some regularities (2(c)). Finally, the RNN model (2(d)
) fails to extract sufficient positional information, probably due to the vanishing gradient problem. Another finding is that by looking at (
2(b)), we observe that the vectors are nearly constant among different large positions (near the bottom of Figure 2(b), we see patterns of vertical lines with the same color). This phenomenon is due to long sentences in the dataset being scarce, and so the positional information carried by lower indices cannot be extrapolated to higher indices. On the contrary, the dynamical model proposed in this paper enjoys the best of both worlds – it is adaptive to dataset distribution, and it is inductive to handle sequences with lengths longer than the training split.4.4 Remarks on Training and Testing Efficiency
It is not surprising that during the training time, our flowbased method adds a nonnegligible time and memory overhead; this is because solving the Neural ODE precisely involves times forward and backward propagations of the flow model. Even though we deliberately designed a small flow model (consisting of only two FFN and one nonlinearity layers), stacking them together still increases training time substantially. To make it possible to train big models, we use the following optimizations:

[nosep,leftmargin=1em,labelwidth=*,align=left]

Initialize with pretrained models that do not contain flowbased dynamics, as discussed in Section 3.3.

From (8), we know that if is close to zero, then the position information diminishes (derived in appendix). In this way, our model degenerates to the original Transformer. Inspired by this property, we can initialize the FLOATER with smaller weights. Combining with the previous trick, we obtain an informed initialization that incurs lower training loss at the beginning.

We observed that weights in are more stable and easy to train. Thus, we can separate the weights of from the remaining parts of the Transformer model. Concretely, we can 1) cache the positional bias vectors for some iterations without recomputing, 2) update the weights of flow models less frequently than other parts of the Transformer, and 3) update the flow models with a larger learning rate to accelerate convergence.

For the RoBERTa model, we adopt an even more straightforward strategy: we first download a pretrained RoBERTa model, plug in some flowbased encoding layers, and retrain the encoding layers on WikiText103 dataset for one epoch. When finetuning on GLUE datasets, we can choose to freeze the encoding layers.
Combining those tricks, we successfully train our proposed models with only % overhead compared to traditional models, and virtually no overhead when finetuning RoBERTa model on GLUE benchmarks. Moreover, there is no overhead during the inference stage if we store the precalculated positional bias vectors in the checkpoints.
5 Conclusions
In this paper, we have shown that learning position encoding with a dynamical model can be an advantageous approach to improve Transformer models. Our proposed position encoding approach is inductive, datadriven, and parameter efficient. We have also demonstrated the superiority of our proposed model over existing position encoding approaches on various natural language processing tasks such as neural machine translation, language understanding, and question answering tasks.
References
 [1] (2018) . In Advances in Neural Information Processing Systems, pp. 6571–6583. Cited by: Appendix A, Appendix A, §1, §3.1.
 [2] (2018) Neural ordinary differential equations. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 6571–6583. External Links: Link Cited by: 2nd item.
 [3] (2018) Universal transformers. arXiv preprint arXiv:1807.03819. Cited by: §1, §2.2, §3.2, 2nd item.
 [4] (2018) BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Table 1, §1, §1, §2.2, §3.3, 3rd item.
 [5] (2018) Ffjord: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §1.
 [6] (2017) RACE: largescale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683. Cited by: §4.2, Table 4.

[7]
(2019)
Albert: a lite bert for selfsupervised learning of language representations
. arXiv preprint arXiv:1909.11942. Cited by: §1, §1, §2.2, §3.2. 
[8]
(2019)
Hierarchical transformers for multidocument summarization
. arXiv preprint arXiv:1905.13164. Cited by: 1st item.  [9] (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §2.2, 3rd item.
 [10] (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §B.3.
 [11] (2019) Fairseq: a fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038. Cited by: §B.2, §4.
 [12] (2018) Scaling neural machine translation. arXiv preprint arXiv:1806.00187. Cited by: §B.2.
 [13] (1992) Numerical recipes in c++. The art of scientific computing 2, pp. 1002. Cited by: §B.1.
 [14] (2019) Language models are unsupervised multitask learners. Cited by: §1.

[15]
(2019)
Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §1.  [16] (2018) Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: §4.2.
 [17] (2016) SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §4.2, §4.2.
 [18] (2018) Selfattention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. Cited by: Table 1, §1.
 [19] (1985) Ordinary differential equations: an elementary textbook for students of mathematics, engineering, and the sciences. Dover Books on Mathematics, Dover Publications. External Links: ISBN 9780486649405, LCCN lc85012983, Link Cited by: §3.1.
 [20] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §B.2, Table 1, §1, §1, §2.2, §3, 2nd item, 3rd item, §4.1.
 [21] (2018) GLUE: a multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §4.2.
 [22] (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
 [23] (2019) Are transformers universal approximators of sequencetosequence functions?. arXiv preprint arXiv:1912.10077. Cited by: §1.
 [24] (2019) HIBERT: document level pretraining of hierarchical bidirectional transformers for document summarization. CoRR abs/1905.06566. External Links: Link, 1905.06566 Cited by: 1st item.
Appendix A Training a Neural ODE model in Transformer
We discuss the details of training the dynamical model , recall in our FLOWER model, function joins in the computational graph implicitly by generating a sequence of position encoding vectors , conditioning on a freely initialized vector . The generation steps are computed iteratively as follows (suppose we choose the interval between two consecutive tokens to be )
(15)  
Finally, the loss of this sequence is going to be a function of all position encoding results , which is further a function of model parameters . The question is how to calculate the gradient through backpropagation. This question is fully solved in Neural ODE method [1] with an efficient adjoint ODE solver. To illustrate the principle, we draw a diagram showing the forward and backward propagation in Figure 4.
From [1], we know that the gradients can be computed by
(16) 
where defined in is called the “adjoint state” of ODE, which can be computed by solving another ODE
(17) 
Note that the computation of (17) only involves Jacobianvector product so it can be efficiently calculated by automatic differentiation.
Appendix B Implementation details
b.1 Settings of ODE solver
To setup the ODE server, we need to first choose the numerical algorithms [13]. We have different setups for different datasets. For neural machine translation problems (WMT14 EnDe and EnFr), we use the more accurate RungeKutta scheme with discretization step to solve the adjoint equation (recall that we set the interval of two neighboring tokens to be globally). While for datasets with long sentences such as GLUE and RACE benchmarks, we found that solving the adjoint equation with high order scheme is too slow, in such case we adopt simple midpoint method with discretization step , and the gradients are calculated by automatic differentiation rather than adjoint method. The third party implementation of ODE solver can be found at https://github.com/rtqichen/torchdiffeq.
b.2 Training NMT tasks
We run the same preprocessing script provided by fairseq [11], which is also used in ScalingNMT [12]. With the standard training script, we first successfully reproduce all the results in Transformer paper [20]. Based on that we execute the following protocol to get our results:

Train the original Transformer model for 30 epochs.

Random initialize FLOWER model of same shape configuration.

Copy tensors from the best performing checkpoint (validation set) to initialize FLOWER model. Initialize weights in the dynamical model with small values.

Half the peak learning rate (e.g. in Transformerbase + EnDe, the peak learning rate is changed from to ).

With the warminitialized FLOWER checkpoint, retrain on the same dataset for 10 epochs (EnDe) or 1 epoch (EnFr).

Averaging last 5 checkpoints and compute BLEU score on test split.
b.3 Training language understanding tasks
For GLUE/SQuAD/RACE benchmarks, our experiments are all conducted upon RoBERTa, in which both base and large configurations are available. Due to resource constraint (and to show the compatibility to existing models), we initialize our FLOWER model with pretrained RoBERTa, which is similar to NMT task. However, the weights in dynamic function are not trained in large corpus, given that GLUE/SQuAD/RACE datasets are too small to train dynamics from scratch, we decided to pretrain alone in WikiText103 [10] data using masked language modeling loss. We have found that when we train alone, it only takes a few hours (2x Titan V100) and one epoch to convergence.
Once having the pretrained FLOWER model, we can run following downstream tasks and compare with RoBERTa under the same setting:
GLUE benchmark
consists of eight datasets and each have different hyperparameter settings. For hyperparameters such as learning rate, batch size, training iterations, warmup iterations, etc., we use the same values recommended by official repository of RoBERTa^{1}^{1}1Available at: https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md.
SQuAD benchmark.
For this benchmark we wrote our own finetuning code because currently there is no official code available. During the implementation process, we mainly refer to the thirdparty repositories^{2}^{2}2 Mainly https://github.com/ecchochan/robertasquad and https://github.com/huggingface/transformers. We are not able to exactly match the official result reported in RoBERTa paper but quite close ( difference in F1). For our FLOWER model, we use the same hyperparameters as RoBERTa.
RACE benchmark.
This benchmark has the longest context and sequence length. We follow the official training script^{3}^{3}3https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.race.md and reproduce the result. Similar to other benchmarks, we then repeat the training process using exactly the same training hyperparameters to make a fair comparison. In this benchmark we freeze the weights and only finetune the weights of RoBERTa.
Appendix C Cases suitable for nonequidistant discritization
Although our model allows continuous values of and in (8), limiting the scope to text modeling tasks, positions are discrete values as . Once the continuous version of position representation is obtained, we simply take the discritized as the actual values to feed into Transformer model, where is a hyperparameter (e.g. ). By choosing positions equidistantly, we are implicitly assuming the position signal evolves steadily as we go through each token in a sentence. More generally, the dynamics in (8) can deal with the case in which positions are not integers etc., but arbitrary monotone increasing series which may not be equidistant. In appendix, we exemplify this general situation with several widely deployed tasks; we regard this as a interesting future direction. This makes our model particularly suitable for following scenarios yet traditional position representation may not be good at:

[nosep,leftmargin=1em,labelwidth=*,align=left]

Hierarchical Transformer model [8, 24]. The model is a direct extension of hierarchical RNN and is often used in long document processing. It works by first running a wordlevel Transformer model on each sentence to extract the sentence embedding, and then applying a sentencelevel Transformer scanning through each sentence embedding sequentially. We argue that when processing at the sentence level, it could be better to set the increment of position index proportional to the length of the th sentence. This is because longer sentences tend to carry more information, so is likely to move farther from .

Transformer for timeseries events. As measurement time is continuous, timeseries data is another scenario when a continuous position makes more sense than a discrete counterpart. More importantly, to predict the future values by modeling historical values observed at irregular time grids, it is better to consider the length of time horizon between two consecutive measures. A successful previous work is the Latent ODE [2], except that they use RNN as the backbone, and they model the hidden states rather than position representations with Neural ODE (because RNN itself provides positional bias).
In this paper, we are not going to explore the more general cases discussed above. Instead, we decided to leave them as interesting future work.
Comments
There are no comments yet.