1 Introduction
Recent years have witnessed great success of deep neural networks for natural language processing tasks, such as language modeling
Zaremba et al. (2014); Merity et al. (2018)and Neural Machine Translation
(Bahdanau et al., 2015; Vaswani et al., 2017). The excellent task performance they achieved spiked the interest in interpreting their underlying mechanism. Since linguistic knowledge is crucial in natural languages, an emerging body of literature uses probes (Conneau et al., 2018; Alt et al., 2020; Saleh et al., 2020; Cao et al., 2021) to investigate whether a standard model trained towards better task performance also captures the linguistic information. From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b)show that probes can be used to estimate the amount of linguistic information captured by a fixed model.
However, the above probing only extracts linguistic information from a fixed standard model, which helps little to understand the relationship between the task performance and linguistic information encoded by the model. For example, under their methodology, it is difficult to answer the following two questions. First, would adding linguistic information be beneficial for an NLP model; second, is it harmful when this linguistic information is reduced. Therefore, it is still an open and intriguing question to reveal how task performance changes with respect to different amounts of linguistic information.
To this end, this paper proposes a novel viewpoint to study the relationship between task performance and the amount of linguistic information, inspired by the criterion of Pareto Optimality which is widely used in economics Greenwald and Stiglitz (1986). Our main idea is to obtain Paretooptimal models on a test set in terms of both linguistic information and task performance and then visualize their relationship along with these optimal models. By comparing a standard model with these optimal models, it is clear to answer the question that whether adding the encoded information is helpful to improve the task performance over the standard model , as illustrated in Figure 1, where the points on the line are Paretooptimal and the red triangle denotes the standard model with best performance.
Nevertheless, it is typically intractable to obtain the Paretooptimal models according to both dimensions on test data. To address the challenge, we propose a principled method to approximately optimize the Paretooptimal models on the training data which can be expected to generalise well on test sets according to statistical learning theory
Vapnik (1999). Formally, the approach can be regarded as a multiobjective optimization problem: during the learning procedure, it optimizes two objectives, i.e., the task performance and extracted linguistic information. In addition, we develop a computationally efficient algorithm to address the optimization problem. By inspecting the trend of those Paretooptimal points, the relationship between task performance and linguistic information can be clearly illustrated. Back to our questions, we also consider two instances within the proposed methodology: one aims to maximize the amount of linguistic information (i.e., adding) while the other tries to minimize it (i.e., reducing).We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and choose three different linguistic properties, including two syntactic properties (PartofSpeech and dependency labels) and one phonetic property. We investigate the relationship between NMT performance and each syntactic information, and the relationship between LM performance and phonetic information. For machine translation, we use LSTM, i.e., RNNsearch (Bahdanau et al., 2015), and Transformer (Vaswani et al., 2017) as the main model architectures, and conduct our experiments on En De and Zh
En tasks. For language modeling, we employ the LSTM model and conduct experiments on the Penn Treebank dataset. The experimental results show that: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by maximum likelihood estimation (MLE) is Paretooptimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic information encoded by LM models only makes task performance drop slightly.
In summary, our contributions are threefold:

[wide=0noitemsep, topsep=0pt]

We make an initial attempt to study the relationship between encoded linguistic information and task performance, i.e., how the change of linguistic information affects the performance of models.

We propose a new viewpoint from Pareto Optimality as well as a principled approach which is formulated as a multiobjective optimization problem, to visualize the relationship.

Our experimental results show that encoding more linguistic information is not necessary to yield better task performance depending on the specific model architecture.
2 Related Work
Probe
With the impressive performance of Neural Network models for NLP tasks (Sutskever et al., 2014; Luong et al., 2015; Vaswani et al., 2017; Devlin et al., 2019; Xu et al., 2020), people are becoming interested in understanding neural models Ding et al. (2017); Li et al. (2019, 2020). One popular interpretation method is probe (Conneau et al., 2018), also known as auxiliary prediction (Adi et al., 2017) and diagnostic classification (Hupkes et al., 2018), which aims to understand how neural models work and what information they have encoded and used. From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b) show that probes can be used to estimate the amount of linguistic information captured by a model. However, recent research studies point out that probes fail to demonstrate whether the information is used by models. For example, Hewitt and Liang (2019) show that the probe can also achieve high accuracy in predicting randomly generated tags, which is useless for the task. And Ravichander et al. (2021) present that the representations encode the linguistic properties even if they are invariant and not required for the task. Instead of studying the encoded linguistic information by training a probe for fixed representations, in this work we study how the amount change of linguistic information affects the performance of NLP tasks.
Information Removal
Information removal is crucial in the area of transfer learning
(Ganin and Lempitsky, 2015; Tzeng et al., 2017; Long et al., 2018) and fairness learning (Xie et al., 2017; Elazar and Goldberg, 2018), where people want to remove domain information or bias from learned representations. One popular method is Adversarial Learning (Goodfellow et al., 2014; Ganin and Lempitsky, 2015), which trains a classifier to predict the properties of representations,
e.g., domain information or gender bias, while the feature extractor tries to fool the classifier. In this work, when using our method to reduce the linguistic information in the representations, we find that our multiobjective loss function is the same form as adversarial learning, which provides the theoretical guarantee for using adversarial learning to find the Paretooptimal solutions to a multiobjective problem.
Recently, Elazar et al. (2020) also propose to study the role of linguistic properties with the idea of information removal (Ravfogel et al., 2020). However, the representations got by their method may not be Paretooptimal, because it only minimizes the mutual information, but ignores the objective of task performance. On the contrary, our proposed method optimizes towards both objectives, thus our results can be used to visualize the relationship between linguistic properties and task performance.
Pareto Optimality
The idea of Pareto Optimality (MasColell et al., 1995)
is an important criterion in economics, where the goal is to characterize situations where no variable can be better off without making at least one variable worse off. It has been also widely used in the area of sociology and game theory
Beckman et al. (2002); Chinchuluun et al. (2008). In addition, in artificial intelligence
Martínez et al. (2020) use Pareto optimality to solve group fairness problem and Duh et al. (2012) proposed to optimize an MT system on multiple metrics based on the theory of Pareto optimality. In particular, Pimentel et al. (2020a)propose a variant of probing on the hidden representation of deep models and they consider Pareto optimality in terms of both objectives similar to our work. Comparing with their work, one difference is the choice of objectives. Another significant difference is that they optimize probing model in a conventional fashion, and thus are unable to study the relationship between linguistic information and task performance.
3 Visualizing Relationship via Pareto Optimality
We consider the relationship between linguistic information and task performance for two popular tasks in NLP, i.e., machine translation and language modeling. Let be a sentence and be the labels of the linguistic property of , where is the label for , e.g., POS tag. On both tasks, a deep model typically encodes into a hidden representation with a subnetwork parameterized by : , and then uses another subnetwork parameterized by to map into an output.
3.1 Background
and Loss in NMT
An NMT architecture aims to output a target sentence for a given source sentence according to Zaremba et al. (2014); Vaswani et al. (2017), where indicates a set of parameters of a sequencetosequence neural network, which contains an encoder and a decoder . We define as the output of the encoder. To train , the MLE loss is usually minimized on a training dataset. For NMT, the loss is defined as following:
(1) 
In our experiments, we consider two models, namely the LSTM Bahdanau et al. (2015) and Transformer Vaswani et al. (2017).
and Loss in LM
For language modeling task, a deep model typically generates a token based on according to . Here the subnetworks is set as one hidden layer to encode into and is set as the subnetwork to generate on top of . The parameter is optimized by the following MLE loss:
To make notations consistent for both NMT and LM, in the rest of this paper, we follow the form of Eq. (1) and rewrite the in LM as , where is a shifted version of , i.e., .
Encoded Information
Let denote the linguistic information in the representation , i.e., the mutual information between and the linguistic label
. Since the probability
is unknown, it is intractable to compute . Following Pimentel et al. (2020b), we approximately estimate by using a probing model as follows:(2)  
where is the entropy of linguistic labels, is the ideal cross entropy, and is the crossentropy loss of the probe model parameterized by .
Theory of Pareto Optimality
Pareto optimality (MasColell et al., 1995) is essentially involved in the multiobjective optimization problem. Suppose that we have different objectives to evaluate a parameter , i.e.,
(3) 
There are two important concepts in Pareto optimality as follows:
Definition 1. Pareto Optimal: A parameter is Paretooptimal iff for any , the condition always holds that, and such that .
Definition 2. Pareto Frontier: The set of all Paretooptimal parameters is called the Pareto frontier.
3.2 Viewpoint via Pareto Optimality
Motivation
Suppose is a given model parameter, is its task performance on a test set, and is the amount of linguistic information encoded in its hidden representation. Conventionally, if one can figure out a function such that for any , it is trivial to study their relationship by visualizing . Unfortunately, for some complicated situations as illustrated in Figure 1, there does not exist such a function to represent the relationship between two variables due to a large number of manytomany correspondences.
Our Viewpoint
Pareto Optimality, a wellknown criterion in economics (MasColell et al., 1995), is widely used to analyze the relationship among multiple variables in a complicated environment Chinchuluun et al. (2008). In our context, it is also a powerful tool to reveal the relationship between the encoded linguistic information and task performance. Taking the Pareto Frontier in Figure 1 as an example, since the capacity of a model is fixed and linguistic information may compete with other kinds of information, capturing more linguistic information may reduce the amount of information from other sources that are also helpful for the model. Nevertheless, if increasing the amount of linguistic information constantly leads to performance gain, i.e., linguistic information is complimentary to translation, only one Pareto Optimal point would exist on the top right corner.
Therefore, in this paper, we propose to study the relationship between and from the viewpoint of Pareto Optimality. Our key idea is to take into account only Paretooptimal models rather than all models like the conventional method. Thanks to the definition of Pareto optimality, there are no manytomany correspondences between two variables along the Pareto frontier. Hence their relationship can be visualized by the trend of these frontier points, as shown in Figure 1. Taking Figure 1 as an example, to answer the questions mentioned before, we can see that adding more information is possible to increase the task performance comparing with a standard model. According to this viewpoint, the core challenge is how to obtain a set of models which are Pareto optimality on a test dataset.
It is natural to employ a heuristic method to approximately obtain the Paretooptimal models as following. We can first randomly select a number of checkpoints during the standard training and probe each checkpoint by optimizing its corresponding probing model , as shown in Eq (2). Second, we can record the task performances and the amounts of linguistic information of the selected models on a test set. Finally, we can find the Paretooptimal points and obtain the Pareto frontier. However, when using this method in our experiments, we find the amounts of encoded linguistic information for all checkpoints are similar and the the task performances of those checkpoints are worse than the optimal model. Hence, in the next section, a new method will be presented to approximately derive the Paretooptimal models.
4 Methodology
4.1 MultiObjective Optimization
To study the relationship between linguistic information and task performance, our goal is to obtain a set of models which are Pareto optimal on test data in terms of both objectives. Inspired by statistical learning theory Vapnik (1999), we propose an approach by optimizing the Paretooptimal models towards both objectives on a given training dataset, which are expected to generalize well on unseen test data, i.e., these models are Pareto optimal on unseen test data. Formally, Our approach can be formulated as the following multiobjective optimization problem:
(4) 
where minimizing aims to promote the task performance and maximizing
encourages a model to encode more linguistic information in the representation. Once we obtain a set of Paretooptimal models, we can observe how increasing the encoded linguistic information affects the variance of task performance.
To further study how reducing the encoded linguistic information affects task performance, we optimize a similar multiobjective problem:
(5) 
The only difference between Eq. (4) and Eq. (5) is that the former maximizes while the latter minimizes .
Since is a constant term, we can plug Eq. (2) into the above two equations and obtain the following reduced multiobjective problems:
(6)  
(7) 
Notice that in the above equations, resembles a conventional probing if is a fixed representation. However, unlike the standard probing applied on top of a fixed determined by the standard model, here is the representation obtained from a encoder parameterized by . It is also worth noting that the Pareto frontiers obtained from the Eq. (6) and (7) are independent, although they have a similar measurement, because the Pareto Optimal is only valid for the same objective.
4.2 Optimization Algorithm
To solve the above multiobjective problems, we leverage the linearcombination method to find a set of solutions, and then filter the nonParetooptimal points from the set to get the Pareto frontier. The details of our algorithm are shown below.
Optimization Process
Since the detailed optimization method for Eq. (6) is similar to that for Eq. (7), in the following we take Eq. (6) as an example to describe the optimization method. Inspired by (Duh et al., 2012), we employ a twostep strategy for optimization to find the Pareto frontier to address the multiobjective problems.
In the first step, we adopt an method to find the Paretooptimal solutions to the problem. There are several different methods to solve the problem, such as linearcombination, PMO (Duh et al., 2012) and APStar (Martínez et al., 2020). In this work, we adopt the linearcombination method because of its simplicity. Specifically, we select a coefficient set
and minimize the following interpolating function for each coefficient
: ^{1}^{1}1Eq. (8) is similar to the loss of standard multiple task learning (MTL) Dong et al. (2015); Lee et al. (2020), . However, the solutions to the loss are weaker than our solutions according to Pareto optimality, and it can not remove linguistic information in our preliminary experiments.(8) 
Notice that the first term of the loss function is the function of both encoder parameters and decoder parameters , while the second term is only the function of . Therefore, when minimizing Eq.(8), we apply a GradientMultiple (GM) Layer on the representations before inputting it into the probe model. As shown in Fig. 2, in the forward propagation, the GM Layer acts as an identity transform, while in the backward propagation, the GM Layer multiples the gradient by and passes it to the preceding layers. Note that when the multiplier is , the GM Layer is the same as Gradient Reversal Layer (Ganin and Lempitsky, 2015).
Suppose is the minimized solution set for Eq. (8). In the second step, to get more accurate solutions, we filter the nonParetooptimal points of the solution set obtained by . Finally, we get the Pareto frontier to the multiobjective problem according to the definition of Pareto optimality.
Detailed Algorithm
The overall optimization algorithm regarding to Eq. (6) is shown in Algorithm 1. Theoretically, when minimizing Eq. (8), in every step updating , we should retrain the probe model to minimize in for many steps, in order to estimate precisely. However, this is timeconsuming and inefficient. Instead, after updating , we update only by one step (see line 7 Algorithm 1). Empirically, we find that optimization in this way has been very effective.
In addition, as is mentioned by Elazar and Goldberg (2018), information leakage may occur when minimizing the mutual information. Therefore, after the training process is finished, we fix the deep model and retrain another probe model to estimate more precisely (line 9 in Algorithm 1). When maximizing the mutual information, we find there is no difference between estimated by jointly trained or retrained probe models.
5 Experimental Settings
5.1 Dataset
We conduct experiments on both machine translation and language modeling tasks. For machine translation, we conduct the experiments on En De and Zh En translation tasks. For En De task, we use WMT14 corpus which contains 4M sentence pairs. For Zh En task, we use LDC corpus which consists of 1.25M sentence pairs, and we choose NIST02 as our validation set, and NIST06 as our test set. For language modeling task, we use Penn Treebank^{2}^{2}2https://deepai.org/dataset/penntreebank dataset. We preprocess our data using bytepair encoding (Sennrich et al., 2016) and keep all tokens in the vocabulary. For machine translation, we use caseinsensitive 4gram BLEU score (Papineni et al., 2002) to measure the task performance, which is proved to be positively correlated well with the MLE loss lee2020discrepancy; For language modeling, we directly use the MLE loss to evaluate the task performance.
5.2 Linguistic Properties
For machine translation, we study partofspeech (POS) and dependency labels in this work. Since there are no gold labels for the MT datasets, we use Stanza toolkit^{3}^{3}3https://github.com/stanfordnlp/stanza (Qi et al., 2020) to annotate source sentences and use the pseudo labels for running our algorithm, following Sennrich and Haddow (2016); Li et al. (2018). We clean the labels and remove the sentences that fail to be parsed by Stanza from the dataset. To study whether all kinds of linguistic information are critical for neural models, we also investigate the phonetic information on the language modeling task. More precisely, the probing model needs to predict the first character of the International Phonetic Alphabet of each word.^{4}^{4}4For example, given the input sentence ”This dog is so cute”, the probing model is asked to predict ”ð d I s k”.
We get the labels with the open source toolkit EnglishtoIPA
^{5}^{5}5https://github.com/mphilli/EnglishtoIPA. We use mutual information to evaluate the amount of information in the representations. Since is a constant, we only compare in the experiments. Note that is estimated by our probe model .5.3 Implementation Details
All of our models are implemented with Fairseq^{6}^{6}6https://github.com/pytorch/fairseq (Ott et al., 2019). For NMT experiments, our LSTM model consists of a bidirectional 2layer encoder with 256 hidden units, and a 2layer decoder with 512 hidden units, and the probe model is a 2layer MLP with 512 hidden units. Our Transformer model consists of a 6layer encoder and a 6layer decoder, whose hyperparameters are the same as the base model in (Vaswani et al., 2017), and the probe model is a 6layer transformer encoder. For LM experiments, our model is a 2layer LSTM with 256 hidden states, and the probe model is a 2layer MLP with 256 hidden states. More training details about our models are shown in appendix A.
6 Experiment Results
In the following experiments, ”Model + Property”, e.g., ”Transformer+Pos”, which is corresponding to Eq. 4 and studies how adding the linguistic properties information affects the task performance. Instead, ”Model  Property”, e.g., ”TransformerPos”, which is corresponding to Eq. 5 and studies how removing the linguistic properties information affects the task performance. It is worth noting that merging the two frontiers of + Property and  Property together would lead to trivial results, because Pareto Optimal points of the + Property are more likely to dominate. However, we think the frontier of  Property is helpful for answering the question that whether reducing the encoded linguistic information would affect the model performance. Therefore, we plot the Pareto frontiers for the two objectives independently.
6.1 Soundness of Methodology
The heuristic method mentioned before can be considered as a simple and straightforward baseline method to measure the relationship. To set up this baseline, we firstly save some checkpoints every 1,000 steps when training a standard model. Second, we randomly sample 30 checkpoints for probing and plot a scatter diagram in terms of BLEU and encoded linguistic information.
As shown in Figure 4, we compare our proposed method with the heuristic method in the setting of ”Transformer+Pos”. Comparing with the baseline method, the frontier obtained from our method is better: for each model explored by baseline, there exists at least one model explored by our method whose two objectives, i.e., encoded linguistic information and BLEU score, are larger. The main reason is that the objective of baseline method only considers the task performance and most checkpoints contain similar encoded linguistic information. Therefore, the models optimized by our multiobjective method is more close to the globally Paretooptimal points ^{7}^{7}7It is worth mentioning that there are no algorithms to guarantee globally Paretooptimal solutions in our scenario on the training data. Although the globally Paretooptimal solutions are unknown, our solutions are definitely more close to them than the solutions by baseline., making the revealed relationship between encoded linguistic information and task performance more reliable. Therefore, in the next subsection, our proposed method will be used to visualize the relationship between encoded linguistic information and task performance for neural models.
6.2 Visualization Results
Results on NMT
The results of machine translation on the WMT dataset are shown in Figure 3. For LSTM based NMT, we observe that the standard model, i.e., the in Figure 3, is not in the Pareto frontier in Figure 3 (a,c). In other words, when adding linguistic information into the LSTM model, it is possible to obtain a model which contains more POS or DEP information and meanwhile leads to better BLEU score than the standard model by standard training. In contrast, for Transformer based NMT, the standard model is in the Pareto frontier, as shown in Figure 3 (e,g). This finding provides an explanation to the fact in NMT: many efforts Luong et al. (2016); Nădejde et al. (2017); Bastings et al. (2017); Hashimoto and Tsuruoka (2017); Eriguchi et al. (2017) have been devoted to improve the LSTM based NMT architecture by explicitly modeling linguistic properties, but few have been done on Transformer based NMT McDonald and Chiang (2021); Currey and Heafield (2019). In addition, when removing the linguistic information from LSTM or Transformer, the standard model is very close to the lower right of Pareto frontier, or even at the frontier, as shown in Figure 3 (b,d,f,h). This result shows that removing linguistic information always hurts the performance of NMT models for both LSTM and Transformer, indicating that encoding POS and DEP information is important for NMT task. Similar trends are observed on the LDC datasets, as shown in Figure 5. More details about the effect of randomness on our approach are shown in appendix B.
Results on LM
Above experiments have shown that both syntactic information are important for NMT models, and then a natural question is whether all kinds of linguistic information are important for neural models. To answer this question, we propose to investigate the influence of phonetic information on a language model. Figure 6 depicts the relationship between encoded phonetic information and task performance for an LSTM based language model. In Figure 6 (a), we find that the performances of Paretooptimal models drop slightly when forcing an LSTM model to encode more phonetic information. Besides, as the Paretofrontier shown in Figure 6 (b), removing phonetic information from an LSTM model only leads to a slight change in performance. These experiments demonstrate that the encoded phonetic information may be not that critical for an LSTM based language model. This finding suggest that not all kinds of linguistic information are crucial for LSTM based LM and it is not promising to further improve language modeling with phonetic information.
7 Conclusion
This paper aims to study the relationship between linguistic information and the task performance and proposes a new viewpoint inspired by the criterion of Pareto Optimality. We formulate this goal as a multiobjective problem and present an effective method to address the problem by leveraging the theory of Pareto optimality. We conduct experiments on both MT and LM tasks and study their performance with respect to linguistic information sources. Experimental results show that the presented approach is more plausible than a baseline method in the sense that it explores better models in terms of both encoded linguistic information and task performance. In addition, we obtain some valuable findings as follows: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by minimizing MLE loss is Paretooptimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic information encoded by LM models only leads to slight performance drop.
Acknowledgement
We would like to thank the anonymous reviewers for their constructive comments. L. Liu is the corresponding author.
References
 Finegrained analysis of sentence embeddings using auxiliary prediction tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2.
 Probing linguistic features of sentencelevel representations in neural relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1534–1545. External Links: Document Cited by: §1.
 Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, §1, §3.1.
 Graph convolutional encoders for syntaxaware neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1957–1967. External Links: Document Cited by: §6.2.
 Envy, malice and pareto efficiency: an experimental examination. Social Choice and Welfare 19 (2), pp. 349–367. Cited by: §2.
 Lowcomplexity probing via finding subnetworks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 960–966. External Links: Document Cited by: §1.
 The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 76–86. External Links: Link, Document Cited by: Appendix B.
 Pareto optimality, game theory and equilibria. Springer. Cited by: §2, §3.2.

What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties
. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2126–2136. External Links: Document Cited by: §1, §2.  Incorporating source syntax into transformerbased neural machine translation. In Proceedings of the Fourth Conference on Machine Translation, Florence, Italy, pp. 24–33. External Links: Document Cited by: §6.2.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: §2.
 Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1150–1159. External Links: Link, Document Cited by: §2.
 Multitask learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1723–1732. External Links: Link, Document Cited by: footnote 1.
 Learning to translate with multiple objectives. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 1–10. External Links: Link Cited by: §2, §4.2, §4.2.
 Adversarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 11–21. External Links: Document Cited by: §2, §4.2.
 When bert forgets how to pos: amnesic probing of linguistic properties and mlm predictions. arXiv preprint arXiv:2006.00995. Cited by: §2.
 Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 72–78. External Links: Document Cited by: §6.2.

Unsupervised domain adaptation by backpropagation
. InProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015
, F. R. Bach and D. M. Blei (Eds.), JMLR Workshop and Conference Proceedings, Vol. 37, pp. 1180–1189. Cited by: §2, §4.2.  Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §2.
 Externalities in economies with imperfect information and incomplete markets. The quarterly journal of economics 101 (2), pp. 229–264. Cited by: §1.
 Neural machine translation with sourceside latent graph parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 125–135. External Links: Document Cited by: §6.2.
 Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 2733–2743. External Links: Document Cited by: §2.
 Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research 61, pp. 907–926. Cited by: §2.
 On the discrepancy between density estimation and sequence generation. In Proceedings of the Fourth Workshop on Structured Prediction for NLP, Online, pp. 84–94. External Links: Link Cited by: footnote 1.
 Evaluating explanation methods for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 365–375. External Links: Link, Document Cited by: §2.
 On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1293–1303. External Links: Link, Document Cited by: §2.
 Target foresight based attention for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, pp. 1380–1390. External Links: Document Cited by: §5.2.
 Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 1647–1657. Cited by: §2.
 Multitask sequence to sequence learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §6.2.
 Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Document Cited by: §2.
 Minimax pareto fairness: A multi objective perspective. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 6755–6764. Cited by: §2, §4.2.
 Microeconomic theory. Vol. 1, Oxford university press New York. Cited by: §2, §3.1, §3.2.
 Syntaxbased attention masking for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Online, pp. 47–52. External Links: Document Cited by: §6.2.

Regularizing and optimizing LSTM language models
. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §1.  Predicting target language CCG supertags improves neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 68–79. External Links: Document Cited by: §6.2.
 Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 48–53. External Links: Document Cited by: §5.3.
 Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Document Cited by: §5.1.
 Pareto probing: Trading off accuracy for complexity. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, pp. 3138–3153. External Links: Document Cited by: §2.
 Informationtheoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4609–4622. External Links: Document Cited by: §1, §2, §3.1.
 Stanza: a python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 101–108. External Links: Document Cited by: §5.2.
 Null it out: guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7237–7256. External Links: Document Cited by: §2.
 Probing the probing paradigm: does probing accuracy entail task relevance?. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, pp. 3363–3377. External Links: Link Cited by: §2.
 Probing neural dialog models for conversational understanding. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 132–143. External Links: Document Cited by: §1.
 Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1715–1725. External Links: Document Cited by: §5.1.
 Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation, Berlin, Germany, pp. 83–91. External Links: Document Cited by: §5.2.
 Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §2.

Adversarial discriminative domain adaptation.
In
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017
, pp. 2962–2971. External Links: Document Cited by: §2.  An overview of statistical learning theory. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1, §4.1.
 Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 49, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §1, §2, §3.1, §5.3.
 Informationtheoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 183–196. External Links: Document Cited by: §1, §2.
 Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 49, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 585–596. Cited by: §2.
 LayoutLM: pretraining of text and layout for document image understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 2327, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 1192–1200. Cited by: §2.
 Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §1, §3.1.
Appendix A Training Details
On the WMT14 corpus, training one LSTM model with 4 V100 GPUs costs 5 hours, and training one Transformer with 8 V100 GPUs costs 8 hours. On LDC corpus, training one LSTM model with 4 V100 GPUs costs 3 hours, and training one Transformer with 8 V100 GPUs costs 3 hours. On the PTB dataset, training LSTM model with 1 V100 GPU costs 6 minutes.
When running our algorithm, we empirically observe that when is below 0.01, the optimized models show little difference comparing with the standard model, and when is larger than 0.1, the proposed algorithm becomes unstable and can’t converge to Paretooptimal solutions well. Therefore, we take ten values from 0.1 to 0.01 at equal intervals as in Eq. 8, and train ten models with different for each condition respectively. Then we plot all the models and the Pareto frontier of these models in the experiments.
Appendix B Effects of Randomness
BLEU  H(POS—h)  

mean  var  mean  var 
21.08  0.00407  0.1113  0 
21.32  0.01536  0.1093  0 
21.49  0.01847  0.108  0 
21.52  0.00060  0.1123  0 
Following the method from Chen et al. (2018), we check if randomness will affect our experimental results. Specifically, we select a window of size 3 around the best checkpoint model and report the mean and variance over the selected window. The results are shown in Table 1. Because repeating experiments under all the settings are too extensive, we only randomly select 4 models from LSTM + POS settings. As shown in the table, all the variances are small, and the variances of the entropy even achieve 0. This suggests that the random disturbance of our experiments are small and thus our results are reliable.