Visualizing the Relationship Between Encoded Linguistic Information and Task Performance

03/29/2022
by   Jiannan Xiang, et al.
Tencent
Carnegie Mellon University
USTC
2

Probing is popular to analyze whether linguistic information can be captured by a well-trained deep neural model, but it is hard to answer how the change of the encoded linguistic information will affect task performance. To this end, we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality. Its key idea is to obtain a set of models which are Pareto-optimal in terms of both objectives. From this viewpoint, we propose a method to optimize the Pareto-optimal models by formalizing it as a multi-objective optimization problem. We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances. Experimental results demonstrate that the proposed method is better than a baseline method. Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance, because the model architecture is also an important factor.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/04/2022

Probing via Prompting

Probing is a popular method to discern what linguistic information is co...
08/03/2017

Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning

Linguistic resources such as part-of-speech (POS) tags have been extensi...
06/29/2020

Measuring Memorization Effect in Word-Level Neural Networks Probing

Multiple studies have probed representations emerging in neural networks...
12/21/2018

What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models

Despite the remarkable evolution of deep neural networks in natural lang...
10/06/2020

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Typically, tokenization is the very first step in most text processing w...
07/12/2018

Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate

Can advances in NLP help advance cognitive modeling? We examine the role...
04/08/2021

Low-Complexity Probing via Finding Subnetworks

The dominant approach in probing neural networks for linguistic properti...

1 Introduction

Recent years have witnessed great success of deep neural networks for natural language processing tasks, such as language modeling 

Zaremba et al. (2014); Merity et al. (2018)

and Neural Machine Translation

(Bahdanau et al., 2015; Vaswani et al., 2017). The excellent task performance they achieved spiked the interest in interpreting their underlying mechanism. Since linguistic knowledge is crucial in natural languages, an emerging body of literature uses probes (Conneau et al., 2018; Alt et al., 2020; Saleh et al., 2020; Cao et al., 2021) to investigate whether a standard model trained towards better task performance also captures the linguistic information. From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b)

show that probes can be used to estimate the amount of linguistic information captured by a fixed model.

However, the above probing only extracts linguistic information from a fixed standard model, which helps little to understand the relationship between the task performance and linguistic information encoded by the model. For example, under their methodology, it is difficult to answer the following two questions. First, would adding linguistic information be beneficial for an NLP model; second, is it harmful when this linguistic information is reduced. Therefore, it is still an open and intriguing question to reveal how task performance changes with respect to different amounts of linguistic information.

Figure 1: Illustration of Pareto frontier by a toy example. Triangle () corresponds to the standard checkpoint with best performance and each circle () corresponds to a sampled checkpoint. The y-axis indicates the linguistic information encoded by the model, and x-axis indicates the negative loss value .

To this end, this paper proposes a novel viewpoint to study the relationship between task performance and the amount of linguistic information, inspired by the criterion of Pareto Optimality which is widely used in economics Greenwald and Stiglitz (1986). Our main idea is to obtain Pareto-optimal models on a test set in terms of both linguistic information and task performance and then visualize their relationship along with these optimal models. By comparing a standard model with these optimal models, it is clear to answer the question that whether adding the encoded information is helpful to improve the task performance over the standard model , as illustrated in Figure 1, where the points on the line are Pareto-optimal and the red triangle denotes the standard model with best performance.

Nevertheless, it is typically intractable to obtain the Pareto-optimal models according to both dimensions on test data. To address the challenge, we propose a principled method to approximately optimize the Pareto-optimal models on the training data which can be expected to generalise well on test sets according to statistical learning theory 

Vapnik (1999). Formally, the approach can be regarded as a multi-objective optimization problem: during the learning procedure, it optimizes two objectives, i.e., the task performance and extracted linguistic information. In addition, we develop a computationally efficient algorithm to address the optimization problem. By inspecting the trend of those Pareto-optimal points, the relationship between task performance and linguistic information can be clearly illustrated. Back to our questions, we also consider two instances within the proposed methodology: one aims to maximize the amount of linguistic information (i.e., adding) while the other tries to minimize it (i.e., reducing).

We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and choose three different linguistic properties, including two syntactic properties (Part-of-Speech and dependency labels) and one phonetic property. We investigate the relationship between NMT performance and each syntactic information, and the relationship between LM performance and phonetic information. For machine translation, we use LSTM, i.e., RNN-search (Bahdanau et al., 2015), and Transformer (Vaswani et al., 2017) as the main model architectures, and conduct our experiments on En De and Zh

En tasks. For language modeling, we employ the LSTM model and conduct experiments on the Penn Treebank dataset. The experimental results show that: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by maximum likelihood estimation (MLE) is Pareto-optimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic information encoded by LM models only makes task performance drop slightly.

In summary, our contributions are three-fold:

  1. [wide=0noitemsep, topsep=0pt]

  2. We make an initial attempt to study the relationship between encoded linguistic information and task performance, i.e., how the change of linguistic information affects the performance of models.

  3. We propose a new viewpoint from Pareto Optimality as well as a principled approach which is formulated as a multi-objective optimization problem, to visualize the relationship.

  4. Our experimental results show that encoding more linguistic information is not necessary to yield better task performance depending on the specific model architecture.

2 Related Work

Probe

With the impressive performance of Neural Network models for NLP tasks (Sutskever et al., 2014; Luong et al., 2015; Vaswani et al., 2017; Devlin et al., 2019; Xu et al., 2020), people are becoming interested in understanding neural models Ding et al. (2017); Li et al. (2019, 2020). One popular interpretation method is probe (Conneau et al., 2018), also known as auxiliary prediction (Adi et al., 2017) and diagnostic classification (Hupkes et al., 2018), which aims to understand how neural models work and what information they have encoded and used. From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b) show that probes can be used to estimate the amount of linguistic information captured by a model. However, recent research studies point out that probes fail to demonstrate whether the information is used by models. For example, Hewitt and Liang (2019) show that the probe can also achieve high accuracy in predicting randomly generated tags, which is useless for the task. And Ravichander et al. (2021) present that the representations encode the linguistic properties even if they are invariant and not required for the task. Instead of studying the encoded linguistic information by training a probe for fixed representations, in this work we study how the amount change of linguistic information affects the performance of NLP tasks.

Information Removal

Information removal is crucial in the area of transfer learning

(Ganin and Lempitsky, 2015; Tzeng et al., 2017; Long et al., 2018) and fairness learning (Xie et al., 2017; Elazar and Goldberg, 2018), where people want to remove domain information or bias from learned representations. One popular method is Adversarial Learning (Goodfellow et al., 2014; Ganin and Lempitsky, 2015)

, which trains a classifier to predict the properties of representations,

e.g.

, domain information or gender bias, while the feature extractor tries to fool the classifier. In this work, when using our method to reduce the linguistic information in the representations, we find that our multi-objective loss function is the same form as adversarial learning, which provides the theoretical guarantee for using adversarial learning to find the Pareto-optimal solutions to a multi-objective problem.

Recently, Elazar et al. (2020) also propose to study the role of linguistic properties with the idea of information removal (Ravfogel et al., 2020). However, the representations got by their method may not be Pareto-optimal, because it only minimizes the mutual information, but ignores the objective of task performance. On the contrary, our proposed method optimizes towards both objectives, thus our results can be used to visualize the relationship between linguistic properties and task performance.

Pareto Optimality

The idea of Pareto Optimality (Mas-Colell et al., 1995)

is an important criterion in economics, where the goal is to characterize situations where no variable can be better off without making at least one variable worse off. It has been also widely used in the area of sociology and game theory 

Beckman et al. (2002); Chinchuluun et al. (2008)

. In addition, in artificial intelligence

Martínez et al. (2020) use Pareto optimality to solve group fairness problem and  Duh et al. (2012) proposed to optimize an MT system on multiple metrics based on the theory of Pareto optimality. In particular, Pimentel et al. (2020a)

propose a variant of probing on the hidden representation of deep models and they consider Pareto optimality in terms of both objectives similar to our work. Comparing with their work, one difference is the choice of objectives. Another significant difference is that they optimize probing model in a conventional fashion, and thus are unable to study the relationship between linguistic information and task performance.

3 Visualizing Relationship via Pareto Optimality

We consider the relationship between linguistic information and task performance for two popular tasks in NLP, i.e., machine translation and language modeling. Let be a sentence and be the labels of the linguistic property of , where is the label for , e.g., POS tag. On both tasks, a deep model typically encodes into a hidden representation with a sub-network parameterized by : , and then uses another sub-network parameterized by to map into an output.

3.1 Background

and Loss in NMT

An NMT architecture aims to output a target sentence for a given source sentence according to  Zaremba et al. (2014); Vaswani et al. (2017), where indicates a set of parameters of a sequence-to-sequence neural network, which contains an encoder and a decoder . We define as the output of the encoder. To train , the MLE loss is usually minimized on a training dataset. For NMT, the loss is defined as following:

(1)

In our experiments, we consider two models, namely the LSTM Bahdanau et al. (2015) and Transformer Vaswani et al. (2017).

and Loss in LM

For language modeling task, a deep model typically generates a token based on according to . Here the sub-networks is set as one hidden layer to encode into and is set as the sub-network to generate on top of . The parameter is optimized by the following MLE loss:

To make notations consistent for both NMT and LM, in the rest of this paper, we follow the form of Eq. (1) and re-write the in LM as , where is a shifted version of , i.e., .

Encoded Information

Let denote the linguistic information in the representation , i.e., the mutual information between and the linguistic label

. Since the probability

is unknown, it is intractable to compute . Following Pimentel et al. (2020b), we approximately estimate by using a probing model as follows:

(2)

where is the entropy of linguistic labels, is the ideal cross entropy, and is the cross-entropy loss of the probe model parameterized by .

Theory of Pareto Optimality

Pareto optimality (Mas-Colell et al., 1995) is essentially involved in the multi-objective optimization problem. Suppose that we have different objectives to evaluate a parameter , i.e.,

(3)

There are two important concepts in Pareto optimality as follows:
Definition 1. Pareto Optimal: A parameter is Pareto-optimal iff  for any , the condition always holds that, and such that .
Definition 2. Pareto Frontier: The set of all Pareto-optimal parameters is called the Pareto frontier.

3.2 Viewpoint via Pareto Optimality

Motivation

Suppose is a given model parameter, is its task performance on a test set, and is the amount of linguistic information encoded in its hidden representation. Conventionally, if one can figure out a function such that for any , it is trivial to study their relationship by visualizing . Unfortunately, for some complicated situations as illustrated in Figure 1, there does not exist such a function to represent the relationship between two variables due to a large number of many-to-many correspondences.

Our Viewpoint

Pareto Optimality, a well-known criterion in economics (Mas-Colell et al., 1995), is widely used to analyze the relationship among multiple variables in a complicated environment Chinchuluun et al. (2008). In our context, it is also a powerful tool to reveal the relationship between the encoded linguistic information and task performance. Taking the Pareto Frontier in Figure 1 as an example, since the capacity of a model is fixed and linguistic information may compete with other kinds of information, capturing more linguistic information may reduce the amount of information from other sources that are also helpful for the model. Nevertheless, if increasing the amount of linguistic information constantly leads to performance gain, i.e., linguistic information is complimentary to translation, only one Pareto Optimal point would exist on the top right corner.

Therefore, in this paper, we propose to study the relationship between and from the viewpoint of Pareto Optimality. Our key idea is to take into account only Pareto-optimal models rather than all models like the conventional method. Thanks to the definition of Pareto optimality, there are no many-to-many correspondences between two variables along the Pareto frontier. Hence their relationship can be visualized by the trend of these frontier points, as shown in Figure 1. Taking Figure 1 as an example, to answer the questions mentioned before, we can see that adding more information is possible to increase the task performance comparing with a standard model. According to this viewpoint, the core challenge is how to obtain a set of models which are Pareto optimality on a test dataset.

It is natural to employ a heuristic method to approximately obtain the Pareto-optimal models as following. We can first randomly select a number of checkpoints during the standard training and probe each checkpoint by optimizing its corresponding probing model , as shown in Eq (2). Second, we can record the task performances and the amounts of linguistic information of the selected models on a test set. Finally, we can find the Pareto-optimal points and obtain the Pareto frontier. However, when using this method in our experiments, we find the amounts of encoded linguistic information for all checkpoints are similar and the the task performances of those checkpoints are worse than the optimal model. Hence, in the next section, a new method will be presented to approximately derive the Pareto-optimal models.

4 Methodology

4.1 Multi-Objective Optimization

To study the relationship between linguistic information and task performance, our goal is to obtain a set of models which are Pareto optimal on test data in terms of both objectives. Inspired by statistical learning theory Vapnik (1999), we propose an approach by optimizing the Pareto-optimal models towards both objectives on a given training dataset, which are expected to generalize well on unseen test data, i.e., these models are Pareto optimal on unseen test data. Formally, Our approach can be formulated as the following multi-objective optimization problem:

(4)

where minimizing aims to promote the task performance and maximizing

encourages a model to encode more linguistic information in the representation. Once we obtain a set of Pareto-optimal models, we can observe how increasing the encoded linguistic information affects the variance of task performance.

To further study how reducing the encoded linguistic information affects task performance, we optimize a similar multi-objective problem:

(5)

The only difference between Eq. (4) and Eq. (5) is that the former maximizes while the latter minimizes .

Since is a constant term, we can plug Eq. (2) into the above two equations and obtain the following reduced multi-objective problems:

(6)
(7)

Notice that in the above equations, resembles a conventional probing if is a fixed representation. However, unlike the standard probing applied on top of a fixed determined by the standard model, here is the representation obtained from a encoder parameterized by . It is also worth noting that the Pareto frontiers obtained from the Eq. (6) and (7) are independent, although they have a similar measurement, because the Pareto Optimal is only valid for the same objective.

4.2 Optimization Algorithm

To solve the above multi-objective problems, we leverage the linear-combination method to find a set of solutions, and then filter the non-Pareto-optimal points from the set to get the Pareto frontier. The details of our algorithm are shown below.

Optimization Process

Since the detailed optimization method for Eq. (6) is similar to that for Eq. (7), in the following we take Eq. (6) as an example to describe the optimization method. Inspired by (Duh et al., 2012), we employ a two-step strategy for optimization to find the Pareto frontier to address the multi-objective problems.

Figure 2: Overview of our multi-objective optimization method, where and . In the back propagation, the GM Layer multiplies the gradient by , i.e., for Eq. (6) and for Eq. (7).

In the first step, we adopt an method to find the Pareto-optimal solutions to the problem. There are several different methods to solve the problem, such as linear-combination, PMO (Duh et al., 2012) and APStar (Martínez et al., 2020). In this work, we adopt the linear-combination method because of its simplicity. Specifically, we select a coefficient set

and minimize the following interpolating function for each coefficient

111Eq. (8) is similar to the loss of standard multiple task learning (MTL) Dong et al. (2015); Lee et al. (2020), . However, the solutions to the loss are weaker than our solutions according to Pareto optimality, and it can not remove linguistic information in our preliminary experiments.

(8)

Notice that the first term of the loss function is the function of both encoder parameters and decoder parameters , while the second term is only the function of . Therefore, when minimizing Eq.(8), we apply a Gradient-Multiple (GM) Layer on the representations before inputting it into the probe model. As shown in Fig. 2, in the forward propagation, the GM Layer acts as an identity transform, while in the backward propagation, the GM Layer multiples the gradient by and passes it to the preceding layers. Note that when the multiplier is , the GM Layer is the same as Gradient Reversal Layer (Ganin and Lempitsky, 2015).

Suppose is the minimized solution set for Eq. (8). In the second step, to get more accurate solutions, we filter the non-Pareto-optimal points of the solution set obtained by . Finally, we get the Pareto frontier to the multi-objective problem according to the definition of Pareto optimality.

1:, learning rate
2:Pareto frontier set
3: empty model set
4:for  do minimize Eq.  (8)
5:     Random initialize , and
6:     while convergence do
7:          is for Eq. (6) and changing it to would optimize Eq. (7)
8:         
9:         
10:     end while
11:     Re-train a probe model based on fixed encoder
12:     Add into
13:end for
14: Pareto frontier set
15:for all  do
16:     if IsParetoOptimal then
17:         Add into
18:     end if
19:end for
Algorithm 1 Optimization Algorithm

Detailed Algorithm

The overall optimization algorithm regarding to Eq. (6) is shown in Algorithm 1. Theoretically, when minimizing Eq. (8), in every step updating , we should retrain the probe model to minimize in for many steps, in order to estimate precisely. However, this is time-consuming and inefficient. Instead, after updating , we update only by one step (see line 7 Algorithm 1). Empirically, we find that optimization in this way has been very effective.

In addition, as is mentioned by Elazar and Goldberg (2018), information leakage may occur when minimizing the mutual information. Therefore, after the training process is finished, we fix the deep model and retrain another probe model to estimate more precisely (line 9 in Algorithm 1). When maximizing the mutual information, we find there is no difference between estimated by jointly trained or retrained probe models.

5 Experimental Settings

Figure 3: Experiments on WMT14 corpus. Triangle () denotes the model trained by minimizing MLE loss, circle () denotes the models obtained by our method, and the models on the line (—) denotes the Pareto frontier.

5.1 Dataset

We conduct experiments on both machine translation and language modeling tasks. For machine translation, we conduct the experiments on En De and Zh En translation tasks. For En De task, we use WMT14 corpus which contains 4M sentence pairs. For Zh En task, we use LDC corpus which consists of 1.25M sentence pairs, and we choose NIST02 as our validation set, and NIST06 as our test set. For language modeling task, we use Penn Treebank222https://deepai.org/dataset/penn-treebank dataset. We preprocess our data using byte-pair encoding (Sennrich et al., 2016) and keep all tokens in the vocabulary. For machine translation, we use case-insensitive 4-gram BLEU score (Papineni et al., 2002) to measure the task performance, which is proved to be positively correlated well with the MLE loss lee2020discrepancy; For language modeling, we directly use the MLE loss to evaluate the task performance.

5.2 Linguistic Properties

For machine translation, we study part-of-speech (POS) and dependency labels in this work. Since there are no gold labels for the MT datasets, we use Stanza toolkit333https://github.com/stanfordnlp/stanza (Qi et al., 2020) to annotate source sentences and use the pseudo labels for running our algorithm, following Sennrich and Haddow (2016); Li et al. (2018). We clean the labels and remove the sentences that fail to be parsed by Stanza from the dataset. To study whether all kinds of linguistic information are critical for neural models, we also investigate the phonetic information on the language modeling task. More precisely, the probing model needs to predict the first character of the International Phonetic Alphabet of each word.444For example, given the input sentence ”This dog is so cute”, the probing model is asked to predict ”ð d I s k”.

We get the labels with the open source toolkit English-to-IPA

555https://github.com/mphilli/English-to-IPA. We use mutual information to evaluate the amount of information in the representations. Since is a constant, we only compare in the experiments. Note that is estimated by our probe model .

5.3 Implementation Details

All of our models are implemented with Fairseq666https://github.com/pytorch/fairseq (Ott et al., 2019). For NMT experiments, our LSTM model consists of a bi-directional 2-layer encoder with 256 hidden units, and a 2-layer decoder with 512 hidden units, and the probe model is a 2-layer MLP with 512 hidden units. Our Transformer model consists of a 6-layer encoder and a 6-layer decoder, whose hyper-parameters are the same as the base model in (Vaswani et al., 2017), and the probe model is a 6-layer transformer encoder. For LM experiments, our model is a 2-layer LSTM with 256 hidden states, and the probe model is a 2-layer MLP with 256 hidden states. More training details about our models are shown in appendix A.

Figure 4: Comparison with baseline method. Triangle () denotes the standard model by minimizing MLE loss. The green line and blue line are frontiers got from baseline method and our method respectively.
Figure 5: Experimental results on LDC corpus. The format is the same as Fig. 3

6 Experiment Results

In the following experiments, ”Model + Property”, e.g., ”Transformer+Pos”, which is corresponding to Eq. 4 and studies how adding the linguistic properties information affects the task performance. Instead, ”Model - Property”, e.g., ”Transformer-Pos”, which is corresponding to Eq. 5 and studies how removing the linguistic properties information affects the task performance. It is worth noting that merging the two frontiers of + Property and - Property together would lead to trivial results, because Pareto Optimal points of the + Property are more likely to dominate. However, we think the frontier of - Property is helpful for answering the question that whether reducing the encoded linguistic information would affect the model performance. Therefore, we plot the Pareto frontiers for the two objectives independently.

6.1 Soundness of Methodology

The heuristic method mentioned before can be considered as a simple and straightforward baseline method to measure the relationship. To set up this baseline, we firstly save some checkpoints every 1,000 steps when training a standard model. Second, we randomly sample 30 checkpoints for probing and plot a scatter diagram in terms of BLEU and encoded linguistic information.

As shown in Figure 4, we compare our proposed method with the heuristic method in the setting of ”Transformer+Pos”. Comparing with the baseline method, the frontier obtained from our method is better: for each model explored by baseline, there exists at least one model explored by our method whose two objectives, i.e., encoded linguistic information and BLEU score, are larger. The main reason is that the objective of baseline method only considers the task performance and most checkpoints contain similar encoded linguistic information. Therefore, the models optimized by our multi-objective method is more close to the globally Pareto-optimal points 777It is worth mentioning that there are no algorithms to guarantee globally Pareto-optimal solutions in our scenario on the training data. Although the globally Pareto-optimal solutions are unknown, our solutions are definitely more close to them than the solutions by baseline., making the revealed relationship between encoded linguistic information and task performance more reliable. Therefore, in the next subsection, our proposed method will be used to visualize the relationship between encoded linguistic information and task performance for neural models.

6.2 Visualization Results

Results on NMT

The results of machine translation on the WMT dataset are shown in Figure 3. For LSTM based NMT, we observe that the standard model, i.e., the in Figure 3, is not in the Pareto frontier in Figure 3 (a,c). In other words, when adding linguistic information into the LSTM model, it is possible to obtain a model which contains more POS or DEP information and meanwhile leads to better BLEU score than the standard model by standard training. In contrast, for Transformer based NMT, the standard model is in the Pareto frontier, as shown in Figure 3 (e,g). This finding provides an explanation to the fact in NMT: many efforts Luong et al. (2016); Nădejde et al. (2017); Bastings et al. (2017); Hashimoto and Tsuruoka (2017); Eriguchi et al. (2017) have been devoted to improve the LSTM based NMT architecture by explicitly modeling linguistic properties, but few have been done on Transformer based NMT McDonald and Chiang (2021); Currey and Heafield (2019). In addition, when removing the linguistic information from LSTM or Transformer, the standard model is very close to the lower right of Pareto frontier, or even at the frontier, as shown in Figure 3 (b,d,f,h). This result shows that removing linguistic information always hurts the performance of NMT models for both LSTM and Transformer, indicating that encoding POS and DEP information is important for NMT task. Similar trends are observed on the LDC datasets, as shown in Figure 5. More details about the effect of randomness on our approach are shown in appendix B.

Figure 6: Experimental results on the PTB dataset.

Results on LM

Above experiments have shown that both syntactic information are important for NMT models, and then a natural question is whether all kinds of linguistic information are important for neural models. To answer this question, we propose to investigate the influence of phonetic information on a language model. Figure 6 depicts the relationship between encoded phonetic information and task performance for an LSTM based language model. In Figure 6 (a), we find that the performances of Pareto-optimal models drop slightly when forcing an LSTM model to encode more phonetic information. Besides, as the Pareto-frontier shown in Figure 6 (b), removing phonetic information from an LSTM model only leads to a slight change in performance. These experiments demonstrate that the encoded phonetic information may be not that critical for an LSTM based language model. This finding suggest that not all kinds of linguistic information are crucial for LSTM based LM and it is not promising to further improve language modeling with phonetic information.

7 Conclusion

This paper aims to study the relationship between linguistic information and the task performance and proposes a new viewpoint inspired by the criterion of Pareto Optimality. We formulate this goal as a multi-objective problem and present an effective method to address the problem by leveraging the theory of Pareto optimality. We conduct experiments on both MT and LM tasks and study their performance with respect to linguistic information sources. Experimental results show that the presented approach is more plausible than a baseline method in the sense that it explores better models in terms of both encoded linguistic information and task performance. In addition, we obtain some valuable findings as follows: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by minimizing MLE loss is Pareto-optimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic information encoded by LM models only leads to slight performance drop.

Acknowledgement

We would like to thank the anonymous reviewers for their constructive comments. L. Liu is the corresponding author.

References

  • Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg (2017) Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.
  • C. Alt, A. Gabryszak, and L. Hennig (2020) Probing linguistic features of sentence-level representations in neural relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1534–1545. External Links: Document Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, §1, §3.1.
  • J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Sima’an (2017) Graph convolutional encoders for syntax-aware neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1957–1967. External Links: Document Cited by: §6.2.
  • S. R. Beckman, J. P. Formby, W. J. Smith, and B. Zheng (2002) Envy, malice and pareto efficiency: an experimental examination. Social Choice and Welfare 19 (2), pp. 349–367. Cited by: §2.
  • S. Cao, V. Sanh, and A. Rush (2021) Low-complexity probing via finding subnetworks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 960–966. External Links: Document Cited by: §1.
  • M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes (2018) The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 76–86. External Links: Link, Document Cited by: Appendix B.
  • A. Chinchuluun, P. M. Pardalos, A. Migdalas, and L. Pitsoulis (2008) Pareto optimality, game theory and equilibria. Springer. Cited by: §2, §3.2.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018)

    What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2126–2136. External Links: Document Cited by: §1, §2.
  • A. Currey and K. Heafield (2019) Incorporating source syntax into transformer-based neural machine translation. In Proceedings of the Fourth Conference on Machine Translation, Florence, Italy, pp. 24–33. External Links: Document Cited by: §6.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, pp. 4171–4186. External Links: Document Cited by: §2.
  • Y. Ding, Y. Liu, H. Luan, and M. Sun (2017) Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1150–1159. External Links: Link, Document Cited by: §2.
  • D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1723–1732. External Links: Link, Document Cited by: footnote 1.
  • K. Duh, K. Sudoh, X. Wu, H. Tsukada, and M. Nagata (2012) Learning to translate with multiple objectives. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 1–10. External Links: Link Cited by: §2, §4.2, §4.2.
  • Y. Elazar and Y. Goldberg (2018) Adversarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 11–21. External Links: Document Cited by: §2, §4.2.
  • Y. Elazar, S. Ravfogel, A. Jacovi, and Y. Goldberg (2020) When bert forgets how to pos: amnesic probing of linguistic properties and mlm predictions. arXiv preprint arXiv:2006.00995. Cited by: §2.
  • A. Eriguchi, Y. Tsuruoka, and K. Cho (2017) Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 72–78. External Links: Document Cited by: §6.2.
  • Y. Ganin and V. S. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    , F. R. Bach and D. M. Blei (Eds.),
    JMLR Workshop and Conference Proceedings, Vol. 37, pp. 1180–1189. Cited by: §2, §4.2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §2.
  • B. C. Greenwald and J. E. Stiglitz (1986) Externalities in economies with imperfect information and incomplete markets. The quarterly journal of economics 101 (2), pp. 229–264. Cited by: §1.
  • K. Hashimoto and Y. Tsuruoka (2017) Neural machine translation with source-side latent graph parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 125–135. External Links: Document Cited by: §6.2.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2733–2743. External Links: Document Cited by: §2.
  • D. Hupkes, S. Veldhoen, and W. Zuidema (2018) Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research 61, pp. 907–926. Cited by: §2.
  • J. Lee, D. Tran, O. Firat, and K. Cho (2020) On the discrepancy between density estimation and sequence generation. In Proceedings of the Fourth Workshop on Structured Prediction for NLP, Online, pp. 84–94. External Links: Link Cited by: footnote 1.
  • J. Li, L. Liu, H. Li, G. Li, G. Huang, and S. Shi (2020) Evaluating explanation methods for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 365–375. External Links: Link, Document Cited by: §2.
  • X. Li, G. Li, L. Liu, M. Meng, and S. Shi (2019) On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1293–1303. External Links: Link, Document Cited by: §2.
  • X. Li, L. Liu, Z. Tu, S. Shi, and M. Meng (2018) Target foresight based attention for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, pp. 1380–1390. External Links: Document Cited by: §5.2.
  • M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 1647–1657. Cited by: §2.
  • M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2016) Multi-task sequence to sequence learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §6.2.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Document Cited by: §2.
  • N. Martínez, M. Bertrán, and G. Sapiro (2020) Minimax pareto fairness: A multi objective perspective. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 6755–6764. Cited by: §2, §4.2.
  • A. Mas-Colell, M. D. Whinston, J. R. Green, et al. (1995) Microeconomic theory. Vol. 1, Oxford university press New York. Cited by: §2, §3.1, §3.2.
  • C. McDonald and D. Chiang (2021) Syntax-based attention masking for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Online, pp. 47–52. External Links: Document Cited by: §6.2.
  • S. Merity, N. S. Keskar, and R. Socher (2018)

    Regularizing and optimizing LSTM language models

    .
    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1.
  • M. Nădejde, S. Reddy, R. Sennrich, T. Dwojak, M. Junczys-Dowmunt, P. Koehn, and A. Birch (2017) Predicting target language CCG supertags improves neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 68–79. External Links: Document Cited by: §6.2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 48–53. External Links: Document Cited by: §5.3.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Document Cited by: §5.1.
  • T. Pimentel, N. Saphra, A. Williams, and R. Cotterell (2020a) Pareto probing: Trading off accuracy for complexity. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, pp. 3138–3153. External Links: Document Cited by: §2.
  • T. Pimentel, J. Valvoda, R. Hall Maudslay, R. Zmigrod, A. Williams, and R. Cotterell (2020b) Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4609–4622. External Links: Document Cited by: §1, §2, §3.1.
  • P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning (2020) Stanza: a python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 101–108. External Links: Document Cited by: §5.2.
  • S. Ravfogel, Y. Elazar, H. Gonen, M. Twiton, and Y. Goldberg (2020) Null it out: guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7237–7256. External Links: Document Cited by: §2.
  • A. Ravichander, Y. Belinkov, and E. Hovy (2021) Probing the probing paradigm: does probing accuracy entail task relevance?. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, pp. 3363–3377. External Links: Link Cited by: §2.
  • A. Saleh, T. Deutsch, S. Casper, Y. Belinkov, and S. Shieber (2020) Probing neural dialog models for conversational understanding. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 132–143. External Links: Document Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1715–1725. External Links: Document Cited by: §5.1.
  • R. Sennrich and B. Haddow (2016) Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation, Berlin, Germany, pp. 83–91. External Links: Document Cited by: §5.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §2.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    ,
    pp. 2962–2971. External Links: Document Cited by: §2.
  • V. N. Vapnik (1999) An overview of statistical learning theory. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §1, §2, §3.1, §5.3.
  • E. Voita and I. Titov (2020) Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 183–196. External Links: Document Cited by: §1, §2.
  • Q. Xie, Z. Dai, Y. Du, E. H. Hovy, and G. Neubig (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 585–596. Cited by: §2.
  • Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou (2020) LayoutLM: pre-training of text and layout for document image understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 1192–1200. Cited by: §2.
  • W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §1, §3.1.

Appendix A Training Details

On the WMT14 corpus, training one LSTM model with 4 V100 GPUs costs 5 hours, and training one Transformer with 8 V100 GPUs costs 8 hours. On LDC corpus, training one LSTM model with 4 V100 GPUs costs 3 hours, and training one Transformer with 8 V100 GPUs costs 3 hours. On the PTB dataset, training LSTM model with 1 V100 GPU costs 6 minutes.

When running our algorithm, we empirically observe that when is below 0.01, the optimized models show little difference comparing with the standard model, and when is larger than 0.1, the proposed algorithm becomes unstable and can’t converge to Pareto-optimal solutions well. Therefore, we take ten values from 0.1 to 0.01 at equal intervals as in Eq. 8, and train ten models with different for each condition respectively. Then we plot all the models and the Pareto frontier of these models in the experiments.

Appendix B Effects of Randomness

BLEU H(POS—h)
mean var mean var
21.08 0.00407 0.1113 0
21.32 0.01536 0.1093 0
21.49 0.01847 0.108 0
21.52 0.00060 0.1123 0
Table 1: Experiment results from LSTM + POS setting. Specifically, “mean” and “var” denotes the mean and the variance over the window.

Following the method from Chen et al. (2018), we check if randomness will affect our experimental results. Specifically, we select a window of size 3 around the best checkpoint model and report the mean and variance over the selected window. The results are shown in Table 1. Because repeating experiments under all the settings are too extensive, we only randomly select 4 models from LSTM + POS settings. As shown in the table, all the variances are small, and the variances of the entropy even achieve 0. This suggests that the random disturbance of our experiments are small and thus our results are reliable.