Predicting Argumenthood of English Preposition Phrases

09/20/2018 ∙ by Najoung Kim, et al. ∙ Johns Hopkins University 0

Distinguishing between core and non-core dependents (i.e., arguments and adjuncts) of a verb is a longstanding, nontrivial problem. In natural language processing, argumenthood information is important in tasks such as semantic role labeling (SRL) and preposition phrase (PP) attachment disambiguation. In theoretical linguistics, many diagnostic tests for argumenthood exist but they often yield conflicting and potentially gradient results. This is especially the case for syntactically oblique items such as PPs. We propose two PP argumenthood prediction tasks branching from these two motivations: (1) binary argument/adjunct classification of PPs in VerbNet, and (2) gradient argumenthood prediction using human judgments as gold standard, and report results from prediction models that use pretrained word embeddings and other linguistically informed features. Our best results on each task are (1) acc.=0.955, F_1=0.954 (ELMo+BiLSTM) and (2) Pearson's r=0.624 (word2vec+MLP). Furthermore, we demonstrate the utility of argumenthood prediction in improving sentence representations via performance gains on SRL when a sentence encoder is pretrained with our tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In theoretical linguistics, a formal distinction is made between “core” and “non-core” elements with respect to a verb. For example, in the following example, the window is the core dependent of the verb open, whereas this morning and with Mary are both non-core dependents.

John opened [the window] [this morning] [with Mary].

What distinguishes core dependents (arguments or complements) from non-core dependents (adjuncts or modifiers)111Various combinations of the terminology are found in the literature, with subtle domain-specific preferences. We use arguments and adjuncts, with a rough definition of arguments as elements specifically selected or subcategorized by the verb., and why is this distinction important? Theoretically, the distinct representations given to arguments and adjuncts manifest in different formal behaviors (Chomsky, 1993; Steedman, 2000). There is also a range of psycholinguistic evidence which supports the psychological reality of the distinction (Tutunjian and Boland, 2008). In natural language processing (NLP), argumenthood information is useful in various applied tasks such as automatic parsing (Briscoe and Carroll, 1998) and PP attachment disambiguation (Merlo and Ferrer, 2006). In particular, automatic distinction of argumenthood could prove useful in improving structure-aware semantic role labeling, which has been shown to outperform structure-agnostic models in recent works (Marcheggiani and Titov, 2017). However, argument/adjunct distinction is one of the most difficult linguistic properties to annotate, and has remained unmarked in popular resources including the Penn TreeBank. PropBank (Palmer et al., 2005) addresses this issue to an extent by providing arg-n labels, but does not provide full coverage (Hockenmaier and Steedman, 2002). Thus, there are theoretical and practical motivations in developing a systematic approach to predicting argumenthood.

We focus on PP dependents in this paper, which are known to be one of the most challenging verbal dependents to classify correctly

(Abend and Rappoport, 2010). The paper is structured as follows. First, we discuss the theoretical and practical motivations for PP argumenthood prediction in more detail and review related works. Second, we formulate two different argumenthood tasks—binary and gradient—and describe how each dataset is constructed. Results for each task using various word embeddings and linguistic features as predictors are reported. Finally, we investigate whether better PP argumenthood prediction is indeed useful for NLP. Through a controlled evaluation setup, we demonstrate that training sentence encoders on our proposed tasks improves the quality of learned representations.

2 Argumenthood prediction

Theoretical motivation.

Although arguments and adjuncts are theoretically and practically important concepts, distinguishing arguments from adjuncts in practice is not a trivial problem even for linguists (Schütze, 1995). Numerous diagnostic tests have been proposed in the literature; for instance, omissibility and iterability tests are commonly used (Pollard and Sag, 1987). However, none of the existing diagnostic tests (or a set of tests) provide necessary or sufficient criteria to determine the status of a dependent. Moreover, it has long been noted that argumenthood is a gradient phenomenon rather than a strict dichotomy (e.g., some arguments are less argument-like than others, resulting in different syntactic and semantic behaviors of the dependents (Schütze, 1995; Rissman et al., 2015)). This raises many interesting theoretical questions such as what kinds of lexical and contextual information affect these judgments, and whether the judgments would be predictable in a principled way given that information. By building prediction models for gradient judgments using lexical features, we hope to gain insights about what factors explain gradience and to what degree they do so.

Utility in NLP.

Automatic parsing will likely benefit from argumenthood information. For instance, the issue of PP attachment is responsible for a large portion of the errors in parsers (Dasigi et al., 2017). It has been shown that reducing PP attachment errors leads to higher parsing accuracy (Agirre et al., 2008; Belinkov et al., 2014), and also that argument-adjunct distinction is useful for PP attachment disambiguation (Merlo and Ferrer, 2006).

Moreover, argument-adjunct status is explicitly encoded in many formal grammars including Combinatory Categorial Grammar (CCG) (Steedman, 2000). This often makes translation between grammars with and without mandatory representation of argumenthood difficult (Gildea and Hockenmaier, 2003). Being able to predict argumenthood would facilitate and improve the quality of resources being ported between different grammars, such as CCGBank.

Argument-adjunct distinction is also closely connected to Semantic Role Labeling (SRL). He et al. (2017) report that even state-of-the-art deep models for SRL still suffer from argument-adjunct distinction errors as well as PP attachment errors. They also observe that errors in widely-used automatic parsers pose challenges to improving performance in syntax-aware neural models (Marcheggiani and Titov, 2017). This suggests that improving parsers with better argumenthood distinction would lead to better SRL performance.

Related work.

Our tasks share a similar objective with Villavicencio (2002), which is to distinguish PP arguments from adjuncts by an informed selection of linguistic features. However, we do not use logical forms or explict formal grammar in our models, although the use of distributional word representations may capture some syntax. The scale and data collection procedure of our binary classification task (Experiment 1) are more comparable to those of Merlo and Ferrer (2006) or Belinkov et al. (2014), where the authors construct a PP attachment database from Penn TreeBank data. Our binary classfication dataset is similar in scale, but is based on VerbNet (Kipper-Schuler, 2005) frames. Experiment 2, which is a smaller-scale experiment on predicting gradient argumenthood judgment data from humans, is a novel task to the extent of our knowledge. The crowdsourcing protocol for collecting human judgments is inspired by Rissman et al. (2015).

Our evaluation setup to measure downstream task performance gains from PP argumenthood information (Section 5) is inspired by a recent line of efforts on probing task approaches for evaluating the quality of sentence representations (Ribeiro et al., 2018; Ettinger et al., 2016; Gulordava et al., 2018). In order to investigate whether PP argument/adjunct distinction task has a practical application, we attempt to improve performances on downstream tasks such as SRL with an ultimate goal of making sentence representations better and more generalizable using a set of linguistically-motivated tasks (as opposed to extremely fine-tuned representations for a very specific task). The method is to pretrain a sentence encoder with a linguistic task of interest, fix the encoder weights and then train a classifier for tasks other than the pretraining task, using the representations from the frozen encoder. This enables us to compare the utility of information from different pretraining tasks (e.g., PP argument/adjunct distinction) on another task (e.g., SRL) or even multiple external tasks.

3 Experiment 1: Binary classification

3.1 Task formulation

w2v GloVe fastText ELMo
Classification model Acc. Acc. Acc. Acc.
BiLSTM + MLP 94.0 94.0 94.5 94.4 94.6 94.6 95.5 95.4
Concatenation + MLP 92.4 92.3 92.8 92.6 93.6 93.4 94.0 93.9
BoW + MLP 91.4 91.4 91.4 91.3 91.4 91.3 93.4 93.3

Concatenation + Random forest

91.5 91.2 90.8 90.8 90.6 90.5 91.1 90.9
Concatenation + SVM (no hparam tuning) 81.6 81.3 82.3 82.1 83.1 83.1 87.2 87.2
Majority class (== chance) 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0
Table 1: Test set performance on the binary classification task ().

Class labels.

We use VerbNet subcategorization frames to define the argument-adjunct status of a verb-PP combination. This means if a certain PP is listed as a subcategorization frame under a certain verb entry, the PP is considered to be an argument of the verb. If it is not listed as a subcategorization frame, it is considered an adjunct of the verb. This way of defining argumenthood has been proposed and evaluated by McConville and Dzikovska (2008). Although it should be noted that VerbNet is not an exhaustive list of subcategorization frames and not all frames listed are strictly arguments, McConville and Dzikovska (2008)’s evaluation suggests that VerbNet frame membership is a reasonable proxy of argumenthood. We chose VerbNet over PropBank arg-n and am labels, which are also a popular proxy for argumenthood (Abend and Rappoport, 2010), since VerbNet’s design goal of exhaustively listing frames for each verb better matches our task that requires broad type-level coverage of V-PP constructions.

Figure 1: Examples of PP.x frames in VerbNet.

Verbs and prepositions.

2714 unique verbs and 60 unique prepositions are used to generate all possible combinations of {Verb, Prep.}. These are all unique verb entries that pertain to a single set of VerbNet class and all prepositions that appear in VerbNet PP frames excluding multi-word prepositions (e.g., all over, on top of). Some frames are only defined by features such as {+spatial}, without specifying which specific prepositions these features correspond to (see Figure 1). For such featurally-marked frames, a manual mapping was made to specific preposition sets constructed approximately based on PrepWiki sense annotations (Schneider et al., 2015). As a result, we have different {Verb, Prep.} tuples that either correspond to (arg) or do not correspond to (adj) a subcategorization frame.

Dataset and task objective.

Since each verb subcategorizes only a handful of prepositions out of the possible 60, the distribution of labels arg and adj

is heavily skewed towards

adj. The ratio of arg:adj labels in the whole dataset is approximately 1:10. For this reason, we use randomized subsampling of the negative cases to construct a balanced dataset. Since there were datapoints with label 1 in the whole set, the same number of label 0 datapoints were randomly subsampled. This balanced dataset () is randomly split into 70:15:15 train:dev:test sets.

The task is to predict whether a given {Verb, Prep.} pair is an argument or an adjunct construction (i.e., whether it is an existing VerbNet subcategorization frame or not). Performance is measured by classification accuracy (acc.) and on the test set.

Full-sentence variants of the task.

The meaning of the complement noun phrase (NP) of the preposition (NP under PP) can also be a clue to determining argumenthood. We did not include the NP under PP as a part of the input in our main task dataset because the argumenthood labels we are using are type-level (or frame-level) and not labels for individual instantiation of the frames (token-level). However,the NP meanings, especially thematic roles, do play crucial roles in argumenthood. To address this concern, we propose variants of the main task that give prediction models access to information from the NP by providing full sentence inputs. We report additional results on one of the full-sentence variants (ternary classification) using two of the best-performing model setups for the main task (Table 2).

The full-sentence variants of the main task dataset are constructed by performing a heuristic search through the Stanford Natural Language Inference (SNLI;

(Bowman et al., 2015)) and Multi-genre Natural Language Inference (MNLI; (Williams et al., 2018)) datasets using the syntactic parse trees provided, to find sentences that contain a particular {Verb, Prep.} construction. Note that the full sentence data is noisier compared to the main dataset (1) because the trees are parser outputs and (2) the original type-level gold labels given to the {Verb, Prep.} were unchanged regardless of what the NP may be. Duplicate entries for the same {Verb, Prep.} were permitted as long as the sentences themselves were different. In the first task variant, for case where no examples of the input pair is found in the dataset, we retained the original label. In the second variant, we assign a new (unobserved) label to such cases. This helps filter out overgenerated adjunct labels in the original dataset, where the {Verb, Prep.} is not listed as a frame because it is an impossible combination. It also eliminates the need for subsampling, since the ratio of arg:adj:unobserved labels were reasonably balanced.

We chose NLI datasets as the source of full sentence inputs over other parsed corpora such as the Penn TreeBank for the following two reasons. First, we wanted to avoid using the the same source text as several downstream tasks we test in Section 5 (e.g., CoNLL-2005 SRL (Carreras and Màrquez, 2005) uses sections of the Penn TreeBank), in order to separate out the benefits of seeing the source text at train time from the benefits of structural knowledge gained from learning to distinguish PP arguments and adjuncts. Second, we wanted both simple, short sentences (SNLI) and complex, naturally-occuring sentences (MNLI) from datasets of a consistent structure.

3.2 Model

Input representation.

We report results using four different types of word embeddings (word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fastText (Bojanowski et al., 2016), ELMo (Peters et al., 2018)) to represent the input tuples. Publicly available pretrained embeddings provided by the respective authors222
are used.

w2v GloVe fastText ELMo
Classification model Acc. Acc. Acc. Acc.
BiLSTM + MLP 95.6 94.3 97.0 96.1 96.9 96.0 97.4 96.6
BoW + MLP 93.6 91.7 93.7 91.8 93.6 91.7 95.9 94.7
Majority class 38.8 38.8 38.8 38.8 38.8 38.8 38.8 38.8
Chance 33.4 33.4 33.4 33.4 33.4 33.4 33.4 33.4
Table 2: Model performances on full-sentence, unobserved label included variant of the original classification task (now ternary classification) on the test set ().


Our current best model uses a combination of bidirectional LSTM (BiLSTM) and multi-layer perceptron (MLP) for the binary classification task. We use the LSTM implementation from PyTorch (Eq. 1) to obtain a representation of the given {

Verb, Prep.

} sequence and then train an MLP classifier (Eq. 2) on top to perform the actual classification. The MLP classifier is a simple feedforward neural network with a single hidden layer that consists of

units. We also test models that use the same MLP classifier (Eq. 2) with concatenated input vectors (Concatenation + MLP) or bag-of-words encodings of the input (BoW + MLP).


is a linear projection of the output of Eq. 1 (or the projection of the BoW encoder output or a concatenated word vector;

) followed by max pooling.

is the softmax activation function. The output of Eq. 

2 is the label (1 or 0) that is more likely given . The models are trained using Adadelta (Zeiler, 2012) with cross-entropy loss and batch size . Several other off-the-shelf implementations from the Python library scikit-learn were also tested in place of MLP, but since MLP consistently outperformed other classifiers, we only list a subset of the models that use non-MLP classifiers out of the models we tested (Table 1).

3.3 Results

Table 1 compares the performance of the tested models. The most trivial baseline is chance-level classification, which would yield 50% accuracy and . Since the labels in the dataset are perfectly balanced by randomized subsampling, majority class classification is equivalent to chance-level classification. All nontrivial models outperform chance, and for all models tested ELMo yielded the best performance. Using concatenated inputs that preserve the linear order of the inputs increase both accuracy and by around 1.5%p over bag-of-words, and using a BiLSTM to encode the V+P representation gives another 1.5%p improvement across the board.

We additionally report results on the full-sentence, ternary-classification variant of the task (description in Section 3.1) from the best model on the main task. The results are given in Table 2. We did not test the concatenation model since the dimensionality of the vectors would be too large with full sentence inputs. All tested models perform over chance, with ELMo achieving the best performance once again. We observe a similar gain of around 3%p by replacing BoW with BiLSTM as in the main task.

4 Experiment 2: Gradient argumenthood prediction

As previously discussed in Section 2, there is much work in theoretical linguistics literature that suggests argument-adjuncthood is a continuum rather than a dichotomy. We propose a gradient argumenthood prediction task and test regression models that use a combination of embeddings and lexical features. Since there is no publicly available gradient argumenthood data, we collected our own dataset via crowdsourcing333See Supplemental Material for examples of questions given to participants. Detailed protocol and theoretical analysis of the data are omitted; it will be discussed in a separate theoretically-oriented paper in preparation.. Due to the resource-consuming nature of human judgment collection, the size of the dataset is limited (, 25-way redundant). This task serves as a small-scale, proof-of-concept test for whether a reasonable prediction of gradient argumenthood judgments is possible with informed selection of lexical features (and if so, how well different models perform and what features are informative).

all w2v-googlenews GloVe fastText
Model Pearson’s Pearson’s Pearson’s
Simple MLP 0.554 0.255 0.231 0.568 0.268 0.245 0.582 0.281 0.257
Linear 0.579 0.280 0.257 0.560 0.243 0.219 0.591 0.291 0.268
SVM 0.561 0.243 0.218 0.463 0.110 0.082 0.580 0.267 0.243
w2v-wiki444  () ELMo ()
Model Pearson’s Pearson’s
Simple MLP 0.624 0.330 0.309 0.609 0.304 0.281
Linear 0.609 0.311 0.289 0.586 0.293 0.270
SVM 0.549 0.237 0.213 0.337 0.052 0.022
Table 3: 10-fold cross-validation results on the gradient argumenthood prediction task.
Figure 2: Distribution of argumenthood scores in our gradient argumenthood dataset.

4.1 Data

The gradient argumenthood judgment dataset consists of 305 sentences that contain a single main verb and a PP dependent of that verb. All sentences are adaptations from example sentences in VerbNet PP subcategorization frames or sentences containing arg-n PPs in PropBank. To capture a full range of the argumenthood spectrum from fully argument-like to fully adjunct-like, we manually augmented the dataset by adding more strongly adjunct-like examples. These examples are generated by substituting the PP of the original sentence with a felicitous adjunct. This step is necessary since PPs listed as a subcategorization frame or arg-n are more argument-like, as discussed in Section 3.1.

25 participants were recruited to produce argumenthood judgments about these sentences on a 7-point Likert scale, using protocols adapted from a prior work for collecting similar judgments (Rissman et al., 2015). Larger numbers are given an interpretation of being more argument-like and smaller numbers, more adjunct-like. The results were z-normalized within-subject to adjust for individual differences in the use of the scale, and then the final argumenthood scores were computed by averaging the normalized scores of all participants. The result is a set of values on a continuous scale, each value representing each sentence in the dataset. The actual numbers range between , roughly centered around zero (). The distribution is slightly left-skewed (right-leaning), with more positive values than negative values (; see Figure 2). Here are some examples with varying degrees of argumenthood, indicated by the numbers:

I whipped the sugar [with cream]. (0.35)
The witch turned him [into a frog]. (0.57)
The children hid [in a hurry]. (-0.41)
It clamped [on his ankle]. (0.66)
Amanda shuttled the children [from home]. (-0.1)

We refrain from assigning definitive interpretations to the absolute values of the scores, but how the scores compare to each other gives us insight into the relative difference in argumenthood. For instance, shuttled from home with a score of is more adjunct-like than a higher-scoring construction such as clamped on his ankle , but more argument-like compared to hid in a hurry with a score of . This matches the intuition that a locative PP from home would be more argument-like to a change-of-location predicate shuttle, compared to a manner PP like in a hurry. However, it is still less argument-like than a more clearly argument-like PP on his ankle that is selected by clamp.

4.2 Model


We use the same sets of word embeddings we used in Experiment 1 as base features, but we reduced their dimensionality to

via Principal Component Analysis. This reduction step is necessary due to the large dimensionality (

) of the word vectors compared to the small size of our dataset (). Various features in addition to the embeddings of verbs and prepositions were also tested. The features we experimented with include semantic proto-role property scores (Reisinger et al., 2015) of the target PP (normalized mean across 5 annotators), mutual information (MI) (Aldezabal et al., 2002), word embeddings of the nominal head token of the NP under the PP in question, existence of a direct object, and various interaction terms between the features (e.g., additive, subtractive, inner/outer products). The following lexical features were selected for the final models based on dev set performance: embeddings of the verb, preposition, nominal head, mutual information and existence of a direct object (D.O.). The intuition behind including D.O. is that if there exists a direct object in the given sentence, the syntactically oblique PP dependent would seem comparatively less argument-like compared to the direct object. This feature is expected to reduce the noise introduced by different argument structures of the main verbs.

w2v-wiki ELMo No embeddings
Model Pearson’s Pearson’s Pearson’s
Embeddings 0.430 0.064 0.046 0.404 0.063 0.044 - - -
+MI 0.464 0.158 0.138 0.458 0.153 0.133 0.376 0.083 0.079
+D.O. 0.575 0.245 0.227 0.530 0.202 0.183 0.301 0.029 0.025
+diag. 0.449 0.125 0.104 0.440 0.138 0.117 0.268 0.027 0.023
+MI. +D.O. 0.586 0.278 0.258 0.607 0.297 0.277 0.466 0.165 0.158
+MI. +diag. 0.515 0.205 0.182 0.512 0.193 0.170 0.436 0.141 0.134
+diag. +D.O. 0.572 0.265 0.245 0.535 0.224 0.203 0.392 0.114 0.107
+all 0.624 0.330 0.309 0.609 0.304 0.281 0.516 0.215 0.206
Table 4: Ablation results from the two best models and a non-embedding features only-model (10-fold cross validation).

We also include a diagnostics feature which is a weighted combination of two different traditional diagnostic test results (omissibility and pseudo-cleftability) produced by a linguist with expertise in theoretical syntax. Unlike all other features in our feature set, this diagnostics feature is not straightforwardly computable from corpus data. We add this feature in order to examine how powerful traditional linguistic diagnostics are in capturing gradient argumenthood.

Regression model.

The selected features are given as inputs to an MLP regressor which is equivalent to Eq. 2 in Experiment 1 except that it outputs a continuous value. The regressor consists of an input layer with units (corresponding to features), m

hidden units and a single output unit. Smooth L1 loss is used in order to reduce sensitivity to outliers, and


, the activation function (ReLU, Tanh or Sigmoid), optimizers and learning rates are all tuned with the dev set for each individual model using Hyperband

(Li et al., 2016). We limit ourselves to a simpler MLP-only model for this experiment; the BiLSTM encoder model suffered from overfitting.

Evaluation metrics.

We use 15% of the dataset as development set, and train/test using 10-fold cross-validation on the remaining 85% rather than reporting performance on a fixed test split. This is because the credibility of performance on one test split may be questioned due to the small sample size. Pearson’s averaged over the 10 folds using Fisher z-transformation is the main metric. Mean and Adjusted () are also reported to account for the potentially differing number of predictors in the ablation experiment.

Sentence encoder pretraining tasks
Test tasks metric Random Arg Arg fullsent Arg fullsent 3-way
SRL-CoNLL2005 (WSJ) 81.7 83.9 84.7 84.5
SRL-CoNLL2012 (OntoNotes) 77.3 80.2 80.4 80.7
PP attachment (Belinkov et al., 2014) 87.5 87.6 88.2 87.0
Table 5: Performance gains over random initialization from pretraining sentence encoders on PP argumenthood task variants. ()

4.3 Results and discussion

Table 3 reports performances on the regression task. Results from several off-the-shelf regressors are reported for comparison. ELMo embeddings again produced the best results among the embeddings used in Experiment 1. We speculated that the dimensionality of the embeddings may have impacted the results, and ran additional experiments using higher-dimensional embeddings. Higher-dimensional embeddings did indeed lead to performance improvements, even though the actual inputs given to the models were all PCA-reduced to . From this observation, we could further improve upon the inital ELMo results. Results from the best model (w2v-wiki) are given in addition to the set of results using the same embeddings as the models in Experiment 1. This model uses 1000- word2vec features with additional interaction features (multiplicative, subtractive) that improved dev set performance.


We conducted ablation experiments with the two best-performing models to examine the contribution of non-embedding features discussed in Section 4.2. Table 4 indicates that any linguistic feature contributes positively towards performance, with the direct object feature helping both word2vec and ELMo models the most. This supports our initial hypothesis that adding the direct object feature would help reduce noise in the data. When only the linguistic features are used without embeddings as base features, mutual information is the most informative. This suggests that there is some (but not complete) redundancy in information captured by word embeddings and mutual information. The diagnostics feature is informative but is a comparatively weak predictor, which aligns with the current state of diagnostic acceptability tests—they are sometimes useful but not always, especially with respect to syntactically oblique items such as PPs. This behavior of the diagnostics predictor adds credibility to our data collection protocol.

5 Why is this a useful standalone task?

In motivating our tasks, we suggested that PP argumenthood information could improve downstream task performance such as SRL and parsing. We investigate whether this is a grounded claim by testing two separate hypotheses: (1) whether the task is indeed useful, and if so, (2) whether it is worthy as a standalone task. We leave the issue of gradient argumenthood to future work for now, since the dataset is currently small and the notion of gradient argumenthood is not yet compatible with formulations of many NLP tasks.

5.1 Improving representations with pretraining

We first test the utility of the binary argumenthood task in improving performances on existing NLP tasks. We selected three tasks that may benefit from PP argumenthood information: SRL on labeled Wall Street Journal data (WSJ) (CoNLL 2005 shared task; Carreras and Màrquez 2005), SRL on OntoNotes Corpus (CoNLL 2012 data; Pradhan et al. 2012), and PP attachment disambiguation on WSJ data (Belinkov et al., 2014).

Our test setup is as follows. We first train a BiLSTM sentence encoder with an objective to maximize dev set performance on our PP argumenthood tasks. Then we freeze the encoder weights and train an MLP classifier to perform new tasks only using the representations produced by the frozen-weight sentence encoder. If learning to make correct PP argumenthood distinction teaches models knowledge that is generalizable to the new tasks, the classifier trained on top of the fixed-weights encoder will perform better on those tasks compared to a classifier trained on top of an encoder with randomly initialized weights. Improvements over the randomly initialized setup from pretraining on our main PP argumenthood task (Arg) and its full-sentence variants (Arg fullsent and Arg fullsent 3-way; see Section 3.1 for details) are shown in Table 5. Only statistically significant () improvements over the random encoder model are bolded, with significance levels calculated via Approximate Randomization (Yeh, 2000) (). The models trained on PP argumenthood tasks perform significantly better than the random initalization model in both SRL tasks, which supports our initial claim that argumenthood tasks can be useful for SRL. However, we did not observe significant improvements for the PP attachment disambiguation task. We speculate that since the task as formulated in Belinkov et al. (2014) requires the model to understand PP dependents of NPs as well as VPs, our tasks that focus on verbal dependents may not provide the full set of linguistic knowledge necessary to solve this task. Nevertheless, our models are not significantly worse than the baseline, and the accuracy of the Arg fullsent model (88.2%) was comparable to a model that uses an encoder directly trained on PP attachment (88.7%)

Secondly, we discuss whether it is indeed useful to formulate PP argumenthood prediction as a separate task. The questions that need to be answered are (1) whether it would be the same or better to use a different pretraining task that would provide similar information (e.g., PP attachment disambiguation), and (2) whether the performance gain can be attributed to simply seeing more datapoints at train time rather than to the regularities we hope the models would learn through our task. Table 6 addresses both questions; we compare models pretrained on argumenthood tasks to a model pretrained directly on the PP attachment task listed in Table 5

. All models trained on PP argumenthood prediction outperform the model trained on PP attachment, despite the fact that the latter has unfair advantage for SRL2005 since the tasks share the same source text (WSJ). Furthermore, the variance in the sizes of the datasets indicates that the reported performance gains cannot solely be due to the increased number of datapoints seen during training.

PP att. Arg Arg full Arg full 3-way
Size 32k 19k 58k 87k
SRL2005 80.2 83.9 84.7 84.5
SRL2012 79.8 80.2 80.3 80.7
Table 6: Comparison against using PP attachment directly as a pretraining task ().

6 Conclusion

We have proposed two different tasks—binary and gradient—for predicting PP argumenthood, and reported results on each using four different types of word embeddings as base predictors. We obtain 95.5 accuracy and 95.4 in the binary classification task with BiLSTM and ELMo, and for the gradient human argumenthood judgment prediction task. Our overall contribution is threefold: first, we have demonstrated that a principled prediction of both binary and gradient argumenthood judgments is possible with informed selection of lexical features; second, we justified the utility of our binary PP argumenthood classification as a standalone task by reporting performance gains on multiple SRL tasks through encoder pretraining. Finally, we have conducted a small-scale, proof-of-concept study with a novel gradient argumenthood prediction task, paired with a new dataset that we plan to release.

6.1 Future work

The pretraining approach holds much promise in understanding and improving neural network models of language. Especially for end-to-end models, this method has substantial advantage over architecture engineering or hyperparameter tuning in terms of interpretability. That is, we can attribute the source of the performance gain on end tasks to the knowledge necessary to do well on the pretraining task. For instance, in Section 

5 we can infer that that knowing how to make correct PP argumenthood distinction helps models encode representations that are more useful for SRL. Furthermore, we believe it is important to contribute to the recent efforts for designing better probing tasks to understand what machines really know about natural language (as opposed to taking downstream performances as metrics of ’better’ models). We hope to scale up our preliminary experiments and will continue to work on developing a set of linguistically informed probing and pretraining tasks for higher-quality, better-generalizable sentence representations.


  • Abend and Rappoport (2010) Abend, O. and Rappoport, A. (2010). Fully unsupervised core-adjunct argument classification. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 226–236.
  • Agirre et al. (2008) Agirre, E., Baldwin, T., and Martinez, D. (2008). Improving parsing and pp attachment performance with sense information. In Proceedings of ACL-08: HLT, pages 317–325.
  • Aldezabal et al. (2002) Aldezabal, I., Aranzabe, M., Gojenola, K., Sarasola, K., and Atutxa, A. (2002). Learning argument/adjunct distinction for basque. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages 42–50.
  • Belinkov et al. (2014) Belinkov, Y., Lei, T., Barzilay, R., and Globerson, A. (2014). Exploring compositional architectures and word vector representations for prepositional phrase attachment. Transactions of the Association for Computational Linguistics, 2:561–572.
  • Bojanowski et al. (2016) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  • Bowman et al. (2015) Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
  • Briscoe and Carroll (1998) Briscoe, T. and Carroll, J. (1998).

    Can subcategorization probabilities help a statistical parser?

    In Proceedings of the ACL/SIGDAT workshop in Very Large Corpora. Association for Computational Linguistics.
  • Carreras and Màrquez (2005) Carreras, X. and Màrquez, L. (2005). Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning, pages 152–164. Association for Computational Linguistics.
  • Chomsky (1993) Chomsky, N. (1993). Lectures on government and binding: The Pisa lectures. Number 9. Walter de Gruyter.
  • Dasigi et al. (2017) Dasigi, P., Ammar, W., Dyer, C., and Hovy, E. (2017). Ontology-aware token embeddings for prepositional phrase attachment. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2089–2098.
  • Ettinger et al. (2016) Ettinger, A., Elgohary, A., and Resnik, P. (2016). Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139.
  • Gildea and Hockenmaier (2003) Gildea, D. and Hockenmaier, J. (2003). Identifying semantic roles using combinatory categorial grammar. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 57–64.
  • Gulordava et al. (2018) Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. arXiv preprint arXiv:1803.11138.
  • He et al. (2017) He, L., Lee, K., Lewis, M., and Zettlemoyer, L. (2017). Deep semantic role labeling: What works and what’s next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 473–483.
  • Hockenmaier and Steedman (2002) Hockenmaier, J. and Steedman, M. (2002). Acquiring compact lexicalized grammars from a cleaner treebank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02).
  • Kipper-Schuler (2005) Kipper-Schuler, K. (2005).

    Verbnet: A broad-coverage, comprehensive verb lexicon.

    Ph. D. Thesis, University of Pennsylvania.
  • Li et al. (2016) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560.
  • Marcheggiani and Titov (2017) Marcheggiani, D. and Titov, I. (2017). Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1506–1515.
  • McConville and Dzikovska (2008) McConville, M. and Dzikovska, M. O. (2008). Evaluating complement-modifier distinctions in a semantically annotated corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).
  • Merlo and Ferrer (2006) Merlo, P. and Ferrer, E. E. (2006). The notion of argument in prepositional phrase attachment. Computational Linguistics, 32(3):341–378.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
  • Palmer et al. (2005) Palmer, M., Gildea, D., and Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31(1):71–106.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, pages 1532–1543.
  • Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2227–2237.
  • Pollard and Sag (1987) Pollard, C. and Sag, I. (1987). Information-Based Syntax and Semantics, Vol. 1. CSLI Publications.
  • Pradhan et al. (2012) Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., and Zhang, Y. (2012). Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pages 1–40. Association for Computational Linguistics.
  • Reisinger et al. (2015) Reisinger, D., Rudinger, R., Ferraro, F., Harman, C., Rawlins, K., and Van Durme, B. (2015). Semantic proto-roles. Transactions of the Association for Computational Linguistics, 3:475–488.
  • Ribeiro et al. (2018) Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865.
  • Rissman et al. (2015) Rissman, L., Rawlins, K., and Landau, B. (2015). Using instruments to understand argument structure: Evidence for gradient representation. Cognition, 142:266–290.
  • Schneider et al. (2015) Schneider, N., Srikumar, V., Hwang, J. D., and Palmer, M. (2015). A hierarchy with, of, and for preposition supersenses. In Proceedings of The 9th Linguistic Annotation Workshop, pages 112–123.
  • Schütze (1995) Schütze, C. T. (1995). Pp attachment and argumenthood. MIT working papers in linguistics, 26(95):151.
  • Steedman (2000) Steedman, M. (2000). The syntactic process, volume 24. MIT Press.
  • Tutunjian and Boland (2008) Tutunjian, D. and Boland, J. E. (2008). Do we need a distinction between arguments and adjuncts? evidence from psycholinguistic studies of comprehension. Language and Linguistics Compass, 2(4):631–646.
  • Villavicencio (2002) Villavicencio, A. (2002). Learning to distinguish pp arguments from adjuncts. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1–7. Association for Computational Linguistics.
  • Williams et al. (2018) Williams, A., Nangia, N., and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1112–1122.
  • Yeh (2000) Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th conference on Computational linguistics-Volume 2, pages 947–953. Association for Computational Linguistics.
  • Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.