Log In Sign Up

Instance-Based Neural Dependency Parsing

by   Hiroki Ouchi, et al.
Nara Institute of Science and Technology
Preferred Infrastructure

Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.


page 1

page 2

page 3

page 4


Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition

Interpretable rationales for model predictions play a critical role in p...

Span-Based Constituency Parsing with a Structure-Label System and Provably Optimal Dynamic Oracles

Parsing accuracy using efficient greedy transition systems has improved ...

Sequential Graph Dependency Parser

We propose a method for non-projective dependency parsing by incremental...

A Logic-Driven Framework for Consistency of Neural Models

While neural models show remarkable accuracy on individual predictions, ...

Yet Another Format of Universal Dependencies for Korean

In this study, we propose a morpheme-based scheme for Korean dependency ...

An empirical study for Vietnamese dependency parsing

This paper presents an empirical comparison of different dependency pars...

Improved Semantic Role Labeling using Parameterized Neighborhood Memory Adaptation

Deep neural models achieve some of the best results for semantic role la...

1 Introduction

While deep neural networks have improved prediction accuracy in various tasks, rationales underlying the predictions have been more challenging for humans to understand 

lei-etal-2016-rationalizing. In practical applications, interpretable rationales play a crucial role in driving humans’ decisions and promoting human-machine cooperation ribeiro2016should. From this perspective, the utility of instance-based learning aha1991instance

, a traditional machine learning method, has been realized again 


Instance-based learning is a method that learns similarities between training instances and infers a value or class for a test instance on the basis of similarities against the training instances. On the one hand, standard neural models encode all the knowledge in the parameters, making it challenging to determine what knowledge is stored and used for predictions guu2020realm. On the other hand, models with instance-based inference explicitly use training instances for predictions and can exhibit the instances that significantly contribute to the predictions. The instances play a role of an explanation to the question: why did the model make such a prediction? This type of explanation is called instance-based explanation caruana1999case; baehrens2010explain; plumb2018model, which facilitates the users’ understandings of model predictions and allows users to make decisions with higher confidence kolodneer1991improving; ribeiro2016should.

It is not trivial to combine neural networks with instance-based inference processes while keeping high prediction accuracy. Recent studies in image recognition seek to develop such methods wang2014learning; hoffer2015deep; liu2017sphereface; wang2018cosface; deng2019arcface

. This paradigm is called deep metric learning. Compared to image recognition, there are much fewer studies on deep metric learning in natural language processing (NLP). As a few exceptions,

wiseman-stratos-2019-label and ouchi-etal-2020-instance developed neural models that have an instance-based inference process for sequence labeling tasks. They reported that their models have high explainability without sacrificing the prediction accuracy.

As a next step from targeting consecutive tokens, we study instance-based neural models for relations between discontinuous elements. To correctly recognize relations, systems need to capture associations between elements. As an example of relation recognition, we address dependency parsing, where systems seek to recognize binary relations between tokens (hereafter edges). Traditionally, dependency parsers have been a useful tool for text analysis. An unstructured text of interest is parsed, and its structure leads users to a deeper understanding of the text. By successfully introducing instance-based models to dependency parsing, users can extract dependency edges along with similar edges as a rationale for the parse, which further helps the process of text analysis.

In this paper, we develop new instance-based neural models for dependency parsing, equipped with two inference modes: (i) explainable mode and (ii) fast mode. In the explainable mode, our models make use of similarities between the candidate edge and each edge in a training set. By looking at the similarities, users can quickly check which training edges significantly contribute to the prediction. In the fast mode, our models run as fast as standard neural models, while general instance-based models are much slower than standard neural models because of the dependence on the number of training instances. The fast mode is motivated by the actual situation: in many cases, users want only predictions, and when the predictions seem suspicious, they want to check the rationales. So, the fast mode does not offer rationales, but instead, it enables faster parsing that outputs exactly the same predictions as the explainable mode. Users can freely switch between the explainable and fast modes according to their purposes. This property is realized by taking advantage of the linearity of score computation in our models and avoids comparing a candidate edge to each training edge one by one for computing the score at test time (see Section 4.4 for details).

Our experiments on multilingual datasets show that our models can achieve competitive accuracy with standard neural models. In addition, we shed light on the plausibility of instance-based explanations, which has been underinvestigated in dependency parsing. We verify whether our models meet a minimal requirement related to the plausibility hanawa2021evaluation. Additional analysis reveals the existence of hubs radovanovic2010hubs, a small number of specific training instances that often appear as nearest neighbors, and that hubs have a terrible effect on the plausibility. Our main contributions are as follows:

  • This is the first work to develop and study instance-based neural models111Our code is publicly available at for dependency parsing (Section 4);

  • Our empirical results show that our instance-based models achieve competitive accuracy with standard neural models (Section 6.1);

  • Our analysis reveals that L2-normalization for edge representations suppresses the hubs’ occurrence, and as a result, succeeds in improving the plausibility of instance-based explanations (Sections 6.2 and LABEL:sec:analysis).

2 Related Work

2.1 Dependency Parsing

There are two major paradigms for dependency parsing kubler2009dependency: (i) transition-based paradigm nivre2003efficient; yamada2003statistical and (ii) graph-based paradigm mcdonald-etal-2005-non. Recent literature often adopts the graph-based paradigm and achieves high accuracy dozat2017deep; zhang-etal-2017-dependency-parsing; hashimoto-etal-2017-joint; clark-etal-2018-semi; ji-etal-2019-graph; zhang-etal-2020-efficient. The first-order edge-factored models under this paradigm factorize the score of a dependency tree into independent scores of single edges mcdonald-etal-2005-non. The score of each edge is computed on the basis of its edge feature. This decomposable property is preferable for our work because we want to model similarities between single edges. Thus, we adopt the basic framework of the first-order edge-factored models for our instance-based models.

2.2 Instance-Based Methods in NLP

Traditionally, instance-based methods (memory-based learning) have been applied to a variety of NLP tasks daelemans2005memory, such as part of speech tagging daelemans-etal-1996-mbt, NER tjong-kim-sang-2002-memory; de-meulder-daelemans-2003-memory; hendrickx-van-den-bosch-2003-memory, partial parsing daelemans1999memory; sang2002memory, phrase-structure parsing lebowitz1983memory; scha1999memory; kubler2004memory; bod2009exemplar, word sense disambiguation veenstra2000memory, semantic role labeling akbik-li-2016-k, and machine translation (MT) nagao1984framework; sumita-iida-1991-experiments.

nivre-etal-2004-memory proposed an instance-based (memory-based) method for transition-based dependency parsing. The subsequent actions of a transition-based parser are selected at each step by comparing the current parser configuration to each of the configurations in the training set. Here, each parser configuration is treated as an instance and plays a role of rationales for predicted actions but not for predicted edges. Generally, parser configurations are not directly mapped to each predicted edge one by one, so it is troublesome to interpret which configurations significantly contribute to edge predictions. By contrast, since we adopt the graph-based one, our models can naturally treat each edge as an instance and exhibit similar edges as rationales for edge predictions.

2.3 Instance-Based Neural Methods in NLP

Most of the studies above were published before the current deep learning era. Very recently, instance-based methods have been revisited and combined with neural models in language modeling 

khandelwal2019generalization, MT khandelwal2020nearest, and question answering lewis2020retrieval. They augment a main neural model with a non-parametric sub-module that retrieves auxiliary objects, such as similar tokens and documents. guu2020realm proposed to parameterize and learn the sub-module for a target task.

These studies assume a different setting from ours. There is no ground-truth supervision signal for retrieval in their setting, so they adopt non-parametric approaches or indirectly train the sub-module to help a main neural model from the supervision signal of the target task. In our setting, the main neural model plays a role in retrieval and is directly trained with ground-truth objects (annotated dependency edges). Thus, our findings and insights are orthogonal to theirs.

2.4 Deep Metric Learning

Our work can be categorized into the deep metric learning research in terms of the methodological perspective. Although the origins of metric learning can be traced back to some earlier work short1981optimal; friedman1994flexible; hastie1996discriminant, the pioneering work is xing2002distance.222If you would like to know the history of metric learning in more detail, please read bellet2013survey. Since then, many methods using neural networks for metric learning have been proposed and studied.

Deep metric learning methods can be categorized into two classes from the training loss perspective sun2020circle: (i) learning with class-level labels and (ii) learning with pair-wise labels

. Given class-level labels, the first one learns to classify each training instance to its target class with a classification loss, e.g., Neighbourhood Component Analysis (NCA) 

goldberger2005neighbourhood, L2-constrained softmax loss ranjan2017l2, SpereFace liu2017sphereface, CosFace wang2018cosface, and ArcFace deng2019arcface. Given pair-wise labels, the second one learns pair-wise similarity (the similarity between a pair of instances), e.g., contrastive loss hadsell2006dimensionality, triplet loss wang2014learning; hoffer2015deep, N-pair loss sohn2016improved, and Multi-similarity loss wang2019multi. Our method is categorized into the first one because it adopts a classification loss (Section 4).

2.5 Neural Models Closely Related to Ours

Among the metric learning methods above, NCA goldberger2005neighbourhood shares the same spirit as our models. In this framework, models learn to map instances with the same label to the neighborhood in a feature space. wiseman-stratos-2019-label and ouchi-etal-2020-instance developed NCA-based neural models for sequence labeling. We discuss the differences between their models and ours later in more detail (Section 4.5).

3 Dependency Parsing Framework

We adopt a two-stage approach mcdonald-etal-2006-multilingual; zhang-etal-2017-dependency-parsing: we first identify dependency edges (unlabeled dependency parsing) and then classify the identified edges (labeled dependency parsing). More specifically, we solve edge identification as head selection and solve edge classification as multi-class classification.333Although some previous studies adopt multi-task learning methods for edge identification and classification tasks dozat2017deep; hashimoto-etal-2017-joint, we independently train a model for each task because the interaction effects produced by multi-task learning make it challenging to analyze models’ behaviors.

3.1 Edge Identification

To identify unlabeled edges, we adopt the head selection approach zhang-etal-2017-dependency-parsing, in which a model learns to select the correct head of each token in a sentence. This simple approach enables us to train accurate parsing models in a GPU-friendly way. We learn the representation for each edge to be discriminative for identifying correct heads.

Let denote a tokenized input sentence, where is a special root token and are original tokens, and denote an edge from head token to dependent token

. We define the probability of token

being the head of token in the sentence as:


Here, the scoring function can be any neural network-based scoring function (see Section 4.1).

At inference time, we choose the most likely head for each token 444While this greedy formulation has no guarantee to produce well-formed trees, we can produce well-formed ones by using the Chu-Liu-Edmonds algorithm in the same way as zhang-etal-2017-dependency-parsing. In this work, we would like to focus on the representation for each edge and evaluate the goodness of the learned edge representation one by one. With such a motivation, we adopt the greedy formulation.:


At training time, we minimize the negative log-likelihood of training data:


Here, is a training set, where is each input token, is the gold (ground-truth) head for token , and is the label for edge .

3.2 Label Classification

We adopt a simple multi-class classification approach for labeling each unlabeled edge. We define the probability that each of all possible labels will be assigned to the edge :


Here, the scoring function can be any neural network-based scoring function (see Section 4.2).

At inference time, we choose the most likely class label from the set of all possible labels :


Here, is the head token identified by a head selection model.

At training time, we minimize the negative log-likelihood of training data:


Here, is the gold (ground-truth) relation label for gold edge .

4 Instance-Based Scoring Methods

For the scoring functions in Eqs. 1 and 4, we describe our proposed instance-based models.

4.1 Edge Scoring

We would like to assign a higher score to the correct edge than other candidates (Eq. 1). Here, we compute similarities between each candidate edge and ground-truth edges in a training set (hereafter training edge). By summing the similarities, we then obtain the score that indicates how likely the candidate edge is the correct one.

Specifically, we first construct a set of training edges, called the support set, :


Here, is the ground-truth head token of token . We then compute and sum similarities between a candidate edge and each edge in the support set:


Here, are -dimensional edge representations (Section 4.3), and

is a similarity function. Following recent studies of deep metric learning, we adopt the dot product and the cosine similarity:

As you can see, the cosine similarity is the same as the dot product between two unit vectors: i.e.,

. As we will discuss later in Section LABEL:sec:analysis, this property suppresses the occurrence of hubs, compared with the dot product between unnormalized vectors. Note that, following the previous studies of deep metric learning wang2018cosface; deng2019arcface, we rescale the cosine similarity by using the scaling factor , which works as the temperature parameter in the softmax function.555In our preliminary experiments, we set by selecting a value from . As a result, whichever we chose, the prediction accuracy was stably better than .

4.2 Label Scoring

Similarly to the scoring function above, we also design our instance-based label scoring function in Eq. 4. We first construct a support set for each relation label :


Here, only the edges with label are collected from the training set. We then compute and sum similarities between a candidate edge and each edge of the support set:


Here is the intuition: if the edge is more similar to the edges with label than those with other labels, the edge is more likely to have the label .

4.3 Edge Representation

In the proposed models (Eqs. 8 and 10), we use -dimensional edge representations. We define the representation for each edge as follows:


Here, are -dimensional feature vectors for the dependent and head, respectively. These vectors are created from a neural encoder (Section 5.2). When designing , it is desirable to capture the interaction between the two vectors. By referring to the insights into feature representations of relations on knowledge bases bordes2013translating; yang2014embedding; nickel2016holographic, we adopt a multiplicative composition, a major composition technique for two vectors666In our preliminary experiments, we also tried an additive composition and the concatenation of the two vectors. The accuracies by these techniques for unlabeled dependency parsing, however, were both about %, which is much inferior to that by the multiplicative composition.:

Here, the interaction between and is captured by element-wise multiplication . These are composed into one vector, which is then transformed by a weight matrix into .

4.4 Fast Mode

Do users want rationales for all the predictions? Maybe not. In many cases, all they want to do is to parse sentences as fast as possible. Only when they find a suspicious prediction, they will check the rationale for it. To fulfill the demand, our parser provides two modes: (i) explainable mode and (ii) fast mode. The explainable mode, as described in the previous subsections, enables to exhibit similar training instances as rationales, but its time complexity depends on the size of the training set. By contrast, the fast mode does not provide rationales, but instead, it enables faster parsing than the explainable mode and outputs exactly the same predictions as the explainable mode. Thus, at test time, users can freely switch between the modes: e.g., they first use the fast mode, and if they find a suspicious prediction, then they will use the explainable mode to obtain the rationale for it.

Formally, if using the dot product and cosine similarity for similarity function in Eq. 8, the explainable mode can be rewritten as the fast mode:


where . In this way, once you sum all the vectors in the training set , you can reuse the summed vector without searching the training set again. At test time, you can precompute this summed vector before running the model on a test set, which reduces the exhaustive similarity computation over the training set to the simple dot product between the two vectors .777In the same way as Eq. 12, we can transform in Eq. 10 to the fast mode.

4.5 Relations to Existing Models

The closest models to ours.

wiseman-stratos-2019-label and ouchi-etal-2020-instance proposed an instance-based model using Neighbourhood Components Analysis (NCA) goldberger2005neighbourhood for sequence labeling. Given an input sentence of tokens, , the model first computes the probability that a token (or span) in the sentence selects each of all the tokens in the training set as its neighbor:

The model then constructs a set of only the tokens associated with a label : , and computes the probability that each token will be assigned a label :

The point is that while our models first sums the similarities (Eq 8) and then put the summed score into exponential form as , their model puts each similarity into exponential form as before the summation. The different order of using exponential function makes it impossible to rewrite their model as the fast mode, so their model always has to compare a token to each of the training set . This is the biggest difference between their model and ours. While we leave the performance comparison between the NCA-based models and ours for future work, our models have an advantage over the NCA-based models in that our models offer two options, the explainable and fast modes.

Standard models using weights.

Typically, neural models use the following scoring functions:


Here, is a learnable weight vector and is a learnable weight vector associated with label . In previous work zhang-etal-2017-dependency-parsing, this form is used for dependency parsing. We call such models weight-based models. caruana1999case proposed to combine weight-based models with instance-based inference: at inference time, the weights are discarded, and only the trained encoder is used to extract feature representations for instance-based inference. Such combination has been reported to be effective for image recognition ranjan2017l2; liu2017sphereface; wang2018cosface; deng2019arcface. In dependency parsing, there has been no investigation on it. Since such a combination can be a promising method, we investigate its utility (Section 6).

5 Experimental Setup

5.1 Data

Language Treebank Family Order Train
Arabic PADT non-IE VSO 6.1k
Basque BDT non-IE SOV 5.4k
Chinese GSD non-IE SVO 4.0k
English EWT IE SVO 12.5k
Finnish TDT non-IE SVO 12.2k
Hebrew HTB non-IE SVO 5.2k
Hindi HDTB IE SOV 13.3k
Italian ISDT IE SVO 13.1k
Japanese GSD non-IE SOV 7.1k
Korean GSD non-IE SOV 4.4k
Russian SynTagRus IE SVO 48.8k
Swedish Talbanken IE SVO 4.3k
Turkish IMST non-IE SOV 3.7k
Table 1: Dataset information. “Family” indicates if Indo-European (IE) or not. “Order” indicates dominant word orders according to WALS haspelmath2005world. “Train” is the number of training sentences.

We use English PennTreebank (PTB) marcus-etal-1993-building and Universal Dependencies (UD) mcdonald-etal-2013-universal. Following previous studies kulmizev2019deep; smith2018investigation; de2017old, we choose a variety of 13 languages888These languages have been selected by considering the perspectives of different language families, different morphological complexity, different training sizes and domains. from the UD v2.7. Table 1 shows information about each dataset. We follow the standard training-development-test splits.

5.2 Neural Encoder Architecture

To compute and (in Eq. 11), we adopt the encoder architecture proposed by dozat2017deep. First, we map the input sequence 999We use the gold tokenized sequences in PTB and UD. to a sequence of token representations, , each of which is , where , , and are computed by word embeddings101010For PTB, we use 300 dimensional GloVe pennington-etal-2014-glove. For UD, we use 300 dimensional fastText grave2018learning. During training, we fix them., character-level CNN, and BERT devlin-etal-2019-bert111111We first conduct subword segmentation for each token of the input sequence. Then, the BERT encoder takes as input the subword-segmented sequences and computes the representation for each subword. Here, we use the (last layer) representation of the first subword within each token as its token representation. For PTB, we use “BERT-Base, Cased.” For UD, we use “BERT-Base, Multilingual Cased.”, respectively. Second, the sequence is fed to bidirectional LSTM (BiLSTM) graves:13 for computing contextual ones: . Finally, is transformed as and , where and are parameter matrices.

5.3 Mini-Batching

We train models with the mini-batch stochastic gradient descent method. To make the current mini-batch at each time step, we follow a standard technique for training instance-based models 

hadsell2006dimensionality; oord2018representation.

At training time, we make the mini-batch that consists of query and support sentences at each time step. A model encodes the sentences and the edge representations used for computing similarities between each candidate edge in the query sentences and each gold edge in the support sentences. Here, due to the memory limitation of GPUs, we randomly sample a subset from the training set at each time step: i.e., . In edge identification, for query sentences, we randomly sample a subset of sentences from . For support sentences, we randomly sample a subset of sentences from , and construct and use the support set instead of in Eq. 7. In label classification, we would like to guarantee that the support set in every mini-batch always contains at least one edge for each label. To do so, we randomly sample a subset of edges from the support set for each label : i.e., in Eq. 9. Note that each edge is in the -th sentence in the training set , so we put the sentence into the mini-batch to compute the representation for . Actually, we use query sentences in both edge identification and label classification, support sentences in edge identification121212As a result, the whole mini-batch size is ., and support edge (sentence) for each label in label classification131313When , the whole mini-batch size is ..

At test time, we encode each test (query) sentence and compute the representation for each candidate edge on-the-fly. The representation is then compared to the precomputed support edge representation, in Eq 12. To precompute , we first encode all the training sentences and obtain the edge representations. Then, in edge identification, we sum all of them and obtain one support edge representation . In label classification, similarly to , we sum only the edge representations with label  and obtain one support representation for each label 141414The total number of the support edge representations is equal to the size of the label set ..

5.4 Training Configuration

Name Value
Word Embedding GloVe (PTB) / fastText (UD)
CNN window size 3
CNN filters 30
BiLSTM layers 2
BiLSTM units 300 dimensions
Optimization Adam
Learning rate 0.001
Rescaling factor 64
Dropout ratio {0.1, 0.2, 0.3}
Table 2: Hyperparameters used in the experiments.

Table 2 lists the hyperparameters. To optimize the parameters, we use Adam kingma:14 with and . The initial learning rate is

and is updated on each epoch as

, where and

is the epoch number completed. A gradient clipping value is

 pascanu2013difficulty. The number of training epochs is . We save the parameters that achieve the best score on each development set and evaluate them on each test set. It takes less than one day to train on a single GPU, NVIDIA DGX-1 with Tesla V100.

6 Results and Discussion

6.1 Prediction Accuracy on Benchmark Tests

Learning Weight-based Weight-based Instance-based
Inference Weight-based Weight-based Instance-based Instance-based
System ID Kulmizev+’19 WWd WWc WId WIc IId IIc
PTB-English 96.4/95.3 96.4/95.3 96.4/94.4 93.0/91.8 96.4/95.3 96.4/95.3
UD-Average – /84.9 89.0/85.6 89.0/85.6 89.0/85.2 83.0/79.5 89.3/85.7 89.0/85.5
UD-Arabic – /81.8 87.8/82.1 87.8/82.1 87.8/81.6 84.9/79.0 88.0/82.1 87.6/81.9
UD-Basque – /79.8 84.9/81.1 84.9/80.9 84.9/80.6 82.0/77.9 85.1/80.9 85.0/80.8
UD-Chinese – /83.4 85.6/82.3 85.8/82.4 85.7/81.6 80.9/77.3 86.3/82.8 85.9/82.5
UD-English – /87.6 90.9/88.1 90.7/88.0 90.9/87.8 88.1/85.3 91.1/88.3 91.0/88.2
UD-Finnish – /83.9 89.4/86.6 89.1/86.3 89.3/86.1 84.1/81.2 89.6/86.6 89.4/86.4
UD-Hebrew – /85.9 89.4/86.4 89.5/86.5 89.4/85.9 82.7/79.7 89.8/86.7 89.6/86.6
UD-Hindi – /90.8 94.8/91.7 94.8/91.7 94.8/91.4 91.4/88.0 94.9/91.8 94.9/91.6
UD-Italian – /91.7 94.1/92.0 94.2/92.1 94.1/91.9 91.5/89.4 94.3/92.2 94.1/92.0
UD-Japanese – /92.1 94.3/92.8 94.5/93.0 94.3/92.7 92.5/90.9 94.6/93.1 94.4/92.8
UD-Korean – /84.2 88.0/84.4 87.9/84.3 88.0/84.2 84.3/80.4 88.1/84.4 88.2/84.5
UD-Russian – /91.0 94.2/92.7 94.1/92.7 94.2/92.4 57.7/56.5 94.3/92.8 94.1/92.6
UD-Swedish – /86.9 90.3/87.6 90.3/87.5 90.4/87.1 88.6/85.8 90.5/87.5 90.4/87.5
UD-Turkish – /64.9 73.0/65.3 73.2/65.4 73.1/64.5 69.9/61.9 73.7/65.5 72.9/64.7
Table 3: Comparison between weight-based and instance-based systems. Cells show unlabeled attachment scores (UAS) before the slash and labeled attachment scores (LAS) after the slash on each test set. System IDs stand for the first letters of the options: e.g., WId stands for “W”eight-based learning and “I”nstance-based inference using the “d”ot product. The system ID, Kulmizev+’19, is the graph-based parser with BERT in kulmizev2019deep.
Weight-Based Instance-Based
Emails 81.7 81.7 81.6 81.4
Newsgroups 83.1 83.3 83.1 82.9
Reviews 88.5 88.7 88.7 88.8
Weblogs 81.9 80.9 80.9 81.9
Average 83.8 83.7 83.6 83.8
Table 4: UAS in out-of-domain settings, where each model is trained on the source domain “Yahoo! Answers" and tested on each of the four target domains.
Emails 81.5 81.4 81.5 81.5
Newsgroups 82.8 83.0 82.9 82.9
Reviews 88.7 88.7 88.8 88.8
Weblogs 81.8 82.1 82.0 81.9
Average 83.7 83.8 83.8 83.8
Table 5: UAS by the instance-based system using the cosine similarity (IIc) and randomly sampled support training sentences.

We report averaged unlabeled attachment scores (UAS) and labeled attachment scores (LAS) across three different runs of the model training with random seeds. We compare 6 systems, each of which consists of two models for edge identification and label classification, respectively. For reference, we list the results by the graph-based parser with BERT in kulmizev2019deep, whose architecture is the most similar to ours.

Table 3 shows UAS and LAS by these systems. The systems WWd and WWc are the standard ones that consistently use the weight-based scores (Eqs. 13 and 14) during learning and inference. Between these systems, the difference of the similarity functions does not make a gap in the accuracies. In other words, the dot product and the cosine similarity are on par in terms of the accuracies. The systems WId and WIc use the weight-based scores during learning and the instance-based ones during inference. While the system WId using achieved competitive UAS and LAS to those by the standard weight-based system WWd, the system WIc using achieved lower accuracies than those by the system WWc. The systems IId and IIc consistently use the instance-based scores during learning and inference. Both of them succeeded in keeping competitive accuracies with those by the standard weight-based ones WWd and WWc.

Out-of-domain robustness.

We evaluate the robustness of our instance-based models in out-of-domain settings by using the five domains of UD-English: we train each model on the training set of the source domain “Yahoo! Answers” and test it on each test set of the target domains, Emails, Newsgroups, Reviews and Weblogs. As Table 4 shows, the out-of-domain robustness of our instance-based models is comparable to the weight-based models. This tendency is observed when using different source domains.

Sensitivity of for inference.

In the experiments above, we used all the training sentences for support sentences at test time. What if we reduce the number of support sentences? Here, in the same out-of-domain settings above, we evaluate the instance-based system using the cosine similarity IIc with support sentences randomly sampled at each time step. Intuitively, if using a smaller number of randomly sampled support sentences (e.g., ), the prediction accuracies would drop. Surprisingly, however, Table 5 shows that the accuracies do not drop even if reducing . This tendency is observed when using the other three systems WId, WIc and IId. One possible reason for it is that the feature space is appropriately learned: i.e., because positive edges are close to each other and far from negative edges in the feature space, the accuracies do not drop even if randomly sampling a single support sentence and using the edges.

6.2 Sanity Check for Plausible Explanations

Figure 1: Valid () and invalid () examples of unlabeled edges for the identical subclass test.

It is an open question how to evaluate the “plausibility” of explanations: i.e., whether or not the retrieved instances as explanations are convincing for humans. As a reasonable compromise, hanawa2021evaluation designed the identical subclass test for evaluating the plausibility. This test is based on a minimal requirement that interpretable models should at least satisfy: training instances to be presented as explanations should belong to the same latent (sub)class as the test instance. Consider the examples in Figure 1. The predicted unlabeled edge “wrote novels” in the test sentence has the (unobserved) latent label, obj. To this edge, two training instances are given as explanations: the above one seems more convincing than the below one because “published books” has the same latent label, obj, as that of “wrote novels” while “novels the” has the different one, det. As these show, the agreement between the latent classes are likely to correlate with plausibility. Note that this test is not perfect for the plausibility assessment, but it works as a sanity check for verifying whether models make obvious violations in terms of plausibility.

This test can be used for assessing unlabeled parsing models because the (unobserved) relation labels can be regarded as the latent subclasses of positive unlabeled edges. We follow three steps; (i) identifying unlabeled edges in a development set; (ii) retrieving the nearest training edge for each identified edge; (iii) calculating LAS, i.e., if the labels of the query and retrieved edges are identical, we regard them as correct.151515If the parsed edge is incorrect, we regard it as incorrect.

Weight-Based Instance-Based
System ID WId WIc IId IIc
PTB-English 1.8 67.5 7.0 71.6
UD-English 16.4 51.5 3.9 54.0
Table 6: Results of the identical subclass test. Each cell indicates labeled attachment scores (LAS) on each development set. All the models are trained with head selection supervision and without labeling supervision.

Table 6 shows LAS on PTB and UD-English. The systems using instance-based inference with the cosine similarity, WIc and IIc, succeeded in retrieving the support training edges with the same label as the queries. Surprisingly, the system IIc achieved over 70% LAS on PTB without label supervision. The results suggest that systems using instance-based inference with the cosine similarity meet the minimal requirement, and the retrieved edges are promising as plausible explanations.