Factoring Statutory Reasoning as Language Understanding Challenges

by   Nils Holzenberger, et al.
Johns Hopkins University

Statutory reasoning is the task of determining whether a legal statute, stated in natural language, applies to the text description of a case. Prior work introduced a resource that approached statutory reasoning as a monolithic textual entailment problem, with neural baselines performing nearly at-chance. To address this challenge, we decompose statutory reasoning into four types of language-understanding challenge problems, through the introduction of concepts and structure found in Prolog programs. Augmenting an existing benchmark, we provide annotations for the four tasks, and baselines for three of them. Models for statutory reasoning are shown to benefit from the additional structure, improving on prior baselines. Further, the decomposition into subtasks facilitates finer-grained model diagnostics and clearer incremental progress.



There are no comments yet.


page 3


EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

Quantitative reasoning is an important component of reasoning that any i...

Reasoning-Driven Question-Answering for Natural Language Understanding

Natural language understanding (NLU) of text is a fundamental challenge ...

A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering

Legislation can be viewed as a body of prescriptive rules expressed in n...

Gradual Parametricity, Revisited

Bringing the benefits of gradual typing to a language with parametric po...

Question Directed Graph Attention Network for Numerical Reasoning over Text

Numerical reasoning over texts, such as addition, subtraction, sorting a...

Non-entailed subsequences as a challenge for natural language inference

Neural network models have shown great success at natural language infer...

Whodunnit? Crime Drama as a Case for Natural Language Understanding

In this paper we argue that crime drama exemplified in television progra...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As more data becomes available, Natural Language Processing (NLP) techniques are increasingly being applied to the legal domain, including for the prediction of case outcomes

(xiao2018cail; vacek-etal-2019-litigation; chalkidis-etal-2019-neural). In the US, cases are decided based on previous case outcomes, but also on the legal statutes compiled in the US code. For our purposes, a case is a set of facts described in natural language, as in Figure 1, in blue. The US code is a set of documents called statutes, themselves decomposed into subsections. Taken together, subsections can be viewed as a body of interdependent rules specified in natural language, prescribing how case outcomes are to be determined. Statutory reasoning is the task of determining whether a given subsection of a statute applies to a given case, where both are expressed in natural language. Subsections are implicitly framed as predicates, which may be true or false of a given case. holzenberger2020dataset

introduced SARA, a benchmark for the task of statutory reasoning, as well as two different approaches to solving this problem. First, a manually-crafted symbolic reasoner based on Prolog is shown to perfectly solve the task, at the expense of experts writing the Prolog code and translating the natural language case descriptions into Prolog-understandable facts. The second approach is based on statistical machine learning models. While these models can be induced computationally, they perform poorly because the complexity of the task far surpasses the amount of training data available.

We posit that statutory reasoning as presented to statistical models is underspecified, in that it was cast as Recognizing Textual Entailment (dagan2005pascal)

and linear regression. Taking inspiration from the structure of Prolog programs, we re-frame statutory reasoning as a sequence of four tasks, prompting us to introduce a novel extension of the SARA dataset (Section 

2), referred to as SARA v2. Beyond improving the model’s performance, as shown in Section 3, the additional structure makes it more interpretable, and so more suitable for practical applications. We put our results in perspective in Section 4 and review related work in Section 5.

2 SARA v2

The symbolic solver requires experts translating the statutes and each new case’s description into Prolog. In contrast, a machine learning-based model has the potential to generalize to unseen cases and to changing legislation, a significant advantage for a practical application. In the following, we argue that legal statutes share features with the symbolic solver’s first-order logic. We formalize this connection in a series of four challenge tasks, described in this section, and depicted in Figure 1. We hope they provide structure to the problem, and a more efficient inductive bias for machine learning algorithms. The annotations mentioned throughout the remainder of this section were developed by the authors, entirely by hand, with regular guidance from a legal scholar111The dataset can be found under https://nlp.jhu.edu/law/. Examples for each task are given in Appendix A. Statistics are shown in Figure 2 and further detailed in Appendix B.

Figure 1: Decomposing statutory reasoning into four tasks. The flowchart on the right indicates the ordering, inputs and outputs of the tasks. In the statutes in the yellow box, argument placeholders are underlined, and superscripts indicate argument coreference. The green box shows the logical structure of the statutes just above it. In blue are two examples of argument instantiation.

Argument identification

This first task, in conjunction with the second, aims to identify the arguments of the predicate that a given subsection represents. Some terms in a subsection refer to something concrete, such as “the United States” or “April 24th, 2017”. Other terms can take a range of values depending on the case at hand, and act as placeholders. For example, in the top left box of Figure 1, the terms “a taxpayer” and “the taxable year” can take different values based on the context, while the terms “section 152” and “this paragraph” have concrete, immutable values. Formally, given a sequence of tokens , the task is to return a set of start and end indices where each pair represents a span. We borrow from the terminology of predicate argument alignment (roth2012aligning; wolfe-etal-2013-parma) and call these placeholders arguments. The first task, which we call argument identification, is tagging which parts of a subsection denote such placeholders. We provide annotations for argument identification as character-level spans representing arguments. Since each span is a pointer to the corresponding argument, we made each span the shortest meaningful phrase. Figure 2(b) shows corpus statistics about placeholders.

Argument coreference

Some arguments detected in the previous task may appear multiple times within the same subsection. For instance, in the top left of Figure 1, the variable representing the taxpayer in §2(a)(1)(B) is referred to twice. We refer to the task of resolving this coreference problem at the level of the subsection as argument coreference. While this coreference can span across subsections, as is the case in Figure 1, we intentionally leave it to the next task. Keeping the notation of the above paragraph, given a set of spans , the task is to return a matrix where if spans and denote the same variable, otherwise. Corpus statistics about argument coreference can be found in Figure 2(a). After these first two tasks, we can extract a set of arguments for every subsection. In Figure 1, for §2(a)(1)(A), that would be {Taxp, Taxy, Spouse, Years}, as shown in the bottom left of Figure 1.

Figure 2: Corpus statistics about arguments. “Random statutes” are 9 sections sampled from the US code.

Structure extraction

A prominent feature of legal statutes is the presence of references, implicit and explicit, to other parts of the statutes. Resolving references and their logical connections, and passing arguments appropriately from one subsection to the other, are major steps in statutory reasoning. We refer to this as structure extraction. This mapping can be trivial, with the taxpayer and taxable year generally staying the same across subsections. Some mappings are more involved, such as the taxpayer from §152(b)(1) becoming the dependent in §152(a). Providing annotations for this task in general requires expert knowledge, as many references are implicit, and some must be resolved using guidance from Treasury Regulations. Our approach contrasts with recent efforts in breaking down complex questions into atomic questions, with the possibility of referring to previous answers (wolfson-etal-2020-break). Statutes contain their own breakdown into atomic questions. In addition, our structure is interpretable by a Prolog engine.

We provide structure extraction annotations for SARA in the style of Horn clauses (horn1951), using common logical operators, as shown in the bottom left of Figure 1. We also provide character offsets for the start and end of each subsection. Argument identification and coreference, and structure extraction can be done with the statutes only. They correspond to extracting a shallow version of the symbolic solver of holzenberger2020dataset.

Argument instantiation

We frame legal statutes as a set of predicates specified in natural language. Each subsection has a number of arguments, provided by the preceding tasks. Given the description of a case, each argument may or may not be associated with a value. Each subsection has an @truth argument, with possible values True or False, reflecting whether the subsection applies or not. Concretely, the input is (1) the string representation of the subsection, (2) the annotations from the first three tasks, and (3) values for some or all of its arguments. Arguments and values are represented as an array of key-value pairs, where the names of arguments specified in the structure annotations are used as keys. In Figure 1, compare the names of arguments in the green box with the key names in the blue boxes. The output is values for its arguments, in particular for the @truth argument. In the example of the top right in Figure 1, the input values are taxpayer = Alice and taxable year = 2017, and one expected output is @truth = True. We refer to this task as argument instantiation. Values for arguments can be found as spans in the case description, or must be predicted based on the case description. The latter happens often for dollar amounts, where incomes must be added, or tax must be computed. Figure 1 shows two examples of this task, in blue.

Before determining whether a subsection applies, it may be necessary to infer the values of unspecified arguments. For example, in the top of Figure 1, it is necessary to determine who Alice’s deceased spouse and who the dependent mentioned in §2(a)(1)(B) are. If applicable, we provide values for these arguments, not as inputs, but as additional supervision for the model. We provide manual annotations for all (subsection, case) pairs in SARA. In addition, we run the Prolog solver of holzenberger2020dataset to generate annotations for all possible (subsection, case) pairs, to be used as a silver standard, in contrast to the gold manual annotations. We exclude from the silver data any (subsection, case) pair where the case is part of the test set. This increases the amount of available training data by a factor of 210.

3 Baseline models

We provide baselines for three tasks, omitting structure extraction because it is the one task with the highest return on human annotation effort222Code for the experiments can be found under https://github.com/SgfdDttt/sara_v2. In other words, if humans could annotate for any of these four tasks, structure extraction is where we posit their involvement would be the most worthwhile. Further, pertierra2017towards have shown that the related task of semantic parsing of legal statutes is a difficult task, calling for a complex model.

3.1 Argument identification

We run the Stanford parser (socher2013parsing) on the statutes, and extract all noun phrases as spans – specifically, all NNP, NNPS, PRP$, NP and NML constituents. While de-formatting legal text can boost parser performance (morgenstern2014toward), we found it made little difference in our case.

As an orthogonal approach, we train a BERT-based CRF model for the task of BIO tagging. With the 9 sections in the SARA v2 statutes, we create 7 equally-sized splits by grouping §68, 3301 and 7703 into a single split. We run a 7-fold cross-validation, using 1 split as a dev set, 1 split as a test set, and the remaining as training data. We embed each paragraph using BERT, classify each contextual subword embedding into a 3-dimensional logit with a linear layer, and run a CRF

(lafferty2001conditional). The model is trained with gradient descent to maximize the log-likelihood of the sequence of gold tags. We experiment with using Legal BERT (holzenberger2020dataset) and BERT-base-cased (devlin2018bert) as our BERT model. We freeze its parameters and optionally unfreeze the last layer. We use a batch size of 32 paragraphs, a learning rate of and the Adam optimizer (kingma2014adam). Based on F1 score measured on the dev set, the best model uses Legal BERT and unfreezes its last layer. Test results are shown in Table 1.

Parser-based avg stddev macro
precision 17.6 4.4 16.6
recall 77.9 5.0 77.3
F1 28.6 6.2 27.3
BERT-based avg stddev macro
precision 64.7 15.0 65.1
recall 69.0 24.2 59.8
F1 66.2 20.5 62.4
Table 1:

Argument identification results. Average and standard deviations are computed across test splits.

3.2 Argument coreference

Argument coreference differs from the usual coreference task (pradhan-EtAl:2014:P14-2), even though we are using similar terminology, and frame it in a similar way. In argument coreference, it is equally as important to link two coreferent argument mentions as it is not to link two different arguments. In contrast, regular coreference emphasizes the prediction of links between mentions. We thus report a different metric in Tables 2 and 4, exact match coreference, which gives credit for returning a cluster of mentions that corresponds exactly to an argument. In Figure 1, a system would be rewarded for linking together both mentions of the taxpayer in §2(a)(1)(B), but not if any of the two mentions were linked to any other mention within §2(a)(1)(B). This custom metric gives as much credit for correctly linking a single-mention argument (no links), as for a 5-mention argument (10 links).

Single mention baseline

Here, we predict no coreference links. Under usual coreference metrics, this system can have low performance.

String matching baseline

This baseline predicts a coreference link if the placeholder strings of two arguments are identical, up to the presence of the words such, a, an, the, any, his and every.

Single mention avg stddev macro
precision 81.7 28.9 68.2
recall 86.9 21.8 82.7
F1 83.8 26.0 74.8
String matching avg stddev macro
precision 91.2 20.0 85.5
recall 92.8 16.8 89.4
F1 91.8 18.6 87.4
Table 2: Exact match coreference results. Average and standard deviations are computed across subsections.

We also provide usual coreference metrics in Table 3, using the code associated with pradhan-EtAl:2014:P14-2. This baseline perfectly resolves coreference for 80.8% of subsections, versus 68.9% for the single mention baseline.

Single mention String matching
MUC 00.0 / 00.0 / 00 82.1 / 64.0 / 71.9
CEAF 82.5 / 82.5 / 82.5 92.1 / 92.1 / 92.1
CEAF 77.3 / 93.7 / 84.7 90.9 / 95.2 / 93.0
BLANC 50.0 / 50.0 / 50.0 89.3 / 81.0 / 84.7
Table 3: Argument coreference baselines scored with usual metrics. Results are shown as Precision / Recall / F1.

In addition, we provide a cascade of the best methods for argument identification and coreference, and report results in Table 4. The cascade perfectly resolves a subsection’s arguments in only 16.4% of cases. This setting, which groups the first two tasks together, offers a significant challenge.

Cascade avg stddev macro
precision 54.5 35.6 58.0
recall 53.5 37.2 52.4
F1 54.7 33.4 55.1
Table 4: Exact match coreference results for BERT-based argument identification followed by string matching-based argument coreference. Average and standard deviations are computed across subsections.

3.3 Argument instantiation

Argument instantiation takes into account the information provided by previous tasks. We start by instantiating the arguments of a single subsection, without regard to the structure of the statutes. We then describe how the structure information is incorporated into the model.

Single subsection

1:argument spans with coreference information , input argument-value pairs , subsection text , case description
2:output argument-value pairs
3:function ArgInstantiation()
5:     for  in  do
11:     end for
17:end function
Algorithm 1 Argument instantiation for a single subsection

We follow the paradigm of chen2020reading, where we iteratively modify the text of the subsection by inserting argument values, and predict values for uninstantiated arguments. Throughout the following, we refer to Algorithm 1 and to its notation.

For each argument whose value is provided, we replace the argument’s placeholders in subsection by the argument’s value, using InsertValues (line 6). This yields mostly grammatical sentences, with occasional hiccups. With §2(a)(1)(A) and the top right case from Figure 1, we obtain “(A) Alice spouse died during either of the two years immediately preceding 2017”.

We concatenate the text of the case with the modified text of the subsection , and embed it using BERT (line 7), yielding a sequence of contextual subword embeddings . Keeping with the notation of chen2020reading

, assume that the embedded case is represented by the sequence of vectors

and the embedded subsection by . For a given argument , compute its attentive representation and its augmented feature vectors . This operation, described by chen2020reading, is performed by ComputeAttentiveReps (line 8). The augmented feature vectors represent the argument’s placeholder, conditioned on the text of the statute and case.

Based on the name of the argument span, we predict its value either as an integer or a span from the case description, using PredictValue (line 9

). For integers, as part of the model training, we run k-means clustering on the set of all integer values in the training set, with enough centroids such that returning the closest centroid instead of the true value yields a numerical accuracy of 1 (see below). For any argument requiring an integer (e.g.

tax), the model returns a weighted average of the centroids. The weights are predicted by a linear layer followed by a softmax, taking as input an average-pooling and a maxpooling of . For a span from the case description, we follow the standard procedure for fine-tuning BERT on SQuAD (devlin2018bert)

. The unnormalized probability of the span from tokens

to is given by where are learnable parameters.

The predicted value is added to the set of predictions (line 10), and will be used in subsequent iterations to replace the argument’s placeholder in the subsection. We repeat this process until a value has been predicted for every argument, except @truth (lines 5-11). Arguments are processed in order of appearance in the subsection. Finally, we concatenate the case and fully grounded subsection and embed them with BERT (lines 12-13), then use a linear predictor on top of the representation for the [CLS] token to predict the value for the @truth argument (line 14).

Subsection with dependencies

1:argument spans with coreference information , structure information , input argument-value pairs , subsection , case description
2:output argument-value pairs
3:function ArgInstantiationFull()
7:     for  in  do
8:          if  is a subsection and a leaf node then
12:          else if  is a subsection and not a leaf node then
19:          else if  then
22:          end if
23:     end for
27:end function
Algorithm 2 Argument instantiation with dependencies

To describe our procedure at a high-level, we use the structure of the statutes to build out a computational graph, where nodes are either subsections with argument-value pairs, or logical operations. We resolve nodes one by one, depth first. We treat the single-subsection model described above as a function, taking as input a set of argument-value pairs, a string representation of a subsection, and a string representation of a case, and returning a set of argument-value pairs. Algorithm 2 and Figure 3 summarize the following.

We start by building out the subsection’s dependency tree, as specified by the structure annotations (lines 4-6). First, we build the tree structure using BuildDependencyTree. Then, values for arguments are propagated from parent to child, from the root down, with PopulateArgValues. The tree is optionally capped to a predefined depth. Each node is either an input for the single-subsection function or its output, or a logical operation. We then traverse the tree depth first, performing the following operations, and replacing the node with the result of the operation:

  • [leftmargin=*]

  • If the node is a leaf, resolve it using the single-subsection function ArgInstantiation (lines 8-11 in Algorithm 2; step 1 in Figure 3).

  • If the node is a subsection that is not a leaf, find its child node (GetChild, line 14), and corresponding argument-value pairs other than @truth, (GetArgValuePairs, line 15). Merge with , the argument-value pairs of the main node (line 16). Finally, resolve the parent node using the single-subsection function (lines 17-18; step 3 in Figure 3.

  • If node is a logical operation (line 19), get its children (GetChildren, line 20), to which the operation will be applied with DoOperation (line 21) as follows:

    • If , assign the negation of the child’s @truth value to .

    • If , pick its child with the highest @truth value, and assign its arguments’ values to .

    • If , transfer the argument-value pairs from all its children to . In case of conflicting values, use the value associated with the lower @truth value. This operation can be seen in step 4 of Figure 3.

This procedure follows the formalism of neural module networks (andreas2016neural) and is illustrated in Figure 3

. Reentrancy into the dependency tree is not possible, so that a decision made earlier cannot be backtracked on at a later stage. One could imagine doing joint inference, or using heuristics for revisiting decisions, for example with a limited number of reentrancies. Humans are generally able to resolve this task in the order of the text, and we assume it should be possible for a computational model too. Our solution is meant to be computationally efficient, with the hope of not sacrificing too much performance. Revisiting this assumption is left for future work.

Figure 3: Argument instantiation with the top example from Figure 1. At each step, nodes to be processed are in blue, nodes being processed in yellow, and nodes already processed in green. The last step was omitted, and involves determining the truth value of the root node’s @truth argument.

Metrics and evaluation

Arguments whose value needs to be predicted fall into three categories. The @truth argument calls for a binary truth value, and we score a model’s output using binary accuracy. The values of some arguments, such as gross income, are dollar amounts. We score such values using numerical accuracy, as  if  else , where is the prediction and the target. All other argument values are treated as strings. In those cases, we compute accuracy as exact match between predicted and gold value. Each of these three metrics defines a form of accuracy. We average the three metrics, weighted by the number of samples, to obtain a unified accuracy metric, used to compare the performance of models.

@truth dollar amount string unified binary numerical
baseline 58.3 7.5 18.2 11.5 04.4 07.4 43.3 6.2 50 8.3 30 18.1
+ silver 58.3 7.5 39.4 14.6 04.4 07.4 47.2 6.2 50 8.3 45 19.7
BERT 59.2 7.5 23.5 12.5 37.5 17.3 49.4 6.2 51 8.3 30 18.1
- pre-training 57.5 7.5 20.6 11.9 37.5 17.3 47.8 6.2 49 8.3 30 18.1
- structure 65.8 7.2 20.6 11.9 33.3 16.8 52.8 6.2 59 8.2 30 18.1
- pre-training, 60.8 7.4 20.6 11.9 33.3 16.8 49.4 6.2 53 8.3 30 18.1
- structure (best results in bold)
Table 5:

Argument instantiation. We report accuracies, in %, and the 90% confidence interval. Right of the bar are accuracy metrics proposed with the initial release of the dataset. Blue cells use the silver data, brown cells do not. “BERT” is the model described in Section 

3.3. Ablations to it are marked with a “-” sign.


Based on the type of value expected, we use different loss functions. For

@truth, we use binary cross-entropy. For numerical values, we use the hinge loss . For strings, let be all the spans in the case description equal to the expected value. The loss function is (clark2018simple). The model is trained end-to-end with gradient descent.

We start by training models on the silver data, as a pre-training step. We sweep the values of the learning rate in and the batch size in . We try both BERT-base-cased and Legal BERT, allowing updates to the parameters of its top layer. We set aside 10% of the silver data as a dev set, and select the best model based on the unified accuracy on the dev set. Training is split up into three stages. The single-subsection model iteratively inserts values for arguments into the text of the subsection. In the first stage, regardless of the predicted value, we insert the gold value for the argument, as in teacher forcing (kolen2001field). In the second and third stages, we insert the value predicted by the model. When initializing the model from one stage to the next, we pick the model with the highest unified accuracy on the dev set. In the first two stages, we ignore the structure of the statutes, which effectively caps the depth of each dependency tree at 1.

Picking the best model from this pre-training step, we perform fine-tuning on the gold data. We take a k-fold cross-validation approach (stone1974cross)

. We randomly split the SARA v2 training set into 10 splits, taking care to put pairs of cases testing the same subsection into the same split. Each split contains nearly exactly the same proportion of binary and numerical cases. We sweep the values of the learning rate and batch size in the same ranges as above, and optionally allow updates to the parameters of BERT’s top layer. For a given set of hyperparameters, we run training on each split, using the dev set and the unified metric for early stopping. We use the performance on the dev set averaged across the 10 splits to evaluate the performance of a given set of hyperparameters. Using that criterion, we pick the best set of hyperparameters. We then pick the final model as that which achieves median performance on the dev set, across the 10 splits. We report the performance of that model on the test set.

In Table 5, we report the relevant argument instantiation metrics, under @truth, dollar amount and string. For comparison, we also report binary and numerical accuracy metrics defined in holzenberger2020dataset. The reported baseline has three parameters. For @truth, it returns the most common value for that argument on the train set. For arguments that call for a dollar amount, it returns the one number that minimizes the dollar amount hinge loss on the training set. For all other arguments, it returns the most common string answer in the training set. Those parameters vary depending on whether the training set is augmented with the silver data.

4 Discussion

Our goal in providing the baselines of Section 3 is to identify performance bottlenecks in the proposed sequence of tasks. Argument identification poses a moderate challenge, with a language model-based approach achieving non-trivial F1 score. The simple parser-based method is not a sufficient solution, but with its high recall could serve as the backbone to a statistical method. Argument coreference is a simpler task, with string matching perfectly resolving nearly 80% of the subsections. This is in line with the intuition that legal language is very explicit about disambiguating coreference. As reported in Table 3, usual coreference metrics seem lower, but only reflect a subset of the full task: coreference metrics are only concerned with links, so that arguments appearing exactly once bear no weight under that metric, unless they are wrongly linked to another argument.

Argument instantiation is by far the most challenging task, as the model needs strong natural language understanding capabilities. Simple baselines can achieve accuracies above 50% for @truth, since for all numerical cases, @truth = True. We receive a slight boost in binary accuracy from using the proposed paradigm, departing from previous results on this benchmark. As compared to the baseline, the models mostly lag behind for the dollar amount and numerical accuracies, which can be explained by the lack of a dedicated numerical solver, and sparse data. Further, we have made a number of simplifying assumptions, which may be keeping the model from taking advantage of the structure information: arguments are instantiated in order of appearance, forbidding joint prediction; revisiting past predictions is disallowed, forcing the model to commit to wrong decisions made earlier; the depth of the dependency tree is capped at 3; and finally, information is being passed along the dependency tree in the form of argument values, as opposed to dense, high-dimensional vector representations. The latter limits both the flow of information and the learning signal. This could also explain why the use of dependencies is detrimental in some cases. Future work would involve joint prediction (chan2019kermit), and more careful use of structure information.

Looking at the errors made by the best model in Table 5 for binary accuracy, we note that for 39 positive and negative case pairs, it answers each pair identically, thus yielding 39 correct answers. In the remaining 11 pairs, there are 10 pairs where it gets both cases right. This suggests it may be guessing randomly on 39 pairs, and understanding 10. The best BERT-based model for dollar amounts predicts the same number for each case, as does the baseline. The best models for string arguments generally make predictions that match the category of the expected answer (date, person, etc) while failing to predict the correct string.

Performance gains from silver data are noticeable and generally consistent, as can be seen by comparing brown and blue cells in Table 5. The silver data came from running a human-written Prolog program, which is costly to produce. A possible substitute is to find mentions of applicable statutes in large corpora of legal cases (caselawproject), for example using high-precision rules (ratner2017snorkel), which has been successful for extracting information from cases (boniol2020performance).

In this work, each task uses the gold annotations from upstream tasks. Ultimately, the goal is to pass the outputs of models from one task to the next.

5 Related Work

Law-related NLP tasks have flourished in the past years, with applications including answering bar exam questions (yoshioka2018overview; zhong2019jec), information extraction (chalkidis2019large; boniol2020performance; lam2020gap), managing contracts (elwany2019bert; liepina2020explaining; nyarko2021stickiness) and analyzing court decisions (sim2014utility; lee2017judging). Case-based reasoning has been approached with expert systems (popp1974judith; hellawell1980computer; vdl1983design), high-level hand-annotated features (ashley2009automatically) and transformer-based models (rabelo2019combining). Closest to our work is saeidi2018interpretation, where a dialog agent’s task is to answer a user’s question about a set of regulations. The task relies on a set of questions provided within the dataset.

clark2019f as well as preceding work (friedland2004project; gunning2010project)

tackle a similar problem in the science domain, with the goal of using the prescriptive knowledge from science textbooks to answer exam questions. The core of their model relies on several NLP and specialized reasoning techniques, with contextualized language models playing a major role.

clark2019f take the route of sorting questions into different types, and working on specialized solvers. In contrast, our approach is to treat each question identically, but to decompose the process of answering into a sequence of subtasks.

The language of statutes is related to procedural language, which describes steps in a process. zhang2012automatically collect how-to instructions in a variety of domains, while wambsganss2019mining focus on automotive repair instructions. branavan-etal-2012-learning exploit instructions in a game manual to improve an agent’s performance. dalvi-etal-2019-everything and amini2020procedural turn to modeling textual descriptions of physical and biological mechanisms. weller2020learning propose models that generalize to new task descriptions.

The tasks proposed in this work are germane to standard NLP tasks, such as named entity recognition

(ratinov2009design), part-of-speech tagging (petrov2011universal; akbik2018contextual), and coreference resolution (pradhan-EtAl:2014:P14-2). Structure extraction is conceptually similar to syntactic (socher2013parsing) and semantic parsing (berant2013semantic), which pertierra2017towards attempt for a subsection of tax law.

Argument instantiation is closest to the task of aligning predicate argument structures (roth2012aligning; wolfe-etal-2013-parma). We frame argument instantiation as iteratively completing a statement in natural language. chen2020reading refine generic statements by copying strings from input text, with the goal of detecting events. chan2019kermit extend transformer-based language models to permit inserting tokens anywhere in a sequence, thus allowing to modify an existing sequence. For argument instantiation, we make use of neural module networks (andreas2016neural), which are used in the visual yi2018neural and textual domains (gupta2019neural). In that context, arguments and their values can be thought of as the hints from khot2020text. The Prolog-based data augmentation is related to data augmentation for semantic parsing (campagna2019genie; weir2019dbpal).

6 Conclusion

Solutions to tackle statutory reasoning may range from high-structure, high-human involvement expert systems, to less structured, largely self-supervised language models. Here, taking inspiration from Prolog programs, we introduce a novel paradigm, by breaking statutory reasoning down into a sequence of tasks. Each task can be annotated for with far less expertise than would be required to translate legal language into code, and comes with its own performance metrics. Our contribution enables finer-grained scoring and debugging of models for statutory reasoning, which facilitates incremental progress and identification of performance bottlenecks. In addition, argument instantiation and explicit resolution of dependencies introduce further interpretability. This novel approach could possibly inform the design of models that reason with rules specified in natural language, for the domain of legal NLP and beyond.


The authors thank Andrew Blair-Stanek for helpful comments, and Ryan Culkin for help with the parser-based argument identification baseline.


Appendix A Task examples

In the following, we provide several examples for each of the tasks defined in Section 2.

a.1 Argument identification

For ease of reading, the spans mentioned in the output are underlined in the input.

Input 1 (§3306(a)(1)(B))

(B) on each of some 10 days during the calendar year or during the preceding calendar year, each day being in a different calendar week, employed at least one individual in employment for some portion of the day.

Output 1

Input 2 (§63(c)(5))

In the case of an individual with respect to whom a deduction under section 151 is allowable to another taxpayer for a taxable year beginning in the calendar year in which the individual’s taxable year begins, the basic standard deduction applicable to such individual for such individual’s taxable year shall not exceed the greater of-

Output 2

Input 3 (§1(d)(iv))

(iv) $31,172, plus 36% of the excess over $115,000 if the taxable income is over $115,000 but not over $250,000;

Output 3

a.2 Argument coreference

We report the full matrix . In addition, for ease of reading, coreference clusters are marked with superscripts in the input.

Input 1 (§3306(a)(1)(B))

(B) on each of some 10 days during the calendar year or during the preceding calendar year, each day being in a different calendar week, employed at least one individual in employment for some portion of the day.

Output 1

Input 2 (§63(c)(5))

In the case of an individual with respect to whom a deduction under section 151 is allowable to another taxpayer for a taxable year beginning in the calendar year in which the individual’s taxable year begins, the basic standard deduction applicable to such individual for such individual’s taxable year shall not exceed the greater of-

Output 2

Input 3 (§1(d)(iv))

(iv) $31,172, plus 36% of the excess over $115,000 if the taxable income is over $115,000 but not over $250,000;

Output 3

a.3 Structure extraction

To clarify the link between the input and the output, we are adding superscripts to argument names in the output. While the output is represented as plain text, a graph-based representation would likely be used in a practical system, to facilitate learning and inference. Arguments are keyword based. For example, in Output 2, the value of the Taxp argument of §63(c)(5) is passed to the Spouse argument of §151(b). If no equal sign is specified, it means the argument names match. For example, part of Output 2 could have been rewritten more explicitly as §151(b)(Spouse=Taxp, Taxp=S45, Taxy=Taxy).

Input 1 (§3306(a)(1)(B))

(B) on each of some 10 days during the calendar year or during the preceding calendar year, each day being in a different calendar week, employed at least one individual in employment for some portion of the day.

3 Output 1

§3306(a)(1)(B)( Caly, S16, Workday, Employment,
Preccaly, Employee, S13A, Employer,
Service) :-
§3306(c)(Employee, Employer, Service).

Input 2 (§63(c)(5))

In the case of an individual with respect to whom a deduction under section 151 is allowable to another taxpayer for a taxable year beginning in the calendar year in which the individual’s taxable year begins, the basic standard deduction applicable to such individual for such individual’s taxable year shall not exceed the greater of-

8 Output 2

§63(c)(5)( Bassd, Grossinc, S45, Taxp, Taxy,
S44B, S46B, S47, S48) :-
§151(b)(Spouse=Taxp, Taxp=S45, Taxy) OR
§151(c)(S24A=Taxp, Taxp=S45, Taxy)
§63(c)(5)(A)() AND
§63(c)(5)(B)(Grossinc, Taxp).

Input 3 (§1(d)(iv))

(iv) $31,172, plus 36% of the excess over $115,000 if the taxable income is over $115,000 but not over $250,000;

Output 3

§1(d)(iv)(Tax, Taxinc).

a.4 Argument instantiation

The following are example cases. In addition to the case description, subsection to apply and input argument-value pairs, the agent has access to the output of Argument identification, Argument coreference and Structure extraction, for the entirety of the statutes.

Input 1: case 3306(a)(1)(B)-positive

Case description: Alice has employed Bob on various occasions during the year 2017: Jan 24, Feb 4, Mar 3, Mar 19, Apr 2, May 9, Oct 15, Oct 25, Nov 8, Nov 22, Dec 1, Dec 3.

Subsection to apply: §3306(a)(1)(B)

Argument-value pairs: {Employer=“Alice”, Caly=“2017”}

Output 1

{Workday=[“Jan 24”, “Feb 4”, “Mar 3”, “Mar 19”, “Apr 2”, “May 9”, “Oct 15”, “Oct 25”, “Nov 8”, “Nov 22”, “Dec 1”, “Dec 3”], Employee=“Bob”, Employment=“has employed”, “S13A”: [4, 5, 9, 11, 13, 19, 41, 43, 45, 47], @truth=True}

Input 2: case §63(c)(5)-negative

Case description: In 2017, Alice was paid $33200. Alice and Bob have been married since Feb 3rd, 2017. Bob earned $10 in 2017. Alice and Bob file separate returns. Alice is not entitled to a deduction for Bob under section 151.

Subsection to apply: §63(c)(5)

Argument-value pairs: {Taxp=“Bob”, Taxy=“2017”, Bassd=500}

Output 2


Input 3: tax case 5

Case description: In 2017, Alice’s gross income was $326332. Alice and Bob have been married since Feb 3rd, 2017, and have had the same principal place of abode since 2015. Alice was born March 2nd, 1950 and Bob was born March 3rd, 1955. Alice and Bob file separately in 2017. Bob has no gross income that year. Alice takes the standard deduction.

Subsection to apply: Tax

Argument-value pairs: {Taxy=“2017”, Taxp=“Alice”}

Output 3

{Tax=116066, @truth=True}

Appendix B Dataset statistics

b.1 Argument identification

In Table 6, we report statistics on the annotations for the argument identification task. The numbers in that table were used to plot the top histogram in Figure 2(a).

Counts SARA Random
00 033 024
01 039 016
02 034 021
03 032 021
04 013 011
05 011 012
06 007 013
07 004 006
08 010 004
09 005 001
10 002 007
11 002 003
12 001 003
13 000 002
14 001 000
15 000 000
16 000 002
total 194 146
average 3.0 4.0
stddev 2.8 3.6
median 2 3
Table 6: Number of argument placeholders per subsection. “Counts” reports the number of subsections (right columns) containing a specific number of placeholders (left column). “Random” refers to 9 sections drawn at random from the Tax Code, and annotated.

b.2 Argument coreference

In Tables 7 and 8, we report statistics on the annotations for the argument coreference task. The numbers in Table 7 (resp. 8) were used to plot the middle (resp. bottom) histogram in Figure 2(a).

Counts SARA Random
00 033 024
01 040 022
02 044 018
03 030 023
04 015 018
05 010 014
06 013 013
07 005 008
08 004 002
09 000 003
10 000 001
total 161 146
average 2.4 3.1
stddev 2.0 2.4
median 2 3
Table 7: Number of arguments per subsection. “Counts” reports the number of subsections (right columns) containing a specific number of arguments (left column). “Random” refers to 9 sections drawn at random from the Tax Code, and annotated.
Counts SARA Random
1 391 360
2 070 073
3 006 016
4 006 006
5 000 001
total 473 456
average 1.2 1.3
stddev 0.5 0.6
median 1 1
Table 8: Number of mentions per argument. “Counts” reports the number of arguments (right columns) mentioned a specific number of times (left column). “Random” refers to 9 sections drawn at random from the Tax Code, and annotated.

b.3 Structure identification

Table 9 reports statistics on the annotations for the structure extraction task. These numbers for arguments differ from those in Table 6, because any subsection is allowed to contain the arguments of any subsections it refers to.

Counts Arguments Dependencies
00 009 080
01 013 042
02 040 028
03 060 018
04 024 008
05 013 002
06 014 003
07 007 007
08 007 001
09 005 001
10 - 000
11 - 000
12 - 002
total 192 192
average 3.0 1.0
stddev 2.6 2.4
median 3 1
Table 9: Number of arguments and dependencies of each subsection, as represented in the structure annotations. “Counts” reports the number of arguments (right column) mentioned a specific number of times (left column).

b.4 Argument instantiation

Tables 10 and 11 show statistics for the annotations for the argument instantiation task. In the gold data, we separate training and test data, to show that both distributions are close.

Gold Silver
Counts train test all
0 007 008 015 01197
1 024 013 037 05487
2 177 073 250 35629
3 041 024 065 32751
4 005 002 007 00447
5 002 000 002 00032
total 256 120 376 75543
average 2.1 2.0 2.0 2.3
stddev 0.7 0.8 0.7 0.7
median 2 1 2 2
Table 10: Number of arguments-value pairs for the input to the argument instantiation task. “Counts” reports the number of arguments (right columns) mentioned a specific number of times (left column). “Gold” refers to the manually annotated data, and “Silver” to the data produced automatically through the Prolog program.
Gold Silver
Counts train test all
1 131 078 209 41248
2 096 033 129 17051
3 012 004 016 08712
4 007 003 010 06656
5 008 002 010 01573
6 001 000 001 00242
7 001 000 001 00051
8 000 000 00008
9 000 000 00002
total 256 120 376 75543
average 1.7 1.5 1.6 1.8
stddev 1.0 0.8 1.0 1.1
median 1 1 1 1
Table 11: Number of arguments-value pairs for the output to the argument instantiation task. “Counts” reports the number of arguments (right columns) mentioned a specific number of times (left column). “Gold” refers to the manually annotated data, and “Silver” to the data produced automatically through the Prolog program.