Legal documents are long and hard to understand. These documents, either in form of court orders, contracts or terms of service, involve elaborative sentences, formal grammars and dense texts with free-flowing legal jargon [bhattacharya2019comparative, jain2021summarization]. Reading legal texts is thus challenging for both ordinary people and legal experts. Given the increasing number of legal texts released everyday, there is an urgent need for automating legal text summarization that is capable of shortening the original texts without losing critical information of the document.
Before the deep learning era, classical summarization methods use handcrafted features and simple statistics designed for specific type of case judgments[farzindar2004letsum, polsley2016casesummarizer]. With the rise of deep learning and public legal documents being made available, there have been various attempts to train automated end-to-end legal summarization systems [jain2021summarization]. One straightforward way is to adopt powerful domain-independent summarization models, casting the problem to ranking [zhong2020extractive] or sentence classification [liu2019fine]. All of the prior works trained these models using differentiable loss (i.e cross-entropy) to maximize the likelihood of the ground-truth summaries, showing better results compared to classical methods on some legal datasets [anand2019effective, kornilova2019billsum].
However, the performance of general summarization methods is still limited and does not satisfy the requirement of the law industry [jain2021summarization]. This is attributed to a well-known problem in text summarization [narayan2018ranking]: the mismatch between the learning objective and the evaluation criterion, namely Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [lin2004rouge]. While the learning objective aims to maximize the likelihood of the ground-truth summary, ROUGE heavily relies on the lexical correspondence between the ground-truth and candidate summaries. The problem is even worse for the legal domain since a legal document is often very long compared to the summary, making a challenge to optimize the loss.
To address this problem, we use reinforcement learning (RL) to train summarization models. By using RL, we can design complementary reward schemes that guide the learning towards objectives beyond the traditional likelihood maximization. Here we focus on keyword-level semantics to generate summaries which contain critical legal terms and phrases in the document. This allows our model to simulate the summarization process of humans in dealing with long legal documents. Our method is illustrated in Figure 1. We first train a backbone model to predict whether a sentence from the original document should be included in the summary using normal supervised training. Then, we finetune the model by using RL with a novel reward model that smoothly integrates lexical, sentence and keyword-level semantics into one reward function, unifying different perspectives constituting a good legal summary. Moreover, to ensure a stable finetuning process, we use proximal policy optimization (PPO) and enforce the exploring policy to be close to the supervised model by using Kullback-Leibler (KL) divergence as the additional intermediate rewards. We examine our proposed reward model on three different summarization backbones and validate the performance of our approach on three public legal datasets with different characteristics. Experimental results show that our training method consistently boosts the performance of the backbone summarization models quantitatively and qualitatively.
Our contributions are three-fold: (1) we pioneer employing reinforcement learning (RL) to train summarization models for the legal domain; (2) we construct a new training objective in form of RL rewards to facilitate both semantics and lexical requirements; (3) we evaluate the proposed method on diverse legal document types. The proposed method gains significant improvement over other baselines across 3 datasets.
2.1 Problem formulation
We formulate the problem as extractive summarization. Given a document with sentences , the model extracts salient sentences (where ) and re-organizes them as a summary . In most cases, the summarization model can be viewed as a binary classification task, where it predicts a label for each sentence (a label denotes that the -th sentence should be included in the summary). To do this, the model learns to assign a score that quantifies the importance of each sentence, where
is the learned parameter of a neural network. After training, the model makes prediction to select topsentences with the highest scores among as a summary.
2.2 Pitfalls of current reinforcement training for summarization
To train the summarization model with RL, prior works normally use a simple policy gradient method-REINFORCE as the optimization algorithm and directly adopts ROUGE scores as a global reward. In this section, we give a background on these traditional methods and point out their limitations.
[williams1992simple] relies on direct differentiation of the RL objective, which is the expected total reward over time , where denotes sampled trajectory, is the state of the agent, and is the action taken by the agent at the time step . The key idea of
is to reinforce good actions to push up the probabilities of actions that lead to a higher total reward, and push down the probabilities of actions that lead to a lower total reward, until the model obtains an optimal policy.
The updated gradient in REINFORCE is too sensitive to the choice of learning rate. If parameters are updated with large values at a learning step, this can lead to a large change of the policy. Otherwise, too small learning rate can hopelessly lead to a slow learning progress. This might affect the summarization because summaries collected from a bad policy will guide the learning and gradually makes the policy to be far from the optimal solution.
ROUGE as a reward
ROUGE-scores [lin2004rouge] are often used as the reward for reinforcement training of summarization models [narayan2018ranking] such as , where denotes the F1-scores of the corresponding ROUGE. However, ROUGE does not assess the fluency of the summary. It only tries to assess the adequacy, by simply counting -grams overlapping between the extracted summary and the gold summary. Moreover, -grams suffer from potential synonym or equivalent phrases appeared in a summary.
3 Summarization with the Unified Reward
3.1 Supervised summarization backbone model
We follow BERTSUM [liu2019text] to encode input sentences of a document. To do that, a special [CLS] token (short for "classification") was inserted at the start position of every sentence. We also modify the interval segment embeddings by assigning embedding or
for the i-th sentence, depending on its odd or even position. This way, both tokens’ relations and inter-relations among sentences are learned simultaneously through different layers of the Transformer encoder. The output of BERT returns representations of each token, where the embedding of the i-th [CLS] token from the top layer of the Transformer encoder (denoted as) is used as the sentence representation of .
Given the sentence representation, neural networks are used to predict whether to select the sentence as part of the summary. We examine with three architectures: linear, LSTM, and Transformer to create three different summarization backbones. The output of each backbone is followed by a sigmoid function to compute the score for each sentence as follows.
a linear layer:
an LSTM layer [hochreiter1997long]:
a Transformer encoder block with multi-head attention followed by two fully-connected feed-forward network (FFN) [vaswani2017attention]. The attention is:
where , , are matrices from the embeddings of all tokens in the i-th sentence.
3.2 Finetuning with Proximal Policy Optimization
After training the backbone model by using the standard cross-entropy loss, we consider the trained backbone model as the initial policy, and then continue finetuning the policy with RL. Here we choose Proximal policy optimization (PPO) [schulman2017proximal] as our RL algorithm to optimize the policy because it works well as optimization algorithm in NLP domain [stiennon2020learning, ziegler2019fine]. PPO controls the change in the policy being updated at each iteration, so that the policy does not move too far. We hypothesize that it can give benefit to our policy for extracting better summaries.
We follow [ziegler2019fine, stiennon2020learning] to define the reward scheme for RL. Let denotes the supervised trained backbone model and the one that we optimize with RL. We calculate the reward at the time step as:
where is the KL coefficient, is the final time step corresponding to the total number of sentences in the document .
For the intermediate time step (), the reward is just the negative KL divergence between the output’s distribution of the backbone model and the current policy. It ensures to prevent the current policy from generating outputs which are too different from the outputs of the backbone model. For the final time step (), when the model obtains the entire summary , a reward term that we design to measure the quality of the extracted summary as a whole. We introduce our reward in Section 3.3.
3.3 The unified reward
As mentioned, the reward using ROUGE-scores only considers -grams overlapping between an extracted summary and the ground-truth reference. It ignores the semantic aspect of word and sentence levels. We argue that the reward function should encode the semantics to guide the summarization model to output a good summary, that can reach human level. To exploit the semantic aspect, we introduce a unified reward function in Eq. (6), which considers different important aspects of a good summary.
where , , and are the control coefficients; is the ROUGE-score function; considers the keyword semantic; and captures the semantics of a sequence. The new reward function includes three components. The ROUGE function encodes the word overlapping between an extracted summary and the gold reference. We use this function to directly force the backbone model to extract important sentences which tend to be similar to the gold reference [narayan2018ranking]. The supports the ROUGE function in term of semantics. This is because the ROUGE function only considers the -grams overlapping aspect. In many cases, words in an extracted summary and the gold reference are different in term of characters, but they share similar meaning. Therefore, we design the function to address this problem. Finally, the helps the backbone model to extract sequences similar to the target text.
, we first use BERT to embed phrases and utilize the cosine function to compute the similarity among embedded vectors. Then, the method shown in Algorithm1 is used to produce two sets of keywords from the original document and the summary . For each keyword in , we find the most similar keywords in . Finally, the keyword reward is computed by the average of all similarities.
Opposed to DSR [li2019deep], we select the list of keywords before comparing them. It is essential for legal documents as keywords take a vital role in the text’s meaning and amplify their effect on the reward function. In addition, several terms used in the legal domain can have the same meaning. Therefore, to reduce the redundancy and inaccuracy, we take a phrase that is the most different from the current set of keywords when choosing keywords. In practice, the size of the keyword set is set as 3 in all experiments.
The semantic sentence function
Since only encourages similar keywords appearing in the summary, it does not guarantee a the coherence of the whole summary. Thus, we define to enforce the semantic similarity between the final summary and the gold reference . Initially, Moverscore [kusner2015word] on Word2Vec [mikolov2013linguistic] is used to measure the translating distance between the two texts . Concretely, the reward is defined as where .
3.4 Training and inference
The backbone model was trained to to initialize the policy. Due to the limitation of resources, the model bases on BERT-base with 12 Transformer blocks, the hidden size of 768, and 110M parameters. During RL training, the PPO algorithm was adopted to optimize the policy. The Adam optimizer [kingma2014adam] was used to optimize the reward function in Eq. 6 with the learning rate of 1e-5. At the test time, the model selects top sentences that have the highest probabilities predicted by the trained policy to form a summary.
4 Experimental Setup
We used three legal datasets for evaluation as follows.
4.1.1 Plain English Summarization of Contracts (PESC)
is a legal dataset written in plain English, in which a text snippet is paired with an abstractive summary [manor2019plain]. Each legal snippet contains approximately 595 characters. The length of a summary is around 202 characters.
consists of 22,218 US Congressional bills with human-written reference summaries collected from United States Government Publishing Office [kornilova2019billsum]. The data was split into 18,949 training bills and 3,269 testing bills. Each document contains around 46 sentences and each summary includes around 6 sentences.
4.1.3 Legal Case Reports dataset (LCR)
contains 3,890 Australian legal cases from the Federal Court of Australia [galgani2012combining]. Each case contains around 221 sentences and 8.3 catchphrases which is considered as a summary. We divided the data into three sets: training (2,590), validation (800), and testing (500 samples).
4.2 Settings and evaluation metrics
The summarization model is based on BERT-base with 12 Transformer blocks, the hidden size of from BERT is 768. We used one layer for the backbone of LSTM and two linear layers for the backbone of transformers with the output vector of 768. All models were trained with a batch size of 256 on a single Tesla T4 GPU. In our experiment, we set , , and for by using random search tuning on the validation set. We set for the KL coefficient. At the testing time, the number of selected sentences was set as 3 for PESC, 6 for BillSum, and 8 for LCR datasets. We used ROUGE-scores111The parameter: -c 95 -m -r 1000 -n 2.
to measure the quality of summaries. We report the F-score of ROUGE-1, ROUGE-2, and ROUGE-L as the main metrics.
5 Experimental Results
5.1 Legal Cases: Working with different summarization backbones
In this section, we report quantitative improvement over the backbone baselines on 2 formal and complicated legal datasets: Billsum and LCR. Here, Backbone is the BERTSum model with three different sentence selectors: Linear, LSTM, and Transformer. In addition, we also compare our backbone to other classical legal summarization baselines: CaseSummarizer [polsley2016casesummarizer]
and Restricted Boltzmann Machine (RBM)[verma2018extractive].
Table 1 reports the ROUGE scores of the backbone before and after training with RL. Notably, our backbones achieve strong results, significantly outperforming traditional legal summarizers. In this case, classical methods use handcrafted features and only work for certain types of legal documents, thus showing poor results on the datasets. We observe that after training with our reward function, the performance of our summarization backbones gets improvement by a substantial amount. The improvements indicates that the proposed method can effectively improve all types of the pretrained-supervised model. The new reward function in Eq. (6) forces the summarization model to select salient sentences which are good in term of lexical and semantic aspects on the word and sentence levels. Overall, the Transformer backbone gives the best results. The possible reason is that this backbone uses a more complicated architecture than Linear and LSTM. Due to the high accuracy, we use the Transformer as the main backbone in other experiments.
5.2 Terms of Service: State-of-the-art results on PESC
In this section, we compare our method to strong methods from the literature on PESC dataset. Lead- extracts first sentences of a document to form a summary [manor2019plain]. ROUGE [lin2004rouge] uses the ROUGE-L as the final reward to optimize the policy [li2019deep, pasunuru2018multi]. DSR [li2019deep] is a strong reward method that boost the performance of backbone [li2019deep]. BLANC [vasilyev2020fill]
is also a metric to estimate the quality of a summary. This metric will be used to provide a reward signal when optimizing the policy.REFRESH [narayan2018ranking] uses an encoder-decoder to rank sentences, then the top-ranked sentences are assembled as a summary. The model was updated using REINFORCE algorithm.
Table 2 describes the best results on the PESC dataset. For reward-based methods, all models in this experiment are in the same configuration except for the final reward. It is noticeable that all these models improve the supervised backbone (Transformer). Our model with achieves SOTA result on the dataset. There is also a large performance gap between our method and baselines: REFRESH and Lead-. Our method achieves 2.23%, 1.3%, and 1.45% higher than REFRESH on ROUGE-1, ROUGE-2, and ROUGE-L respectively. In comparison with Lead-, which is the best model reported in [manor2019plain], our approach performs nearly 5 ROUGE-L points higher.
5.3 Ablation studies
|Ours w/o KL||24.83||8.95||21.36|
|Ours w/o PPO||24.79||9.13||21.17|
We conduct more evaluations to assess the impact of KL reward and PPO on the proposed method. We also validate components of our including and . Table 3 shows that the proposed approach obtains significant benefits from KL rewards. Without the KL rewards, ROUGE scores drop approximately 1%. We also observe similar drops in performance when we replace PPO with the classical REINFORCE, which stresses the importance of using proximal policy optimization. Training without and also reduces the performance of the baselines. Since ROUGE-score is our objective, combining with our and will provide a sufficient training signal to the backbone. As expected, the performs best.
5.4 A case study
In this section, we provide a closer look at the extracted summary produced by RL and compare it to the output of supervised-trained model (SL summary), and the gold summary.
|Catchpharases (gold summary): appeal from a decision of the federal magistrates court . application for a protection visa . serious personal assault . whether tribunal failed to comply with s 430 of the migration act . whether tribunal failed to take adequately into account relevant material . whether tribunal failed adequately to take into account integer of appellant husband ’s claims . reviewable error established . migration|
|SL summary: the tribunal ’s decision 3 the appellant husband provided to the tribunal a comprehensive statutory declaration of 69 paragraphs assembled on 27 october 2004 , … whereby his honour dismissed the appellants ’ application for judicial review of the decision of the refugee review tribunal ( ‘ the tribunal ’ ) … he had enjoyed an association with the leadership of the united national party ( ‘ unp ’ ) in sri lanka , being an association which had largely formed the political context to the present claims to refugee status of the appellants . …|
|RL summary: the second appellant is the wife of the first appellant ’s second marriage which took place in 1988 , and both of the appellants are of sinhalese ethnicity . the tribunal ’s decision 3 the appellant husband provided to the tribunal a comprehensive statutory declaration of 69 paragraphs assembled on 27 october 2004 , … whereby his honour dismissed the appellants ’ application for judicial review of the decision of the refugee review tribunal ( ‘ the tribunal ’ ) made on 11 april 2005 and handed down on 4 may 2005 . …|
Table 4 shows an example from the LCR dataset. Thanks to the keyword-level semantic reward, the important term "appellant(s)" (highlighted with red color) existed in the gold summary is presented more times in the RL summary. The KL reward also ensures that the RL summary is not too different from the SL summary. The only different sentence in the RL summary compared to the SL summary is highlighted with the blue color, which gives readers a more clear context for this particular legal case.
6 Related Works
Recently, there have been large amounts of studies on document summarization using reinforcement learning. For this task, most of the research employs discrete functions like ROUGE as a part of reward [wu2018learning, li2019deep]. [wu2018learning] trained a model to evaluate the coherence of current text and used the coherent score as the intermediate reward. In contrast to the original ROUGE score, which treats all words in the text equally, [pasunuru2018multi] introduced a novel salient reward that gives high weight to the critical words in summary, and an entailment scorer gives a high reward to the logically entailed summaries. The most similar to ours is [li2019deep]. The authors proposed distributional semantic reward to capture semantic relation between similar words. However, their study focus on the semantic of the entire text. In contrast, we try to simulate the summarization process of legal experts, so our research introduces a reward that focuses on keyword-level semantics.
In the legal area, many approaches for text summarizing have been presented. FLEXICON[gelbart1991flexicon] is the pioneer in this field. It references keywords of original text against a large database to find a candidate summary. [kim2012summarization] considered a document as a weighted graph, in which sentences were represented as nodes of the graph. Then the summary was a collection of high node values. [galgani2012combining]
introduced a rule-based system using manual knowledge acquisition which combines different summarization techniques. Recently,[pandya2019automatic]
proposed to combine K-mean clustering and TF-IDF word vectorizer to summarize legal case reports. However, to the best of our knowledge, the adaptation of reinforcement learning to legal document summarization is still an open question.
In this paper, we have presented a new method for summarizing various types of legal documents.
Our approach for training the summarization model with a novel reward scheme has reached a ROUGE-score of 25.70% for the PESC dataset, achieving the SOTA result. To further validate our approach, we extensively experiment on different datasets with different configurations. Experimental results show that the method is consistently better than the strong baselines on additional BillSum and Legal Case Reports datasets.
Future work will expand our research to tasks such as case entailment: using reinforcement learning with novel reward designs to find precedents from the law database given a query case to achieve expert-level results.