Searching for Legal Clauses by Analogy. Few-shot Semantic Retrieval Shared Task

11/10/2019 ∙ by Łukasz Borchmann, et al. ∙ 0

We introduce a novel shared task for semantic retrieval from legal texts, where one is expected to perform a so-called contract discovery – extract specified legal clauses from documents given a few examples of similar clauses from other legal acts. The task differs substantially from conventional NLI and legal information extraction shared tasks. Its specification is followed with evaluation of multiple k-NN based solutions within the unified framework proposed for this branch of methods. It is shown that state-of-the-art pre-trained encoders fail to provide satisfactory results on the task proposed, whereas Language Model based solutions perform well, especially when unsupervised fine-tuning is applied. In addition to the ablation studies, the questions regarding relevant text fragments detection accuracy depending on number of examples available were addressed. In addition to dataset and reference results, legal-specialized LMs were made publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Work

There is a tremendous amount of works related to information retrieval in general, however following gillick2018endtoend we consider the problem stated in end-to-end manner, where nearestgrammar-THE_SUPERLATIVE neighbor search is performed on dense document representations. With this assumption the main issue is to obtain reliable representations of documents, grammar-WHERE_ASwhere by document we mean any self-contained unit that can be returned to the user as a search result [books/daglib/0031897]. We use the term segment with the same meaning whenever needed to achieve clarity. Many approaches considered in literature rely on word embedding and aggregation strategy. Simple methods proposed include averaging, as in continuous bag-of-words (CBOW) model [41224] or frequency-weighted averaging with decomposition method applied [arora2017asimple]. More sophisticated schemes include utilizing multiple weights, such as a novelty score, significance score and a corpus-wise uniqueness [Yang2018ZerotrainingSE]

or computing vector of locally-aggregated descriptors 

[ionescu2019vector]. Most of the proposed methods are orderless, and their limitations were recently discussed by DBLP:journals/corr/abs-1902-06423. However, there are pooling approaches preserving spatial information, such as hierarchical pooling operation [shen2018baseline]

. Another methods of obtaining sentence representations from word embeddings include training an autoencoder on a large collection of unlabeled data 

[zhang2018learning] or utilizing random encoders [wieting2019training]. Despite shortcomings of CBOW model and availability of many sophisticated alternatives, it is commonly used due to ability to ensure strong results on many downstream tasks. Different approaches assume training encoders providing document embedding in unsupervised or supervised manner, without the need for explicit aggregation. The former include Skip-Thought Vectors, trained with the objective of reconstructing the surrounding sentences of an encoded passage [NIPS2015_5950]. Although the mentioned method was outperformed by supervised models trained on single NLI task [conneau2017supervised], paraphrase corpora [jiao-etal-2018-convolutional] or multiple tasks [DBLP:journals/corr/abs-1804-00079], the objective of predicting next sentence is used as additional objective in multiple novel models, such as Universal Sentence Encoder [DBLP:journals/corr/abs-1803-11175]

. Even though many transformer-based Language Models implement their own pooling strategy for generating sentence representations (special token pooling), they were shown to yield weak sentence embeddings, as described recently by reimers-2019-sentence-bert. Authors proposed superior method of fine-tuning pretrained BERT network with Siamese and triplet network structures to obtain sentence embeddings. There were attempts to utilize semantic similarity methods explicitly on legal domain, e.g. for case law entailment within COLIEE shared task. During recent edition, Rabelo:2019:CST:3322640.3326741 utilized BERT model fine-tuned on provided train set in supervised manner, and achieved the highest F-score among all teams. However, due to reasons described in Section 

Document, their approach is not consistent with the nearest neighbor search we are aiming. The presented selection of approaches do not cover the whole body of research. It outlines however a variety of methods considered in literature. These chosen as reference for proposed shared task are described briefly in Section Document.

Contract Discovery Shared Task

The aim of this task is to provide substrings from the requested documents representing clauses analogous (semantically and functionally equivalent) to the examples provided from other documents. At the beginning, subsets of Corporate Bond and Non-disclosure Agreement documents from US and Charity Annual Reports from UK Charity were annotated, in a way clauses of the same type were selected (e.g. determining governing law, clause types depend on type of legal act). Clauses can consist of a single sentence, multiple sentences or sentence parts. The exact type of clause is not important during the evaluation, since no full-featured training is allowed and one have to solely use a set of few sample clauses during execution. Each document was annotated by two regular annotators, and then reviewed (or resolved) by super-annotator, who also decided the gold standard. An average score of regular annotators when compared to the gold standard (after super-annotation) was taken to establish human baseline performance. Overall statistics regarding the dataset are presented in Table Document. The detailed list of clauses and their examples can be found in Table LABEL:tab:clauses1, Table LABEL:tab:clauses2 and Table LABEL:tab:clauses3. The dataset is available publicly on GitHub, as well as at git-based platform [gralinski:2016:gonito],333The dataset is available as a git repository: git://; see also the web site where all the readers are encouraged to submit their own solutions.

Files structure

Content of documents can be found in reference.tsv files. Input files in.tsv consists of tab-separated fields: Target ID (e.g. 57), Clause considered (e.g. governing-law), Example #1 (e.g. 59 15215-15453latex-8), …, Example #N. Each example consists of document ID and characters range. Ranges can be discontinuous. In such a case their parts are distinguished with comma, e.g. 4103-4882,12127-12971. File with answers (expected.tsv) contains one answer per line, consisting of entity name (to be copied from input) and characters range in the same format as described above. Reference file contains 2 tab-separated fields: document ID and its content.

Documents annotated 586
Average document length (words) 24284
Clause types 21
Average clause length (words) 110
Clause instances 2663
Table : Basic statistics regarding released dataset.

Legal Corpus

In addition, we release large, cleaned plain-text corpus of legal texts for the purposes of unsupervised models training or fine-tuning (see It is based on US Edgar documents and consists of approx. 1M documents and 2B words in total (1.5G of text after xz compression).


Solutions based on networks consuming pairs of sequences, such as BERT in sentence pair classification task setting [devlin2018bert], are considered out of the scope of this paper, since they are suboptimal in terms of performance — they require expensive encoding of all seed (example) times target combinations, making such solutions unsuitable for semantic similarity search due to the combinatorial explosion [reimers-2019-sentence-bert]. Instead, in this section we describe simple -NNspell-NN based approaches that we propose for the problem stated. They assume spell-prepre-encoding of all candidate segments and can be described within the unified framework consisting of segmenters, vectorizers, projectors, aggregators, scorers and choosers. This taxonomy is consistent with the assumptions made by gillick2018endtoend. This taxonomy is presented in order to highlight the similarities and differences between particular solutions during their introduction and ablation studies.

  • Segmenter

    is utilized to split text into candidate sub-sequences to be encoded and considered in further steps. All the described solutions rely on candidate sentence and n-grams of sentences, determined with

    spaCy CNN model trained on

  • Vectorizer produces vector representations of texts on either word, sub-word or segment (e.g. sentence) level. Examples include sparse TF-IDF representations, static word embeddings and neural sentence encoders.

  • Projector projects embeddings into different space (e.g. decomposition methods such as PCA or ICA).

  • Aggregator is able to use word or sub-word units embeddings to create segment embedding (e.g. embedding mean, inverse frequency weighting, autoencoder).

  • Scorer

    compares two or more embeddings and returns computed similarities. Since we often compare multiple seed embeddings with one embedding of a candidate segment, scorer includes policies to aggregate scores obtained for competitions with multiple seeds into the final candidate score (e.g. mean of individual cosine similarities or max pooling over Word Mover Distances).

  • Chooser determines whether to return candidate segment with given score (e.g. threshold, one best per document or union of these). For simplicity, during evaluation we restricted ourselves to the chooser returning only one, most similar candidate.

The next section describes vectorizers, aggregators and scorers utilized during evaluation.


Most machine learning algorithms require data to be represented as a vector of numbers and thus, one needs to identify a way to encode fragments which can be processed optimally. In last years, there is a tremendous increase of quality of NLP task solutions due to incorporating dense vector representations for tokens and longer fragments of texts. Further subsections describe the representations that were tested.


From the variety of methods operating on whole documents (segments), we decided to rely on two that may be considered state-of-the-art, as well as on one sparse, preneural representation for reference (TF-IDF).

  • TF-IDF555It can be also viewed as word-level vectorizer with sum aggregator. Word-level TF-IDF is however rather useless. — one of the most widely used methods for vectorization is Term Frequency—Inverse Document Frequency (TF-IDF). In that method, we embed a document or sentence into a vector of size , where th position in that vector represents a score connected with th word from the vocabulary. TF-IDF assigns each word a score which represents the importance of that word. The TF part simply checks how often a given word is used in a given document, while IDF obtains high scores for tokens that occur only in limited number of documents.

  • Universal Sentence Encoder — transformer-based encoder where element-wise sum of word’s representations are treated as sentence embedding [DBLP:journals/corr/abs-1803-11175], trained with multi-task objective. Original models released by authors were used for the purpose of evaluation.

  • Sentence-BERT — modification of the pretrained BERT network, utilizing Siamese and triplet network structures to derive sentence embeddings, trained with explicit objective of making them comparable with cosine similarity [reimers-2019-sentence-bert]. Original models released by authors were used for the purposes of evaluation.


For the purposes of evaluation multiple contextual embeddings from transformer-based Language Models were used, as well as static (context-less) word embeddings for reference.

  • GloVe — Global Vectors for word representations [Pennington14glove:global] is a method of transforming tokens from vocabulary into fixed size vectors of size . In order to obtain a vector from word, the co-occurrence of given word with other words is considered. That’s because according to the distributional hypothesis, words sharing context tent to share similar meanings [doi:10.1080/00437956.1954.11659520].

  • Transformer-based LMs — many approaches to generate context-dependent vector representations were proposed in last years (e.g. Peters:2018, DBLP:journals/corr/VaswaniSPUJGKP17). One important advantage over static embeddings is the fact that every occurrence of the same word is assigned a different embedding vector based on the context in which the word is used. Thus, it is much easier to address issues arising from pretrained static embeddings (e.g. taking into consideration polysemy of words). For the purposes of evaluation we relied on transformer-based models provided by authors of particular architectures, utilizing Transformers library [Wolf2019HuggingFacesTS]. These include BERT [DBLP:journals/corr/abs-1810-04805], GPT-1 [Radford2018ImprovingLU]

    , GPT-2 

    [gpt2], RoBERTa [liu2019roberta]. They differ substantially and introduce many innovations, however all are based on either encoder or decoder from the original model proposed for sequence-to-sequence problems [DBLP:journals/corr/VaswaniSPUJGKP17]. Selected models were fine-tuned on legal texts and re-evaluated.

From (Sub)word- to Document-level with Scorers and Aggregators

Despite conceptually simple methods such as average or max-polling operations, multiple solutions to utilize (sub)word embeddings to compare documents can be used. They can be based either on different methods of aggregation, or implement document-level similarity (distance) method relying on (sub)word embeddings.

  • Smooth Inverse Frequency (SIF) — method proposed by arora2017asimple, where representation of document is obtained in two steps. First, each word embedding is weighted by , where stands for underlying word’s relative frequency and is the weight parameter. Then, the projections on the first tSVD-calculated principal component are subtracted providing final representations.

  • Word Mover’s Distance (WMD) — method of calculating similarity between documents. For two documents, embeddings calculated for each word (e.g. with GloVe) are matched between documents in a way that semantically similar pairs of words between documents are detected. This matching procedure generally leads to better results than simple averaging over embeddings for documents and calculating similarity between centers of mass of documents as their similarity [pmlr-v37-kusnerb15]. Recently, zhao2019moverscore showed it may be beneficial to use it with contextual word embeddings.

  • Discrete Cosine Transform (spell-DCTDCT) — method for generating document-level representations in order-preserving manner, adapted from image compression to NLP by almarwani2019efficient. After mapping an input sequence of real numbers to the coefficients of orthogonal cosine basis functions, low-order coefficients can be used as document embeddings, outperforming vector averaging on most tasks, as shown by authors.


Documents were split into halves to form validation and test sets. Evaluation is performed during repeated random sub-sampling validation procedure. Sub-samples (-combinations for each from 21 clauses, ) drawn from considered set of annotations are split into seed documents and target document. One is expected to return clauses similar to seed from the target. The selected interval results in 1-shot to 5-shot learning, considered as few-shot learning [Wang2019GeneralizingFA], whereas with the chosen number of sub-samples we expect improvements of 0.01 to be statistically significant.


metric on character-level spans is used for the purpose of evaluation, as implemented in GEval tool [gralinski-etal-2019-geval]. Roughly speaking, it is conventional

measure with precision and recall definitions altered to reflect partial success of returning entities. In case of expected clause ranging between

characters and answer with ranges , (system assumes clause is occurring twice within the document), recall equals (since this is the part of relevant item selected) and precision equals ca. (since this is the amount of selected items turned out to be relevant). The Hungarian algorithm is employed to solve the problem of expected and returned ranges assignment. has the advantage of being based on widely-utilized metric, while abandoning binary nature of match, which is undesirable in the case one deals within the task described.


Segmenter Vectorizer Projector Scorer Aggregator spell-devdev-0 test-A
sentence TF-IDF (1–2 grams, binary TF term) mean cosine 0.40 0.38
tSVD ()666TF-IDF with truncated SVD decomposition is commonly referred to as Latent Semantic Analysis [Halko:2011:FSR:2078879.2078881]. mean cosine 0.41 0.39
sentence GloVe (spell-300d300d, Wikipedia & Gigaword) mean cosine mean 0.34 0.34
mean WMD 0.37 0.35
SIF tSVD777SVD in SIF method is used to perform removal of single common component [arora2017asimple]. mean cosine SIF 0.39 0.37
sentence GloVe (spell-300d300d, EDGAR) mean cosine mean 0.37 0.36
mean WMD 0.37 0.35
SIF tSVD mean cosine SIF 0.42 0.41
sentence Sentence-BERT (base-nli-mean ) mean cosine mean888built-in 0.33 0.31
sentence USE (multilingual ) mean cosine 0.39 0.38
sentence grammar-ENGLISH_WORD_REPEAT_BEGINNING_RULEBERT, last layer (large-cased) mean cosine mean 0.21 0.21
sentence GPT-1, last layer mean cosine mean 0.37 0.36
sentence GPT-2, last layer (large ) mean cosine mean 0.42 0.41
sentence RoBERTa, last layer (large ) mean cosine mean 0.31 0.31
sentence grammar-ENGLISH_WORD_REPEAT_BEGINNING_RULEGPT-1, last layer (fine-tuned) mean cosine mean 0.44 0.43
sentence GPT-1, last layer (fine-tuned) fICA () mean cosine mean 0.46 0.44
sentence GPT-2, last layer (large, fine-tuned) mean cosine mean 0.45 0.44
sentence GPT-2, last layer (large, fine-tuned) fICA () mean cosine mean 0.47 0.45
1–3 sen.grammar-SENT_START_NUM GPT-1, last layer (fine-tuned) mean cosine mean 0.48 0.47
1–3 sen. GPT-1, last layer (fine-tuned) fICA (n=500) mean cosine mean 0.51 0.49
1–3 sen. GPT-2, last layer (large, fine-tuned) mean cosine mean 0.47 0.46
1–3 sen. GPT-2, last layer (large, fine-tuned) fICA (n=400) mean cosine mean 0.52 0.51
human 0.85 0.84
Table : Selected results when returning a single, most similar segment, determined with given segmenters, vectorizers, projectors, scorers and aggregators. The symbol indicates we evaluated all the distributed models, but only the best ones from each architecture are presented here for simplicity.

Table Document recapitulates the most important results of conducted evaluation. Sentence-BERT and Universal Sentence Encoder were not able to outperform simple TF-IDF approach, especially when SVD decomposition was applied (setting commonly refereed to as Latent Semantic Analysis). Static word embeddings with SIF weighting performed similarly to TF-IDF, or better, in case they were trained on legal text corpus instead of general English. It could not be clearly confirmed that utilization of WMD or spell-DCTDCT is beneficial. For the latter, the best results were achieved with , which in case of the -NNspell-NN algorithm leads to exactly the same answers as mean pooling. Interestingly, from all the released USE models, the multilingual ones performed best — for the monolingual universal-sentence-encoder-large model scores were 10 percentage points lower. The best Sentence-BERT model performed significantly grammar-EXTREME_ADJECTIVESworse than the best USE — note authors of Sentence-BERT compared themselves to monolingual models released earlier, which they indeed outperform.77footnotetext: TF-IDF with truncated SVD decomposition is commonly referred to as Latent Semantic Analysis [Halko:2011:FSR:2078879.2078881].88footnotetext: SVD in SIF method is used to perform removal of single common component [arora2017asimple].99footnotetext: built-ingrammar-PUNCTUATION_PARAGRAPH_END In case of averaging (sub)word embeddings from the last layer of neural Language Models, the results were either comparable or inferior to TF-IDF. The best-performing language models were GPT-1 and GPT-2. Fine-tuning of these on subsample of legal text corpus improved the results significantly, by factor of 3–7 points. LMs seem to benefit neither from SIF nor from removal of a single common component, their performance can be however mildly improved with conventionally used decomposition, such as ICA [Hyvrinen2000IndependentCA]. Considerable improvement can be achieved with considering segments different from a single sentence, such as n-grams of sentences. Figure Document

presents how the performance of particular methods changes as a function of number of examples available within simple similarity averaging scheme used in all the presented solutions. In general, methods benefit substantially from availability of second example. Presence of more leads to decreased variance but no improvements of median score.


Figure : Performance as a function of number of examples available.


Brief evaluation presented in previous section has multiple limitations. First, it assumed retrieval of single, the most similar segment, whereas it might be expected to return multiple clauses. However, we consider this restriction as justifiable during a preliminary comparison of applicable methods. Multiple alternative selectors can be proposed in the future. Secondly, all the methods evaluated assume scoring with policy of averaging individual similarities. We encourage readers to experiment with different pooling methods or meta-learning strategies. Moreover, even the LM-based methods we studied the most were underexploited — note e.g. only embeddings from the last layer were evaluated, despite it is possible the higher layers may better capture semantics. Finally, it is in principle possible to address the task in entirely different ways. For example, with performing no segmentation nor aggregation of word embeddings at all, but with matching clauses on word level instead, which may be an interesting research direction.

Summary and Conclusions

We introduced a novel shared task for semantic retrieval from legal texts, which differs substantially from conventional NLI. It is heavily inspired by enterprise solutions referred to as contract discovery, focused on ensuring the inclusion of relevant clauses or their retrieval for further analysis. The distinguishing specific of Searching for Legal Clauses by Analogy is among others conceptual, since:

  • it assumes mining for candidate sequences in real texts (in contrast to providing them explicitly);

  • it is suited for few-shot methods, filling the gap between conventional sentence classification and NLI tasks based on sentence pairs.

We considered the problem stated in end-to-end manner, where grammar-THE_SUPERLATIVEnearest neighbor search is performed on documents representations. With this assumption the main issue was to obtain representations of text fragments, we referred to as segments. Specification of the task was followed with evaluation of multiple -NNspell-NN based solutions within the unified framework which may be used to describe future solutions. Moreover, the practical justification of handling the problem with -NN was briefly introduced. It is shown that in this particular setting pretrained, universal encoders fail to provide satisfactory results. One may suspect it is a result of difference between domain they were trained on and legal domain. During the evaluation, solutions based on Language Model performed well, especially when unsupervised fine-tuning was applied. In addition to said ability to fine-tune method on legal texts, so far the most important success indicator was consideration of multiple, sometimes overlapping substrings instead of sentences. Moreover, it has been demonstrated that methods benefit substantially from availability of second example, and presence of more leads to decreased variance, even when simple similarities averaging scheme is considered. During the discussion regarding presented methods and their limitations, possible directions towards improving baseline methods were briefly outlined. In addition to dataset and reference results, legal-specialized LMs were made publicly available to help research community perform further experiments.