CAIL2019-SCM: A Dataset of Similar Case Matching in Legal Domain

by   Chaojun Xiao, et al.

In this paper, we introduce CAIL2019-SCM, Chinese AI and Law 2019 Similar Case Matching dataset. CAIL2019-SCM contains 8,964 triplets of cases published by the Supreme People's Court of China. CAIL2019-SCM focuses on detecting similar cases, and the participants are required to check which two cases are more similar in the triplets. There are 711 teams who participated in this year's competition, and the best team has reached a score of 71.88. We have also implemented several baselines to help researchers better understand this task. The dataset and more details can be found from


page 1

page 2

page 3

page 4


CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction

In this paper, we introduce the Chinese AI and Law challenge dataset (CA...

Overview of CAIL2018: Legal Judgment Prediction Competition

In this paper, we give an overview of the Legal Judgment Prediction (LJP...

A Summary of the ALQAC 2021 Competition

We summarize the evaluation of the first Automated Legal Question Answer...

nigam@COLIEE-22: Legal Case Retrieval and Entailment using Cascading of Lexical and Semantic-based models

This paper describes our submission to the Competition on Legal Informat...

Explainable Legal Case Matching via Inverse Optimal Transport-based Rationale Extraction

As an essential operation of legal retrieval, legal case matching plays ...

The Detection of Thoracic Abnormalities ChestX-Det10 Challenge Results

The detection of thoracic abnormalities challenge is organized by the De...

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

In this paper, we present DuReader_retrieval, a large-scale Chinese data...

1 Introduction

Similar Case Matching (SCM) plays a major role in legal system, especially in common law legal system. The most similar cases in the past determine the judgment results of cases in common law systems. As a result, legal professionals often spend much time finding and judging similar cases to prove fairness in judgment. As automatically finding similar cases can benefit to the legal system, we select SCM as one of the tasks of CAIL2019.

Chinese AI and L

aw Challenge (CAIL) is a competition of applying artificial intelligence technology to legal tasks. The goal of the competition is to use AI to help the legal system. CAIL was first held in 2018, and the main task of CAIL2018 

Xiao et al. (2018); Zhong et al. (2018b) is predicting the judgment results from the fact description. The judgment results include the accusation, applicable articles, and the term of penalty. CAIL2019 contains three different tasks, including Legal Question-Answering, Legal Case Element Prediction, and Similar Case Matching. Furthermore, we will focus on SCM in this paper.

More specifically, CAIL2019-SCM contains 8,964 triplets of legal documents. Every legal documents is collected from China Judgments Online111 In order to ensure the similarity of the cases in one triplet, all selected documents are related to Private Lending. Every document in the triplet contains the fact description. CAIL2019-SCM requires researchers to decide which two cases are more similar in a triplet. By detecting similar cases in triplets, we can apply this algorithm for ranking all documents to find the most similar document in the database. There are 247 teams who have participated CAIL2019-SCM, and the best team has reached a score of , which is about points higher than the baseline. The results show that the existing methods have made great progress on this task, but there is still much room for improvement.

In other words, CAIL2019-SCM can benefit the research of legal case matching. Furthermore, there are several main challenges of CAIL2019-SCM: (1) The difference between documents may be small, and then it is hard to decide which two documents are more similar. Moreover, the similarity is defined by legal workers. We must utilize legal knowledge into this task rather than calculate similarity on the lexical level. (2) The length of the documents is quite long. Most documents contain more than characters, and then it is hard for existing methods to capture document level information.

In the following parts, we will give more details about CAIL2019-SCM, including related works about SCM, the task definition, the construction of the dataset, and several experiments on the dataset.

2 Related Work

2.1 Semantic Text Matching

SCM aims to measure the similarity between legal case documents. Essentially, it is an application of semantic text matching, which is central for many tasks in natural language processing, such as question answering, information retrieval, and natural language inference. Take information retrieval as an example, given a query and a database, a semantic matching model is required to judge the semantic similarity between the query and documents in the database. Moreover, the tasks related to semantic matching have attracted the attention of many researchers in recent decades.

Intuitively traditional approaches calculate word-to-word similarity with vector space model, e.g. term frequency-inverse document frequency 

Wu et al. (2008), bag-of-words Bilotti et al. (2007). However, due to the variety of words in different texts, these approaches achieve limited success in the task.

Recently, with the development of deep learning in natural language processing, researchers attempt to apply neural models to encode text into distributed representation. The Siamese structure 

Bromley et al. (1994) for metric learning achieve great success and is widely applied Amiri et al. (2016); Liu et al. (2018); Mueller and Thyagarajan (2016); Neculoiu et al. (2016); Wan et al. (2016); He et al. (2015). Besides, there are many researchers put emphasis on integrating syntactic structure into semantic matching Liu et al. (2018); Chen et al. (2017) and multi-level text matching with attention-aware representation Duan et al. (2018); Tan et al. (2018); Yin et al. (2016).

Nevertheless, most previous studies are designed for identifying the relationship between two sentences with limited length.

2.2 Legal Intelligence

Researchers widely concern tasks for legal intelligence. Applying NLP techniques to solve a legal problem becomes more and more popular in recent years. Previous works Kort (1957); Keown (1980); Lauderdale and Clark (2012) focus on analyzing existing cases with mathematical tools. With the development of deep learning, more researchers pay much efforts on predicting the judgment result of legal cases Luo et al. (2017); Hu et al. (2018); Zhong et al. (2018a); Chalkidis et al. (2019); Jiang et al. (2018); Yang et al. (2019). Furthermore, there are many works on generating court views to interpret charge results Ye et al. (2018), information extraction from legal text Vacek and Schilder (2017); Vacek et al. (2019), legal event detection Yan et al. (2017), identifying applicable law articles Liu et al. (2015); Liu and Hsieh (2006) and legal question answering Kim et al. (2015); Fawei et al. (2018).

Meanwhile, retrieving related legal documents with a query has been studied for decades and is a critical issue in applications of legal intelligence. Raghav et al. (2016) emphasize exploiting paragraph-level and citation information. Kano et al. (2017) and Zhong et al. (2018b) held a legal information extraction and entailment competition to promote progress in legal case retrieval.

3 Overview of Dataset

3.1 Task Definition

We first define the task of CAIL2019-SCM here. The input of CAIL2019-SCM is a triplet , where are fact descriptions of three cases. Here we define a function which is used for measuring the similarity between two cases. Then the task of CAIL2019-SCM is to predict whether or .

3.2 Dataset Construction and Details

To ensure the quality of the dataset, we have several steps of constructing the dataset. First, we select many documents within the range of Private Lending. However, although all cases are related to Private Lending, they are still various so that many cases are not similar at all. If the cases in the triplets are not similar, it does not make sense to compare their similarities. To produce qualified triplets, we first annotated some crucial elements in Private Lending for each document. The elements include:

  • The properties of lender and borrower, whether they are a natural person, a legal person, or some other organization.

  • The type of guarantee, including no guarantee, guarantee, mortgage, pledge, and others.

  • The usage of the loan, including personal life, family life, enterprise production and operation, crime, and others.

  • The lending intention, including regular lending, transfer loan, and others.

  • Conventional interest rate method, including no interest, simple interest, compound interest, unclear agreement, and others.

  • Interest during the agreed period, including , , , and others.

  • Borrowing delivery form, including no lending, cash, bank transfer, online electronic remittance, bill, online loan platform, authorization to control a specific fund account, unknown or fuzzy, and others.

  • Repayment form, including unpaid, partial repayment, cash, bank transfer, online electronic remittance, bill, unknown or fuzzy, and others.

  • Loan agreement, including loan contract, or borrowing, “WeChat, SMS, phone or other chat records”, receipt, irrigation, repayment commitment, guarantee, unknown or fuzzy and others.

After annotating these elements, we can assume that cases with similar elements are quite similar. So when we construct the triplets, we calculate the tf-idf similarity and elemental similarity between cases and select those similar cases to construct our dataset. We have constructed 8,964 triples in total by these methods, and the statistics can be found from Table 1. Then, legal professionals will annotate every triplet to see whether or . Furthermore, to ensure the quality of annotation, every document and triplet is annotated by at least three legal professionals to reach an agreement.

Type Count
Small Train 500
Small Test 326
Large Train 5,102
Large Valid 1,500
Large Test 1,536
Total 8,964
Table 1: The number of triplets in different stages of CAIL2019-SCM.

4 Experiments

To access the challenge of the similar cases matching task, we evaluate several baselines on our dataset. The experiment results show that even the state-of-the-art systems perform poorly in evaluating the similarity between different cases.

Baselines. All the baseline models are trained on Large Train and tested on Large Valid and Large Test. We adapt the Siamese framework Bromley et al. (1994) to our scenario with different encoder, e.g. CNN Kim (2014), LSTM Hochreiter and Schmidhuber (1997), Bert Devlin et al. (2019), used for encoding the legal documents. We will elaborate on the details of the framework in the following part.

Given the triplet of fact description, (, , ), we first encode them into distributed vectors with the same encoder and then compute the similarity scores between the query case and the candidate cases , with a linear layer. Assume that each document consisting of words, i.e. .

For CNN/LSTM encoder, we first employ THULAC Sun et al. (2016) for word segmentation and then transform each word into distributed representation with Glove Pennington et al. (2014), where and

is the dimension of word embeddings. Next, the encoder layer and max pooling layer transform the embedding sequence

into features , where is the dimension of hidden vector. While for Bert encoder, we feed the document in character-level into the model to get the features .


Afterward, we calculate the similarity with a linear layer with softmax activation. is a weight matrix to be learned.


For the learning objective, we apply the binary cross-entropy loss function with ground-truth label


Method Valid Test
baselines CNN 62.27 69.53
LSTM 62.00 68.00
BERT 61.93 67.32
Teams AlphaCourt 70.07 72.66
backward 67.73 71.81
11.2 yuan 66.73 72.07
Table 2: Results of baselines and scores of top 3 participants on valid and test datasets.

Model Performance. We use the accuracy metric in our experiments. Table 2 shows the results of baselines and top 3 participant teams on Large Valid and Large Test dataset, from which we get the following conclusion: 1) The participants achieve promising progress compared to baseline models. 2) Both the baselines systems and participant teams perform poorly on the dataset, due to the lack of utilization of prior legal knowledge. It’s still challenging to utilize legal knowledge and simulate legal reasoning for the dataset.

5 Conclusion

In this paper, we propose a new dataset, CAIL2019-SCM, which focuses on the task of similar case matching in the legal domain. Compared with existing datasets, CAIL2019-SCM can benefit the case matching in the legal domain to help the legal partitioners work better. Experimental results also show that there is still plenty of room for improvement.


  • H. Amiri, P. Resnik, J. Boyd-Graber, and H. Daumé III (2016)

    Learning text pair similarity with context-sensitive autoencoders

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1882–1892. Cited by: §2.1.
  • M. W. Bilotti, P. Ogilvie, J. Callan, and E. Nyberg (2007) Structured retrieval for question answering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 351–358. Cited by: §2.1.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994)

    Signature verification using a” siamese” time delay neural network

    In Advances in neural information processing systems, pp. 737–744. Cited by: §2.1, §4.
  • I. Chalkidis, I. Androutsopoulos, and N. Aletras (2019) Neural legal judgment prediction in english. In Proceddings of ACL. Cited by: §2.2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §4.
  • C. Duan, L. Cui, X. Chen, F. Wei, C. Zhu, and T. Zhao (2018) Attention-fused deep matching network for natural language inference.. In IJCAI, pp. 4033–4040. Cited by: §2.1.
  • B. Fawei, J. Z. Pan, M. Kollingbaum, and A. Z. Wyner (2018) A methodology for a criminal law and procedure ontology for legal question answering. In In Proceddings of JIST, Cited by: §2.2.
  • H. He, K. Gimpel, and J. Lin (2015)

    Multi-perspective sentence similarity modeling with convolutional neural networks

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586. Cited by: §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.
  • Z. Hu, X. Li, C. T. Z. Liu, and M. Sun (2018) Few-shot charge prediction with discriminative legal attributes. Cited by: §2.2.
  • X. Jiang, H. Ye, Z. Luo, W. Chao, and W. Ma (2018) Interpretable rationale augmented charge prediction system. In In Proceedings of COLING, Cited by: §2.2.
  • Y. Kano, M. Kim, R. Goebel, and K. Satoh (2017) Overview of coliee 2017.. In COLIEE@ ICAIL, pp. 1–8. Cited by: §2.2.
  • R. Keown (1980) Mathematical models for legal prediction. Computer/lj 2, pp. 829. Cited by: §2.2.
  • M. Kim, R. Goebel, and S. Ken (2015) COLIEE-2015: evaluation of legal question answering. In In Proceddings of JURISIN, Cited by: §2.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §4.
  • F. Kort (1957) Predicting supreme court decisions mathematically: a quantitative analysis of the” right to counsel” cases. The American Political Science Review 51 (1), pp. 1–12. Cited by: §2.2.
  • B. E. Lauderdale and T. S. Clark (2012) The supreme court’s many median justices. American Political Science Review 106 (4), pp. 847–866. Cited by: §2.2.
  • B. Liu, T. Zhang, F. X. Han, D. Niu, K. Lai, and Y. Xu (2018) Matching natural language sentences with hierarchical sentence factorization. In Proceedings of the 2018 World Wide Web Conference, pp. 1237–1246. Cited by: §2.1.
  • C. Liu and C. Hsieh (2006) Exploring phrase-based classification of judicial documents for criminal charges in chinese. In Proceedings of the 16th international conference on Foundations of Intelligent Systems, Cited by: §2.2.
  • Y. Liu, Y. Chen, and W. Ho (2015) Predicting associated statutes for legal problems. Information Processing & Management 51 (1), pp. 194–211. Cited by: §2.2.
  • B. Luo, Y. Feng, J. Xu, X. Zhang, and D. Zhao (2017) Learning to predict charges for criminal cases with legal basis. In In Proceedings of EMNLP, Cited by: §2.2.
  • J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
  • P. Neculoiu, M. Versteegh, and M. Rotaru (2016) Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 148–157. Cited by: §2.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.
  • K. Raghav, P. K. Reddy, and V. B. Reddy (2016) Analyzing the extraction of relevant legal judgments using paragraph-level and citation information. AI4JCArtificial Intelligence for Justice, pp. 30. Cited by: §2.2.
  • M. Sun, X. Chen, K. Zhang, Z. Guo, and Z. Liu (2016) Thulac: an efficient lexical analyzer for chinese. Technical report Technical Report. Technical Report. Cited by: §4.
  • C. Tan, F. Wei, W. Wang, W. Lv, and M. Zhou (2018) Multiway attention networks for modeling sentence pairs.. In IJCAI, pp. 4411–4417. Cited by: §2.1.
  • T. Vacek, R. Teo, D. Song, C. Cowling, F. Schilder, T. Nugent, and C. Wharf (2019) Litigation analytics: case outcomes extracted from us federal court dockets. In Proceddings of NAACL-HLT. Cited by: §2.2.
  • T. Vacek and F. Schilder (2017) A sequence approach to case outcome detection. In In Proceedings of ICAIL, Cited by: §2.2.
  • S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng (2016) A deep architecture for semantic matching with multiple positional sentence representations. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
  • H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok (2008) Interpreting tf-idf term weights as making relevance decisions. ACM Transactions on Information Systems (TOIS) 26 (3), pp. 13. Cited by: §2.1.
  • C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al. (2018) Cail2018: a large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478. Cited by: §1.
  • Y. Yan, D. Zheng, Z. Lu, and S. Song (2017) Event identification as a decision process with non-linear representation of text. arXiv preprint arXiv:1710.00969. Cited by: §2.2.
  • W. Yang, W. Jia, X. Zhou, and Y. Luo (2019) Legal judgment prediction via multi-perspective bi-feedback network. Cited by: §2.2.
  • H. Ye, X. Jiang, Z. Luo, and W. Chao (2018) Interpretable charge predictions for criminal cases: learning to generate court views from fact descriptions. In In Proceedings ofNAACL, Cited by: §2.2.
  • W. Yin, H. Schütze, B. Xiang, and B. Zhou (2016) Abcnn: attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4, pp. 259–272. Cited by: §2.1.
  • H. Zhong, Z. Guo, C. Tu, C. Xiao, Z. Liu, and M. Sun (2018a) Legal judgment prediction via topological learning. In In Proceedings of the EMNLP, Cited by: §2.2.
  • H. Zhong, C. Xiao, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al. (2018b) Overview of cail2018: legal judgment prediction competition. arXiv preprint arXiv:1810.05851. Cited by: §1, §2.2.