Log In Sign Up

Overview of CAIL2018: Legal Judgment Prediction Competition

by   Haoxi Zhong, et al.

In this paper, we give an overview of the Legal Judgment Prediction (LJP) competition at Chinese AI and Law challenge (CAIL2018). This competition focuses on LJP which aims to predict the judgment results according to the given facts. Specifically, in CAIL2018 , we proposed three subtasks of LJP for the contestants, i.e., predicting relevant law articles, charges and prison terms given the fact descriptions. CAIL2018 has attracted several hundreds participants (601 teams, 1, 144 contestants from 269 organizations). In this paper, we provide a detailed overview of the task definition, related works, outstanding methods and competition results in CAIL2018.


page 1

page 2

page 3

page 4


CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction

In this paper, we introduce the Chinese AI and Law challenge dataset (CA...

The Deterministic Part of IPC-4: An Overview

We provide an overview of the organization and results of the determinis...

CAIL2019-SCM: A Dataset of Similar Case Matching in Legal Domain

In this paper, we introduce CAIL2019-SCM, Chinese AI and Law 2019 Simila...

A Summary of the ALQAC 2021 Competition

We summarize the evaluation of the first Automated Legal Question Answer...

Distinguish Confusing Law Articles for Legal Judgment Prediction

Legal Judgment Prediction (LJP) is the task of automatically predicting ...

Le droit du numérique : une histoire à préserver

Although the history of informatics is recent, this field poses unusual ...

Legal Judgment Prediction with Multi-Stage CaseRepresentation Learning in the Real Court Setting

Legal judgment prediction(LJP) is an essential task for legal AI. While ...

1 Introduction

Legal Judgment Prediction is a traditional task in the combination of artificial intelligence and laws. It aims to train a machine judge to predict the judgment results (e.g., relevant law articles, charges, prison terms and so on) automatically according to the facts. A well-performed LJP system can not only benefit those who are not familiar with laws but also provide a reference to professionals, e.g., lawyers and judges.

In order to promote the development of legal intelligence, this year’s AI and Law challenge, CAIL2018 , focuses on how artificial intelligence can help the LJP system. Firstly, we published a large-scale criminal dataset constructed from Chinese law documents Xiao et al. (2018). Based on this dataset, we propose three subtasks of LJP for contestants, including predicting relevant law articles, charges and prison terms given the fact descriptions from law documents.

The goal of CAIL2018 is to explore how NLP techniques and legal knowledge benefit the performance of LJP. For the three subtasks in LJP, there are several major challenges for contestants as follows:

  • The distributions of various law articles, charges, and prison terms are quite imbalanced. According to the statistics,the top- charges covers over cases while the bottom- charges only cover about cases. The imbalanced distribution makes it difficult to predict low-frequency categories.

  • Predicting the prison terms via the fact descriptions is more challenging than other subtasks. In real-world scenarios, when deciding the prison terms of a case, the judge will be affected by plenty of factors, e.g., ages of defendants, amount of money involved in the case and so on. It’s challenging for a machine to define and extract sufficient features from fact description.

  • There are usually complex logic dependencies between subtasks. For example, the charges of the criminals should refer to the relevant articles as in Chinese Criminal Law, and the decision of prison terms should accord with the stipulations in law articles. So it is crucial for the contestants to understand the rules contained in law articles and discover the logic dependencies among subtasks.

  • There exists many confusing categories pairs in these subtasks, such as the charges of robbery and theft. In Chinese Criminal Law, there are only a few differences between the definitions of many charge pairs, which make it difficult to distinguish these confusing charges.

In this year’s competition, there are teams who have submitted their models to the contests, and the best-performed models reach in the three subtasks. Comparing with the performance at the early stage of this competition, all the subtasks achieve significant improvements.

In the following parts, we will give a detailed introduction to CAIL2018 including the task definition and evaluation metrics. In addition, we will introduce the best-performed models submitted by contestants and discuss the reminding challenges.

2 Related Work

LJP is a hot topic in the field of legal intelligence and has been studied for several years. In early years, the studies of LJP usually concentrate on how to utilize mathematical and statistical methods to build LJP systems in some specific scenarios Kort (1957); Ulmer (1963); Nagel (1963); Keown (1980); Segal (1984); Lauderdale and Clark (2012).

With the development of machine learning techniques, more works propose to employ existing machine learning models to improve the performance on LJP. In these works, they usually formalize LJP as a text classification problem and focus on extracting efficient shallow features from the given facts and additional resources 

Liu and Hsieh (2006); Lin et al. (2012); Aletras et al. (2016); Sulea et al. (2017). These works integrate machine learning methods into LJP tasks and achieve a promising performance of LJP. However, these conventional methods can only extract well-defined shallow textual features from the fact descriptions.

In recent years, with the successful usage of deep learning techniques on NLP tasks 

Kim (2014a); Baharudin et al. (2010); Tang et al. (2015), researchers propose to employ neural models to solve LJP tasks. For example, Luo et al. (2017) adopt attention mechanism between facts and relevant law articles for charge prediction. Hu et al. (2018) introduce several charge attributes to predict few-shot and confusing charges. Jiang et al. (2018)

employ deep reinforcement learning to extract rationales for interpretable charge prediction.

Zhong et al. (2018) model the dependencies among the different subtasks in LJP as a Directed Acyclic Graph (DAG), and propose a topological learning model to solve these tasks simultaneously. Ye et al. (2018) integrate Seq2Seq model and predicted charges to generate the court view with fact descriptions.

3 Task Definition and Evaluation Metrics

In this section, we give the detailed dataset construction, task definition, and evaluation metrics of this competition. All the details can also be achieved from

3.1 Dataset Construction

We construct the CAIL2018 dataset from criminal documents collected from China Judgment Online111 As all the law documents are written in a standard format, it is easy to extract the fact description and the judgment results from these documents. During the preprocessing period, we filter out some case documents with low-frequency categories or multiple defendants. Finally, there are different criminal law articles and different charges in this dataset.

We randomly selected documents as the training set. There are two stages in the contest. In the first stage, we selected documents for testing. After all participants confirmed their final models, we collected emerging documents for testing in the second stage.

3.2 Task Definition

LJP takes the fact description of a specific case as the input and predicts the judgment results as the output. The judgment results consist of three parts as follows:

  • Law articles. The contestants should give a list of relevant articles as there might be multiple law articles relevant to one case.

  • Charges. The contestants should give a list of charges that the defendant in the case is convicted of.

  • Prison terms. The contestants should give the prison term that the defendant in the case is sentenced to. The prison terms should be an integer which stands for how much months the prison terms should be.

We denote the prediction of law articles, charges, and prison terms as task , , and respectively.

3.3 Evaluation Metrics

For task and task , we take them as text classification problems. For a specific task, suppose there are categories and documents in total. We denote the ground truth category as and the predicted label as . If the -th documents are annotated with the -th category, then should be and otherwise. Then we can get the following metrics for all classes:


These four metrics represent the true positive, false positive, false negative and true negative value for the -th category. Then we can calculate the precision, recall and value for the -th category as follows:


Here, and

represent precision and recall respectively. With these evaluation results for all categories, we can calculate the macro-level

value as follows:


Besides, we also evaluate the performance in micro-level. For micro-level evaluation, we first calculate:


Similarly, we can calculate the precision, recall, and values in the micro-level as follows:


Finally, we calculate overall score as


For task , we employ the difference of the predicted prison terms and the ground-truth ones as the evaluation metric. Assume that the ground-truth prison term of the -th case is and the predicted result is . Then, we define the difference as


After that, we define the score function as :


Then the final score of task should be:


4 Approach Overview

Tasks Law Articles Charges Prison Terms
Evaluation Metrics Score
nevermore 0.958 0.781 0.962 0.836 77.57
jiachx 0.952 0.748 0.958 0.815 69.64
xlzhang 0.952 0.760 0.958 0.811 69.64
HFL 0.953 0.769 0.958 0.811 77.70
大师兄 0.945 0.757 0.951 0.816 73.16
安徽高院类案指引研发团队 0.946 0.756 0.950 0.803 72.24
AI_judge 0.952 0.766 0.956 0.811
只看看不说话 0.948 0.738 0.954 0.801 77.54
DG 0.945 0.717 0.949 0.755 76.18
SXU_AILAW 0.940 0.728 0.950 0.791 76.49
中电28所联合部落 0.934 0.740 0.937 0.772 75.77
Table 1: Performance of participants on CAIL2018 .

There are over teams who have registered for CAIL2018 and submitted their final models. The final scores show that neural models can achieve considerable results on task and task , but it is still challenging to predict the prison terms. In Table 1, we list the scores of top- participants of each subtask. Here, we evaluate these models on the testing set in the second stage, which contains cases.

We have collected the technical reports of these contestants. In the following parts, we summarize their methods and tricks according to these reports.

4.1 General Architecture

Pre-processing. For most contestants, they conduct the following pre-processing steps to transform the raw documents into the format which is suitable for their models.

  • Word Segmentation. As all the documents are written in Chinese, it is important for the contestants to conduct a high-quality word segmentation. For word segmentation, the contestants usually choose jieba222, ICTCLAS333, THULAC444 or other Chinese word segmentation tools.

  • Word Embedding. After word segmentation, we need to transform the discrete word symbols into continuous word embeddings. Generally, the contestants employ word2vec Mikolov et al. (2013), Glove Pennington et al. (2014), or FastText Joulin et al. (2017) to pre-train word embeddings on these criminal cases.

Text Classification Models

. After preprocessing, we need to classify these processed fact descriptions into corresponding categories. For most contestants, they employ existing neural network based text classification models to extract efficient text features. The most commonly used text classification models are listed as follows:

  • Text-CNN Kim (2014b): CNN with multiple filter widths.

  • LSTM Hochreiter and Schmidhuber (1997)) or bidirectional LSTM.

  • GRU

    , Gated Recurrent Unit 

    Cho et al. (2014).

  • HAN, Hierarchical Attention Networks Yang et al. (2016).

  • RCNN

    , Recurrent Convolutional Neural Networks 

    Lai et al. (2015).

  • DPCNN, Deep Pyramid Convolutional Neural Networks Johnson and Zhang (2017).

According to the technical reports of contestants, it has been proven that these neural models can achieve good performance in high-frequency categories.

4.2 Promising Tricks

In predicting relevant law articles and charges, these traditional models can achieve promising results in high-frequency categories. However, due to the imbalance issue, it is challenging to reach a good performance on the low-frequency ones. Therefore, how to address the problem of imbalanced data becomes the most important thing in the first two subtasks.

In the task of predicting prison terms, simple linear regression methods perform poorly than classification models. Thus, most participants still treat it as a text classification problem. However, how to divide the intervals is challenging and will badly influence the classification performance. Meanwhile, the prison terms are affected by many factors and explicit features, rather than implicit semantic meanings in the text. All these issues make the task

the most difficult subtask.

According to the technical reports, there are some useful tricks which can address these issues and improve the text classification models significantly. We summarize them as follows:

  • Word Embeddings. It has been proven by participants that a better word embedding model, such as ELMO Peters et al. (2018) could achieve a better performance than Skip-GramMikolov et al. (2013). Moreover, training word embeddings on a larger legal corpus can also improve the performance of LJP models.

  • Data Balance. Undersampling and oversampling methods are the most common ways to address the imbalance issue of categories in this competition.

  • Joint Learning. As there are dependencies among these subtasks, some participants employ multi-task learning models to solve them jointly.

  • Additional Attributes. Inspired by Hu et al. (2018), participants improve their performance on few-shot and confusing category pairs by predicting their legal attributes.

  • Additional Features. Many participants attempted to extract features manually, e.g., amount involved, named entities, ages and so on. These manually defined features can improve the performance of task greatly.

  • Loss Function.

    Most models use cross-entropy as their loss functions. However, some models adopt more promising loss functions, such as focal loss 

    Lin et al. (2018)

    , to enhance the performance on low-frequency categories. Besides, the loss weights of various categories and the activation functions of the output layer also have great influence on the final performance.

  • Ensemble. Most participants train several different classification models and combine them with simple voting or weighted average strategies to combine their predicting results.

4.3 Conclusion

In CAIL2018, we employ Legal Judgement Prediction as the competition topic. In this competition, we construct and release a large-scale LJP dataset. The performance of LJP subtasks significantly raised with the efforts of over participants. In this paper, we summarize the general architecture and promising tricks they employed, which are expected to benefit further researches on legal intelligence.


  • Aletras et al. (2016) Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preotiuc-Pietro, and Vasileios Lampos. 2016.

    Predicting judicial decisions of the european court of human rights: A natural language processing perspective.

    PeerJ Computer Science, 2.
  • Baharudin et al. (2010) Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. 2010. A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1):4–20.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Computer Science.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hu et al. (2018) Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot charge prediction with discriminative legal attributes. In Proceedings of COLING.
  • Jiang et al. (2018) Xin Jiang, Hai Ye, Zhunchen Luo, WenHan Chao, and Wenjia Ma. 2018. Interpretable rationale augmented charge prediction system. In Proceedings of COLING: System Demonstrations, pages 146–151.
  • Johnson and Zhang (2017) Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of ACL, volume 1, pages 562–570.
  • Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of EACL, volume 2, pages 427–431.
  • Keown (1980) R Keown. 1980. Mathematical models for legal prediction. Computer/LJ, 2:829.
  • Kim (2014a) Yoon Kim. 2014a. Convolutional neural networks for sentence classification. In Proceedings of EMNLP.
  • Kim (2014b) Yoon Kim. 2014b. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Kort (1957) Fred Kort. 1957. Predicting supreme court decisions mathematically: A quantitative analysis of the ”right to counsel” cases. American Political Science Review, 51(1):1–12.
  • Lai et al. (2015) Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of AAAI, volume 333, pages 2267–2273.
  • Lauderdale and Clark (2012) Benjamin E Lauderdale and Tom S Clark. 2012. The supreme court’s many median justices. American Political Science Review, 106(4):847–866.
  • Lin et al. (2018) Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal loss for dense object detection. IEEE TPAMI.
  • Lin et al. (2012) Wanchen Lin, Tsung Ting Kuo, and Tung Jia Chang. 2012. Exploiting machine learning models for chinese legal documents labeling, case classification, and sentencing prediction. In Processdings of ROCLING, page 140.
  • Liu and Hsieh (2006) Chaolin Liu and Chwen Dar Hsieh. 2006. Exploring phrase-based classification of judicial documents for criminal charges in chinese. In Proceedings of ISMIS, pages 681–690.
  • Luo et al. (2017) Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to predict charges for criminal cases with legal basis. In Proceedings of EMNLP.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pages 3111–3119.
  • Nagel (1963) Stuart S Nagel. 1963. Applying correlation analysis to case prediction. Texas Law Review, 42:1006.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.

    Glove: Global vectors for word representation.

    In Proceedings of EMNLP, pages 1532–1543.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • Segal (1984) Jeffrey A Segal. 1984. Predicting supreme court cases probabilistically: The search and seizure cases, 1962-1981. American Political Science Review, 78(4):891–900.
  • Sulea et al. (2017) Octavia Maria Sulea, Marcos Zampieri, Mihaela Vela, and Josef Van Genabith. 2017. Exploring the use of text classi cation in the legal domain. In Proceedings of ASAIL workshop.
  • Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. 2015.

    Document modeling with gated recurrent neural network for sentiment classification.

    In Proceedings of EMNLP, pages 1422–1432.
  • Ulmer (1963) S Sidney Ulmer. 1963. Quantitative analysis of judicial processes: Some practical and theoretical applications. Law and Contemporary Problems, 28:164.
  • Xiao et al. (2018) Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, et al. 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of NAACL, pages 1480–1489.
  • Ye et al. (2018) Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao. 2018. Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions. In Proceedings of NAACL.
  • Zhong et al. (2018) Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. 2018. Legal judgment prediction via topological learning. In Proceedings of EMNLP.