Log In Sign Up

Equality before the Law: Legal Judgment Consistency Analysis for Fairness

In a legal system, judgment consistency is regarded as one of the most important manifestations of fairness. However, due to the complexity of factual elements that impact sentencing in real-world scenarios, few works have been done on quantitatively measuring judgment consistency towards real-world data. In this paper, we propose an evaluation metric for judgment inconsistency, Legal Inconsistency Coefficient (LInCo), which aims to evaluate inconsistency between data groups divided by specific features (e.g., gender, region, race). We propose to simulate judges from different groups with legal judgment prediction (LJP) models and measure the judicial inconsistency with the disagreement of the judgment results given by LJP models trained on different groups. Experimental results on the synthetic data verify the effectiveness of LInCo. We further employ LInCo to explore the inconsistency in real cases and come to the following observations: (1) Both regional and gender inconsistency exist in the legal system, but gender inconsistency is much less than regional inconsistency; (2) The level of regional inconsistency varies little across different time periods; (3) In general, judicial inconsistency is negatively correlated with the severity of the criminal charges. Besides, we use LInCo to evaluate the performance of several de-bias methods, such as adversarial learning, and find that these mechanisms can effectively help LJP models to avoid suffering from data bias.


page 4

page 10


HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models

Fairness has become a trending topic in natural language processing (NLP...

Certifying and removing disparate impact

What does it mean for an algorithm to be biased? In U.S. law, unintentio...

Beyond Incompatibility: Interpolation between Mutually Exclusive Fairness Criteria in Classification Problems

Trustworthy AI is becoming ever more important in both machine learning ...

FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing

We present a benchmark suite of four datasets for evaluating the fairnes...

Towards causal benchmarking of bias in face analysis algorithms

Measuring algorithmic bias is crucial both to assess algorithmic fairnes...

Transitioning from Real to Synthetic data: Quantifying the bias in model

With the advent of generative modeling techniques, synthetic data and it...

A Fair, Traceable, Auditable and Participatory Randomization Tool for Legal Systems

Many real-world scenarios require the random selection of one or more in...

1 Introduction

Legal judgment consistency indicates the degree to which similar cases yields similar court decision in realistic legal systems. It is an essential component of judicial fairness, whether in the Civil Law system or the Common Law system (Flynn, 2013). However, even though we have a well-developed legal system, cases are not always judged consistently in reality. Table 1 gives an example where two cases are judged inconsistently. The two criminals behaved similarly, but they received quite different verdicts. The inconsistency in legal judgment may be caused by many complex factors, like regional differences, public opinion, or judges’ personal subjective thoughts. Different judgments to similar cases violate the principle of equality before the law, undermines the parties’ rights and interests. Therefore, it is essential to develop efficient and effective methods to evaluate legal judgment inconsistency in the real world.

Case A: In a snack bar in Region A, Alice stole the victim’s iPhone. After identification, the value of the stolen phone was RMB 4000.
Imprisonment: 6 months.   Fine: RMB 1000.
Case B: When working in a restaurant in Region G, Bob found that the victim who was eating had hung his jacket on a stool, and a Samsung mobile phone (valued at RMB 4000) was in his jacket pocket, so he stole the cell phone while no one was watching.
Imprisonment: 9 months.   Fine: RMB 4000.
Table 1: An example of inconsistency between two cases from different regions. In these two cases, the criminals behaved similarly, and the stolen items were of equal value, but they received different verdicts.

Legal inconsistency analysis has been studied for decades. Anderlini et al. (2020) use a mathematical approach to analyze and compare judicial consistency under different legal systems in theory. And other approaches focus on analysing real-world data from a case-by-case perspective (Reamer, 2005; Li, 2014; Edgely, 2009; Anderlini et al., 2014). However, few works focus on macro-consistency analysis towards real-world data, which will be important complementarity and support to qualitative research.

In this paper, we explore the quantitative analysis of judgment consistency. Judicial fairness requires similar cases to be judged similarly. But it is difficult to decide whether two cases are similar in real-world (Xiao et al., 2019; Zhong et al., 2020b), which makes it challenging to analyze legal inconsistency by comparing the judgment results between similar cases. Therefore, we propose Legal Inconsistency Coefficient, LInCo, which focuses on the judicial inconsistency between groups divided by specific features (e.g., gender, region, race). We simulate judges from different groups with legal judgment prediction (LJP) models, which can predict the term of penalty from cases’ fact description. Since LJP models are able to capture the characteristics within datasets and reflect the bias from the historical judgment in prediction (Grgichlaca et al., 2018), we can train multiple LJP models as “virtual judges” on each data group, and the legal inconsistency coefficient is defined as their disagreement on the same cases. Specifically, given several groups of cases divided by specific features, such as regions or gender, calculating LInCo includes two steps: (1) Train an LJP model as a virtual judge for each group and predict the term of penalty for all cases in test sets; (2) Calculate LInCo as the average disagreement between different virtual judges on the test cases.

In experiments, we choose the Chinese legal system for analysis. China is a country with a large population and a vast territory, and there exist many complex factors (e.g., race, region) that cause inconsistency. Therefore exploring judicial inconsistency in the Chinese legal system is an urgent need. Moreover, the Chinese government has published a large-scale dataset of formatted legal documents, which provides data support for our research.

Our contributions are threefold:

(1) We propose LInCo to measure the legal inconsistency in real-world data. The results in the simulation study verify the effectiveness of LInCo, and show that the correlation coefficient between LInCo and the inconsistency factor is high (), which proves the reliability of LInCo. (Section 4.2)

(2) We conduct a series of region-specific (cases are judged in different regions) and gender-specific (defendants have different genders) experiments on real-world datasets. The results show that there exists inconsistency between different genders or different provincial-level administrative regions in the Chinese legal system. However, gender inconsistency is much less than regional inconsistency. We also discover that, in general, judicial inconsistency is negatively correlated with the severity of the criminal charges. (Section 4.3)

(3) We use LInCo to evaluate the performance of several de-biasing methods, including adversarial learning. The results show that the methods are effective in decreasing LInCo, which indicates that the methods can mitigate the data bias for existing LJP models. (Section 4.4)

We will release the datasets and source code once accepted to help the researchers make improvements on legal consistency analysis.

2 Related Work

2.1 Legal Judgment Consistency

Judicial inconsistency is a serious threat to the authority of the legal system and has attracted the attention of many researchers in the legal field. Most of them focus on the importance of consistency, causes of inconsistency, solutions, etc., from a case-by-case or theoretical perspective (Reamer, 2005; Li, 2014; Edgely, 2009; Anderlini et al., 2014, 2020)

. Meanwhile, many efforts have been devoted to developing fairness-aware machine learning algorithms 

(Zafar et al., 2017; Dwork et al., 2012; Speicher et al., 2018; Berk et al., 2018), which focus on designing models to produce fair outcomes rather than analyzing the unfairness in existing data or legal systems. There are also some existing inconsistency coefficients (Li et al., 2012; Jezewski et al., 2003; Korenius et al., 2004)

, but they involve knowledge or task settings from other specific domains or tasks (e.g., medical domain, chemical domain, hierarchical clustering algorithms, etc.), so they are not applicable and comparable in legal inconsistency analysis. To sum up, legal judgment consistency analysis over large-scale data remains to be explored.

2.2 Legal AI

Recently, owing to a large number of high-quality legal textual data, many researchers explore to employ NLP technology to help lawyers and other practitioners in the legal field. There are many works on generating the court’s view to interpret charge results (Ye et al., 2018), retrieving relevant cases and law articles (Chen et al., 2013; Raghav et al., 2016), legal information extraction Shen et al. (2020); Chen et al. (2020), legal question answering (Zhong et al., 2020c) and legal debate dialogue summarization (Duan et al., 2019).

Besides, many efforts have been devoted to predicting judgment according to fact description (Zhong et al., 2018; Hu et al., 2018; He et al., 2019; Chen et al., 2019; Zhong et al., 2020a). These works show that neural models can effectively learn the judgment patterns from large-scale data and thus reflect the data bias, which provides the technical basis for the quantitative study. However, few researchers explore to evaluate the judgment consistency in large-scale realistic data with Legal AI technology. To the best of our knowledge, we are the first to analyse the judgment consistency in real-world data with LJP models.

3 Problem Formulation and Methodology

In this section, we will introduce the problem formulation and the definition of the proposed metric LInCo.

3.1 Problem Formulation

Here we describe the problem formulation of our task in this paper. In this paper, our task is to calculate the degree of disagreement of court decisions between different groups. Let denotes the independent variable, which is a discrete integer ranging from to , describing the feature (e.g., region, gender) used to group the dataset. Let denotes the term of penalty, and denotes the fact description. Each case can be represented as a triplet . We use to denote the judgment function, which takes the fact description as inputs and outputs the term of penalty .

Given a set of legal cases , we can divide it into groups according to the value of , where for each case , . Then, the task is to calculate the degree of judicial disagreement between these groups.

It is notable that we can treat any feature as independent variables and thus analyze fairness from multiple perspectives. For instance, in the following experiments, we let represents the source region of cases or the gender of defendants.

3.2 Legal Inconsistency Coefficient

Figure 1: The overview of LInCo. We divide the dataset into groups, train virtual judges, and measure their disagreement to evaluate the inconsistency.

In this section, we define the metric LInCo, the Legal Inconsistency Coefficient for measuring judicial inconsistency. Figure 1

illustrates the computation. We first train neural models to estimate the judgment function of different groups. Then we calculate LInCo as the average of the disagreement on the cases in test sets.

Estimation of Judgment Functions. We first train LJP models to estimate the judgment functions for different groups. The judgment function can be seen as a virtual judge, which obeys both the fair judgment rules and judgment bias for group . With these functions, we can estimate the results for all cases following rules and biases in different groups. We train our models to predict the term of penalty and treat this task as a regression problem. Given a case whose fact description consists of words, we first adopt an encoder (BERT (Devlin et al., 2019)

or any other models) to encode the fact description into a hidden vector. And then, a linear layer is employed to calculate the term of penalty

. We use log-scaled mean square error as the loss function for optimization:


We split each group into a training set and a test set. The training sets are used to train models, and the test sets are used to calculate LInCo. We can get different models with different training sets. To minimize impact from other factors that may affect consistency, we balance all train sets to have the same number of cases.

Definition of LInCo. After obtaining the judgment functions for different groups, we utilize these models to calculate LInCo. We define the LInCo as the average disagreement of all cases in the test sets:


where is the disagreement on the case . We argue that if the legal judgment is consistent across different groups, the judgment functions should output close, or even identical, results for the same case, and vice versa. In other words, if the judgment is consistent, the disagreement for the case is supposed to be close to . Therefore, we formally define

as the standard deviation of the results of

virtual judges:


with which we can compute the disagreement for all cases. Here represents the standardized prediction result, which to make comparable across different test groups. Let

be the mean and variance of the ground truth

in the test group, and then the standardization can be formalized as:


Intuitively, models trained on a particular group can represent the characteristics and biases of that group and make judgments close to the reality of that group. By calculating the average disagreement of these “virtual judges” on the test set, we can measure the inconsistency of the groups.

4 Experiments

In this section, we first verify the effectiveness of LInCo through a simulation study (Section 4.2). Then we conduct experiments on realistic datasets to analyze the judgment inconsistency in the Chinese legal system (Section 4.3). Moreover, we employ LInCo to evaluate the performance of several de-biasing strategies (Section 4.4).

4.1 Dataset

In all experiments, we construct the datasets, including synthetic dataset and real-world dataset, based on CAIL (Xiao et al., 2018), which is the largest dataset for legal judgment prediction (LJP). This dataset consists of million legal cases, including fact description, applicable law articles, charges, and the term of penalty published by the Supreme People’s Court of China111 CAIL only includes criminal cases. According to the Chinese Criminal Law, criminal judgments should be uniform across the country, so CAIL cases are supposed to be consistent theoretically. Based on this dataset, we construct synthetic data and real-world data for evaluation. Statistics of the dataset for every single experiment can be found in the Appendix.

Moreover, we have several settings applied to all experiments in this paper: (1) Datasets for each group are balanced to the same size. (2) For each dataset, we randomly select of data for testing and leave the remainder for training.

Please refer to the following sections for the details of synthetic and real-world datasets.

4.2 Simulation Study

To verify the effectiveness of LInCo, we propose to construct synthetic datasets with controlled perturbation on the cases’ term of penalty. Notably, it is a good idea to employ a crowd-sourced human evaluation to help our verification. However, the main reason for the judgment inconsistency is that the judges cannot be completely consistent. If we introduce a crowd-sourced human evaluation, we cannot make sure that different annotators judge consistently across different cases because they are just professionals like the real judges.

Therefore, we do a simulation study instead, in which the inconsistency of different groups is well controlled. Thus, we can evaluate the reliability of LInCo by computing the correlation between LInCo and the controlled inconsistency.

4.2.1 Experimental Settings

Dataset Construction. We construct the synthetic data of different groups by keeping facts the same and perturbing the term of penalty. In other words, each data group uses the same set of facts, but the ground truths are perturbed for each group. To simulate the inconsistency between groups, we make the perturbations vary for each data group.

Let denote the inconsistency factor of the synthetic dataset. We build a synthetic dataset with the following steps:

(1) We first randomly select cases from the CAIL dataset.

(2) Then, for the -th case in the group , we keep the facts the same as those in and perturb the term of the penalty:



is randomly sampled following the normal distribution

, where


Intuitively, it can be considered that we give each group a scaling , and apply an additional random effect to each case. The values of make the single-group scaling roughly in the range , and the value of makes fall between and with a probability.

We argue that if LInCo is sufficient to evaluate the legal judgment inconsistency, the correlation coefficient between LInCo and the inconsistency factor is supposed to be close to .

Models. We employ two strategies to estimate the judgment functions for different groups.

  • [leftmargin=*]

  • DGN. We adopt DGN (Chen et al., 2019), which is designed for predicting the term of penalty, as our encoder to train virtual judges. Compared with general NLP models, DGN mainly focuses on the relationship between charge and term. DGN stacks multiple blocks of an LSTM layer and a charge-specific gating layer for generating a focused charge-based representation of the case. Therefore, this work can make good use of charge information to help judgment prediction.

  • Golden Value. As mentioned before, LInCo represents the disagreement of models to reflect the inconsistency. However, LInCo will be influenced by both the judgment inconsistency and model misspecification. Therefore, we propose to use the synthetic ground truth to calculate the golden value of LInCo for comparison. More specifically, when given the case , the virtual judge will output the ground truth value in group .

Training Settings.

We use trainable character-level embeddings for DGN and use Adam (Kingma and Ba, 2015) to train all models. The learning rate is

. We implement DGN using PyTorch. You can find the codes in the attached files.

4.2.2 Experimental Results and Analysis

Figure 2: The curves in the figure show the tendency of LInCo to change with the increasing of the inconsistency factor . Here represents the correlation coefficient between and LInCo. For each result, we run times for average (this setting has been applied to all experiments in this paper).

The main results of the simulation study are shown in Figure 2. Please refer to the Appendix for the detailed values for each point in the figure. From the results, we can observe that:

(1) When is fixed, a larger inconsistency factor will lead to larger LInCo as expected. We also calculate the correlation coefficient between and LInCo. The correlation coefficients for the golden value and LInCo and DGN are relatively high, which proves that LInCo is reliable in theory and in practice. Thus LInCo can be used to evaluate the judicial inconsistency in the real world.

(2) There is a difference between the LInCo with DGN and the golden value. For example, When , which means the judgment is consistent across all groups, the LInCo with DGN is greater than . This indicates that neural models will introduce bias due to the model misspecification. In reality, the model-introduced bias is inevitable because we are unable to know the underlying perfect model for prediction. However, the difference between golden value and LInCo with DGN is small, which means the model misspecification has little impact on LInCo. And the correlation coefficients between and LInCo with DGN are greater than with an average of , which further proves the robustness of LInCo.

4.3 LInCo on Real-World Dataset

In this section, we analyze the judgment inconsistency on CAIL with LInCo. We mainly focus on regional inconsistency and gender inconsistency. We match all data with cases on China Judgment Online222 to obtain the source region and the defendant’s gender for each case. On one hand, when studying the regional consistency, we select data from different provincial-level administrative regions333The provincial-level administrative region is the first-level administrative division unit of China. In China, there are 34 provincial-level administrative regions, including 23 provinces, 5 autonomous regions, 4 municipalities, and 2 special administrative regions. and use the data from these regions as the groups for the experiments. For anonymity reasons, we do not list the names of these regions. On the other hand, for the gender inconsistency, we set and group the data by the defendant’s gender assignment.

We first conduct experiments on nine crimes to explore the relationship between legal judgment consistency and the severity of crimes. Detailed descriptions of these charges can be found in the Appendix. In the Chinese legal system, the severity of individual cases is divided into three levels: sentences being imprisonment of fewer than years, to years, and more than years. Therefore, we calculate the proportion of each crime’s actual judgment falling in these three ranges to reflect the severity of the charge.

Real-world Sentence Distribution Crimes LInCo
55% 34% 11% Fraud 0.457
67% 25% 8% Drug Trafficking 0.487
73% 22% 5% Possession of Illegal Drugs 0.541
84% 14% 2% Intentional Injury 0.274
86% 14% 0% Traffic Offence 0.553
95% 5% 0% Theft 0.376
97% 3% 0% Picking Quarrels and Provoking Trouble 0.698
100% 0% 0% Disrupting Public Service 0.725
100% 0% 0% Providing Venues for Drug Users 0.870
Table 2: Results of LInCo on nine crimes, compared with the distributions of real judgment for these crimes. The distributions reflect the crimes’ severity.

The results are shown in Table 2. We notice that the judicial inconsistency is negatively correlated with the severity of the selected crime in general. In other words, overall, the sentences for the more serious crimes are more consistent, which indicates that the judges may be more careful about felony trials than misdemeanor ones.

Crime Cross-region Single region
Fraud 0.457 0.371
Drug Trafficking 0.487 0.304
Traffic Offence 0.553 0.127
Theft 0.376 0.262
Picking Quarrels and Provoking Trouble
0.698 0.145
Table 3: LInCo calculates on crimes with sufficient data. All data is from one particular region and divided into groups. Results for cross-region inconsistency is also shown for comparison.

To demonstrate that region is indeed the main factor influencing the results in Table 2, we calculate the inconsistency without regional differences for comparison. We select several crimes with sufficient data and randomly divide the data from one particular region into groups for the experiments. The results of LInCo are shown in Table 3. It can be observed that LInCo within one single region is much smaller than LInCo between different regions. This reveals that region is an important factor for the inconsistency in Table 2. Besides, some other factors may have an impact on the consistency like GDP and education levels, especially for some of the crimes like fraud or theft.

Crime 2013 – 2015 2016 – 2018
Fraud 0.504 0.516
Drug Trafficking 0.523 0.518
Intentional Injury 0.305 0.285
Traffic Offence 0.627 0.629
Theft 0.356 0.340
Picking Quarrels and Provoking Trouble
0.738 0.755
Table 4: Regional legal judgment inconsistency between different years.

Moreover, we conduct experiments on data in different years in an attempt to verify whether regional judgment inconsistency changes over time, and the results are shown in Table 4. It can be observed that the regional inconsistency is nearly the same across the years, which indicates that inconsistency does not change significantly over time.

Real-world Sentence Distribution Crimes LInCo
55% 34% 11% Fraud 0.216
67% 25% 8% Drug Trafficking 0.182
73% 22% 5% Possession of Illegal Drugs 0.258
84% 14% 2% Intentional Injury 0.163
86% 14% 0% Traffic Offence 0.251
95% 5% 0% Theft 0.220
97% 3% 0% Picking Quarrels and Provoking Trouble 0.351
100% 0% 0% Disrupting Public Service 0.353
100% 0% 0% Providing Venues for Drug Users 0.338
Table 5: Results of gender inconsistency, compared with the distributions of real judgment for these crimes.

We also calculate LInCo to detect gender inconsistency, and the results are shown in Table 5. Notably, sexual harassment or other gender-sensitive charges are not included, so that there should be complete equality between men and women. From the table, we can observe that the inconsistency is negatively correlated with the crime’s severity generally persists. Furthermore, we can find that gender inconsistency is much less than regional inconsistency, which means biased sentencing due to gender discrimination is rare, or at least less than that due to regional differences.

4.4 De-biasing Strategies Evaluation

In the previous section, we have defined LInCo for measuring judicial inconsistency with the help of LJP, and we find that inconsistency does exist in real datasets.

However, inconsistency should be avoided in both the real-world judgment and for the LJP model. Therefore, in this section, we use LInCo to evaluate the de-bias performance of several de-biasing methods. We will introduce two training strategies for de-biasing and compare them with the regular one. As in the previous section, we focus on regional inconsistency as an example.

4.4.1 Training Strategies and Optimization

In this part, we describe the three different training strategies. To summarize: Vanilla Strategy fully follows the method of LJP task to train virtual judges with regional differences; Universal Strategy tries to reduce regional differences by pre-training a shared encoder over regions; Adversarial Strategy is similar to Universal Strategy except that an adversarial pre-trained encoder is used to eliminate regional differences further. An overview of all strategies can be found in Figure 3.

Vanilla Strategy. Vanilla Strategy (V-Strategy) only utilizes encoder and predictor, which means it is equal to the traditional LJP model and does not do any de-biasing at all.

Universal Strategy. The same as V-Strategy, Universal Strategy (U-Strategy) utilizes encoder and predictor. The difference is that we first use the entire dataset to pre-train an encoder and then train the models separately with the fixed pre-trained encoder (only optimize the predictor) on each train set. By this strategy, models can share the parameters of embedding for de-biasing.

Adversarial Strategy. Adversarial Strategy (A-Strategy) is similar to U-Strategy, except that we introduce a discriminator when pre-training the encoder. The discriminator is used for predicting which group the input data belongs to, i.e., from which region, and we can formalize the task of discriminator as a multi-class classification problem. We use a linear layer in the discriminator and use cross-entropy to optimize the models in our experiments. Specifically, we use to optimize the discriminator’s linear layer and use to adversarial optimize the encoder as:


Here () indicates the probability of output by the model. For training, we first use the entire dataset to train the model with a discriminator, which targets to obtain the pre-trained encoder that can confuse the discriminator. Then we train independent virtual judges with this pre-trained encoder, which is the same as U-Strategy.

Figure 3: An overview of the training framework and three strategies.

4.4.2 Experimental Settings

In the experiments of verifying the strategies, we mainly follow the experimental settings mentioned in previous experiments. In addition, besides the specific term predicting model DGN, we employ more NLP models for classification and regression here.

DPCNN (Johnson and Zhang, 2017)

: This method is based on Convolutional Neural Networks. It uses region embedding layers and convolutional blocks, combined with shortcut connections proposed in ResNet 

(He et al., 2016). The structure of DPCNN makes deep models for NLP classification tasks possible, and it can effectively extract distant relationship features in the text.

GRU (Cho et al., 2014)

: This model is a variation of LSTM. GRU adds a gating mechanism in the recurrent neural network unit, which couples the input and forget gates to decrease the number of parameters. Even with fewer parameters, GRU could still achieve similar performances as LSTM.

BERT (Devlin et al., 2019): BERT is the model formed by multiple bidirectional Transformer layers. The parameters of BERT has been fully pre-trained on large-scale text corpora. Recently, BERT has achieved state-of-the-art in many NLP tasks, including classification, reading comprehension, and question answering.

To ensure a fair comparison between different models, we use trainable character-level embeddings for every model. We use Adam (Kingma and Ba, 2015) to train all models except BERT, for which we use BertAdam (Devlin et al., 2019). The learning rate is for BERT, for DPCNN, and for all other models. As above, we implement all the models using PyTorch, and you can find the codes in the attached files.

4.4.3 Inconsistency Evaluation with Different Strategies

In this part, we present the experimental results of the three strategies. We evaluate LInCo of these strategies using different models on the real dataset. We pick the two charges with the most data, i.e., theft and intentional injury, to prevent models overfitting on small datasets.

Crimes Theft Intentional Injury
Strategy V U A V U A
DGN 0.376 0.056 0.128 0.274 0.053 0.113
DPCNN 0.307 0.191 0.175 0.255 0.127 0.119
GRU 0.298 0.084 0.117 0.219 0.076 0.084
BERT 0.305 0.194 0.175 0.251 0.227 0.16
Table 6: Results of LInCo on two representative crimes using three strategies.

We calculate LInCo for all models in each strategy, shown in Table 6. From the table, we find that overall, models trained with the U-Strategy and A-Strategy have lower LInCo than those with V-Strategy. It is also worth noting that the de-biasing effect of U-Strategy is better than A-Strategy for the two sequential models, DGN and GRU. In contrast, A-Strategy is better for the two non-sequential models. This phenomenon suggests that we need to use different strategies to de-bias different models.

Figure 4: Heat maps of with DGN as encoder on the crime of theft.

To better interpret the results, we propose and compute matrix to measure inconsistency among virtual judges for visualization, as the following equation:


Here represents the index of the test set. This matrix can intuitively and roughly reflect the trend of LInCo. We choose DGN as the encoder and the crime of theft and visualize the heat map of in Figure 4. From the figure, we can see that the inconsistency of the three strategies when using DGN corresponds to our analysis of the sequential model, i.e., V-Strategy has the worst consistency, and U-Strategy has the best consistency.

From the figure we can find that for three different strategies, the inconsistency of the model predictions gradually decreases as we expected, which reflects: (1) Encoders inevitably catch regional diversity during the training process of V-Strategy; (2) U-Strategy tends to extract general features instead of specific regional features, which improves prediction consistency; (3) A-Strategy almost eliminates the regional diversity learned by the entire network, thus even when train sets from different regions are used, the network can still maintain high consistency in prediction. In addition, we can further confirm LInCo can directly reveal judgment inconsistency quantitatively and accurately.

Crimes Theft Intentional Injury
Strategy V U A V U A
DGN 6.74 5.18 7.40 8.57 7.60 9.39
DPCNN 6.22 7.95 8.30 9.44 11.35 13.72
GRU 6.27 5.05 6.80 7.58 6.90 9.58
BERT 5.48 5.01 7.76 8.08 7.53 12.53
Table 7: The average L1 distance between ground truth and prediction of models on two crimes (unit: month).

We also pay attention to the performance of each network for each strategy. Specifically, we calculate the average L1 distance between prediction and ground truth to measure the accuracy performance of each strategy. The results can be found in Table 7. It indicates that U-Strategy often leads to the best performance and A-Strategy to the worst. U-Strategy is usually the best because it learns from more data, so the encoders are strong enough to remember the features of every region to reach a better performance. However, since A-strategy will eliminate these regional features, it understandably results in worse performance as the judgment results between regions are inherently inconsistent.

As a result, if we want to apply LJP technology to the real legal system, we must use some strategies to reduce the inconsistency of models. Otherwise, we cannot promise the fairness of the legal system. On the other hand, the model cannot perform too poorly to be used in the system. So our strategy should achieve a balance between inconsistency and performance (like U-strategy) to be applied to the real world.

5 Discussion

Our experimental results show that inconsistency does exist in real datasets. However, many possible factors may lead to inconsistency, such as regional differences in economic conditions, education levels, differences between courts or judges, interference of public opinions, etc. The reasons for inconsistency may be very complicated, and we will leave it as our future work.

Besides, we list several questions and answers in the Appendix for readers to understand our paper better. Here we show the six most typical questions as following:

Q: Measuring judgment inconsistency is already difficult for humans. Why to automatically do it? A: Humans cannot accurately perform macro-analysis due to the huge amount of data, which is the specialty of computers.

Q: What is the motivation of LInCo? A: The motivation of LInCo can be illustrated with the following example. If we want to measure regional consistency, the best way is to have local judges independently adjudicate the same set of cases and observe their disagreement. However, this is unrealistic, so we train LJP models to simulate local virtual judges to achieve similar results.

Q: Since judgment consistency means the degree to which similar cases yields similar court decision, why not measure the similarity between cases to help solve the problem? A: There are two reasons. (1) Facing groups with cases in each group, the time complexity of measuring the similarity between cases will be , which is really time-consuming. (2) It is hard to define the similarity between two different cases (Xiao et al., 2019; Zhong et al., 2020b). For example, an act of self-defense can make two textually similar cases very different, and much more factors should be considered when evaluating the similarity between cases.

Q: For the LJP model, why do we select the term of penalty rather than charges or applicable articles as the task? A: Both relevant charges, applicable articles, and the term of penalty are the task of LJP. However, the first two are qualitative, while the term is quantitative. Comparatively, qualitative judgment results are less likely to be inconsistent. For example, it is rare for a case that should have been convicted of theft to be convicted of fraud. In contrast, inconsistencies are often reflected in the term of penalty. Even similar cases may have various lengths of penalty, as shown in the first table in the main text. Therefore, the term of penalty can better reflect consistency, and it is why we select it as the major task.

Q: Why not do a case study? A: LInCo, as a macro indicator, presents the overall inconsistency of a certain amount of data. In contrast, case studies show differences between individual cases, so it makes no sense in this paper.

Q: Why not use unsupervised methods, such as clustering, to solve the problem? A: Traditional clustering methods can measure consistency between texts. However, our study focuses more on judgment, which is the intermediate process from the factual text to the judgment result. It is hard to represent the judgment process in terms of vectors, so clustering cannot address the issue. Besides, clustering cannot learn the document representation well and is therefore too weak to be applied.

6 Conclusions

In this paper, we address the lack of methods for measuring the inconsistency of legal judgment and propose LInCo, an approach that quantifies the judgment inconsistency between different data groups. We verify the effectiveness of LInCo on the synthetic datasets and make sure that LInCo can be applied to measure legal judgment inconsistency. Then, we conduct experiments on the real-world datasets and discover that judicial inconsistency indeed exists in the Chinese legal system. We also discover several interesting phenomena, including that felonies tend to be judged more consistently than misdemeanors, and gender inconsistency is much less than the regional one. Moreover, we use LInCo to evaluate the performance of several de-biasing strategies (such as adversarial learning).

We will focus on exploring the following directions in the future: (1) We will explore the relationship between judicial inconsistency and other factors and discover patterns in the legal system with the help of LInCo. (2) We will also investigate which factors cause legal inconsistency, which will be important for improving the fairness of the judicial system.

We hope that with more research on judgment consistency analysis, judicial fairness and consistency can be better quantified so that the entire legal system can be better monitored, and the development of judgment fairness can be promoted.

This work was supported by National Natural Science Foundation of China (Grant Nos. 00000000 and 11111111).


  • L. Anderlini, L. Felli, and A. Riboni (2014) Why stare decisis. Review of Economic Dynamics 17 (4), pp. 726–738. Cited by: §1, §2.1.
  • L. Anderlini, L. Felli, and A. Riboni (2020) Legal efficiency and consistency. European Economic Review 121, pp. 103323. Cited by: §1, §2.1.
  • R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth (2018) Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research, pp. 0049124118782533. Cited by: §2.1.
  • H. Chen, D. Cai, W. Dai, Z. Dai, and Y. Ding (2019) Charge-based prison term prediction with deep gating network. In Proceedings of EMNLP, Cited by: §2.2, 1st item.
  • Y. Chen, Y. Sun, Z. Yang, and H. Lin (2020) Joint entity and relation extraction for legal documents with legal feature enhancement. In Proceedings of COLING, pp. 1561–1571. Cited by: §2.2.
  • Y. Chen, Y. Liu, and W. Ho (2013) A text mining approach to assist the general public in the retrieval of legal documents. Journal of the American Society for Information Science and Technology 64 (2), pp. 280–290. Cited by: §2.2.
  • K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of EMNLP, Cited by: §4.4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, Cited by: §3.2, §4.4.2, §4.4.2.
  • X. Duan, Y. Zhang, L. Yuan, X. Zhou, X. Liu, T. Wang, R. Wang, Q. Zhang, C. Sun, and F. Wu (2019) Legal summarization for multi-role debate dialogue via controversy focus mining and multi-task learning. In Proceedings of CIKM, Cited by: §2.2.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel (2012) Fairness through awareness. In Proceedings of ITCS, Cited by: §2.1.
  • M. Edgely (2009) Common law sentencing of mentally impaired offenders in australian courts: a call for coherence and consistency. Psychiatry, Psychology and Law 16 (2), pp. 240–261. Cited by: §1, §2.1.
  • E. Flynn (2013) Making human rights meaningful for people with disabilities: advocacy, access to justice and equality before the law. The International Journal of Human Rights 17 (4), pp. 491–510. Cited by: §1.
  • N. Grgichlaca, E. M. Redmiles, K. P. Gummadi, and A. Weller (2018) Human perceptions of fairness in algorithmic decision making: a case study of criminal risk prediction. In Proceedings of WWW, Cited by: §1.
  • C. He, L. Peng, Y. Le, J. He, and X. Zhu (2019) SECaps: a sequence enhanced capsule model for charge prediction. In

    International Conference on Artificial Neural Networks

    pp. 227–239. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of CVPR, pp. 770–778. Cited by: §4.4.2.
  • Z. Hu, X. Li, C. Tu, Z. Liu, and M. Sun (2018) Few-shot charge prediction with discriminative legal attributes. In Proceedings of COLING, Cited by: §2.2.
  • J. Jezewski, K. Horoba, A. Gasek, J. Wrobel, A. Matonia, and T. Kupka (2003) Analysis of nonstationarities in fetal heart rate signal: inconsistency measures of baselines using acceleration/deceleration patterns. In Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings., Vol. 2, pp. 9–12. Cited by: §2.1.
  • R. Johnson and T. Zhang (2017) Deep pyramid convolutional neural networks for text categorization. In Proceedings of ACL, Cited by: §4.4.2.
  • D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR, Cited by: §4.2.1, §4.4.2.
  • T. Korenius, J. Laurikkala, K. Järvelin, and M. Juhola (2004) Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 625–633. Cited by: §2.1.
  • L. Li (2014) A sentencing study of the criminal cases with similar conditions but different sentence (in chinese). Journal of Zhongzhou University, pp. 34–37. Cited by: §1, §2.1.
  • T. Li, C. Lin, and Q. Chen (2012) Inconsistency analysis of lifepo (4) battery packing. Journal of Tsinghua University Science and Technology 52 (7), pp. 1001–1006. Cited by: §2.1.
  • K. Raghav, P. K. Reddy, and V. B. Reddy (2016) Analyzing the extraction of relevant legal judgments using paragraph-level and citation information. Artificial Intelligence for Justice 30. Cited by: §2.2.
  • F. G. Reamer (2005) Ethical and legal standards in social work: consistency and conflict. Families in society-The journal of contemporary social services 86 (2), pp. 163–169. Cited by: §1, §2.1.
  • S. Shen, G. Qi, Z. Li, S. Bi, and L. Wang (2020) Hierarchical chinese legal event extraction via pedal attention mechanism. In Proceedings of COLING, pp. 100–113. Cited by: §2.2.
  • T. Speicher, H. Heidari, N. Grgichlaca, K. P. Gummadi, A. Singla, A. Weller, and M. B. Zafar (2018) A unified approach to quantifying algorithmic unfairness: measuring individual &group unfairness via inequality indices. knowledge discovery and data mining, pp. 2239–2248. Cited by: §2.1.
  • C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al. (2018) Cail2018: a large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478. Cited by: §4.1.
  • C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, T. Zhang, X. Han, Z. Hu, H. Wang, et al. (2019) CAIL2019-scm: a dataset of similar case matching in legal domain.. arXiv: Computation and Language. Cited by: §1, §5.
  • H. Ye, X. Jiang, Z. Luo, and W. Chao (2018) Interpretable charge predictions for criminal cases: learning to generate court views from fact descriptions. In Proceedings of NAACL, Cited by: §2.2.
  • M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi (2017) Fairness constraints: mechanisms for fair classification. In Proceedings of AISTATS, Cited by: §2.1.
  • H. Zhong, Y. Wang, C. Tu, T. Zhang, Z. Liu, and M. Sun (2020a) Iteratively questioning and answering for interpretable legal judgment prediction. In Proceedings of AAAI, Cited by: §2.2.
  • H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and M. Sun (2020b) How does nlp benefit legal system: a summary of legal artificial intelligence. In Proceedings of ACL, Cited by: §1, §5.
  • H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and M. Sun (2020c) JEC-qa: a legal-domain question answering dataset.. In Proceedings of AAAI, Cited by: §2.2.
  • H. Zhong, G. Zhipeng, C. Tu, C. Xiao, Z. Liu, and M. Sun (2018) Legal judgment prediction via topological learning. In Proceedings of EMNLP, Cited by: §2.2.

Appendix A Data Volume

In this section, we show the data volume of each experiment.

Again, it is important to emphasize that we keep the amount of data uniform for each group in each experiment. Hence, the dataset sizes used for each experiment we report below are for each group.

In the main text, Section 4.2 (Simulation Study) and Section 4.4 (De-biasing Strategies Evaluation), for both experiments with theft and intentional injury, we control the size of the artificial dataset to 5000 per group. Besides, for the simulation study, we introduced cases with only one criminal charge in every single experiment. The entire experimental results are shown in Section B, and the results we report in the main text are for the crime of theft.

Table 8 shows the dataset sizes used for the different crime experiments in Section 4.3 (LInCo on Realistic Dataset) of the main text. In addition to this, for experiments using data from one specific region (i.e., the experiments in Table 3 of the main text), the amount of data for the individual charge experiments is also the same as that presented in Table 8.

Crime Data per Region Data per Gender
Fraud 2000 5000
Drug Trafficking 2000 5000
Possession of Illegal Drugs
500 3031
Intentional Injury 5000 5000
Traffic Offence 2000 5000
Theft 5000 5000
Picking Quarrels and Provoking Trouble
1000 1740
Disrupting Public Service
500 5000
Providing Venues for Drug Users
500 5000
Table 8: Statistics for the data set of Table 2, and Table 3 experiment in the main text.

Also, in Section 4.3, the amount of data used for each count in experiments with regional inconsistencies across different years (i.e., the experiments in Table 4 of the main text) is shown in Table 9.

Crime 2013 – 2015 2016 – 2018
Fraud 1000 1000
Drug Trafficking 1000 1000
Intentional Injury 4800 4800
Traffic Offence 1000 1000
Theft 5000 5000
Picking Quarrels and Provoking Trouble
500 500
Table 9: Statistics for the data set of the Table 5 experiment in the main text.

Appendix B Experimental Results of The Simulation Study

The entire experimental results of the simulation study are displayed in Table 10. Note that the results we report in the main text are for the crime of theft.

with DGN
with DGN
Table 10: The result of different and . For each cell, we conduct multiple experiments and report the mean and standard deviation of LInCo.

Appendix C Description of charges

In this section, we describe the criminal charges mentioned in this paper. All the descriptions refer to Chinese Criminal Law.

Fraud: The act that, for the purpose of illegal possession, swindles public or private money or property, and if the amount is relatively large.

Drug Trafficking: The act of knowingly smuggling, trafficking, transporting or manufacturing drugs.

Possession of Illegal Drugs: The act that illegally possesses drugs (e.g., opium, heroin, methylaniline) knowingly of relatively large quantities.

Intentional Injury: The act that intentionally inflicts injury upon another person.

Traffic Offence: The act violates regulations governing traffic and transportation and thereby causes a serious accident, resulting in serious injuries or deaths or heavy losses of public or private property.

Theft: the act that, for the purpose of illegal possession, steals a relatively large amount of public or private property or commits theft repeatedly.

Picking Quarrels and Provoking Trouble: Committing any of the following acts of creating disturbances, thus disrupting public order: (1) beating another person at will and to a flagrant extent; (2) chasing, intercepting or hurling insults to another person to a flagrant extent; (3) forcibly taking or demanding, willfully damaging, destroying or occupying public or private money or property to a serious extent; or (4) creating disturbances in a public place, thus causing serious disorder in such place.

Disrupting Public Service: The act that, by means of violence or threat, obstructs a functionary of a State organ from carrying out his functions according to law.

Providing Venues for Drug Users: The act that provides shelter for another person to ingest or inject narcotic drugs.