Exploiting artificial intelligence techniques to assist legal judgment has become popular in recent years. Legal judgment prediction (LJP) aims to predict a case’s judgment results, such as applicable law articles, charges, and terms of penalty, based on its fact description, as illustrated in Figure 1. LJP can assist judiciary workers in processing cases and offer legal consultancy services to the public. In the literature, LJP is usually formulated as a text classification problem, and several rule-based methods Liu et al. (2004); Lin et al. (2012) and neural-based methods Hu et al. (2018); Luo et al. (2017); Zhong et al. (2018) have been proposed.
The main drawback of existing methods is that they cannot solve the confusing charges issue. That is, due to the high similarity of several law articles, their corresponding law cases can be easily misjudged. For example, in Figure 2, both Article 385 and Article 163 describe offenses of accepting bribes, and their subtle difference is whether the guilty parties are state staffs or not. The key to solving the confusing charges issue is how to capture important but rare features for distinguishing confusing law articles. Hu et al. Hu et al. (2018) defined 10 discriminative attributes to distinguish confusing charges. However, their method relies too much on experts to hinder its applications in a large number of laws. In practice, we desire a method that can automatically extract textual features from law articles to assist JLP. The most relevant existing work to this requirement is Luo et al. (2017), which used an attention mechanism to extract features from fact descriptions with respect to a specific law article. As shown in Figure 3
a, for each law article, an attention vector is computed, which is used to extract features from the fact description of a law case to predict whether the law article is applicable to the case. Nevertheless, the weakness is that they learn each law article’s attention vector independently, and this may result in that similar attention vectors are learned for semantically close law articles; hence, it is ineffective in distinguishing confusing charges.
To solve the confusing charges issue, we propose an end-to-end framework, i.e., Law Article Distillation based Attention Network (LADAN). LADAN uses the difference among similar law articles to attentively extract features from law cases’ fact descriptions, which is more effective in distinguishing confusing law articles, and improve the performance of LJP. To obtain the difference among similar law articles, a straightforward way is to remove duplicated texts between two law articles and only use the leftover texts for the attention mechanism. However, we find that this method may generate the same leftover texts for different law articles, and generate misleading information to LJP. As shown in Fig. 2, if we remove the duplicated phrases and sentences between Article 163 and Article 385 (i.e., the red text in Fig. 2), and between Article 164 and Article 389 (i.e., the pink text in Fig. 2), respectively, then Article 385 and Article 389 will be almost same to each other (i.e., the blue text in Fig. 2).
We design LADAN based on the following observation: it is usually easy to distinguish dissimilar law articles as sufficient distinctions exist, but difficult to discriminate similar law articles due to the few effective features. We first group law articles into different communities, and law articles in the same community are highly similar to each other. Then we propose a graph-based representation learning method to automatically explore the difference among law articles and compute an attention vector for each community. For an input law case, we learn both macro- and micro-level features. Macro-level features are used for predicting which community includes the applicable law articles. Micro-level features are attentively extracted by the attention vector of the selected community for distinguishing confusing law articles within the same community. Our main contributions are summarized as follows:
(1) We develop an end-to-end framework, i.e., LADAN, to solve the LJP task. It addresses the confusing charges issue by mining similarities between fact descriptions and law articles as well as the distinctions between confusing law articles.
(2) We propose a novel graph distillation operator (GDO) to extract discriminative features for effectively distinguishing confusing law articles.
(3) We conduct extensive experiments on real-world datasets. The experimental results show that our model outperforms all state-of-the-art methods.
2 Problem Formulation
In this section, we introduce some notations and terminologies, and then formulate the LJP task.
Each law case consists of a fact description and several judgment results (cf. Figure 1). The fact description is represented as a text document, denoted by . The judgment results may include applicable law articles, charges, terms of penalty, etc. Assume there are kinds of judgment results, and the
-th judgment result is represented as a categorical variablewhich takes value from set . Then, a law case can be represented by a tuple .
Law cases are often analyzed and adjudicated according to a legislature’s statutory law (also known as, written law). Formally, we denote the statutory law as a set of law articles where is the number of law articles. Similar to the fact description of cases, we also represent each law article as a document.
Legal Judgment Prediction.
In this paper, we consider three kinds of judgment results: applicable law articles, charges, and terms of penalty. Given a training dataset of size , we aim to train a model that can predict the judgment results for any test law case with a fact description , i.e., , where , . Following Zhong et al. (2018); Yang et al. (2019), we assume each case has only one applicable law article.
3 Our Method
In our framework LADAN (cf. Fig. 4a), the fact description of a case is represented by two parts: a basic representation, denoted by , and a distinguishable representation, denoted by . The basic representation contains basic semantic information for matching a group of law articles that may apply to the case. In contrast, the distinguishable representation captures features that can effectively distinguish confusing law articles. The concatenation of and
is fed into subsequent classifiers to predict the labels of JLP tasks.
As we mentioned, it is easy to distinguish dissimilar law articles as sufficient distinctions exist and the difficulty in solving confusing charges lies in extracting distinguishable features of similar law articles. To obtain the basic representation , therefore, we use one of the popular document encoding methods (e.g., CNN encoder Kim (2014) and Bi-RNN encoder Yang et al. (2016)). To learn the distinguishable representation , we use a law distillation module to first divide law articles to several communities to ensure that the law articles in each community are highly similar, and then extract each community ’s distinction vector (or, distinguishable features) from the basic representation of law articles in community . Given the case’s fact description, from all communities’ distinction vectors, we select the most relevant one (i.e., in Fig. 4(a)) for attentively extracting the distinguishable features in the fact re-encode module. In the follows, we elaborate law distillation module (Sec. 3.2) and fact re-encode module (Sec. 3.3) respectively.
3.2 Distilling Law Articles
A case might be misjudged due to the high similarity of some law articles. To alleviate this problem, we design a law distillation module (cf. Fig. 4 b) to extract distinguishable and representative information from all law articles. Specifically, it first uses a graph construction layer (GCL) to divide law articles into different communities. For each law article community, a graph distillation layer is applied to learn its discriminative representation, hereinafter, called distinction vector.
3.2.1 Graph Construction Layer
To find probably confusing law articles, we first construct a fully-connected graphfor all law articles , where the weight on the edge between a pair of law article
is defined as the cosine similarity between their TF-IDF (Term Frequency-Inverse Document Frequency) representationsand . Since confusing law articles are usually semantically similar and there exists sufficient information to distinguish dissimilar law articles, we remove the edges with weights less than a predefined threshold from graph . By setting an appropriate , we obtain a new graph composed of several disconnected subgraphs (or, communities), where each contains a specific community of probably confusing articles. Our later experimental results demonstrate that this easy-to-implement method effectively improves the performance of LADAN.
3.2.2 Graph Distillation Layer
To extract the distinguishable information from each community , a straightforward way is to delete duplicate words and sentences presented in law articles within the community (as described in Sec. 1). In addition to introducing large errors, this simple method cannot be plugged into end-to-end neural architectures due to its non-differentiability. To overcome the above issues, inspired by the popular graph convolution operator (GCO) Kipf and Welling (2017); Hamilton et al. (2017); Veličković et al. (2017), we propose a graph distillation operator (GDO) to effectively extract distinguishable features. Different from GCO, which computes the message propagation between neighbors and aggregate these messages to enrich representations of nodes in the graph, the basic idea behind our GDO is to learn effective features with distinction by removing similar features between nodes.
Specifically, for an arbitrary law article , GDO uses a trainable weight matrix to capture similar information between it and its neighbors in graph , and a matrix to extract effective semantic features of . At each layer , the aggregation of similar information between and its neighbors is removed from its representation, that is,
where refers to the representation of law in the graph distillation layer, refers to the neighbor set of in graph , is the bias, and and are the trainable self weighted matrix and the neighbor similarity extracting matrix respectively. Note that is the dimension of the feature vector in the graph distillation layer. We set , where is the dimension of basic representations and . Similar to GCO, our GDO also supports multi-layer stacking.
Using GDO with layers, we output law article representation of the last layer, i.e., , which contains rich distinguishable features that can distinguish law article from the articles within the same community. To further improve law articles’ distinguishable features, for each subgraph in graph , we compute its distinction vector by using pooling operators to aggregate the distinguishable features of articles in . Formally, is computed as:
are the element-wise max pooling and element-wise min pooling operators respectively.
3.3 Re-encoding Fact with Distinguishable Attention
To capture a law case’s distinguishable features from its fact description , we firstly define the following linear function, which is used to predict its most related community in graph :
where is the basic representation of fact description , and are the trainable weight matrix and bias respectively. Each element , reflects the closeness between fact description and law articles community . The most relevant community is computed as
Then, we use the corresponding community’s distinction vector to attentively extract distinguishable features from fact description .
Inspired by Yang et al. (2016)
, we attentively extract distinguishable features based on word-level and sentence-level Bi-directional Gated Recurrent Units (Bi-GRUs). Specifically, for each input sentencein fact description , word-level Bi-GRUs will output a hidden state sequence, that is,
where represents the word embedding of word and . Based on this hidden state sequence and the distinction vector , we calculate an attentive vector , where each evaluates the discrimination ability of word . is formally computed as:
where and are trainable weight matrices. Then, we get a representation of sentence as:
where denotes the word number in sentence .
By the above word-level Bi-GRUs, we get a sentence representations sequence , where refers to the number of sentences in the fact description . Based on this sequence, similarly, we build sentence-level Bi-GRUs and calculate a sentence-level attentive vector that reflects the discrimination ability of each sentence, and then get the fact’s distinguishable representation . Our sentence-level Bi-GRUs are formulated as:
3.4 Prediction and Training
We concatenate the basic representation and the distinguishable representation as the final representation of fact description , i.e., . Based on , we generate a corresponding feature vector for each subtask , mentioned in Sec. 2, i.e., : law article prediction; : charge prediction; : term of penalty prediction. To obtain the prediction for each subtask, we use a linear classifier:
where and are parameters specific to task
. For training, we compute a cross-entropy loss function for each subtask and take the loss sum of all subtasks as the overall prediction loss:
where denotes the number of different classes (or, labels) for task and refers to the ground-truth vector of task . Besides, we also consider the loss of law article community prediction (i.e., Eq. 1):
where is the ground-truth vector of the community including the correct law article applied to the law case. In summary, our final overall loss function is:
To evaluate the performance of our method, we use the publicly available datasets of the Chinese AI and Law challenge (CAIL2018)111http://cail.cipsc.org.cn/index.html Xiao et al. (2018): CAIL-small (the exercise stage dataset) and CAIL-big (the first stage dataset). The case samples in both datasets contain fact description, applicable law articles, charges and the terms of penalty. For data processing, we first filter out samples with fewer than 10 meaningful words. To be consistent with state-of-the-art methods, we filter out the case samples with multiple applicable law articles and multiple charges. Meanwhile, referring to Zhong et al. (2018), we only keep the law articles and charges that apply to not less than 100 corresponding case samples and divide the terms of penalty into non-overlapping intervals. The detailed statistics of the datasets are shown in Table 1.
|#Training Set Cases||101,619||1,587,979|
|#Test Set Cases||26,749||185,120|
|#Term of Penalty||11||11|
4.2 Baselines and Settings
We compare LADAN with some baselines, including:
CNN Kim (2014): a CNN-based model with multiple filter window widths for text classification.
HARNN Yang et al. (2016): an RNN-based neural network with a hierarchical attention mechanism for document classification.
FLA Luo et al. (2017): a charge prediction method that uses an attention mechanism to capture the interaction between fact description and applicable laws.
TOPJUDGE Zhong et al. (2018): a topological multi-task learning framework for LJP, which formalizes the explicit dependencies over subtasks in a directed acyclic graph.
MPBFN-WCA Yang et al. (2019): a multi-task learning framework for LJP with multi-perspective forward prediction and backward verification, which is the state-of-the-art method.
, we train the baselines CNN, HLSTM and FLA using a multi-task framework (recorded as MTL) and select a set of the best experimental parameters according to the range of the parameters given in their original papers. Besides, we use our method LADAN with the same multi-task framework (i.e., Landan+MTL, LADAN+TOPJUDGE, and LADAN+MPBFN) to demonstrate our superiority in feature extraction.
to pre-train word embeddings on these case documents, where the model’s embedding size and frequency threshold are set to 200 and 25 respectively. Meanwhile, we set the maximum document length as 512 words for CNN-based models in baselines and set the maximum sentence length to 100 words and maximum document length to 15 sentences for LSTM-based models. As for hyperparameters setting, we set the dimension of all latent states (i.e.,, , and ) as 256 and the threshold as . In our method LADAN, we use two graph distillation layers and a Bi-GRU with a randomly initialized attention vector is adopted as the basic document encoder. For training, we set the learning rate of Adam optimizer to
, and the batch size to 128. After training every model for 16 epochs, we choose the best model on the validation set for testing.222Our source codes are available at https://www.dropbox.com/sh/3nmeu81xrgtkfkx/AAB23hvE_cJkyXSceu6HgMOOa?dl=0
4.3 Experimental Results
To compare the performance of the baselines and our methods, we use metrics including accuracy (Acc.), macro-precision (MP), macro-recall (MR), and macro-F1 (F1), which are widely used for multi-classification tasks. Tables 2 and 3
show the experimental results on datasets CAIL-small and CAIL-big respectively. Our method LADAN performs the best in terms of all evaluation metrics. Because both CAIL-small and CAIL-big are imbalanced datasets, we focus on comparing the F1-score, which more objectively reflects the effectiveness of our LADAN and other baselines. Compared with the state-of-the-art MPBFN-WCA, LADAN improved the F1-scores of law article prediction, charge prediction, and term of penalty prediction on dataset CAIL-small by%, % and % respectively, and about %, % and % on dataset CAIL-big. Meanwhile, the comparison under the same multi-task framework (i.e., MTL, TOPJUDGE, and MPBFN) shows that our LADAN extracted more effective features from fact descriptions than all baselines. The highest MP- and MR-scores of LADAN also demonstrates its ability to distinguish confusing law articles. Note that all methods’ performance on dataset CAIL-big is better than that on CAIL-small, which is because the training set on CAIL-big is more adequate.
|Tasks||Law Articles||Charges||Term of Penalty|
|Tasks||Law Articles||Charges||Term of Penalty|
4.4 Ablation Experiments
To further illustrate the significance of considering the difference between law articles, we conducted ablation experiments on model LADAN+MTL with dataset CAIL-small. To prove the effect of our graph construction layer (GCL), we build a LADAN model with the GCL’s removing threshold (i.e., “-no GCL” in Table 4), which directly applies the GDO on the fully-connected graph to generate a global distinction vector for re-encoding the fact description. To verify the effectiveness of our graph distillation operator (GDO), we build a no-GDO LADAN model (i.e., “-no GDO” in Table 4), which directly pools each subgraph to a distinction vector without GDOs. To evaluate the importance of considering the difference among law articles, we remove both GCL and GDO from LADAN by setting (i.e., “-no both” in Table 4), i.e., each law article independently extracts the attentive feature from fact description. In Table 4, we see that both GCL and GDO effectively improve the performance of LADAN. GCL is more critical than GDO, because GDO has a limited performance when the law article communities obtained by GCL are not accurate. When removing both GCL and GDO, the accuracy of LADAN decreases to that of HARNN+MTL, which strongly demonstrates the effectiveness of our method exploiting differences among similar law articles.
4.5 Case Study
To intuitively verify that LADAN effectively extracts distinguishable features, we visualize the attention of LADAN’s encoders. Figure 5 shows two law case examples, each for Article 385 and Article 163 respectively, where the darker the word is, the higher the attention weight it gets in the corresponding encoder, i.e., its information is more important to the encoder. For the basic encoder, we see that the important information in these two cases is very similar, which both contain the word like “use position”, “accept benefit”, “accept … cash”, etc. Therefore, when using just the representation of basic encoder to predict acceptable law articles, charges and terms of penalty, these two cases tend to be misjudged. As we mentioned in Sec. 3.3, with the distinction vector, our fact re-encoder focuses on extracting distinguishable features like defendants’ identity information (e.g., “company manager”, “working in the Cadastral Unit of Luocheng Branch of Luohe City Land and Resources Bureau” in our examples), which effectively distinguish the applicable law articles and charges of these two cases.
5 Related Work
5.1 Legal Judgment Prediction
Existing approaches for legal judgment prediction (LJP) are mainly divided into three categories. In early times, works usually focus on analyzing existing legal cases in specific scenarios with mathematical and statistical algorithms Kort (1957); Nagel (1963); Keown (1980); Lauderdale and Clark (2012)
. However, these methods are limited to small datasets with few labels. Later, a number of machine learning-based methodsLin et al. (2012); Liu et al. (2004); Sulea et al. (2017) are developed to solve the problem of LJP, which almost combine some manually designed features with a linear classifier to improve the performance of case classification. The shortcoming is that these methods rely heavily on manual features, which suffer from the generalization problem.
In recent years, researchers tend to exploit neural networks to solve LJP tasks. Luo et al. Luo et al. (2017) propose a hierarchical attentional network to capture the relation between fact description and relevant law articles to improve the charge prediction. Zhong et al. Zhong et al. (2018) model the explicit dependencies among subtasks with scalable directed acyclic graph forms and propose a topological multi-task learning framework for effectively solving these subtasks together. Yang et al. Yang et al. (2019) further refine this framework by adding backward dependencies between the prediction results of subtasks. To the best of our knowledge, Hu et al. Hu et al. (2018) are the first to study the problem of discriminating confusing charges for automatically predicting applicable charges. They manually define 10 discriminative attributes and propose to enhance the representation of case fact description by learning these attributes. This method relies too much on experts and cannot be easily extended to different law systems. To solve this issue, we propose a novel attention framework that automatically extracts differences between similar law articles to enhance the representation of fact description.
5.2 Graph Neural Network (GNN)
In general, existing GNNs always focus on proposing different aggregation schemes to fuse features from the neighborhood of each node in the graph for extracting richer and more comprehensive information: Kipf et al. Kipf and Welling (2017) propose graph convolution networks which use mean pooling to pool neighborhood information; GraphSAGE Hamilton et al. (2017) concatenates the node’s features and applies mean/max/LSTM operators to pool neighborhood information for inductively learning node embeddings; MR-GNN Xu et al. (2019) aggregates the multi-resolution features of each node to exploit node information, subgraph information and global information together; In addition, Message Passing Neural Networks Gilmer et al. (2017) further consider edge information when doing the aggregation. However, the aggregation schemes lead to the over-smoothing issue of graph neural networks Li et al. (2018), i.e., the aggregated node representations would become indistinguishable, which is completely contrary to our goal of extracting distinguishable information. So in this paper, we propose our distillation operation, based on a distillation strategy instead of aggregation schemes, to extract the distinguishable features between similar law articles.
In this paper, we present an end-to-end model, LADAN, to solve the issue of confusing charges in LJP. In LADAN, a novel attention mechanism is proposed to attentively extract the key features for distinguishing confusing law articles. Our attention mechanism not only considers the interaction between fact description and law articles but also the differences among similar law articles, which are effectively extracted by a graph neural network GDL proposed in this paper. The experimental results on real-world datasets show that our LADAN raises the F1-score of state-of-the-art by up to %. In the future, we plan to study complicated situations such as a law case with multiple defendants and charges.
The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500),Shenzhen Basic Research Grant (JCYJ20170816100819428), National Natural Science Foundation of China (61922067, U1736205, 61902305), MoE-CMCC “Artifical Intelligence” Project (MCM20190701), National Science Basic Research Plan in Shaanxi Province of China (2019JM-159), National Science Basic Research Plan in Zhejiang Province of China (LGG18F020016)
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In ICML.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
- Hu et al. (2018) Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot charge prediction with discriminative legal attributes. In COLING.
- Keown (1980) R Keown. 1980. Mathematical models for legal prediction. Computer/lj, 2:829.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP.
- Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICML.
- Kort (1957) Fred Kort. 1957. Predicting supreme court decisions mathematically: A quantitative analysis of the “right to counsel” cases. American Political Science Review, 51(1):1–12.
- Lauderdale and Clark (2012) Benjamin E Lauderdale and Tom S Clark. 2012. The supreme court’s many median justices. American Political Science Review, 106(4):847–866.
Li et al. (2018)
Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018.
Deeper insights into graph convolutional networks for semi-supervised learning.In AAAI.
- Lin et al. (2012) Wan-Chen Lin, Tsung-Ting Kuo, Tung-Jia Chang, Chueh-An Yen, Chao-Ju Chen, and Shou-de Lin. 2012. Exploiting machine learning models for chinese legal documents labeling, case classification, and sentencing prediction. Processdings of ROCLING.
- Liu et al. (2004) Chao-Lin Liu, Cheng-Tsung Chang, and Jim-How Ho. 2004. Case instance generation and refinement for case-based criminal summary judgments in chinese.
- Luo et al. (2017) Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to predict charges for criminal cases with legal basis. arXiv preprint arXiv:1707.09168.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS.
- Nagel (1963) Stuart S Nagel. 1963. Applying correlation analysis to case prediction. Tex. L. Rev., 42:1006.
- Sulea et al. (2017) Octavia-Maria Sulea, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P Dinu, and Josef van Genabith. 2017. Exploring the use of text classification in the legal domain. arXiv preprint arXiv:1710.09306.
- Sun et al. (2016) Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, and Zhiyuan Liu. 2016. Thulac: An efficient lexical analyzer for chinese.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
- Xiao et al. (2018) Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, et al. 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
- Xu et al. (2019) Nuo Xu, Pinghui Wang, Long Chen, Jing Tao, and Junzhou Zhao. 2019. Mr-gnn: Multi-resolution and dual graph neural network for predicting structured entity interactions. In IJCAI.
- Yang et al. (2019) Wenmian Yang, Weijia Jia, XIaojie Zhou, and Yutao Luo. 2019. Legal judgment prediction via multi-perspective bi-feedback network. arXiv preprint arXiv:1905.03969.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In NAACL.
- Zhong et al. (2018) Haoxi Zhong, Guo Zhipeng, Cunchao Tu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. 2018. Legal judgment prediction via topological learning. In EMNLP.