1 Introduction and Motivation
The legal documents are heavily interconnected by references. Both of us are currently engaged in the project [ANONYMISED FOR REVIEWS]. While building the training dataset was relatively easy in terms of obtaining the necessary amount of documents, obtaining large (preferably complete) datasets of decisions of Czech apex courts (Constitutional Court, Supreme Administrative Court, Supreme Court) proved to be time-consuming and overall tedious. Every Czech apex court currently runs a website containing a database of its decisions. Additionally, court decisions are accessible via commercial legal information systems. In spite of that, getting access to the whole dataset of respective court requires freedom of information requests, negotiation with court officials, and in the past, due to poor information management choices, also required remuneration for preparation of dataset. Moreover, datasets are sometimes given only for research purposes and with request for non-publication of dataset. Understandably, not every research team interested in language analysis of court decisions is able or willing to overcome these obstacles. Our goal is to make the court decisions available online, in a consistent format.
Czech law is part of the continental legal system, which relies primarily on codified legal rules. The role of judicial decisions is limited compared to the United States or the United Kingdom. However, there is a general consent in the legal community that judicial decisions of apex courts are argumentatively binding. This follows the principle of legal certainty, where a person has a right to have her case decided similarly to previous cases, provided they were similar. In part, this mechanism serves as protection against arbitrary decision-making by judges. This leads to judicial decisions being widely used in legal domain by legal practitioners, judges, students and other. Lawyers use decisions for support of their legal positions, researchers and academics use decisions as a source of legal knowledge and information. Court decisions provide us with valuable information about the law in the Czech Republic. In spite of that, access to them is quite limited. Currently, the Czech Republic does not provide public with access to bulk court decisions, and even in proprietary legal information systems, tools for analysis of sets of decisions are scarce. Individual consultation of decisions still remain the main modus operandi for most lawyers. In 2016, the Czech Ombudsman concluded that state fails to provide its citizens with legal information in suitable quality and quantity. 111Expert study of the Czech Ombudsman on accessibility of Czech court decisions available in Czech at http://eso.ochrance.cz/Nalezene/Edit/4496 We aim to address this issue by publishing a large corpus of decisions of the three Czech apex courts.
2 Related Work
Publicly available corpora of legal documents are essential for corpus linguistics, data processing tasks or just for mere availability of data. There are several well established NLP tasks for legal domain, such as reference recognition, text summarisation or argument extraction. In order to make progress in automated processing of legal documents, we need manually annotated corpora and evidence that these corpora are accurate .
Also, access to selected and curated legal texts fuels other areas as well and is useful in adjudication, education, legal research, and legislation . Unfortunately, access is often hindered by pre-designed user interface [12, p. 1351] or restrictive data mining policies on court websites. ”Freeing” the court decisions in this regard may require significant effort.
There is already a significant amount of datasets focused on the legal domain. Non-exhaustive list includes the HOLJ corpus , The British Law Report Corpus (BLRC) , Corpus of Historical English Law Reports (CHELAR) , German Juristiches Referenzkorpus , DS21 Corpus and Swiss Legislation Corpus (SLC) , The United Nations Parallel Corpus , JRC-Acquis , or US Case Law Access Project .
Corpora focusing on the legal domain and the Czech language are understandably scarcer. JRC-Acquis , focused on legal documents in official languages of the European Union, contains documents in Czech. ,  and  all made their training datasets available through LINDAT-CLARIN repository. Available parallel corpora also include CzEng, Czech-English Parallel Corpus, currently in version 1.7, .
Different corpora serve different purposes - parallel corpora are used for translation [14, 11, 6, 1], while some serve as training datasets for specific tasks, such as reference recognition [5, 8] or summarising of documents .
However, some corpora and datasets have no specific purpose beside making the data available. Either in their completeness or in the scope of more or less representative pieces. Such cases include among others datasets of historical legal documents - Swiss DS21 contains document from year 754 and overall spans more than 1000 years . Some datasets cover contains both historical and contemporary documents, such as  spanning from 1535 to 1999 or  spanning from 1638 to 2018. Some corpora focus predominantly on contemporary legal documents, such as German Juristiches Referenzkorpus , The British Law Report Corpus , and Swiss Legislation Corpus .
Any corpus has to have a coherent methodology for processing the data involved. Because the three apex courts do not share a common standard, all the steps were highly pragmatic and shared the same goal - plaintext version of court decisions enriched by basic metadata describing individual decisions.
3.1 Data Collection
All the three Czech apex courts publish decisions online on website of respective court222Constitutional Court at https://nalus.usoud.cz, Supreme Administrative Court at http://nssoud.cz, and Supreme Court at http://www.nsoud.cz/.. Court decisions can be individually consulted through the interface. Unfortunately, searching categories are quite limited. Furthermore, it is not possible to access all the decisions (either for bulk download or consultation) or whole categories of decisions at once. Access is limited to individual decisions in different data formats. That is significantly different from situation in other European countries (for example in Netherlands333https://www.rechtspraak.nl/). or European Union itself, which publishes all the legal sources online and provides also open access option to download data in XML format straight from the website 444https://eur-lex.europa.eu or for bulk download via API. Moreover, these websites generally do not contain all the decisions of respective court. Whole dataset of judicial decisions is preferable for any statistical analysis or text analysis of the court decision data. To obtain whole datasets of all decisions we needed to file a freedom of information requests under the Czech law.
Constitutional Court provided all the decisions as a set of .RTF documents, and Supreme Administrative Court provided all the decisions as a set of .PDF documents. Unfortunately, Supreme Court was not able to provide the dataset for free under the freedom of infomation request. The online database of Supreme Court was used as a source of dataset and all the decisions published there were downloaded. As we know from the Expert Study of the Czech Ombudsman, the online database of Supreme Court does not contain all decisions. Specific types of decisions, such as decision on recognition of foreign judgments are not published online. 555Expert study of the Czech Ombudsman available in Czech at http://eso.ochrance.cz/Nalezene/Edit/4496. Understandably, these limitations apply to our corpus as well. Decisions of the Supreme Court were obtained in plaintext format.
All the decisions we obtained were anonymized and were identified by their docket numbers. Docket number is is a unique number consisting of identification of court or senate of the court handling the case, number assigned based on order by which the cases arrive to the court, and by the year of delivery of the case to the court.
CzCDC currently contains documents published no later than 30th September 2018.
3.2 Data Unification
Because the court decision were obtained from different sources and under different court-dependent standards, it was necessary to process them and achieve the use of single standard. We decided to make the court decisions available in plain text. This decision is based on the availability, as it does not require any special software for processing. We used Apache Tika to obtain plain text of the Constitutional and Supreme Administrative Court decisions. We then extracted docket numbers and dates of decisions of individual cases from obtained plain texts.
3.3 Corpus Description
We intend our corpus for both lawyers and non-lawyers (computer scientists, linguists), and for different purposes. Therefore, we tried to remain as non-specific in our description of dataset and selected metadata categories, as possible.
Every decision included in the CzCDC is described by the following metadata contained in the accompanying CSV file: docket number of court decisions, name of the relevant .txt file in corpus, data of decisions and identification of court. Identification of court divided corpus into three different sub-corpora. We chose a CSV file to organise metadata for its availability, simplicity, and for its low requirements given the large number of decisions included. Every decision is marked as belonging to sub-corpus by an abbreviation of respective court: ’ConCo’ for Constitutional Court, ’SupCo’ for Supreme Court and ’SupAdmCo’ for Supreme Administrative Court. Date of decisions is provided in ISO 8601 data format (YYYY-MM-DD).
The docket numbers and file names of individual decisions remain in their original format. In addition to the explanations of metadata labels, the instructions also contain basic explanations of system of docket number. Docket numbers contain additional information about decisions and it is possible to use them for further sub-categorization of decision, eg. on civil cases and criminal cases, albeit this approach is rather crude.
4 Dataset Statistics
All the decisions in the CzCDC were published from 1st January 1993 to the 30th September 2018. Corpus contains 231,548 decisions of Constitutional, Supreme and Supreme Administrative Court. The corpus is then composed of a 521,247,702 words.
4.1 Sub-corpus of Constitutional Court
Constitutional court rules on individual violations of human rights by courts and other public bodies. It is also concerned with derogation of unconstitutional acts and local bylaws. It is not part of traditional appellate structures of Czech courts, but serves as the guardian of the Constitution.
Sub-corpus of Constitutional Court provides digitized and anonymized decisions from 1 st January 1993 until 30th September 2018. It consists of 73,086 judicial decisions. These decisions contain 98,623,753 words in all of the decisions.
4.2 Sub-corpus of Supreme Court
Supreme Court is the final instance for civil and criminal cases. It deals with appeals against decisions of lower courts. As part of the Czech court system, it also fulfils the role of unifying the decision-making of the lower courts in civil and criminal matters.
Judicial decisions of Supreme Court are dated from 1st January 1993 until 30th September 2018 because of the previously described reasons. The total number of decisions is 111,977 and this sub-corpus contains 224,061,129 words.
4.3 Sub-corpus of Supreme Administrative Court
Supreme Administrative Court is the final instance for administrative matters, such as cases brought forth by asylum seekers, data protection matters, electoral complaints etc. As part of the Czech court system, it also fulfils the role of unifying the decision-making of the lower courts in administrative matters.
Supreme Administrative Court was founded only in 2003. Therefore, court decisions are dated from 1st January 2003 until 30th September 2018. Sub-corpus of Supreme Administrative Court consists of 52,660 decisions. These decisions contain 137,839,985 words.
5 Conclusion and future work
In this paper, we described the situation and inadequate publication of court decisions by the state. We also described our motivation to create this corpus - and we named the availability as the first step.
At this point, we have developed a minimalist version of CzCDC to fulfill one basic issue - providing access to court decisions with basic metadata, such as docket numbers and date of publication. We are providing the data in plain text in order to provide the broadest possible access options. We also chose fundamental metadata that were easy to extract without any reliability issues. This way we believe we achieved our goal to make the corpus of court decision available.
This paper frames what we understand as the beginning of a longer-time commitment. The first step, publication of plain dataset that accompanies this paper, supplies the public with access to otherwise problematically accessible court decisions. We plan to develop this corpus in the future as well. At first, we would like to identify the ’domain relevant’ documents, which is task dependent on specific legal knowledge.
As a future work we suggest to include more metadata providing more information within the attached database. Suitable metadata are references to other decisions or grammar features to establish this database as a source for legal linguistics. Using information extraction method for the raw texts of judicial decisions has a lot of potential in legal domain, but also overall in linguistics.
We would like to finish this paper as a plea to other research teams. CzCDC in its current version makes available 237,723 court decision, divided into three sub-corpora. Contribution from other research team, in terms of further processing either the whole corpus or one or more of the sub-corpora would lead to enrichment of CzCDC. Also, it might eventually lead to CzCDC containing plethora of other metadata or sub-corpora for other tasks useful for community. Ultimately, it may lift the fog. of inaccessibility and unavailability that currently covers the legal domain
Regarding previous suggestions as a possible future work directions we find this corpus to be a first step towards easier access to legal sources and a better understanding of the functioning of the judicial system and of court decisions.
[NOTE: Should the submission be accepted, corpus and its documentation will be made available for download via repository. Hyperlink for its download will be added to Conclusion.]
[ANONYMISED FOR REVIEWS]
-  Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, and Dušan Variš. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockeres. Proceedings of TSD 2016, pp. 231–238.
-  Caselaw Access Project. 2018, https://case.law/api/.
-  Claire Grover, Ben Hachey, Ian Hughson. The HOLJ Corpus: supporting summarisation of legal texts. Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora, 2004, pp. 47–53.
-  Hanjo Hamann, Friedemann Vogel and Isabelle Gauer. Computer Assisted Legal Linguistics (CAL2). Proceedings of JURIX 2016, pp. 195–198.
-  Jakub Harašta, Jaromír Šavelka, František Kasl, Adéla Kotková, Pavel Loutocký, Jakub Míšek, Daniela Procházková, Helena Pullmannová, Petr Semenišín, Tamara Šejnová, Nikola Šimková, Michal Vosinek, Lucie Zadavilová, and Jan Zibner. Annotated Corpus of Czech Case Law for Reference Recognition Tasks. Proceedings of TSD 2018, pp. 239–250.
-  Duc Tam Hoang, and Ondřej Bojar. TmTriangulate: A Tool for Phrase Table Triangulation. Prague Bulletion of Mathematical Linguistics, 2015, no. 104, pp. 75–86.
-  Stefan Höfler and Michael Piotrowski. Building Corpora for the Philological Study of Swiss Legal Texts. Journal for Language Technology an Computational Linguistics, 2011, vol. 26, n. 2, pp. 77–89.
-  Vincent Kríž, Barbora Hladká, Jan Dědek and Martin Nečaský. Statistical Recognition of References in Czech Court Decisions. Proceedings of MICAI 2014, Part I, pp. 51–61.
-  José Marín Peréz and Camino Rea Rizzo. Structure and Design of the British Law Report Corpus (BLRC): A Legal Corpus of Judicial Decisions from the UK. Journal of English Studies, 2012, vol. 10, pp. 131–145.
-  Paula Rodríguez-Puente. Introducing the Corpus of Historical English Law Reports: Structure and compilation techniques. Revistas de Lenguas para Fines Específicos, 2011, vol. 17, pp. 99–120.
-  Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski, and Signe Gilbro. An Overview of the European Union’s Highly Multilingual Parallel Corpora. Language Resources and Evaluation, 2014, vol. 48, no. 4, pp. 679–707.
-  Friedemann Vogel, Hanjo Hammann, and Isabelle Gauer. Computer-Assisted Legal Linguistics: Corpus Analysis as a New Tool for Legal Studies. Law & Social Inquiry, 2017, early view.
-  Vern R. Walker. The Need for Annotated Corpora from Legal Documents, and for (Human) Protocols for Creating them: the Attribution Problem. In: Natural Language Argumentation: Mining, Processing, and Reasoning over Textual Arguments (Dagstuhl Seminar 16161), ed. Elena Cabrio, Hirst Graeme, Serena Villata and Adam Wyner. 2016.
-  Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. The United Nations Parallel Corpus. Proceedings of LREC 2016, pp. 3530–3534.
-  Marc van Opijnen, and Cristiana Santos. On the concept of relevance in legal information retrieval. Artificial Intelligence and Law, 2017, vol. 25, no. 1, pp 65–87.