Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence level context or document level context for short documents. But these solutions are still struggling when it comes to real-world longer documents with information encoded in the spatial structure of the document, in elements like tables, forms, headers, openings or footers, or the complex layout of pages or multiple pages. To encourage progress on deeper and more complex information extraction, we present a new task (named Kleister) with two new datasets. Based on textual and structural layout features, an NLP system must find the most important information, about various types of entities, in formal long documents. These entities are not only classes from standard named entity recognition (NER) systems (e.g. location, date, or amount) but also the roles of the entities in the whole documents (e.g. company town address, report date, income amount).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

05/12/2021

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

The relevance of the Key Information Extraction (KIE) task is increasing...
08/31/2021

TNNT: The Named Entity Recognition Toolkit

Extraction of categorised named entities from text is a complex task giv...
11/20/2019

Table-Of-Contents generation on contemporary documents

The generation of precise and detailed Table-Of-Contents (TOC) from a do...
09/03/2018

Named Entity Recognition on Noisy Data using Images and Text (1-page abstract)

Named Entity Recognition (NER) is an important subtask of information ex...
06/28/2015

WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Visual Information Extraction

The visual layout of a webpage can provide valuable clues for certain ty...
07/12/2021

Inscriptis – A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Inscriptis provides a library, command line client and Web service for c...
01/21/2019

AD3: Attentive Deep Document Dater

Knowledge of the creation date of documents facilitates several tasks su...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information Extraction (IE) requires quick but careful skimming through the whole document. We often have to not only search for pieces of information, but also to correlate them with each other. In practice, this means that the results should be presented in an appropriate form (e.g. as a normalized output). It should also be explained why certain information has been correlated. This may take the form of an indication in the input text. The process can be tedious and difficult for humans to do. Thus, we need automated systems to cope with multiple documents and to extract the required information in a simple and effective way.

However, the disparity between what can be done with the current state of the art in IE and what is required by real world business use cases is still large. From the point of view of business users, systems that automatically gather information about individuals, their roles, significant dates, addresses, and amounts from invoices, companies reports and contracts, would be useful [Holt and Chisholm2018], [Katti et al.2018], [Wróblewska et al.2018], [Sunder et al.2019]. Furthermore, the systems should be reliable and should reliably assess their own certainty about extracted entities.

As far as the current state of the art is concerned, there are many machine learning models which, however, must be trained for specific named entities to be effective 

[Peters et al.2018], [Akbik et al.2018], [Devlin et al.2018]. To further increase training efficiency, we can use the documents of a previously defined layout, so that the models could learn how to extract a particular piece of information [Zhao et al.2019], [Denk and Reisswig2019], [Liu et al.2019a], [Sarkhel and Nandi2019]. On the other hand, more general extractors are still needed to deal with a variety of information.

In this paper, we describe two novel datasets for Information Extraction from long real-life documents. We will begin by explaining the need for the dataset that would contain authentic scenarios to provide a review of similar tasks and datasets in the next step (Section 2). Then, we describe our data-gathering methods, a few statistics and characteristics of the dataset (Section 3). Subsequently, we describe baseline methods and their results used to cope with the task using the Pipeline approach described in Section 4. Finally, we discuss challenges in the process to extract the proper entities (Section 5).

Figure 1: Three different examples of layouts from the Kleister-charity dataset.

2 Review of Information Extraction Datasets

Our main idea for preparing a new dataset was to develop a strategy to deal with the main challenge we face in business conditions, which means overcoming such difficulties as: complex layout, specific business logic (the way the content is formulated), OCR quality, document-level extraction and normalization.

The most similar dataset to our approach regarding the NLP field is the WikiReading dataset and related challenges [Hewlett et al.2016]. This dataset is a large-scale natural language understanding task with 18 million entities and 4.7 million documents. The goal of the task is to predict textual values from the structured knowledge base — Wikidata — by reading the text of the corresponding Wikipedia articles. Some entities can be extracted from the given text, but some of them have to be inferred. Thus, similarly as in our assumptions, the task contains a rich variety of challenging extraction sub-tasks and is also well-suited for end-to-end models. In both sets there are also challenges with output data normalization, e.g. dates, names.

However, our dataset is even more difficult to process, because it has a complex document layout and noisy OCR-ed input (made by an Optical Character Recognition system). These are the main issues that distinguish it from WikiReading and which justify why our task was not only about understanding the language.

A list of challenges similar to some degree to our goal is also available at the International Conference on Document Analysis and Recognition ICDAR 2019111http://icdar2019.org/competitions-2/. However, the authors focus mainly on understanding tables, various classes and a range of document layouts, not extracting particular information from the data. There is a dataset called Form Understanding in Noisy Scanned Documents (FUNSD) [Guillaume Jaume2019]. FUNSD aims at extracting and structuring the textual content of forms. Unfortunately, the dataset comprises only 200 annotated real scanned forms and the annotations are too general, i.e. question, answer, header.

Another interesting dataset from ICDAR 2019 is a set of scanned receipts. The authors prepared 1000 whole scanned receipt images with annotations of company name, date, address and the total payment amount222https://rrc.cvc.uab.es/?ch=13. Of course, receipts are short documents, and have quite a uniform layout and information structure (it starts with the company name, date and invoice number etc.).

Finally, there are datasets with the information extraction task based on invoices, which are not publicly available to the community [Holt and Chisholm2018], [Katti et al.2018]

. Documents of this kind contain common entities like ‘Invoice date’, ‘Invoice number’, ‘Net amount’ and ‘Vendor Name’ which are extracted using a combination of Natural Language Processing and Computer Vision techniques. This is because spatial information is important to properly understand such documents. However, since they are usually short, there is rather no repetition of the same information, so there is no need to understand the context.

3 Kleister: New Datasets

The main goal of the two gathered datasets is to emphasize business value and focus more on problems related to layout analysis and Information Extraction, as well as Natural Language Understanding (several entities should be inferred from the whole document context). Thus, it can be performed as well as an end-to-end task that can be used in real life use cases for robotic process automation of information extraction from long formal documents.

We collected datasets of long formal documents that are US non-disclosure agreements (Kleister-NDA) and annual financial reports of charitable foundations in the UK (Kleister-Charity).

As it was mentioned above, our Kleister datasets have a multi-modal input (i.e. text versions were obtained from OCR-ed noisy documents, some of which were illustrated and some were scans) and a list of entities to be found. The list of reference findings is not indicated in the input documents. This is not a NER task, in which we would be interested in determining where a given piece of information or entity is in the text. We are interested in the information itself. Moreover, we assume that in our datasets some documents may be missing some entities, or some entities may have more than one gold value.

OCR-ed documents transformed by Tesseract [et. al.2019] 4.1.1-rc1-7-gb36c, ran with --oem 2 -l eng --dpi 300 flags (meaning both new and old OCR engines were used simultaneously, and language and pixel density were forced for better results).

The input of the dataset comprises: images of document pages, text versions of the documents (an open-source Optical Character Recognition system).

These two datasets have been gathered in different ways because of their repository structures. Also, the reasons why they were published on the Internet were different. The most important difference between them is that the NDA dataset was digital-born but the Charity dataset needed to be OCR-ed.

The detailed information about aforementioned open datasets (which are the most popular ones in the domain) and our Kleister datasets are presented in Table 1.

Dataset name CoNLL 2003 WikiReading FUNSD ICDAR 2019 SROIE ICDAR 2019 Kleister-NDA Kleister-Charity
Source Reuters news WikiData/ Wikipedia scanned forms scanned receipts EDGAR UK Charity Commission
Annotation manual automatic manual manual manual semi-automatic
Documents 1 393 4.7M 199 973 540 2 778
Entities 35 089 18M 9 743 3 892 2 160 21 612
train docs 946 16.03M 149 626 254 1 729
dev docs 216 1.89M 83 440
test docs 231 0.95M 50 347 203 609
Entity classes 4 867 (top 20 cover 75%) 3 4 4 8
Mean pages/doc 1/Wikipedia article 1 1/receipt 5.98 22.19
Mean words/doc 216.4 489.2 158.2 45 2540 5149
Mean entities/doc 25.2 5.31 49.0 4 4.0 7.8
Complex layout N N Y Y Y/N Y/N
Table 1: Summary of the existing datasets and the Kleister sets.

3.1 NDA Dataset

The NDA Dataset contains Non-disclosure Agreements, also known as Confidential Agreements. They are legally binding contracts between two or more parties, through which the parties agree not to disclose information covered by the agreement. The NDAs were collected from the Electronic Data Gathering, Analysis and Retrieval system (EDGAR) via Google search engine. Then, a list of entities was established (see Table 2) and documents were manually annotated by a team of linguists, which ensured good quality of the annotation.

In Figure 2, there is an example of a problematic entity in Non-Disclosure Agreement: effective date. It is the date on which the contract enters into force. In general, it coincides with the date of the contract or the date it was signed. It happens, however, that these dates are different and then the date of entry into force of the contract is specially marked, e.g. as ‘Effective date’. However, none of dates in the figure is specified in this way. Most NDAs contain a special clause that indicates the date of entry into force of the contract. Usually it is immediately before the signatures of the parties. In this case, the correct answer is November 1, 2008, because in this agreement is clause: ‘IN WITNESS WHEREOF, he parties hereto have executed this Agreement on the date first written above.’

Entities Description Total % all entities
NDA dataset
party Parties appearing in the agreement (each of them is treated as a separate entity) 1035 47.9
jurisdiction State or country whose law governs the agreement 531 24.6
effective_date Date on which the contract becomes legally binding 400 18.5
term Duration of the agreement 194 9.0
Charity dataset
address__post_town post town (part of a charity address) 2692 12.5
address__postcode postcode (part of a charity address) 2717 12.6
address__street_line street with the house number (part of a charity address) 2414 11,1
charity_name name of the charitable organization 2778 12.9
charity_number identification number in the charity register 2763 12.8
report_date date of reporting 2776 12.8
income_annually annual income in British pounds (GBP) 2741 12.7
spending_annually annual spending in British pounds (GBP) 2731 12.6
Table 2: Summary of the entities in the NDA and Charity datasets.
Figure 2: The example of a problematic entities in documents from the Kleister-NDA and Kleister-charity datasets.

3.2 Charity Dataset

The Charity dataset consists of the annual financial reports that all charities registered in England and Wales are required to submit to the Charity Commission for England and Wales. Then, the Commission makes them publicly available via its website333https://apps.charitycommission.gov.uk/showcharity/registerofcharities/RegisterHomePage.aspx. Charity reports were collected from the UK Charity Commission website, just like annotations to these documents. The entity list was established on the basis of information that we were able to automatically obtain from the tables on the page describing the content of the reports444A detailed description of the data collection method can be found in the Appendix. (see Table 2).

The quality of automatically obtained entities was checked by a team of annotators based on 100 random reports. After analyzing these documents, the following annotations were corrected: the names of the organizations (normalization of Ltd.) and amounts (we fixed entities by adding a decimal part of the value) in a part of the development set and in the whole test set (development and test sets are important in context of measuring model actual performance). Then we repeated the annotation check based on 200 random documents from train and development sets (we assume that the annotation of the test set is excellent—555

Cohen’s kappa coefficient was calculated on the basis of double validation of 100 random documents from the set test. Standard error

, 95% confidence interval = 0.764–0.905.

). Our preliminary and final results of the quality control procedure are presented in Table 2. The results for the train dataset are definitely lower, but at the same time this set is four times larger than the two others and, unlike them, only a small part of it was manually annotated.

Entities Correct initial annotations[%] Correct final annotations[%]
entire dataset train dev test ()
address 23 55 93 0.831
address__post_town 83 99 0.823
address__postcode 78 98 1.000
address_street_line 67 93 0.809
charity_name 86 81 92 0.904
charity_number 99 95 100 0.490
charity_date 99 98 100 1.000
income_annually 82 90 91 0.900
spending_annually 78 86 92 0.750
Table 3: Results of the manual verification of Charity dataset.

Figure 2 shows problems with two entities in reports of the charitable organization: charity address and number. Both can co-occur in many variants for the same organization and in the same document. In these cases, it was necessary to refer to the business logic, so the correct answers are “Registered address” and charity number for England and Wales.

4 Baselines

Figure 3: Our process of preparing Kleister datasets and training baselines. Initially, we gathered PDF documents and required entities’ values; an important part of the process (that can be reproduced or improved) is the OCR. Then,based on only textual data we prepare pipeline solutions. The pipeline process is illustrated in the second frame and consists of the following stages: auto-tagging, standard NER, text normalization, final entities’ values choice.

Kleister datasets for information extraction are challenging tasks and do not exactly match any existing solutions in the current NLP world. In this paper, our aim is to produce strong baselines based on text treated as a sequence, without using additional spatial information. We propose Pipeline technique to solve extraction problems. Our baseline Pipeline method is a chain of processes with a named entity recognition (NER) model as a crucial one to indicate a given entity in the text, then to normalize the entities to canonical forms and finally to aggregate all results into one, adequate to the given entity type. Flair [Akbik et al.2018], BERT-base [Devlin et al.2018] and RoBERTa-base [Liu et al.2019b] models are used for this.

4.1 Pipeline

The core idea of this method is to select specific parts of the text in a document that denote the objects that we are looking for. The whole process is presented in Fig. 3 with the following stages:

  1. Auto-tagging: this stage involves extracting all the fragments that refer to the same or different entities by using sets of regular expressions combined with a gold-standard value for each general entity type (date, organization, amount, etc.), e.g. when we try to detect a report_date entity, we must handle different date formats: ‘November 29, 2019’, ‘11/29/19’ or ‘11-29-2019’. This step is performed only during training (to get data on which a NER model can be trained).

  2. Named Entity Recognition: using the auto-tagged dataset, we train a NER model and then, at the evaluation stage, we use it for the detection of all occurrences of entities in the text being processed.

  3. Normalization: at this stage objects are normalized to the canonical form which we have defined in the Kleister datasets. We use almost the same regular expression as during auto-tagging, e.g. all detected report_date occurrences are normalized from: ‘November 29, 2019’, ‘11/29/19’ and ‘11-29-2019’ into ‘2019-11-29’.

  4. Aggregation: we produce a single output from multiple candidates detected by the NER model. In our case, the technique is simple: we return the object with the maximum summarized scores grouped by the normalized forms of the extracted entities.

Certainly, almost each stage of the above process can be done with a wide range of techniques, from regular expressions to more advanced machine learning models and deep neural networks.

4.2 Pipeline based on Flair

The Flair model (based on stacked char-Bi-LSTM language model and GloVe word embeddings [Pennington et al.2014]) is used as an encoder and Bi-LSTM with a CRF layer—as an output decoder. Based on many experiments on the NDA and Charity development datasets, we found out the best setup for parameters, which is: , , , (NDA/Charity), ,

and with a CRF layer on the top. Moreover, each document was split into chunks of 100 words with overlapping parts of 10 words. Results from overlapping parts were normalized into one by using the mean of probabilities for each word from both overlapping parts.

4.3 Pipeline based on BERT/RoBERTa

The BERT/RoBERTa models are fine-tuning approaches based on the Bidirectional Encoder Representations from Transformers language model. We found out the best experimental setup, which is: , , , after many experiments on the NDA and Charity datasets for both models. Moreover, each document was split into chunks of 510 tokens (plus two special tokens: [CLS] and [SEP]) with overlapping parts of 100 tokens. Results from overlapping parts were normalized into one by using the mean of probabilities for each token from both overlapping parts.

4.4 Results

The results for the two Kleister datasets obtained with the Pipeline method based on Flair, BERT and RoBERTa models are shown in Table 4

. The differences in F-score between Flair and RoBERTa models are not substantial in both challenges. Moreover, the RoBERTa model is much better as far as amounts are concerned (

income_annually and spending_annually entities in Charity dataset).

Models most often have problems with types that are connected with the hard normalization (e.g. in NDA term), the reasoning (e.g. in NDA effective_date) and the visual features (e.g. in Charity income). We can also observe that the entities appearing in the sequential contexts achieve a higher F-score. Moreover, after going through the results obtained on the development set, we could observe that, to some extent, the OCR errors can cause a low F-score (e.g. the OCR engine sometimes wrongly predicts ‘I’ from ‘1’ and, as a result, we can’t predict properly the report_date entity).

Kleister-NDA dataset
Entity type Flair BERT-base RoBERTa-base Human baseline
effective_date 77.67 0.81 75.27 0.74 76.93 2.53 100.00 %
party 65.87 0.03 62.03 0.95 72.57 2.55 98.02 %
jurisdiction 95.00 0.32 92.43 1.51 92.90 1.66 100.00 %
term 61.63 5.34 37.33 1.13 43.00 2.08 87.50 %
ALL 75.17 0.12 68.30 0.41 74.2 1.9 97.86%
Kleister-Charity dataset
address__post_town 82.90 0.01 71.73 1.55 76.82 1.6 97.92 %
address__postcode 82.35 0.72 76.53 1.25 78.53 1.04 100.00 %
address__street_line 61.75 0.06 53.40 13.02 66.17 2.04 95.70 %
charity_name 70.00 1.00 76.43 0.69 77.30 0.45 99.00 %
charity_number 94.20 0.01 95.93 0.33 95.67 0.17 98.00 %
income_annually 49.40 1.96 49.73 0.82 57.30 2.55 96.97 %
report_date 95.50 0.25 94.13 0.05 94.70 0.14 100.00 %
spending_annually 47.40 2.89 48.47 1.17 55.53 1.11 91.92 %
ALL 74.80 0 71.37 1.25 75.70 0.57 97.45 %
Table 4: Results of our baselines for Kleister challenges test-sets.

5 Discussion and Challenges

In Table 1, we gathered the most important information about open datasets, especially we outlined the difference between our datasets and other sets. Additionally, we prepare descriptions of problems related to Kleister task (see Table 5.) Thus, the Kleister datasets appear to be more focused on the real life example, where layout, document-level context, OCR quality, business logic and normalization problems need to resolved for obtaining good results.

Summing up, the datasets are useful for testing real life applications to solve the challenge of the robotic process automation tackled by machine learning techniques666We will release the datasets for public use as a benchmark; several documents with gold standard annotations from both Kleister datasets are already included in the Supplement..

Normalization: Differences in the way entities are given in expected values and documents.
NDA effective_date: October 24, 2012, 10/24/12 or 24th day of October, 2012
term: 2 years, 24 months, two (2) years, two years or second anniversary
Charity charity_name: Ltd vs Limited: King’s Schools Taunton LTD [expected] vs King’s Schools Taunton Limited [document]; 2. The vs non-The: The League of Friends of the Exmouth Hospital [expected] vs League of Friends of Exmouth Hospital [document]
Layout: understand complex layout properly
NDA all entities: four types of layout: 1. Simple layout (one column), 2. Simple layout (two columns), 3. E-mail, 4. Plain text. See the Appendix.
Charity all entities: three types of layouts: 1. Simple document, 2. Report with tables, graphic elements and pictures, 3. Form. See the Appendix.
Document-level context: understand document as a whole
NDA term: The term informs about the duration of the contract. Information on this is generally found in the “Term” chapter. However, this section may also include other periods of validity of certain provisions of the contract.
Example: “Term. This Agreement will be effective for a period of one (1) year after the Effective Date. The restrictions on use and disclosure of the Discloser’s Confidential Information by the Recipient shall survive any expiration or termination of this Agreement and shall continue in full force and effect for a period of five (5) years thereafter.”
Charity income_annually, spending_annually: Co-occurrence of exact and rounded values in one document. See fig. in Appendix
Business logic: apply some rules in a case of ambiguity
NDA term: Co-occurrence of two terms in one document. In such a case, the one constituting the duration of the renewed contract was considered inappropriate.
Example: ‘Term; Termination. The term of the employment agreement set forth in this shall be for a period commencing at the Effective Date and continuing for three (3) years thereafter (the “Scheduled Term”). Following the Scheduled Term, the Agreement shall automatically renew for successive one-year terms (each a “Renewal Term”).’
Charity address__*: Co-occurrence of different addresses (e.g. Principal address, Registered office, Administrative address, etc.) next to each other in one document, or the lack of a clear identification of the charity’s address. In such a case the Registered address was considered to be the main one.
OCR quality: process scan documents
NDA N/A — digital born documents.
Charity all entities: Handwriting in the document, pages upside down or poor scan quality.
Table 5: Common problems in Kleister datasets with examples.

As described above, working with the proposed datasets can be compared to challenges dealing with Information Retrieval and Natural Language Understanding, including challenges related to page layout understanding (i.e. tables, rich graphics, etc.). To solve these challenges, we presented the Pipeline approach that will help to deal with specific problems.

Most of these stages are described in the process of building baselines and are shown in Fig. 3.

Using the presented challenges we are also able to study the impact of each stage of the full process on the final results. It is useful in the production environment where we can have a baseline, and then we can assess what should be done with the highest priority to improve final results.

6 Conclusions

Our datasets (named Kleister) have been prepared to challenge the business usability of Information Extraction models and processes. In this article, we described in detail how they were prepared (i.e. manually or automatically — for Kleister-NDA and Kleister-Charity respectively). Due to their multi-modal nature, we had to face various problems and needed to develop methods to improve the quality of data sets.

We consider our datasets and tasks will help the community to extend the understanding of documents with substantial length, various reasoning problems and complex layouts. Moreover, the community can use our methodology to extend the datasets or prepare similar sets.

In addition, we prepared baseline solutions on the basis of text data from the datasets. This benchmark shows weakness of the models working on a pure text (i.e. input is a sequence of words) without any additional layout features and without a document understanding specific methods.

References

Appendix

In this supplement we describe more precisely our datasets and the annotation processes in Section A and Section B, respectively.

Appendix A NDA Dataset

a.1 Data Detailed Description

The NDA agreements prevent the disclosure of confidential information by one of the parties to a third party. Such agreements, even in oral form, are often found in everyday life (e.g. in the patient-doctor relationship). In business, they usually have a written form, signed by a representative of the legal profession and another person (legal or natural). In our database, we have collected business contracts, but without differentiating them, either by their form (these are both independent contracts and contracts annexed to other contracts), or by the way they were concluded (all contracts were concluded in writing, some of them by e-mail) or because of the number of parties (the dataset contains unilateral, bilateral and multilateral agreements).

The NDAs can take various forms (contract attachments, emails, etc.), but they all generally have a similar structure. First, the circumstances of the contract are determined, i.e. the parties to the contract are presented and the date from which the contract becomes effective is provided. Then they usually contain the following elements:

  • a definition of confidential information, including exceptions to this definition;

  • description of the disclosure procedure (also during court and administrative proceedings);

  • procedures related to non-compliance with confidentiality obligations;

  • term of the contract (termination date);

  • the period during which the information remains confidential (confidential period);

  • information about the jurisdiction to which the contract is subject;

  • information about the possibility of making legally binding copies of the contract;

  • due to the fact that confidential information can be used to recruit new employees or contractors of one party by another, the NDA often also includes non-compete clauses in force for a certain period of time.

a.2 Data Collection Method

During the collection of the NDAs, we focused on contracts concluded by public companies in the United States. All public companies (i.e. those with shareholders) in the US are supervised by the United States Securities and Exchange Commission (SEC). Companies are required to submit a number of reports and forms, the attachments of which are often contracts concluded by these companies, including NDAs. This is done through the Electronic Data Gathering, Analysis and Retrieval system (EDGAR), which is also a public database of these documents (these documents must be made public)777https://www.sec.gov/edgar.shtml. As a result, EDGAR is a huge NDA base. Unfortunately, NDAs are usually attachments to other contracts or forms submitted to EDGAR, as a result of which it is not possible to simply aggregate them from this database. Thus, the process of gathering the dataset had to be manual, with a weak model supervision.

The NDAs were collected via the Google search engine by two computational linguists. Two collections were created—the first contained 170 contracts and the second 330 contracts, except that 117 duplicates were found, so that ultimately the dataset counted a total of 383 documents. After the first tests on the already annotated dataset, it turned out that machine learning models achieve quite poor results for information on jurisdiction. Analysis of the dataset showed that this was due to the under-representation of documents that were prepared in accordance with non-US law (e.g. China, India or Israel). Since no more such documents were obtained, the 68 previously obtained ones were removed from the dataset, which reduced it to 315 documents. In the next step, the collection was supplemented with an additional 127 documents consistent with the others in terms of applicable law (i.e. US law).

The original files were HTML documents, but they were transformed into PDF files to keep processing simple and similar to how other datasets were created. Transformation was made using the puppeteer library, which in turn used the “Print to PDF” functionality present in the Chrome web browser. Subsequently, the transformed PDFs were processed with the Tesseract OCR engine.

a.3 Annotation Procedure

The whole dataset was annotated in two ways. Its first part, i.e. 315 documents, was annotated by linguists, except that only selected contexts, preselected by an in-house system based on semantic similarity, were taken into account (to make the annotation easier and faster). The second, i.e. 127 documents, was entirely annotated by hand. When preparing the dataset, we wanted to find out if the semantic similarity methods could be used to limit the time it would take to perform annotation procedures (this solution saved about 50% of the time compared to fully manual annotation).

The annotation of the dataset consisted of listing the extracted entities. The entities themselves may appear repeatedly in the document, but this did not matter for the annotation procedure (contrary to NER, we are not interested in the exact location(s) of an entity). The following entities have been normalized according to standards adopted by us: (a) parties — commas have been removed before acronyms referring to organization types, and the format has been unified, e.g. LHA LONDON LTD; (b) effective date — the format has been standardized according to ISO 8601, i.e. YYYY-MM-DD; (c) terms — standardized to the following format: number of units followed by a unit, e.g. 2 years; (d) jurisdiction and counterparts did not require standardization. Then the annotations were checked by the super-annotator on 45 random documents (10% of the whole dataset). All the super-annotated entities were correct and did not need to be changed.

Appendix B Charity Dataset

b.1 Data Detailed Description

There is no rule about how such a charity report should look. Therefore, some take the form of reports richly illustrated with photos and charts, where financial information constitutes a small part of the entire report, while others have only a few pages, where only basic data on revenues and expenses in a given calendar year are given (see Figure 4). However, each of these reports should contain at least the following information (although there may be exceptions to this rule):

  • organization’s address, name and number;

  • the date of submission of the report;

  • total income in the reporting year;

  • total expenditure in the reporting year.

b.2 Data Collection Method

Figure 4: Organization’s page on the Charity Commission’s website (left: organization whose annual income is between 25k and 500k GBP, right: over 500k). Information on the website has a different layout, and within documents there is also the case. Entities are underline in red and names of entities are circled.

The decision to create a dataset from the financial reports of British charities was driven by the following goal: to find a publicly available collection of English-language and multi-page documents on the Internet, which would be accompanied by easy-to-extract information about data contained in these documents (e.g. as a separate XML file or a table on a website). We decided that the database of financial reports of British charity organizations would be the best of all the options considered. It is not just that the Charity Commission website actually has a database of all the charity organizations registered in England and Wales, but also that each of these organizations has a separate subpage on the Commission’s website and it is easy to find the most important information about them (see Fig. 4):

  • Charity’s name and number;

  • main activities;

  • current address parts (post town, postcode and street line);

  • a list of the current trustees of the organization;

  • basic financial data for the past year, i.e. income and expenditure (these data are more detailed in the case of organizations with revenues of over 500,000 GBP a year);

  • the date of submission of the report.

This information partly overlaps with what the reports actually contain (although it might happen that some entities are not to be found in the reports, e.g. a list of trustees is given on the website, but it does not have to be included in the report). For this reason, we decided to extract only those entities which also appear in the form of a brief description on the website.

The reports can be found on the website as PDF files (but this does not apply to organizations with income below 25,000 GBP a year, as they are required to submit a condoned financial report). Therefore, the information available on the website and the documents attached to it made the database of these documents perfectly fit the objectives outlined above. In this way, 3414 documents were obtained.

During the analysis of the documents, it turned out that several reports are in Welsh. As we are interested in the English language only, all documents in other languages were found and removed from the collection. In addition, documents, that contained reports for more than one organization, were handwritten, or the quality of their OCR was low, were deleted. As a result, the collection has 2778 documents.

b.3 Annotation Procedure

There was no need to manually annotate documents, because basic information about the reporting organizations could be obtained directly from the website where these documents were located.

Only a random sample of 100 documents was manually checked (see Table 6). The permissible error limit for a given entity was set at 15%. These results were exceeded for charity name (18% of errors and minor differences) and for charity address (76% of errors and minor differences). However, as a result of detailed analysis, it turned out that there are few erroneous entities (respectively 5% and 9%), while the rest is rather due to differences in the way the data is presented on the page and in the document. These minor differences have been corrected manually and automatically, as described below.

Entities Correct [%] Minor differences [%] Error [%]
charity_name 82 13 5
charity_address 24 67 9
charity_number 98 0 2
report_date 99 0 1
income_annually 86 3888Including two cases of non-rounding of the amount and one filling in the amount in USD instead of GBP. 11
spending_annually 86 3999As above. 11
Table 6: Comparison of data on the Charity Commission’s website and in charity reports.

Hence, the charity’s name on the website and in the documents could be noted once with the term Limited (shortened to LTD), and once not. This problem was eliminated by the manual annotation of all documents in which the name of the charity organization co-occurred with the word Limited or LTD. As a result, 366 documents were analyzed manually in this way.

In the case of the charity’s address the most problematic were the names of counties, districts as well as the names of towns and cities, which were once specified on the website, but not in the documents, other times—the other way round. This problem was solved by splitting address data into the three separate entities that we considered the most important—postcode, postal town name and street or road name. The postal code was used as the key element of the address, on the basis of which the city name and street name could be determined101010Postal codes in the UK were aggregated from a website: streetlist.co.uk.