Automated Identification of Climate Risk Disclosures in Annual Corporate Reports

08/03/2021 ∙ by David Friederich, et al. ∙ ETH Zurich 0

It is important for policymakers to understand which financial policies are effective in increasing climate risk disclosure in corporate reporting. We use machine learning to automatically identify disclosures of five different types of climate-related risks. For this purpose, we have created a dataset of over 120 manually-annotated annual reports by European firms. Applying our approach to reporting of 337 firms over the last 20 years, we find that risk disclosure is increasing. Disclosure of transition risks grows more dynamically than physical risks, and there are marked differences across industries. Country-specific dynamics indicate that regulatory environments potentially have an important role to play for increasing disclosure.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Climate-related financial disclosures aim at increasing the transparency to guide investment and lending decisions in the financial sector. They reflect not only the physical effects of a changing climate, also the transition processes to decarbonize the economy that bear financial risks and opportunities. Such transition risks and opportunities include policy and legal change, technology and market shifts, and reputation-related risks (climatewise2019transition; doi:10.1002/wcc.678). Transition risks are found to be more imminent than physical risks from climate change, and there are indications that some of those risks are already financially priced (kolbel2020does). In fact, the G20 Financial Stability Board has established a Task Force on Climate-related Financial Disclosures (TCFD) to issue recommendations for disclosure of climate-related risks and opportunities (CRO) in corporate reporting (tcfd2019). However, recent analyses of the current standard of reporting have found insufficient clarity of CRO reporting (meaning that disclosures are not specific enough to allow a judgement whether CRO are material, i.e. significant, for future financial performance), with some recent improvements (tcfd2019; cdsb2020falling; demaria2019). To establish more stringent reporting standards on CRO, policy makers and financial regulators have begun establishing regulations (steffen2021comparative), for example the 2014 EU Non-Financial Reporting Directive (EU directive 2014/95/EU) and subsequent guidelines (2017/C 215/01 and 2019/C 209/01) (EU2021law), Art. 173 of the French Energy Transitions law from 2015 (france2015law), and a recent Executive Order on Climate-Related Financial Risk in the US (EO2021US).

To monitor the state of climate-related disclosure and establish the effectiveness of related regulation, financial reports need to be analyzed to assess the extent and quality of such disclosures by companies and organizations. Typically, these analyses are conducted manually, which is time-intensive: recent examples covered only the 40 (demaria2019) and 50 largest companies (cdsb2020falling) of different listings. Constraining the analysis to only a small set of the largest companies risks introducing bias or preventing important insights: first, the financially strongest companies might not sufficiently represent sectors that are most relevant for low-carbon transitions (e.g., most freight shipping companies are typically very small, while having a significant carbon footprint (teter2017future)). Second, there are equity concerns if CROs of smaller companies are not monitored (e.g., they could experience significant transition-related opportunities but investments are not directed to them). This bias might cause the financial impacts of climate change and transitions to be incorrectly priced, and result in an inefficient allocation of capital during the low-carbon transition (doi:10.1002/wcc.678).

2 Problem Statement and Related Work

Recent work has shown first successful computerized analysis of climate-related financial disclosures. For example, the TCFD has conducted an “AI review,” using a supervised learning approach that is not further detailed. They identified compliance with the TCFD Recommended Disclosures, but did not assess the quality of the disclosed information nor the type of risk

(tcfd2019). This approach was refined by luccioni2020analyzing, who developed a question answering approach to identify passages in climate disclosures that answer the 14 TCFD recommendations, and make their trained model accessible as a tool for sustainability analysts. bingler2021cheap developed “ClimateBERT” to analyze compliance with TCFD recommendations in a variety of corporate reporting globally, and find mostly disclosure of non-material TCFD categories. kolbel2020does were able to identify an increase in disclosure of transition risks in 10-K reports that outpaced those of physical risks, based on their measure of climate disclosure using a fine-tuned BERT model. Finally, sautner2020firm use a rule-based approach for identifying CRO-related language in corporate conference calls. They use machine learning (ML) for expanding their set of keywords, which was also proposed by luccioni2019using for analyzing climate-related disclosure.

All of these studies have analyzed the number of mentions of climate-related disclosures, however, the quality and materiality of the disclosures remains largely unclear. Analyzing the types of reported risks is a step into this direction, allowing potential investors to better judge the materiality of reported risks. We expand on kolbel2020does

by introducing more fine-grained risk categories and detect them in free text such as European annual reports (instead of 10-Ks). While most previous work has taken the approach to classify at the sentence level, we observe that more context is needed for disclosing risks, and we classify at the paragraph-level.

To carry out our project, we create a novel dataset based on a refined labeling scheme to distinguish different types of climate-related risks (Section 3). We then train different classification algorithms to identify and categorize paragraphs in free-text annual reports that disclose such climate-related risks (Section 4). Finally, we apply the model to analyze climate-related disclosure in annual reports of 337 European firms over the past decade (Section 5).

3 Data

We created our own labeled dataset for the task of classifying paragraphs according to disclosure of climate risks.111Available upon request. We built a corpus of annual reports from the 50 largest publicly traded companies (STOXX Europe 50) and more than half of the European firms in the STOXX Europe 600 index for the last 20 years (where available), which we obtained from the companies’ investors relation websites and Refinitiv Eikon (refinitv2021).We then parsed the PDF files using the Apache Tika package 222 and split the documents on each page into paragraphs using a rule-based approach (regex).

The paragraphs were annotated by student assistants familiar with climate policy, who were trained by the authors and followed our code book (see Appendix). The five risk categories include two types of physical risks and three types of transitions risks. “Acute” and “chronic” physical risks denote those from increases in extreme weather events and those that develop slowly like changes in precipitation patterns, respectively. Transition risks include those related to the potential introduction or strengthening of climate policies (“policy & legal”), to changing market and technological environments (“tech. & market”), and to the reputation of corporations or products (“reputation”).

For the test and validation datasets, we labeled 120 STOXX Europe 50 reports in their entirety. We sampled stratified by years and industries in order to avoid bias in the dataset and later evaluate the model performance across industries and time. To reduce the number of pages to screen, we pre-selected those pages that included at least one match with an extensive list of relevant keywords (see Appendix), and their neighboring pages. On average, a report consisted of 34 relevant pages with 16 paragraphs each. All paragraphs on those selected pages were then annotated with the five categories allowing multiple labels per paragraph. Paragraphs without risk disclosure on relevant pages were considered “negative examples”, and perceived edge cases were labeled as “hard negatives.” We randomly split the dataset in test and validation data, and ensured that each contains a separate set of companies to avoid spill-overs.

The classes are highly imbalanced, which is why we employed a greedier approach for the training dataset focused on covering the variance among positive examples and including only hard negative examples. We extracted relevant pages from annual reports of

STOXX Europe 600 companies using a more tailored keyword list than for the test dataset, and then selectively labeled relevant paragraphs.

The resulting datasets are summarized in Table 1. To assess the inter-coder reliability, two coders independently labeled 20 reports resulting in a Krippendorff’s alpha of (union) and (intersection) for 5 classes.

Train Val Test
Physical risks
   Acute 133 15 28
   Chronic 54 5 19
Transition risks
   Policy & Legal 43 40 60
   Tech. & Market 37 17 21
   Reputation 23 14 14
Unique pos. paragraphs 205 72 97
Neg. paragraphs 295 39’007 40’878
   of these hard neg. 295 73 55
Table 1: Number of labeled paragraphs in dataset (some paragraphs have several labels)

4 Methods

4.1 Tasks and Models

We divide the task of classifying climate risks in three tasks of increasing difficulty: Binary (classification in “risk” or “no risk”), and multi-label with two (physical and transition risks) and five classes (all risk categories). On all of these tasks we evaluate a baseline model, pretrained DistilBERT sanh2019distilbert, and RoBERTa liu2019roberta.

As a baseline model, we selected a support vector machine (SVM) 

cortes1995support as a one-versus-rest classifier and applied standard preprocessing to the input such as stop-word removal, lemmatization and TF-IDF weighting. We addressed class imbalances between negatives and positives with class weights and used Precision-Recall AuC on the validation set for scoring.

To leverage context-specific word embeddings, we fine-tuned different variants of pretrained BERT-related models (devlin2019bert) such as DistilBERT (sanh2019distilbert) and RoBERTa Large (liu2019roberta)

on our training dataset using negative log-likelihood loss with a softmax activation function for binary classification and a binary cross-entropy loss for the multi-label classification tasks. Again, we calculated class weights to address the class-imbalance. We trained the models for 4 epochs, using early stopping and limited hyperparameter search on the validation set. We also determined the optimal class probability thresholds by maximizing the F1-score on the validation set. Training is estimated to have emitted less than

in total.

4.2 Experiments and Validation

We evaluate the models on the three tasks defined above as well as in the following settings: (1) discriminatory where no negative paragraphs are present, (2) hard negatives setting with paragraphs that are edge cases, and (3) realistic setting with all negatives from pre-selected pages. We choose the best model for inference based on the F1-score on the validation set for the realistic setting and five risk categories.

RoBERTa achieves the best performance in the 5-class/realistic setting (Table 2). As RoBERTa is the largest among the models compared, this confirms expectations. Notably, DistillBERT performs slightly better in the 2-class/realistic setting, which could indicate that with more training examples it might be sufficient to rely on a smaller model. Comparing across settings, the realistic case appears the most difficult, and the models perform best on the the discriminatory case (without negatives). Remarkably, SVM outperforms RoBERTa in the easiest setting. We also added additional negative training examples in the realistic setting, which did not improve performance.

4.3 Test Results

We evaluate the model performance on held-out test data in the realistic setting (Table 3). In general, the model suffers from a relatively low recall of , and its precision does not exceed . This is explained by the fact that the task of identifying disclosures requires domain expertise and is also rather difficult for humans. For comparison, after refining the coding scheme, we conducted a review of test and validation data, resulting in a precision of for the preliminary coding of binary, which is lower than what the model achieves on the same task. We find a large variance in performance across classes, with reputational transition risks being the hardest to identify (F1-score of ) and acute physical risks the easiest (F1-score of ). We also observe that physical risk classes exhibit a considerably higher precision and lower recall than transition risk classes, which are more balanced.

Experiment SVM DistilBERT RoBERTa
5 Classes
    Realistic 0.204 0.241 0.356
    Hard neg. 0.457 0.431 0.528
    Discriminatory 0.599 0.558 0.596
2 Classes
    Realistic 0.351 0.497 0.446
    Realistic 0.290 0.444 0.496
Table 2: Validation performance (F1-score macro-avg.)

Precision Recall F1
Physical risks
   Acute 0.846 0.393 0.537
   Chronic 0.833 0.263 0.400
Transition risks
   Policy & Legal 0.291 0.383 0.331
   Tech. & Market 0.400 0.476 0.435
   Reputation 0.093 0.286 0.140
Avg. 5 classes 0.493 0.360 0.369
Avg. Binary 0.695 0.423 0.526
Table 3: Test performance for RoBERTa (best model) in the realistic setting for 5 classes and binary.

5 Applying the Model

We determined the number of risk mentions for 4,498 annual reports by all 337 companies in the dataset by performing inference on pages in proximity of a keyword match. Out of 2.7m analyzed paragraphs, 3892 paragraphs were predicted to contain at least one risk (total of 5501 risk mentions).

Figure 1(a) shows the average number of mentions per report of physical and transition risks over time, which grew slowly until 2015, after which it increased rapidly. This growth is particularly high for transition risks, resulting in about three times as many mentions compared to physical risks in 2019. The analysis of risk subcategories (Figure 1(b)) reveals that the growth was mainly driven by “policy & legal” and “reputation” risks.

Given different regulatory environments, we compare the dynamics in four countries in Figure 2. Companies in France, which has a disclosure mandate, and the United Kingdom saw a marked rise in both transition and physical risk reporting since 2015, while Germany and Switzerland exhibited a lower (but still clearly visible) growth during the same period. Comparing different industries (Figure 3), we find that especially the energy, basic materials, and utilities industries disclose transition risks. These are sectors with high emission intensities, which are particularly affected by climate policies. For physical climate risk, no clear industry pattern is visible in our data.

(a) 2 classes
(b) 5 classes
Figure 1:

Average mentions per report over time. a) with 2 classes and a 95%-bootstrapped confidence interval (CI), b) with 5 classes.

Figure 2: Average number of climate risk mentions per report for selected countries.

Figure 3: Distribution of the average number of climate risk mentions per report over the time frame 2015-2020 by industry.

6 Discussion and Conclusion

In the present article, we developed an approach to automatically identify climate risk disclosures in corporate annual reports, and used it to analyze disclosure of 337 European companies over 20 years. We find that the number of risk mentions (especially of transition risks) started to rise sharply around 2015. It appears likely that public policies played a role for this development, as numerous policies to encourage or mandate climate risk reporting have been enacted in Europe since 2015 (steffen2021comparative). To assess whether specific policies indeed caused the development, however, requires further research. Potential empirical designs to that end include difference-in-differences approaches, or models with country- and industry-fixed effects. Our approach is well suited to the deliver the dependent variable for such analysis.

Next steps for refining our analysis will focus on appropriately quantifying the uncertainty of model predictions, and working to reduce it further by exploring a hierarchical classification approach, and adding more training data. The analysis can also be expanded to a broader set of company types and communication channels beyond annual reports.

Finally, it should be kept in mind that improved transparency on climate risks should not automatically be expected to change investor behavior in a meaningful way (ameli2020climate). More research is needed to understand how capital is (re)allocated based on better climate risk disclosures; in this context our approach can be useful to deliver the explanatory variable for such future analyses. Ultimately, both the effectiveness of policies to trigger climate risk disclosures, and the effectiveness of such disclosures to change investment behavior, are required for financial investments to help achieve the targets of the Paris Agreement.


The project has received funding from the European Union’s Horizon2020 research and innovation programme, European Research Council (ERC) (grant agreement No 948220, project GREENFIN).