EDGAR-CORPUS: Billions of Tokens Make The World Go Round

09/29/2021 ∙ by Lefteris Loukas, et al. ∙ Ernst & Young Athens University of Economics and Business 0

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural Language Processing (nlp) for economics and finance is a rapidly developing research area Hahn et al. (2018, 2019); Chen et al. (2020); El-Haj et al. (2020). While financial data are usually reported in tables, much valuable information also lies in text. A prominent source of such textual data is the Electronic Data Gathering, Analysis, and Retrieval system (edgar) from the us Securities and Exchange (sec) website that hosts filings of publicly traded companies.111See https://www.sec.gov/edgar/searchedgar/companysearch.html for more information. In order to maintain transparency and regulate exchanges, sec requires all public companies to periodically upload various reports, describing their financial status, as well as important events like acquisitions and bankruptcy.

Financial documents from edgar have been useful in a variety of tasks such as stock price prediction Lee et al. (2014), risk analysis Kogan et al. (2009), financial distress prediction Gandhi et al. (2019), and merger participants identification Katsafados et al. (2021). However, there has not been an open-source, efficient tool to retrieve textual information from edgar. Researchers interested in economics and nlp often rely on heavily-paid subscription services or try to build web crawlers from scratch, often unsuccessfully. In the latter case, there are many technical challenges, especially when aiming to retrieve the annual reports of a specific firm for particular years or to extract the most relevant item sections from documents that may contain hundreds of pages. Thus, developing a web-crawling toolkit for edgar as well as releasing a large financial corpus in a clean, easy-to-use form would foster research in financial nlp.

Corpora Filings Tokens Companies Years
Händschke et al. (2018) Various 242M 270 2000-2015
Daudert and Ahmadi (2019) Various 188M 60 1995-2018
Lee et al. (2014) 8-K 27.9M 500 2002-2012
Kogan et al. (2009) 10-K 247.7M 10,492 1996-2006
Tsai et al. (2016) 10-K 359M 7,341 1996-2013
edgar-corpus (ours) 10-K 6.5B 38,009 1993-2020
Table 1: Financial corpora derived from sec (lower part) and other sources (upper part).

In this paper, we release edgar-corpus, a novel financial corpus containing all the us annual reports (10-K filings) from 1993 to 2020.222edgar-corpus is available at: https://zenodo.org/record/5528490 Each report is provided in an easy-to-use json format containing all 20 sections and subsections (items) of a sec annual report; different items provide useful information for different tasks in financial nlp. To the best of our knowledge, edgar-corpus is the largest publicly available financial corpus (Table 1). In addition, we use edgar-corpus to train and release word2vec embeddings, dubbed edgar-w2v. We experimentally show that the new embeddings are more useful for financial nlp tasks than generic glove embeddings (Pennington et al., 2014) and other previously released financial word2vec embeddings (Tsai et al., 2016). Finally, to further facilitate future research in financial nlp, we open-source edgar-crawler, the Python toolkit we developed to download and extract the text from the annual reports of publicly traded companies available at edgar.333edgar-crawler is available at: https://github.com/nlpaueb/edgar-crawler

2 Related Work

There are few textual financial resources in the nlp literature. Händschke et al. (2018) published joco, a corpus of non-sec annual and social responsibility reports for the top 270 us, uk, and German companies. Daudert and Ahmadi (2019) released cofif, the first financial corpus in the French language, comprising annual, semestrial, trimestrial, and reference business documents.

While some previous work has published document collections from edgar, those collections come with certain limitations. Kogan et al. (2009) published a collection of the Management’s Discussion and Analysis Sections (Item 7) for all sec company annual reports from 1996 to 2006. Tsai et al. (2016) updated that collection to include reports up to 2013 while also providing word2vec embeddings. Finally, Lee et al. (2014) released a collection of 8-K reports from edgar, which announce significant firm events such as acquisitions or director resignations, from 2002 until 2012.

Compared to previous work, edgar-corpus contains all 20 items of the annual reports from all publicly traded companies in the us, covering a time period from 1993 to 2020. We believe that releasing the whole annual reports (with all 20 items) will facilitate several research directions in financial nlp Loughran and McDonald (2016). Also, edgar-corpus is much larger than previously published financial corpora in terms of tokens, number of companies, and year range (Table 1).

3 Creating edgar-corpus

3.1 Data and toolkit

Publicly-listed companies are required to submit 10-K filings (annual reports) every year. Each 10-K filing is a complete description of the company’s economic activity during the corresponding fiscal year. Such reports also provide a full outline of risks, liabilities, corporate agreements, and operations. Furthermore, the documents provide an extensive analysis of the relevant sector industry and the marketplace as a whole.

A 10-K report is organized in 4 parts and 20 different items (Table 2). Extracting specific items from documents with hundreds of pages usually requires manual work, which is time- and resource-intensive. To promote research in all possible directions, we extracted all available items using an extensive pre-processing and extraction pipeline.

Item Section Name
Part I Item 1 Business
Item 1A Risk Factors
Item 1B Unresolved Staff Comments
Item 2 Properties
Item 3 Legal Proceedings
Item 4 Mine Safety Disclosures
Part II Item 5 Market
Item 6 Consolidated Financial Data
Item 7 Management’s Discussion and Analysis
Item 7A Quantitative and Qualitative Disclosures about Market Risks
Item 8 Financial Statements
Item 9 Changes in and Disagreements WithAccountants
Item 9A Controls and Procedures
Item 9B Other Information
Part III Item 10 Directors, Executive Officers andCorporate Governance
Item 11 Executive Compensation
Item 12 Security Ownership of Certain Beneficial Owners
Item 13 Certain Relationships and RelatedTransactions
Item 14 Principal Accounting Fees and Services
Part IV Item 15 Exhibits and Financial StatementSchedules Signatures
Table 2: The 20 different items of a 10-K report.

In more detail, we developed edgar-crawler, which we used to download the 10-K reports of all publicly traded companies in the us between the years 1993 and 2020. We then removed all tables to keep only the textual data, which were html-stripped,444We use Beautiful Soup (https://beautiful-soup-4.readthedocs.io/en/latest). cleaned and split into the different items by using regular expressions. The resulting dataset is edgar-corpus.

While there exist toolkits to download annual filings from edgar, they do not support the extraction of specific item sections.555For example, sec-edgar can download complete html reports (with images and tables), but it does not produce clean item-specific text. This is particularly important since researchers often rely on certain items in their experimental setup. For example, Goel and Gangolly (2012), Purda and Skillicorn (2015), and Goel and Uzuner (2016) perform textual analysis on Item 7 to detect corporate fraud. Katsafados et al. (2020) combine Item 7 and Item 1A to detect Initial Public Offering (ipo) underpricing, while Moriarty et al. (2019) combine Item 1 with Item 7 to predict mergers and acquisitions.

Apart from edgar-corpus, we also release edgar-crawler, the toolkit we developed to create edgar-corpus, to facilitate future research based on textual data from edgar. edgar-crawler consists of two Python modules that support its main functions:

edgar_crawler.py is used to download 10-K reports in batch or for specific companies that are of interest to the user.

extract_items.py extracts the text of all or particular items from 10-K reports. Each item’s text becomes a separate json key-value pair (Figure 1).

Figure 1: An example of a 10-K report in json format as downloaded and extracted by edgar-crawler; filename is the downloaded 10-K file; cik is the Company Index Key; year is the corresponding Fiscal Year.

3.2 Word embeddings

To facilitate financial nlp research, we used edgar-corpus to train word2vec embeddings (edgar-w2v), which can be used for downstream tasks, such as financial text classification or summarization. We used word2vec’s skip-gram model Mikolov et al. (2013a, b) with default parameters as implemented by gensim (Řehůřek and Sojka, 2010) to generate 200-dimensional word2vec embeddings for a vocabulary of 100K words. The word tokens are generated using spaCy Honnibal et al. (2020). We also release edgar-w2v.666The edgar-w2v embeddings are available at: https://zenodo.org/record/5524358

To illustrate the quality of edgar-w2v embeddings, in Figure 2 we visualize sampled words from seven different entity types, i.e., location, industry, company, year, month, number, and financial term, after applying dimensionality reduction with the umap algorithm McInnes et al. (2018). The financial terms are randomly sampled from the Investopedia Financial Terms Dictionary.777https://www.investopedia.com/financial-term-dictionary-4769738. In addition, companies and industries are randomly sampled from well-known industry sectors and publicly traded stocks. Finally, the words belonging to the remaining entity types are randomly sampled from gazetteers. Figure 2 shows that words belonging to the same entity type form clear clusters in the 2-dimensional space indicating that edgar-w2v embeddings manage to capture the underlying financial semantics of the vocabulary.

Figure 2: Visualization of the edgar-w2v embeddings. Different colors indicate different entity types.
economy competitor market national investor
downturn competitive marketplace association institutional
recession competing industry regional shareholder
slowdown dominant prices nationwide relations
sluggish advantages illiquidity american purchaser
stagnant competition prevailing zions creditor
Table 3: Sample words from edgar-w2v

embeddings (top row) and their corresponding nearest neighbors (columns) based on cosine similarity.

To further highlight the semantics captured by edgar-w2v embeddings, we retrieved the 5 nearest neighbors, according to cosine similarity, for commonly used financial terms (Table 3).888We exclude obvious top-scoring neighbors of singular/plural pairs such as market/markets or investor/investors. As shown, all the nearest neighbors are highly related to the corresponding term. For instance, the word economy is correctly associated with terms indicating the slowdown of the economy happening during the past few years, e.g., downturn, recession, or slowdown. Also, market is correctly related with words such as marketplace, industry, and prices.

4 Experiments on financial NLP tasks

We also compare edgar-w2v embeddings against generic glove (Pennington et al., 2014) embeddings999We use the 200-dimensional glove embeddings from https://nlp.stanford.edu/data/glove.6B.zip. and the financial embeddings of Tsai et al. (2016) in three financial nlp tasks. For each task, we use the same model and we only alter the embeddings component. In addition, we use the same pre-processing during the creation of the vocabulary of the embeddings in each case.

finsim-3 Juyeon Kang and Gan (2021)

provides a set of business and economic terms and the task is to classify them into the most relevant hypernym from a set of 17 possible hypernyms from the Financial Industry Business Ontology (

fibo).101010https://spec.edmcouncil.org/fibo/. Example hypernyms include Credit Index, Bonds, and Stocks

. We tackle the problem with a multinomial logistic regression model, which, given the embedding of an economic term, classifies the term to one of the 17 possible hypernyms. Since


im-3 is a recently completed challenge, the true labels for the test data were not available. Therefore, we use a stratified 10-fold cross-validation. We report accuracy and the average rank of the correct hypernym as evaluation measures. For the latter, the 17 hypernyms are sorted according to the model’s probabilities. A perfect model, i.e., one that would always rank the correct hypernym first, would have an average rank of 1.

Financial tagging (fint) is an in-house sequence labeling problem for financial documents. The task is to annotate financial reports with word-level tags from an accounting taxonomy. To tackle the problem, we use a bilstm encoder operating over word embeddings with a shared multinomial logistic regression that predicts the correct tag for each word from the corresponding bilstm state. We report the F1 score micro-averaged across all tags.

fiqa Open Challenge Maia et al. (2018)

is a sentiment analysis regression challenge over financial texts. It contains financial tweets annotated by domain experts with a sentiment score in the range [-1, 1], with 1 denoting the most positive score. For this problem, we employ a

bilstm encoder which operates over word embeddings, and a linear regressor operating over the last hidden state of the bilstm. Since we do not have access to the test set in this task, we use a 10-fold cross-validation. We evaluate the results using Mean Squared Error (MSE) and R-squared ().

Across all tasks, edgar-w2v outperforms glove, showing that in-domain knowledge is critical in financial nlp problems (Table 4). The gains are more substantial in finsim-3 and fint, which rely to a larger extent on understanding highly technical economics discourse. Interestingly, the in-domain embeddings of Tsai et al. (2016) are comparable to the generic glove embeddings in two of the three tasks. One possible reason is that Tsai et al. (2016) employed stemming during the creation of the embeddings vocabulary, which might have contributed noise to the models due to loss of information.

finsim-3 fint fiqa
Acc. Rank F1 MSE
glove 85.3 1.26 75.8 0.151 0.119
Tsai et al. (2016) 84.9 1.27 75.3 0.142 0.169
edgar-w2v (ours) 87.9 1.21 77.3 0.141 0.176
Table 4: Results across financial nlp

tasks, with different word embeddings. We report averages over 3 runs with different random seeds. The standard deviations were very small and are omitted for brevity.

5 Conclusions and Future Work

We introduced and released edgar-corpus, a novel nlp corpus for the financial domain. To the best of our knowledge, edgar-corpus is the largest financial corpus available. It contains textual data from annual reports published in edgar, the repository for all us publicly traded companies, covering a period of more than 25 years. All the reports are split into their corresponding items (sections) and are provided in a clean, easy-to-use json format. We also released edgar-crawler, a toolkit for downloading and extracting the reports. To showcase the impact of edgar-corpus, we used it to train and release edgar-w2v, which are financial word2vec embeddings. After illustrating the quality of edgar-w2v embeddings, we also showed their usefulness in three financial nlp tasks, where they outperformed generic glove embeddings and other financial embeddings.

In future work, we plan to extend edgar-crawler to support additional types of documents (e.g., 10-Q, 8-K) and to leverage edgar-corpus

to explore transfer learning for the financial domain, which is vastly understudied.


This publication contains information in summary form and is therefore intended for general guidance only. It is not intended to be a substitute for detailed research or the exercise of professional judgment. Member firms of the global EY organization cannot accept responsibility for loss to any person relying on this article.


  • C. Chen, H. Huang, H. Takamura, and H. Chen (Eds.) (2020) Proceedings of the second workshop on financial technology and natural language processing. Kyoto, Japan. External Links: Link Cited by: §1.
  • T. Daudert and S. Ahmadi (2019) CoFiF: a corpus of financial reports in French language. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, Macao, China, pp. 21–26. External Links: Link Cited by: Table 1, §2.
  • M. El-Haj, V. Athanasakou, S. Ferradans, C. Salzedo, A. Elhag, H. Bouamor, M. Litvak, P. Rayson, G. Giannakopoulos, and N. Pittaras (Eds.) (2020) Proceedings of the 1st joint workshop on financial narrative processing and multiling financial summarisation. Barcelona, Spain (Online). External Links: Link Cited by: §1.
  • P. Gandhi, T. Loughran, and B. McDonald (2019) Using Annual Report Sentiment as a Proxy for Financial Distress in U.S. Banks. Journal of Behavioral Finance 20 (4), pp. 424–436. External Links: Document, Link Cited by: §1.
  • S. Goel and J. Gangolly (2012) Beyond the numbers: mining the annual reports for hidden cues indicative of financial statement fraud. Intelligent Systems in Accounting, Finance and Management 19 (2), pp. 75–89. External Links: Link, Document Cited by: §3.1.
  • S. Goel and Ö. Uzuner (2016)

    Do sentiments matter in fraud detection? estimating semantic orientation of annual reports

    Intelligent Systems in Accounting, Finance and Management 23, pp. 215–239. Cited by: §3.1.
  • U. Hahn, V. Hoste, and M. Tsai (Eds.) (2018) Proceedings of the first workshop on economics and natural language processing. Melbourne, Australia. External Links: Link Cited by: §1.
  • U. Hahn, V. Hoste, and Z. Zhang (Eds.) (2019) Proceedings of the second workshop on economics and natural language processing. Hong Kong. External Links: Link Cited by: §1.
  • S. G.M. Händschke, S. Buechel, J. Goldenstein, P. Poschmann, T. Duan, P. Walgenbach, and U. Hahn (2018) A corpus of corporate annual and social responsibility reports: 280 million tokens of balanced organizational writing. In Proceedings of the First Workshop on Economics and Natural Language Processing, Melbourne, Australia, pp. 20–31. External Links: Link, Document Cited by: Table 1, §2.
  • M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python External Links: Document, Link Cited by: §3.2.
  • S. B. Juyeon Kang and M. Gan (2021) FinSim-3: the 3rd shared task on learning semantic similarities for the financial domain. In Proceedings of the Third Workshop on Financial Technology and Natural Language Processing, Held Online. Cited by: §4.
  • A. G. Katsafados, I. Androutsopoulos, I. Chalkidis, E. Fergadiotis, G. N. Leledakis, and E. G. Pyrgiotakis (2021) Using textual analysis to identify merger participants: evidence from the u.s. banking industry. Finance Research Letters, pp. 101949. External Links: ISSN 1544-6123, Document, Link Cited by: §1.
  • A. G. Katsafados, I. Androutsopoulos, I. Chalkidis, M. Fergadiotis, G. N. Leledakis, and E. G. Pyrgiotakis (2020)

    Textual Information and IPO Underpricing: A Machine Learning Approach

    MPRA Paper Technical Report 103813, University Library of Munich, Germany. External Links: Link, Document Cited by: §3.1.
  • S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith (2009) Predicting risk from financial reports with regression. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp. 272–280. External Links: Link Cited by: Table 1, §1, §2.
  • H. Lee, M. Surdeanu, B. MacCartney, and D. Jurafsky (2014) On the importance of text analysis for stock price prediction. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 1170–1175. External Links: Link Cited by: Table 1, §1, §2.
  • T. Loughran and B. McDonald (2016) Textual analysis in accounting and finance: a survey. Journal of Accounting Research 54 (4), pp. 1187–1230. Cited by: §2.
  • M. Maia, S. Handschuh, A. Freitas, B. Davis, R. McDermott, M. Zarrouk, and A. Balahur (2018) WWW’18 open challenge: financial opinion mining and question answering. Companion Proceedings of the The Web Conference 2018. Cited by: §4.
  • L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018) UMAP: uniform manifold approximation and projection. The Journal of Open Source Software 3 (29), pp. 861. Cited by: §3.2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a)

    Efficient estimation of word representations in vector space

    In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §3.2.
  • R. Moriarty, H. Ly, E. Lan, and S. McIntosh (2019) Deal or no deal: predicting mergers and acquisitions at scale. 2019 IEEE International Conference on Big Data (Big Data), pp. 5552–5558. Cited by: §3.1.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §1, §4.
  • L. Purda and D. Skillicorn (2015) Accounting variables, deception, and a bag of words: assessing the tools of fraud detection. Contemporary Accounting Research 32 (3), pp. 1193–1223. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/1911-3846.12089 Cited by: §3.1.
  • R. Řehůřek and P. Sojka (2010) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Note: http://is.muni.cz/publication/884893/en Cited by: §3.2.
  • M. Tsai, C. Wang, and P. Chien (2016)

    Discovering finance keywords via continuous-space language models

    ACM Transactions on Management Information Systems 7 (3). External Links: ISSN 2158-656X, Link, Document Cited by: Table 1, §1, §2, Table 4, §4, §4.