EDGAR-CORPUS: Billions of Tokens Make The World Go Round

09/29/2021
by   Lefteris Loukas, et al.
0

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2020

A Multi-Source Entity-Level Sentiment Corpus for the Financial Domain: The FinLin Corpus

We introduce FinLin, a novel corpus containing investor reports, company...
research
07/22/2020

IITK at the FinSim Task: Hypernym Detection in Financial Domain via Context-Free and Contextualized Word Embeddings

In this paper, we present our approaches for the FinSim 2020 shared task...
research
09/16/2021

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish ...
research
03/12/2022

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Publicly traded companies are required to submit periodic reports with e...
research
08/21/2021

Yseop at FinSim-3 Shared Task 2021: Specializing Financial Domain Learning with Phrase Representations

In this paper, we present our approaches for the FinSim-3 Shared Task 20...
research
10/28/2016

Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were ...
research
10/14/2022

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing,...

Please sign up or login with your details

Forgot password? Click here to reset