Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

10/17/2022
by   Istiak Ahmad, et al.
0

Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

READ FULL TEXT
research
04/19/2021

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

News article revision histories have the potential to give us novel insi...
research
06/11/2021

Visualization Techniques to Enhance Automated Event Extraction

Robust visualization of complex data is critical for the effective use o...
research
10/19/2022

Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language

Research in Natural Language Processing (NLP) has increasingly become im...
research
06/14/2022

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual e...
research
04/19/2023

Radar de Parité: An NLP system to measure gender representation in French news stories

We present the Radar de Parité, an automated Natural Language Processing...
research
05/26/2017

Helping News Editors Write Better Headlines: A Recommender to Improve the Keyword Contents & Shareability of News Headlines

We present a software tool that employs state-of-the-art natural languag...
research
02/22/2015

Using NLP to measure democracy

This paper uses natural language processing to create the first machine-...

Please sign up or login with your details

Forgot password? Click here to reset