Cem Mil Podcasts: A Spoken Portuguese Document Corpus

09/23/2022
by   Edgar Tanaka, et al.
0

This document describes the Portuguese language podcast dataset released by Spotify for academic research purposes. We give an overview of how the data was sampled, some basic statistics over the collection, as well as brief information of distribution over Brazilian and Portuguese dialects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2017

Czech Text Document Corpus v 2.0

This paper introduces "Czech Text Document Corpus v 2.0", a collection o...
research
03/17/2021

An ELEGANT dataset with Denial of Service and Man in The Middle attacks

This document describes a dataset with diverse types of Denial of Servic...
research
05/19/2023

Arukikata Travelogue Dataset

We have constructed Arukikata Travelogue Dataset and released it free of...
research
07/06/2020

Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

We present a new release of the Czech-English parallel corpus CzEng 2.0 ...
research
06/28/2022

Comparison of metadata with relevance for bibliometrics between Microsoft Academic Graph and OpenAlex until 2020

Microsoft Academic Graph (MAG) has been studied a lot concerning its sui...
research
12/15/2022

You were saying? – Spoken Language in the V3C Dataset

This paper presents an analysis of the distribution of spoken language i...
research
03/29/2023

Statistical Methods for Microbiome Analysis: A brief review

Recent attacks of various viruses with having deep and extensive impact ...

Please sign up or login with your details

Forgot password? Click here to reset