Addestramento con Dataset Sbilanciati

08/18/2020
by   Massimiliano Morrelli, et al.
0

English. The following document pursues the objective of comparing some useful methods to balance a dataset and obtain a trained model. The dataset used for training is made up of short and medium length sentences, such as simple phrases or extracts from conversations that took place on web channels. The training of the models will take place with the help of the structures made available by the Apache Spark framework, the models may subsequently be useful for a possible implementation of a solution capable of classifying sentences using the distributed environment, as described in "New frontier of textual classification: Big data and distributed calculation" by Massimiliano Morrelli et al. Italiano. Il seguente documento persegue l'obiettivo di mettere a confronto alcuni metodi utili a bilanciare un dataset e ottenere un modello addestrato. Il dataset utilizzato per l'addestramento è composto da frasi di lunghezza breve e media, come frasi semplici o estratte da conversazioni avvenute su canali web. L'addestramento dei modelli avverrà con l'ausilio delle strutture messe a disposizione dal framework Apache Spark, i modelli successivamente potranno essere utili a un eventuale implementazione di una soluzione in grado di classificare frasi sfruttando l'ambiente distribuito, come descritto in "Nuova frontiera della classificazione testuale: Big data e calcolo distribuito" di Massimiliano Morrelli et al.

READ FULL TEXT

page 15

page 19

research
01/31/2020

Similarità per la ricerca del dominio di una frase

English. This document aims to study the best algorithms to verify the b...
research
06/28/2019

Nuova frontiera della classificazione testuale: Big data e calcolo distribuito

This document was created in order to study the algorithms for the categ...
research
03/08/2019

Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State

We deploy the methods of controlled psycholinguistic experimentation to ...
research
01/11/2022

DANNTe: a case study of a turbo-machinery sensor virtualization under domain shift

We propose an adversarial learning method to tackle a Domain Adaptation ...
research
10/14/2019

Addressing Troubles with Double Bubbles: Convergence and Stability at Multi-Bubble Junctions

In this report we discuss and propose a correction to a convergence and ...
research
10/23/2017

Communication Efficient Checking of Big Data Operations

We propose fast probabilistic algorithms with low (i.e., sublinear in th...
research
10/21/2016

Scalable Pooled Time Series of Big Video Data from the Deep Web

We contribute a scalable implementation of Ryoo et al's Pooled Time Seri...

Please sign up or login with your details

Forgot password? Click here to reset