Studio e confronto delle strutture di Apache Spark

10/29/2018
by   Massimiliano Morrelli, et al.
0

English. This document is designed to study the data structures that can be used in the Apache Spark framework and to evaluate the best performing ones to implement solutions, in particular we will evaluate advantages / disadvantages deriving from the use of Dataset for job creation. The observation of the results provides further support in evaluating the use of Dataset as an alternative to RDD, in order to understand its strengths and weaknesses. The examination of the results is possible thanks to specifically designed and implemented in Java 1.8 language. The execution of the jobs, entrusted to a suitable distributed environment, will end with the comparison between execution times and results obtained. Italiano. Il presente documento nasce allo scopo di studiare le strutture dati utilizzabili nel framework Apache Spark e valutare quelle più performanti per implementare soluzioni; valuteremo in articolare i vantaggi / svantaggi derivanti dall'utilizzo dei Dataset nella progettazione dei job. L'osservazione dei risultati fornisce ulteriore supporto nel valutare l'utilizzo dei Dataset in alternativa a RDD, al fine di comprederne i punti di forza e di debolezza. L'esame dei risultati è possibile in virtù di due casi appositamente pensati e implementati in linguaggio Java 1.8. L'esecuzione dei job, affidata a un adeguato ambiente distribuito, si concluderà con il confronto tra tempi di esecuzione e risultati ottenuti.

READ FULL TEXT

page 3

page 6

page 7

page 9

page 12

research
07/15/2019

Should we Embed? A Study on the Online Performance of Utilizing Embeddings for Real-Time Job Recommendations

In this work, we present the findings of an online study, where we explo...
research
11/19/2021

Compresion y analisis de imagenes por medio de algoritmos para la ganaderia de precision

The problem that we want to solve in this project of the subject of Data...
research
07/29/2021

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Distributed dataflow systems enable the use of clusters for scalable dat...
research
09/08/2017

Java Extensions for OMNeT++

On the one side, network simulation frameworks are important tools for r...
research
08/14/2019

Resolvable Designs for Speeding up Distributed Computing

Distributed computing frameworks such as MapReduce are often used to pro...
research
09/30/2021

Determining Standard Occupational Classification Codes from Job Descriptions in Immigration Petitions

Accurate specification of standard occupational classification (SOC) cod...

Please sign up or login with your details

Forgot password? Click here to reset