ds-array: A Distributed Data Structure for Large Scale Machine Learning

Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislib's main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislib's main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/29/2012

mlpy: Machine Learning Python

mlpy is a Python Open Source Machine Learning library built on top of Nu...
research
07/27/2020

HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

To cope with the rapid growth in available data, the efficiency of data ...
research
01/15/2020

Awkward Arrays in Python, C++, and Numba

The Awkward Array library has been an important tool for physics analysi...
research
06/16/2016

D2O - a distributed data object for parallel high-performance computing in Python

We introduce D2O, a Python module for cluster-distributed multi-dimensio...
research
08/19/2019

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available toda...
research
02/21/2020

Chronofold: a data structure for versioned text

Collaborative text editing and versioning is known to be a tough topic. ...
research
08/30/2022

The BioExcel methodology for developing dynamic, scalable, reliable and portable computational biomolecular workflows

Developing complex biomolecular workflows is not always straightforward....

Please sign up or login with your details

Forgot password? Click here to reset