Benchmarking SciDB Data Import on HPC Systems

09/24/2016
by   Siddharth Samsi, et al.
0

SciDB is a scalable, computational database management system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming.

READ FULL TEXT

page 2

page 4

research
08/14/2016

Julia Implementation of the Dynamic Distributed Dimensional Data Model

Julia is a new language for writing data analysis programs that are easy...
research
08/25/2018

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Detecting anomalous behavior in network traffic is a major challenge due...
research
08/31/2022

pPython for Parallel Python Programming

pPython seeks to provide a parallel capability that provides good speed-...
research
09/07/2023

pPython Performance Study

pPython seeks to provide a parallel capability that provides good speed-...
research
09/01/2022

Python Implementation of the Dynamic Distributed Dimensional Data Model

Python has become a standard scientific computing language with fast-gro...
research
03/10/2020

The Locus Algorithm IV: Performance metrics of a grid computing system used to create catalogues of optimised pointings

This paper discusses the requirements for and performance metrics of the...
research
07/06/2019

Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M

The Dynamic Distributed Dimensional Data Model (D4M) library implements ...

Please sign up or login with your details

Forgot password? Click here to reset