Deploying a sharded MongoDB cluster as a queued job on a shared HPC architecture

09/06/2022
by   Aaron Saxton, et al.
0

Data stores are the foundation on which data science, in all its variations, is built upon. They provide a queryable interface to structured and unstructured data. Data science often starts by leveraging these query features to perform initial data preparation. However, most data stores are designed to run continuously to service disparate user requests with little or no downtime. Many HPC architectures process user requests by job queue scheduler and maintain a shard filesystem to store a jobs persistent data. We deploy a MongoDB sharded cluster with a run script that is designed to run a data science workload concurrently. As our test piece, we run data ingest and data queries to measure the performance with different configurations on the Blue Waters supper computer.

READ FULL TEXT
research
06/14/2021

Toward a Knowledge Discovery Framework for Data Science Job Market in the United States

The growth of the data science field requires better tools to understand...
research
07/12/2018

Virtualizing the Stampede2 Supercomputer with Applications to HPC in the Cloud

Methods developed at the Texas Advanced Computing Center (TACC) are desc...
research
03/29/2021

Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Developers in data science and other domains frequently use computationa...
research
11/01/2022

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

Modern HPC workload managers and their careful tuning contribute to the ...
research
06/18/2018

AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

We present AccaSim, a simulator for workload management in HPC systems. ...
research
05/07/2021

An Extensive Analytical Approach on Human Resources using Random Forest Algorithm

The current job survey shows that most software employees are planning t...
research
01/29/2023

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Huge amounts of data being generated continuously by digitally interconn...

Please sign up or login with your details

Forgot password? Click here to reset