Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing

04/18/2020
by   I. Sfiligoi, et al.
0

Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. The available capacity of cost-effective instances is not well understood. This paper presents expanding the IceCube's production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about 60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.

READ FULL TEXT
research
02/16/2020

Running a Pre-Exascale, Geographically Distributed, Multi-Cloud Scientific Simulation

As we approach the Exascale era, it is important to verify that the exis...
research
07/08/2021

Expanding IceCube GPU computing into the Clouds

The IceCube collaboration relies on GPU compute for many of its needs, i...
research
04/14/2021

Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links

Many scientific high-throughput applications can benefit from the elasti...
research
02/11/2020

Characterizing network paths in and out of the clouds

Commercial Cloud computing is becoming mainstream, with funding agencies...
research
10/25/2021

Data intensive physics analysis in Azure cloud

The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider ...
research
03/04/2020

Moving the California distributed CMS xcache from bare metal into containers using Kubernetes

The University of California system has excellent networking between all...
research
05/02/2022

Auto-scaling HTCondor pools using Kubernetes compute resources

HTCondor has been very successful in managing globally distributed, plea...

Please sign up or login with your details

Forgot password? Click here to reset