Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

04/17/2018
by   Alok Singh, et al.
0

Distributed computing platforms provide a robust mechanism to perform large-scale computations by splitting the task and data among multiple locations, possibly located thousands of miles apart geographically. Although such distribution of resources can lead to benefits, it also comes with its associated problems such as rampant duplication of file transfers increasing congestion, long job completion times, unexpected site crashing, suboptimal data transfer rates, unpredictable reliability in a time range, and suboptimal usage of storage elements. In addition, each sub-system becomes a potential failure node that can trigger system wide disruptions. In this vision paper, we outline our approach to leveraging Deep Learning algorithms to discover solutions to unique problems that arise in a system with computational infrastructure that is spread over a wide area. The presented vision, motivated by a real scientific use case from Belle II experiments, is to develop multilayer neural networks to tackle forecasting, anomaly detection and optimization challenges in a complex and distributed data movement environment. Through this vision based on Deep Learning principles, we aim to achieve reduced congestion events, faster file transfer rates, and enhanced site reliability.

READ FULL TEXT
research
01/01/2010

A distributed file system for a wide-area high performance computing infrastructure

We describe our work in implementing a wide-area distributed file system...
research
12/08/2017

OneDataShare: A Vision for Cloud-hosted Data Transfer Scheduling and Optimization as a Service

Fast, reliable, and efficient data transmission across wide-area network...
research
03/04/2022

A streamable large-scale clinical EEG dataset for Deep Learning

Deep Learning has revolutionized various fields, including Computer Visi...
research
10/16/2019

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Training and deploying deep learning models in real-world applications r...
research
12/03/2018

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where...
research
11/23/2022

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Solving partial differential equations with deep learning makes it possi...
research
03/29/2023

A Subset of the CERN Virtual Machine File System: Fast Delivering of Complex Software Stacks for Supercomputing Resources

Delivering a reproducible environment along with complex and up-to-date ...

Please sign up or login with your details

Forgot password? Click here to reset