BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

06/22/2021
by   Zhengchun Liu, et al.
5

Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93 tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.

READ FULL TEXT
research
03/24/2021

Towards Accommodating Real-time Jobs on HPC Platforms

Increasing data volumes in scientific experiments necessitate the use of...
research
10/12/2021

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in bot...
research
11/13/2021

A Framework for Routing DNN Inference Jobs over Distributed Computing Networks

Ubiquitous artificial intelligence (AI) is considered one of the key ser...
research
05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...
research
03/06/2019

Softpressure: A Schedule-Driven Backpressure Algorithm for Coping with Network Congestion

We consider the problem of minimizing the delay of jobs moving through a...
research
08/26/2023

Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization

Memory-aware network scheduling is becoming increasingly important for d...
research
12/22/2021

Rightsizing Clusters for Time-Limited Tasks

In conventional public clouds, designing a suitable initial cluster for ...

Please sign up or login with your details

Forgot password? Click here to reset