TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks

10/04/2021
by   Kaiqiang Xu, et al.
0

In Machine Learning (ML) system research, efficient resource scheduling and utilization have always been an important topic given the compute-intensive nature of ML applications. In this paper, we introduce the design of TACC, a full-stack cloud infrastructure that efficiently manages and executes large-scale machine learning applications in compute clusters. TACC implements a 4-layer application workflow abstraction through which system optimization techniques can be dynamically combined and applied to various types of ML applications. TACC also tailors to the lifecycle of ML applications with an efficient process of managing, deploying, and scaling ML tasks. TACC's design simplifies the process of integrating the latest ML system research work into cloud infrastructures, which we hope will benefit more ML researchers and promote ML system researches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2020

Minerva: A Portable Machine Learning Microservice Framework for Traditional Enterprise SaaS Applications

In traditional SaaS enterprise applications, microservices are an essent...
research
10/02/2022

Approximate Computing and the Efficient Machine Learning Expedition

Approximate computing (AxC) has been long accepted as a design alternati...
research
02/26/2020

A Simple and Agile Cloud Infrastructure to Support Cybersecurity Oriented Machine Learning Workflows

Generating up to date, well labeled datasets for machine learning (ML) s...
research
02/11/2022

Compute Trends Across Three Eras of Machine Learning

Compute, data, and algorithmic advances are the three fundamental factor...
research
05/04/2022

SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training

In today's production machine learning (ML) systems, models are continuo...
research
03/18/2020

ContainerStress: Autonomous Cloud-Node Scoping Framework for Big-Data ML Use Cases

Deploying big-data Machine Learning (ML) services in a cloud environment...
research
06/12/2022

MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

Function-as-a-Service (FaaS) has raised a growing interest in how to "ta...

Please sign up or login with your details

Forgot password? Click here to reset