Objcache: An Elastic Filesystem over External Persistent Storage for Container Clusters

09/04/2023
by   Takeshi Yoshimura, et al.
0

Container virtualization enables emerging AI workloads such as model serving, highly parallelized training, machine learning pipelines, and so on, to be easily scaled on demand on the elastic cloud infrastructure. Particularly, AI workloads require persistent storage to store data such as training inputs, models, and checkpoints. An external storage system like cloud object storage is a common choice because of its elasticity and scalability. To mitigate access latency to external storage, caching at a local filesystem is an essential technique. However, building local caches on scaling clusters must cope with explosive disk usage, redundant networking, and unexpected failures. We propose objcache, an elastic filesystem over external storage. Objcache introduces an internal transaction protocol over Raft logging to enable atomic updates of distributed persistent states with consistent hashing. The proposed transaction protocol can also manage inode dirtiness by maintaining the consistency between the local cache and external storage. Objcache supports scaling down to zero by automatically evicting dirty files to external storage. Our evaluation reports that objcache speeded up model serving startup by 98.9 compared to direct copies via S3 interfaces. Scaling up with dirty files completed from 2 to 14 seconds with 1024 dirty files.

READ FULL TEXT
research
09/03/2022

Sion: Elastic Serverless Cloud Storage

Cloud object storage such as AWS S3 is cost-effective and highly elastic...
research
08/13/2021

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Cloud computing provides a powerful yet low-cost environment for distrib...
research
05/08/2018

Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables

This paper proposes round-hashing, which is suitable for data storage on...
research
04/03/2021

Nova-LSM: A Distributed, Component-based LSM-tree Key-value Store

The cloud infrastructure motivates disaggregation of monolithic data sto...
research
10/07/2019

Assise: Performance and Availability via NVM Colocation in a Distributed File System

The adoption of very low latency persistent memory modules (PMMs) upends...
research
02/19/2021

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

Two-phase commit (2PC) has been widely used in distributed databases to ...
research
10/20/2018

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ...

Please sign up or login with your details

Forgot password? Click here to reset