tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

08/10/2020
by   Steven W. D. Chien, et al.
0

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19 selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.

READ FULL TEXT

page 1

page 7

page 8

page 9

page 10

research
01/16/2019

TensorFlow.js: Machine Learning for the Web and Beyond

TensorFlow.js is a library for building and executing machine learning a...
research
08/25/2019

Extending TensorFlow's Semantics with Pipelined Execution

TensorFlow is a popular cloud computing framework that targets machine l...
research
03/11/2019

TensorFlow Doing HPC

TensorFlow is a popular emerging open-source programming framework suppo...
research
11/13/2018

FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

In recent years, there is a surge on machine learning applications in in...
research
10/21/2018

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Training neural network often uses a machine learning framework such as ...
research
03/20/2019

TATi-Thermodynamic Analytics ToolkIt: TensorFlow-based software for posterior sampling in machine learning applications

We describe a TensorFlow-based library for posterior sampling and explor...
research
08/26/2022

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine

The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hu...

Please sign up or login with your details

Forgot password? Click here to reset