CosmoFlow: Using Deep Learning to Learn the Universe at Scale

08/14/2018
by   Amrita Mathuriya, et al.
0

Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on Intel(C) Xeon Phi(TM) processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77 parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. These enhancements enable us to process large 3D dark matter distribution and predict the cosmological parameters Ω_M, σ_8 and n_s with unprecedented accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2017

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

We explore scaling of the standard distributed Tensorflow with GRPC prim...
research
11/30/2018

TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank

TensorFlow Ranking is the first open source library for solving large-sc...
research
11/14/2017

Reinforcement Learning in a large scale photonic Recurrent Neural Network

Photonic Neural Network implementations have been gaining considerable a...
research
03/16/2019

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel f...
research
03/27/2018

Diagonalwise Refactorization: An Efficient Training Method for Depthwise Convolutions

Depthwise convolutions provide significant performance benefits owing to...
research
11/08/2021

Accelerating GAN training using highly parallel hardware on public cloud

With the increasing number of Machine and Deep Learning applications in ...
research
05/19/2023

A Generic Performance Model for Deep Learning in a Distributed Environment

Performance modelling of a deep learning application is essential to imp...

Please sign up or login with your details

Forgot password? Click here to reset