Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

by   Weijie Zhao, et al.

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than 10^11 sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.



page 1

page 2

page 3

page 4


Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

Click-Through Rate (CTR) prediction is a crucial component in the online...

Data-parallel distributed training of very large models beyond GPU capacity

GPUs have limited memory and it is difficult to train wide and/or deep m...

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have gro...

A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent

Scaling node embedding systems to efficiently process networks in real-w...

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

In this paper, we present a co-designed petascale high-density GPU clust...

Learning@home: Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Many recent breakthroughs in deep learning were achieved by training inc...

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Widely popular transformer-based NLP models such as BERT and GPT have en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.