
DouZero: Mastering DouDizhu with SelfPlay Deep Reinforcement Learning
Games are abstractions of the real world, where artificial agents learn ...
1bit Adam: Communication Efficient LargeScale Training with Adam's Convergence Speed
Scalable training of large models (like BERT and GPT3) requires careful...
APMSqueeze: A Communication Efficient AdamPreconditioned Momentum SGD Algorithm
Adam is the important optimization algorithm to guarantee efficiency and...
Stochastic Recursive Momentum for Policy Gradient Methods
In this paper, we propose a novel algorithm named STOchastic Recursive M...
Stochastic Recursive Variance Reduction for Efficient Smooth NonConvex Compositional Optimization
Stochastic compositional optimization arises in many important machine l...
DeepSqueeze: Decentralization Meets ErrorCompensated Compression
Communication is a key bottleneck in distributed training. Recently, an ...
DeepSqueeze: Parallel Stochastic Gradient Descent with DoublePass ErrorCompensated Compression
Communication is a key bottleneck in distributed training. Recently, an ...
DoubleSqueeze: Parallel Stochastic Gradient Descent with DoublePass ErrorCompensated Compression
A standard approach in large scale machine learning is distributed stoch...
Revisit Batch Normalization: New Understanding from an Optimization View and a Refinement via Composition Optimization
Batch Normalization (BN) has been used extensively in deep learning to a...
D^2: Decentralized Training over Decentralized Data
While training a machine learning model using multiple workers, each of ...
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Recent work shows that decentralized parallel stochastic gradient decent...
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
Most distributed machine learning systems nowadays, including TensorFlow...
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization
Asynchronous parallel implementations of stochastic gradient (SG) have b...
Xiangru Lian
