Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret
We study the communication complexity of distributed multi-armed bandits (MAB) and distributed linear bandits for regret minimization. We propose communication protocols that achieve near-optimal regret bounds and result in optimal speed-up under mild conditions. We measure the communication cost of protocols by the total number of communicated numbers. For multi-armed bandits, we give two protocols that require little communication cost, one is independent of the time horizon T and the other is independent of the number of arms K . In particular, for a distributed K-armed bandit with M agents, our protocols achieve near-optimal regret O(√(MKT T)) with O(M T) and O(MK M) communication cost respectively. We also propose two protocols for d-dimensional distributed linear bandits that achieve near-optimal regret with O(M^1.5d^3) and O((Md+d d) T) communication cost respectively. The communication cost can be independent of T, or almost linear in d.
READ FULL TEXT