UniRank: Unimodal Bandit Algorithm for Online Ranking
We tackle a new emerging problem, which is finding an optimal monopartite matching in a weighted graph. The semi-bandit version, where a full matching is sampled at each iteration, has been addressed by <cit.>, creating an algorithm with an expected regret matching O(Llog(L)/Δlog(T)) with 2L players, T iterations and a minimum reward gap Δ. We reduce this bound in two steps. First, as in <cit.> and <cit.> we use the unimodality property of the expected reward on the appropriate graph to design an algorithm with a regret in O(L1/Δlog(T)). Secondly, we show that by moving the focus towards the main question `Is user i better than user j?' this regret becomes O(LΔ/Δ̃^2log(T)), where Δ > Δ derives from a better way of comparing users. Some experimental results finally show these theoretical results are corroborated in practice.
READ FULL TEXT