ICLR Poster Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds

Poster

Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds

Michael Chen · A. Pavan · N. V. Vinodchandran · Ruosong Wang · Lin Yang

Hall 3 + Hall 2B #446

[ Abstract ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract: This paper investigates *list replicability* [Dixon et al., 2023] in the context of multi-armed (also linear) bandits (MAB). We define an algorithm

$A$ for MAB to be

$(\ell,\delta)$ -list replicable if with probability at least

$1-\delta$ ,

$A$ has at most

$\ell$ traces in independent executions even with different random bits, where a trace means sequence of arms played during an execution. For

$k$ -armed bandits, although the total number of traces can be

$\Omega(k^T)$ for a time horizon

$T$ , we present several surprising upper bounds that either independent of or logarithmic of

$T$ : (1) a

$(2^{k},\delta)$ -list replicable algorithm with near-optimal regret,

$\widetilde{O}({\sqrt{kT}})$ , (2) a

$(O(k/\delta),\delta)$ -list replicable algorithm with regret

$\widetilde{O}\left(\frac{k}{\delta}\sqrt{kT}\right)$ , (3) a

$((k+1)^{B-1}, \delta)$ -list replicable algorithm with regret

$\widetilde{O}(k^{\frac{3}{2}}T^{{\frac{1}{2}}+2^{-(B+1)}})$ for any integer

$B>1$ . On the other hand, for the sublinear regret regime, we establish a matching lower bound on the list complexity (parameter

$\ell$ ). We prove that there is no

$(k-1,\delta)$ -list replicable algorithm with

$o(T)$ -regret. This is optimal in list complexity in the sub-linear regret regime as there is a

$(k, 0)$ -list replicable algorithm with

$O(T^{2/3})$ -regret. We further show that for linear bandits with

$d$ -dimensional features, there is a

$\widetilde{O}(d^2T^{1/2+2^{-(B+1)}})$ -regret algorithm with

$((2d+1)^{B-1},\delta)$ -list replicability, for

$B>1$ , even when the number of possible arms can be infinite.

Live content is unavailable. Log in and register to view live content