ICLR Poster Learning from Imperfect Human Feedback: A Tale from Corruption-Robust Dueling

Poster

Learning from Imperfect Human Feedback: A Tale from Corruption-Robust Dueling

Yuwei Cheng · Fan Yao · Xuefeng Liu · Haifeng Xu

Hall 3 + Hall 2B #450

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as

$t^{\rho-1}$ for some

imperfection rate''

$\rho \in [0, 1]$ . With

$T$ as the total number of iterations, we establish a regret lower bound of

$\Omega(\max\{\sqrt{T}, T^{\rho}\})$ for LIHF, even when

$\rho$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret

$\tilde{\mathcal{O}}(\max\{\sqrt{T}, T^{\rho}\})$ . Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

Live content is unavailable. Log in and register to view live content