ICLR Poster Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Poster

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Yuheng Zhang · Nan Jiang

Hall 3 + Hall 2B #459

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

We investigate off-policy evaluation (OPE), a central and fundamental problemin reinforcement learning (RL), in the challenging setting of Partially ObservableMarkov Decision Processes (POMDPs) with large observation spaces. Recentworks of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-freeframework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend onthe entire observable history remained an open problem. In this work, we proveinformation-theoretic hardness for model-free OPE of history-dependent policies inseveral settings, characterized by additional assumptions imposed on the behaviorpolicy (memoryless vs. history-dependent) and/or the state-revealing property ofthe POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm—whose analysis has surprisingly eluded the literature despite the algorithm’s simplicity—demonstratingprovable separation between model-free and model-based OPE in POMDPs.

Live content is unavailable. Log in and register to view live content