Policy Optimization with Second-Order Advantage Information

Workshop

Policy Optimization with Second-Order Advantage Information

Jiajin Li · Baoxiang Wang

East Meeting Level 8 + 15 #20

Mon 30 Apr, 11 a.m. PDT

[ Abstract ]

[ PDF]

Policy optimization on high-dimensional action spaces exhibits its difficulty caused by the high variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, the algorithm learns the underlying factorization structure among the action space based on the second-order gradient of the advantage function with respect to the action. Empirical studies demonstrate the performance improvement on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Live content is unavailable. Log in and register to view live content