ICLR Poster On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Spotlight Poster

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Jeongyeol Kwon · Dohyun Kwon · Stephen Wright · Robert Nowak

Halle B #283

[ Abstract ]

[ Poster] [ OpenReview]

Abstract: In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter

σ > 0

$\sigma > 0$ . In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be

O (σ)

$O(\sigma)$ -close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as

O (σ)

$O(\sigma)$ -approximation of the original BO, we propose first-order algorithms that find an

ϵ

$\epsilon$ -stationary solution by optimizing the penalty formulation with

σ = O (ϵ)

$\sigma = O(\epsilon)$ . When the perturbed lower-level problem uniformly satisfies the {\it small-error} proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an

ϵ

$\epsilon$ -stationary point of the penalty function using in total

O (ϵ^{- 7})

$O(\epsilon^{-7})$ accesses to first-order stochastic gradient oracles. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, {\it i.e.,} with

O (1)

$O(1)$ samples per iteration, and achieves the improved oracle-complexity of

O (ϵ^{- 5})

$O(\epsilon^{-5})$ .

Chat is not available.