Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning
In-context Newton’s method for regression: Transformers can provably converge
Angeliki Giannou · Liu Yang · Tianhao Wang · Dimitris Papailiopoulos · Jason Lee
Abstract:
Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies suggest that Transformers can implement first-order optimization algorithms, such as gradient descent, for in-context learning. Inspired by recent works, we study whether Transformers can perform higher-order optimization methods. We substantiate this, by first demonstrating the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. We use this result, toestablish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic factor more layers.
Chat is not available.