Oral
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities

Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Connor Schenck · Isaac Reid · Mithun George Jacob · Alex Bewley · Joshua Ainslie · David Rendleman · Deepali Jain · Mohit Sharma · Kumar Dubey · Ayzaan Wahid · Sumeet Singh · René Wagner · Tianli Ding · Chuyuan Fu · Arunkumar Byravan · Jake Varley · Alexey Gritsenko · Matthias Minderer · Dmitry Kalashnikov · Jonathan Tompson · Vikas Sindhwani · Krzysztof Choromanski

Project Page [ OpenReview]

Abstract

We introduce $\textbf{STRING}$: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics.

Chat is not available.