Oral
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities
Learning the RoPEs: Better 2D and 3D Position Encodings with STRING
Connor Schenck · Isaac Reid · Mithun George Jacob · Alex Bewley · Joshua Ainslie · David Rendleman · Deepali Jain · Mohit Sharma · Kumar Dubey · Ayzaan Wahid · Sumeet Singh · RenĂ© Wagner · Tianli Ding · Chuyuan Fu · Arunkumar Byravan · Jake Varley · Alexey Gritsenko · Matthias Minderer · Dmitry Kalashnikov · Jonathan Tompson · Vikas Sindhwani · Krzysztof Choromanski
[
Abstract
]
[ Project Page ]
presentation:
7th Robot Learning Workshop: Towards Robots with Human-Level Abilities
Sat 26 Apr 5:55 p.m. PDT — 3 a.m. PDT
[
OpenReview]
Sat 26 Apr 5:55 p.m. PDT — 3 a.m. PDT
Abstract:
We introduce $\textbf{STRING}$: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics.
Chat is not available.