Environment Predictive Coding for Visual Navigation

Santhosh Kumar Ramakrishnan · Tushar Nagarajan · Ziad Al-Halah · Kristen Grauman

Keywords: [ self-supervised learning ] [ visual navigation ] [ representation learning ]

[ Abstract ]
[ Visit Poster at Spot G2 in Virtual World ] [ OpenReview
Mon 25 Apr 10:30 a.m. PDT — 12:30 p.m. PDT


We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for individual images, we aim to encode a 3D environment using a series of images observed by an agent moving in it. We learn these representations via a masked-zone prediction task, which segments an agent’s trajectory into zones and then predicts features of randomly masked zones, conditioned on the agent’s camera poses. This explicit spatial conditioning encourages learning representations that capture the geometric and semantic regularities of 3D environments. We learn such representations on a collection of video walkthroughs and demonstrate successful transfer to multiple downstream navigation tasks. Our experiments on the real-world scanned 3D environments of Gibson and Matterport3D show that our method obtains 2 - 6× higher sample-efficiency and up to 57% higher performance over standard image-representation learning.

Chat is not available.