Skip to yearly menu bar Skip to main content

Workshop: Reincarnating Reinforcement Learning

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Arjun Majumdar · Karmesh Yadav · Sergio Arnaud · Yecheng Jason Ma · Claire Chen · Sneha Silwal · Aryan Jain · Vincent-Pierre Berges · Pieter Abbeel · Dhruv Batra · Yixin Lin · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier


We present the largest and most comprehensive empirical study of visual foundation models for Embodied AI (EAI). First, we curate CORTEXBENCH, consisting of 17 different EAI tasks spanning locomotion, navigation, dexterous and mobile manipulation. Next, we systematically evaluate existing visual foundation models and find that none is universally dominant.To study the effect of pre-training data scale and diversity, we combine ImageNet with over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and train different sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. These models required over 10,000 GPU-hours to train and will be open-sourced to the community.We find that scaling dataset size and diversity does not improve performance across all tasks but does so on average. Finally, we show that adding a second pre-training step on a small in-domain dataset improves performance, matching or outperforming the best known results in this setting.

Chat is not available.