Robusto-2: Benchmarking Humans and VLMs for Autonomous Driving in Lima & NYC
Abstract
As Self-Driving Cars continue to be deployed in different cities around the world: how well will these systems generalize when exposed in new geographies? Moreover, how well will current multi-modal VLMs (Vision Language Models) be able to cognitively understand and act when faces with bizzare edge-case scenarios. In this talk I will aim to answer these questions through a Visual Question Answering (VQA) framework, where we show humans and VLMs a series of our own recorded dashcam footage from Lima and New York City and test for system divergence and convergence. Moreover we tests for these similarities/divergences in a factorial analysis with 3 groups: Humans from NYC, Humans from Lima and VLMs; and two first-person dashcam data recorded from both Lima and New York City.