Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya ⋅ Bicheng Xu ⋅ Sanjay Haresh ⋅ Reza Pourreza ⋅ Litian Liu ⋅ Sunny Panchal ⋅ Pulkit Madan ⋅ Leonid Sigal ⋅ Roland Memisevic

Project Page [ Poster] [ OpenReview]

Abstract

Current state of the art multi-modal Large Language Models (LLM) have advanced conversational abilities. However, their effectiveness as coaches for learning everyday skills by providing live, interactive step-by-step guidance is still untested. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. To evaluate such capabilities, we introduce \benchmark{}, a new benchmark and dataset built upon CaptainCook4D, features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. Extensive evaluation shows that current state of the art multi-modal LLMs struggle with providing live, interactive step-by-step guidance.

Chat is not available.