Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Abstract
Current state of the art multi-modal Large Language Models (LLM) have advanced conversational abilities. However, their effectiveness as coaches for learning everyday skills by providing live, interactive step-by-step guidance is still untested. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. To evaluate such capabilities, we introduce \benchmark{}, a new benchmark and dataset built upon CaptainCook4D, features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. Extensive evaluation shows that current state of the art multi-modal LLMs struggle with providing live, interactive step-by-step guidance.