ICLR Poster Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Poster

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar · Xiaohan Wang · Yonatan Bitton · Idan Szpektor · Serena Yeung

Hall 3 + Hall 2B #91

[ Abstract ] [ Project Page ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: The performance and reasoning capabilities of Large Multi-modal Models (LMMs) is dependent on the size and quality of their training datasets. However, collecting datasets that support chain-of-thought instruction tuning is highly challenging. Existing video instruction tuning datasets are often derived by prompting large language models with video captions to generate question-answer pairs, which makes them predominantly descriptive rather than reasoning-focused. Meanwhile, many labeled video datasets with diverse labels and supervision exist -- however, we find that their integration into LMMs is non-trivial. Herein, we present

$\underline{\text{Video}}$

$\underline{\text{S}}\text{elf}$ -

$\underline{\text{T}}\text{raining}$

$\text{with}$

$\underline{\text{a}}\text{ugmented}$

$\underline{\text{R}}\text{easoning}$ (Video-STaR), the first self-training approach for video instruction tuning. Video-STaR allows the utilization of *any* labeled video dataset for video instruction tuning.In Video-STaR, an LMM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LMMs to novel downstream tasks with existing supervision. During instruction generation, an LMM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LMM is then re-trained on the generated dataset. By training exclusively on generated answers containing the correct video labels, Video-STaR leverages these existing labels as weak supervision for video instruction tuning.Our results demonstrate that Video-STaR-augmented LMMs achieve notable improvements in (I) general Video QA, where TempCompass performance improved by 6.1%, *and* (II) downstream tasks, with a 9.9% increase in Kinetics700-QA accuracy and a 4.0% improvement in action quality assessment on FineDiving, while also exhibiting better interpretability.

Live content is unavailable. Log in and register to view live content