Split Happens (But Your Video Model Can Be Edited)
Abstract
Recognition models are typically trained on fixed taxonomies. Yet these taxonomies can be too coarse, collapsing distinctions under a single label (e.g. the action “open” may vary by object, manner, or outcome), and they also evolve as new distinctions become relevant. Collecting annotations and retraining to accommodate such changes is costly. We introduce category splitting, a new task where an existing classifier is edited to refine a coarse class into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video models to expose fine-grained distinctions without extra data. We also show that low-shot fine-tuning, though simple, is highly effective, and benefits further from zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest.