Poster
Visually Consistent Hierarchical Image Classification
Seulki Park · Youren Zhang · Stella Yu · Sara Beery · Jonathan Huang
Hall 3 + Hall 2B #80
Hierarchical classification requires predicting an entire taxonomy tree rather than a single flat level, which demands both accurate predictions at each level and consistency across levels. However, solving hierarchical classification often compromises fine-grained accuracy compared to flat classification because each level requires distinct features, making it a multi-task problem. For example, the fine-grained classification of "Green Hermit" and "Ruby-throated Hummingbird" demands more specific details, while distinguishing between "bird" and "plant" at the coarse level requires broader features. Prior methods address this by using separate blocks for each level to learn distinct features. However, this approach struggles to resolve inconsistencies, as classifiers tend to focus on different, unrelated regions.Our key insight is that classifiers across levels should be grounded in consistent visual cues. For example, the fine-grained classifier may focus on details such as the beak and wings to identify a "Green Hermit, and then the coarse classifier identifies "bird" by grouping these details into the overall "bird" shape. Therefore, we propose a novel hierarchical model that grounds fine-to-coarse semantic parsing on consistent hierarchical visual segmentation. We also introduce a tree-path KL divergence loss to enforce semantic consistency across levels. Our approach significantly outperforms zero-shot CLIP and other state-of-the-art methods on common hierarchical classification benchmarks.
Live content is unavailable. Log in and register to view live content