Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
Abstract
DINOv2 sees the world well enough to guide robots and segment images, but we still do not know what it sees. We conduct the first comprehensive analysis of DINOv2’s representational structure using overcomplete dictionary learning, extracting over 32,000 visual concepts in what constitutes the largest interpretability demonstration for any vision foundation model to date. This method provides the backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits “Elsewhere” concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies exclusively on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular cue families matching visual neuroscience principles. Turning to concept geometry and statistics, we find the learned dictionary deviates from ideal near-orthogonal (Grassmannian) structure, exhibiting higher coherence than random baselines. Concept atoms are not aligned with the neuron basis, confirming distributed encoding. We discover antipodal concept pairs that encode opposite semantics (e.g., “white shirt” vs “black shirt”), creating signed semantic axes. Separately, we identify concepts that activate exclusively on register tokens, revealing these encode global scene properties like motion blur and illumination. Across layers, positional information collapses toward a 2D sheet, yet within single images token geometry remains smooth and clustered even after position is removed, putting into question a purely sparse-coding view of representation. To resolve this paradox, we advance a different view: tokens are formed by combining convex mixtures of a few archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). Multi-head attention directly implements this construction, with activations behaving like sums of convex regions. In this picture, concepts are expressed by proximity to landmarks and by regions—not by unbounded linear directions. We call this the Minkowski Representation Hypothesis (MRH), and we examine its empirical signals and consequences for how we study, steer, and interpret vision-transformer representations.