ProCLIP: Product Space Multimodal Contrastive Alignment
Abstract
Contrastive learning has become a dominant paradigm for multimodal representation learning, aligning paired observations (e.g., image-text) in a shared embedding space. Most existing approaches, however, adopt a single latent geometry, implicitly assuming that one manifold can capture the heterogeneous structure of multimodal semantics. In this work, we propose a mixed-curvature product embedding space that combines hyperbolic, Euclidean, and spherical factors within a unified latent manifold. We equip this product space with a weighted product metric and define a geometry-aware CLIP objective by measuring cross-modal similarity, yielding a drop-in replacement for standard cosine-based alignment while respecting manifold constraints. We instantiate the ProCLIP framework with lightweight manifold-specific projection heads and evaluate it on standard image-text retrieval benchmarks. Empirically, mixed-curvature product embeddings consistently improve cross-modal retrieval and alignment over single-manifold baselines, and our analyses highlight regimes in which heterogeneous curvature provides the largest gains.