A Framework for Aligning Human Linguistics and AI Perception
Abstract
Grounding natural language in perceptual representations is central to both human cognition and AI reasoning, yet remains challenging under ambiguity and partial information. We present a computational framework that models key aspects of human referential interpretation by aligning linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The approach approximates human perceptual categorization using scale-invariant feature transform (SIFT) alignment and the Universal Quality Index (UQI), while lightweight linguistic preprocessing captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a benchmark designed to probe perceptual ambiguity and coordination in human communication. The system achieves robust grounding, requiring 65\% fewer utterances than human interlocutors to establish stable mappings, and correctly identifying targets from a single utterance 41.66\% of the time (compared to 20\% for humans). These results suggest that relatively simple perceptual–linguistic alignment mechanisms can exhibit human-competitive behavior on a classic cognitive task, offering insights into grounded reasoning, perceptual inference, and cross-modal concept formation.