Grounding the "Not": Symbolic Representation of Negation for Logical Reasoning in VLMs
Inha Kang ⋅ Seonho Lee ⋅ Jiho Choi ⋅ Junsuk Choe ⋅ Hyunjung Shim
Abstract
Despite their remarkable capabilities in natural language understanding, Vision-Language Models (VLMs) exhibit critical bottlenecks in fundamental logical reasoning, particularly in processing the logical operator of negation. This deficiency frequently results in self-contradictory predictions, where models fail to differentiate between a concept and its negation ($A$ vs.\ $\neg A$), a phenomenon often observed as "affirmative bias'' in visual contexts. In this work, we leverage Described Object Detection (DOD) as a rigorous testbed to evaluate and resolve these logical inconsistencies. To address this, we propose two primary contributions. First, we introduce CoVAND, a dataset constructed via a deductive chain-of-thought (CoT) reasoning pipeline that synthesizes consistent, instance-grounded logical propositions. Second, we present NegToMe, a novel token merging module that acts as a symbolic representation mechanism. NegToMe directly mitigates the structural loss of logical operators caused by standard tokenization. By explicitly binding negation cues with their target operands (e.g., merging "not'' and "girl'' into a singular, structurally coherent $\neg \text{girl}$ token), it preserves strict logical polarity at the input representation level. Evaluated on rigorous consistency benchmarks, our lightweight adaptation approach significantly reduces self-contradictory false positives and boosts NMS-AP by up to +10.8 points on OVDEval. This work demonstrates an effective framework for embedding symbolic logical operations into VLMs, paving the way for more reliable deductive reasoning in multimodal applications.
Chat is not available.
Successful Page Load