Understanding The Limits Of Text-Only Molecular Reasoning: A Case Study In Synthetic Chain-Of-Thought Supervision
Abstract
While large language models show promise for scientific reasoning, their applicability to molecular property prediction remains unclear. We present Mol2Synth, a controlled study that examines whether synthetic chain-of-thought supervision can allow text-only LLMs to match conventional topological fingerprint methods for prediction of toxicity. Our results reveal fundamental limitations: even with tool-grounded reasoning and optimized representations, our best configuration (F1=0.88) underperforms classical ECFP fingerprints (F1=0.96), suggesting an inherent information bottleneck in textual molecular representations. Through systematic ablations across molecular representations (SMILES vs. IUPAC), data scaling, and tool-grounded generation, we demonstrate that reasoning-augmented fine-tuning stabilizes training and improves performance over zero-shot LLMs and label-only supervision, but cannot overcome structural parsing failures inherent to text-only inputs. Our qualitative analysis reveals that the primary failure mode is not faulty chemical reasoning but unreliable SMILES-to-structure interpretation; a bottleneck that tool integration partially addresses but cannot eliminate. These findings establish both the utility and fundamental limits of synthetic chain-of-thought supervision for molecular tasks, motivating hybrid architectures that combine natural language reasoning with explicit structural encoders.