Higher-order grammar representations for molecular topology generation and understanding
Abstract
Molecular learning models are strongly shaped by their underlying representations, yet standard sequential and graph representations struggle to explicitly encode higher-order molecular topology such as ring systems and motifs. To address this gap, we introduce a Higher-order Grammar Representation (HGR), a principled, topology-aware framework that lifts molecules to combinatorial complexes and parses each complex into a compact sequence of production rules under a context-free higher-order grammar. We devise three complementary lifting strategies, which induce distinct grammars that balance topological expressiveness with rule compactness. To mitigate evaluation biases in existing benchmarks, we further construct RingDiv, a ring-enriched benchmark of 1.18M molecules, and curate a higher-quality 300k subset. Across de novo molecular generation, HGR-based models achieve 100% validity and outperform strong SMILES- and diffusion-based baselines. Building on the same representation, we develop a molecular foundation model that integrates symbolic production rules with explicit structural inductive biases, yielding robust and transferable representations across downstream benchmarks.