Higher-order grammar representations for molecular generation and learning
Abstract
Molecular learning models are strongly shaped by their underlying representations, yet standard sequential and graph representations struggle to explicitly encode higher-order molecular topology such as ring systems and motifs. To address this gap, we introduce a Higher-order Grammar Representation (HGR), a principled, topology-aware framework that lifts molecules to combinatorial complexes and parses them into a compact sequence of production rules under a context-free higher-order grammar. We devise three complementary lifting strategies, which induce distinct grammars that balance topological expressiveness with rule compactness. To mitigate evaluation biases in existing benchmarks, we construct RingDiv, a ring-enriched benchmark of 1.18M molecules, and curate a higher-quality 300k subset. Across de novo molecular generation, HGR-based models achieve 100\% validity and outperform strong SMILES- and diffusion-based baselines. We further develop a molecular foundation model that integrates HGR with higher-order structural inductive biases, yielding robust and transferable representations across downstream benchmarks.