ACCURATE TOKENIZATION OF 3D SMALL ORGANIC MOLECULES
Filya Geikyan ⋅ Hrant Khachatrian
Abstract
Atom-level generation of chemical structures, including drug-like molecules, is an increasingly active research direction. Due to the continuous nature of atomic coordinates, 3D structure generation has been mostly done with diffusion-style methods, with only a few attempts at leveraging autoregressive models. In this work we develop CoordToken, a simple recipe to train tokenizers for 3D molecules using Finite Scalar Quantization method. We train CoordToken on two datasets: (i) on $\nabla^2$DFT where we obtain a $0.059$A reconstruction error, which is a $3.5\times$ reduction compared to the prior methods, and (ii) on a large corpus of 196M molecules, where we obtain $0.045$Aaverage RMSD across all test sets, including $0.044$A error on $\nabla^2$DFT, while maintaining near-perfect physical plausibility. The tokenizers will be publicly released.
Chat is not available.
Successful Page Load