Poster
in
Workshop: AI4MAT-ICLR-2025: AI for Accelerated Materials Design
SMI-TED: A large-scale foundation model for materials and chemistry
Emilio Vital Brazil · Eduardo Soares · Victor Shirasuna · Renato Cerqueira · Dmitry Zubarev · Kristin Schmidt
Keywords: [ mixture-of-experts ] [ foundation model ] [ qm9 ] [ Synthesis ]
Abstract:
We present SMI-TED, a large-scale encoder-decoder foundation model for materials and chemistry, trained on 91 million SMILES samples from PubChem using self-supervised learning. Our encoder–decoder architecture supports a wide range of complex tasks, including the prediction of quantum chemical properties and reaction yields. We provide two model variants—289M and $8 \times 289$M parameters—to accommodate different use cases. SMI-TED achieves state-of-the-art performance across multiple benchmark datasets. Latent space analyses reveal signs of compositionality and separability—key properties for higher-level reasoning and few-shot learning.In particular, SMI-TED demonstrates its ability to capture chemically meaningful structure–property relationships without task-specific fine-tuning, as shown by the clustering of nitrogen-containing molecules with high HOMO energies. Compared to an encoder-only baseline, SMI-TED achieves a lower Davies–Bouldin index, highlighting the benefits of its reconstruction-based training objective. To support further research and applications, we publicly release the model weights and source code on HuggingFace and GitHub.
Chat is not available.
Successful Page Load