DELBERT: Fingerprint Language Modeling For Generalizable Hit Discovery in DNA-Encoded Libraries
Arman Seyed-Ahmadi ⋅ Bing Hu ⋅ Armin Geraili ⋅ ⋅ Helen Chen ⋅ Shana Kelley ⋅ BO WANG
Abstract
DNA-Encoded Libraries (DEL) enable high-throughput exploration of vast chemical spaces for drug discovery, yet machine learning progress in this domain is limited by the scarcity of publicly available data. Initiatives such as AIRCHECK have begun releasing public DEL data as molecular fingerprints (FPs) only, preserving the confidentiality of proprietary chemical structures while enabling model development. However, current open-source FP-based approaches typically rely on supervised decision tree models that show limited generalization to out-of-distribution (OOD) chemical space. We introduce DELBERT, a transformer encoder that treats molecular FPs as a discrete token language, enabling self-supervised pretraining via masked language modeling without requiring access to underlying molecular structures. Under comprehensive library-based OOD evaluation across four protein targets (WDR91, LRRK2, SETDB1, DCAF7), DELBERT significantly outperforms baseline ensemble models on three of four targets, with 1.6-2.7$\times$ improvements in key early-enrichment metrics. Our results demonstrate that self-supervised learning over FPs alone can substantially enhance generalization for hit identification, unlocking confidentiality-preserving collaboration for accelerated drug discovery in data-constrained settings.
Video
Chat is not available.
Successful Page Load