A Transcriptomic Benchmark for Foundation Models in Immunology and Inflammation Drug Development
Abstract
Foundation models for transcriptomics are increasingly evaluated on technical metrics disconnected from drug development. We introduce an immunology and inflammation (I&I) benchmark of 35 tasks across 8 diseases, organized along the drug development pipeline: target discovery, preclinical translation, and clinical applications. Tasks span treatment response, clinical severity, molecular perturbations, and patient endotypes, with cross-species, cross-disease, and cross-platform transfer to test translational generalization. Patient sample sizes range from 9 to 713, reflecting data-limited regimes typical of early clinical research. We evaluate general-purpose and domain-specific foundation models against statistical baselines. Foundation models achieve the largest gains on translational tasks (perturbation prediction and cross-species transfer) where baselines fail. Treatment outcome prediction and patient stratification also favor foundation models, while clinical severity prediction remains competitive with feature-selected regression. A domain-specific model (EVA) pretrained on I&I data outperforms general-purpose models across most task categories. Benchmark performance improves with pretraining steps without saturating, suggesting it can serve as a diagnostic for model development.