Poster
in
Workshop: Machine Learning for Genomics Explorations (MLGenX)
BEYOND SEQUENCE-ONLY MODELS: LEVERAGING STRUCTURAL CONSTRAINTS FOR ANTIBIOTIC RESISTANCE PREDICTION IN SPARSE GENOMIC DATASETS
Mahbuba Tasmin · Anna G. Green
Abstract:
To combat the rise of antibiotic-resistant $\textit{Mycobacterium tuberculosis}$, genotype-based diagnosis of resistance is critical, as it could substantially speed time to treatment. However, machine learning efforts at genotype-based resistance prediction are hindered by limited sequence diversity and high redundancy in genomic datasets, complicating model generalization. Here, we introduce a dataset of $\textit{M. tuberculosis}$ sequences for nine key resistance-associated genes and corresponding resistance phenotypes, performing genotype de-duplication to mitigate the effects of data leakage. This study evaluates three computational approaches for resistance prediction: baseline Ridge regression, zero-shot mutation effect prediction using ESM-2 embeddings, and a Fused Ridge approach that moves beyond sequence-only prediction by introducing protein structure constraints. Our results show that Fused Ridge achieves the highest mean AUC (0.766), outperforming Ridge regression (0.755) and ESM-2-based log-likelihood ratio scoring (0.603). It also exhibits enhanced precision and recall in identifying resistance-conferring variants, particularly for genes such as $\textit{gyrA}$ and $\textit{rpoB}$, likely due to strong association between 3D location of mutations and resistance. The fusion penalty enforces smoothness in regression coefficients for spatially adjacent residues, embedding biological knowledge into the predictive framework and improving generalization in sparse and highly redundant datasets.
Chat is not available.
Successful Page Load