Poster
in
Workshop: AI for Nucleic Acids (AI4NA)
VirProBERT: A Sequence Language Model for Predicting Viral Hosts
Blessy Antony · Maryam Haghani · Anuj Karpatne · T. Murali
It is crucial to accurately predict hosts of viruses to understand and anticipate human infectious diseases that originate from animals. We develop a machine learning model for predicting the host infected by a virus, given only the sequence of a protein encoded by the genome of that virus. Our approach, VirProBERT, is the first to apply to multiple hosts and to generalize to unseen hosts and viruses. VirProBERT is a transformer-based architecture coupled with hierarchical self-attention that can accept sequences of highly diverse lengths. We integrate VirProBERT with a prototype-based few-shot learning classifier to predict rare classes. We demonstrate the accuracy, robustness, and generalizability of VirProBERT through a comprehensive series of experiments. In particular, we show that VirProBERT can achieve a median AUPRC of 0.67 while predicting common hosts. Moreover, VirProBERT retains this AUPRC value even for rare hosts (median prevalence as low as 0.09%). Our model performs on par with state-of-the-art foundation models which are 65 to 5,000 times larger in size and outperforms them in identifying hosts of SARS-CoV-2 variants of concern.