Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Genomics Explorations (MLGenX)

DARKIN: A zero-shot classification benchmark and an evaluation of protein language models

Emine Ayşe Sunar · Zeynep Işık · Mert Pekey · Ramazan Gokberk Cinbis · Oznur Tastan


Abstract:

Protein language models (pLMs) aim to capture the complex information embedded within protein sequences and are useful for downstream protein prediction tasks. With a plethora of pLMs available, there is now a critical need to benchmark their performance across diverse tasks. Here, we introduce a biologically relevant zero-shot prediction benchmark, focusing on dark kinase-phosphosite associations. Kinases are the enzymes responsible for protein phosphorylation and they play vital roles in cellular signaling. While phosphoproteomics allows large-scale identification of phosphosites, determining the catalyzing kinase remains challenging. We present a zero-shot classification benchmark dataset, DARKIN, for assigning phosphosites to one of the understudied kinases (dark kinases). DARKIN provides train, validation, and test folds split based on zero-shot classification, kinase groups, and sequence similarities. Evaluation of pLMs using a novel training-free k-NN-based zero-shot classifier and a bilinear zero-shot classifier reveals superior performance by Esm models, ProtT5-XL, and the recently introduced structure-based SaProt model. We believe this biologically relevant yet challenging benchmark will further facilitate assessing the efficacy of pLMs and aid the exploration of dark kinases.

Chat is not available.