TripleThreat: Benchmarking Functional Sensitivity in Protein Representations with Paralog-Ortholog Triplets
Abstract
Understanding why certain sequence changes lead to functional change while others conserve function remains a central challenge in protein biology. A meaningful protein representation should be sensitive to these distinctions, so we introduce TripleThreat, a benchmark dataset to evaluate this capability. We construct test cases using natural examples of divergence and conservation in paralogs and orthologs, respectively, assembling protein–paralog–ortholog triplets. By controlling for confounders (sequence identity, length, and species) at varying levels, we generate six dataset subsets that trade off dataset size and stringency. Because protein language models (pLM) are popularly used as protein representations, we evaluate five widely used pLMs on this benchmark. We find that performance declines as confounding variables are more tightly controlled. Further, by varying confounder stringency, we identify which confounding signals overshadow functional signals; for example, an alignment-based pLM is observed to encode species identity more prominently than function. In sum, this work offers a framework to test whether protein representation spaces capture fine-grained functional relationships beyond confounding signals. We make our benchmark publicly available at https://github.com/mohinimisra26/triple-threat.