Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization
Jianan Zhao · Zhihao Zhan · Narendra Chaudhary · Xinyu Yuan · Zuobai Zhang · Qian Cong · Jian Zhou · Sanchit Misra · Jian Tang
Abstract
Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For $N$ proteins of average length $L$, exhaustive all-against-all search requires $\mathcal{O}(N^2L^2)$ computation, rendering conventional approaches computationally impractical. We introduce RaftPPI, a scalable framework that approximates residue-level PPI modeling while enabling efficient large-scale retrieval. RaftPPI represents residue interactions with a Gaussian kernel, approximated efficiently via structured random Fourier features, and applies a low-rank factorized attention mechanism that admits pooling into a compact embedding per protein. Each protein is encoded once into an indexable embedding, allowing approximate nearest-neighbor search to replace exhaustive pairwise scoring, reducing proteome-wide retrieval from *months* to *minutes* on a single GPU. On the human proteome with the D-SCRIPT dataset, RaftPPI retrieves the top 20\% candidate pairs ($\sim$200M) in 6 GPU minutes, covering 75.1\% of the true interacting pairs, compared to 4.9 GPU months for the best prior method (61.2\%). Across seven benchmarks with sequence- and degree-controlled splits, RaftPPI achieves state-of-the-art PPI classification and retrieval performance, while enabling residue-aware, retrieval-friendly screening at proteome scale.
Successful Page Load