Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization
Jianan Zhao ⋅ Zhihao Zhan ⋅ Narendra Chaudhary ⋅ Xinyu Yuan ⋅ Zuobai Zhang ⋅ Qian Cong ⋅ Jian Zhou ⋅ Sanchit Misra ⋅ Jian Tang
Abstract
Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For $N$ proteins of average length $L$, exhaustive all-against-all search requires $\mathcal{O}(N^2L^2)$ computation, rendering conventional approaches computationally impractical. We introduce RaftPPI, a scalable framework that approximates residue-level PPI modeling while enabling efficient large-scale retrieval. RaftPPI represents residue interactions with a Gaussian kernel, approximated efficiently via structured random Fourier features, and applies a low-rank factorized attention mechanism that admits pooling into a compact embedding per protein. Each protein is encoded once into an indexable embedding, allowing approximate nearest-neighbor search to replace exhaustive pairwise scoring, reducing proteome-wide retrieval from *months* to *minutes* on a single GPU or CPU. On the human proteome with the D-SCRIPT dataset, RaftPPI retrieves the top 20% pairs from $\sim$200M candidate pairs in 5.7 minutes on an A100 GPU, or 3.3 minutes on an Intel Xeon 6980P CPU, covering 75.1% of the true interacting pairs, compared to 4.9 GPU months for the best prior method (61.2%). Across seven benchmarks with sequence- and degree-controlled splits, RaftPPI achieves state-of-the-art PPI classification and retrieval performance, while enabling residue-aware, retrieval-friendly screening at proteome scale.
Successful Page Load