Joint Variable Selection in Proteomics Survival Models
Abstract
The incidence of the vast majority of neurodegeneration, cancer, and metabolic diseases generally increases exponentially with age. In large-scale biobanks, linking time-to-diagnosis information in electronic health records to multiple genomic (``multiomics'') measures has the potential to reveal the genes and biological pathways involved in the disease onset and progression. To date, association testing has commonly been conducted by testing one variable at a time using semiparametric Cox proportional hazards (CoxPH) models, which ignores correlation structure and increases the risk of false discoveries. To address these issues, we introduce a novel fully parametric computational method, vampW, based on the Bayesian Vector Approximate Message Passing framework applied to a Weibull model. vampW jointly models correlated features, while providing an interpretable hazard structure, producing a continuous survival curve, and incorporating prior knowledge. In an extensive simulation study, we demonstrate that joint modeling of proteomics data and time-to-event outcomes using vampW substantially reduces false discoveries in comparison to marginal testing and other forms of joint CoxPH models. The application of vampW to 2,924 proteins across 24 diseases in 53,018 individuals from the UK Biobank, identifies 219 protein associations, the majority of which are not among the top marginal discoveries. vampW also achieves a significant improvement in the prediction of disease onset times: across 14 medical outcomes, it reduces the root mean squared error by over 32% and 26%, when compared respectively to CoxPH variants and the deep learning approach DeepSurv. In addition, vampW outperforms deep learning methods in the data-scarce regime on common survival benchmarking datasets. In summary, vampW offers accurate and interpretable variable selection and out-of-sample prediction within a single computational framework, making it a powerful tool for dissecting the proteomic architecture of human health span.