Spotlight 1: GPC - Deep generative model of genetic variation data improves imputation accuracy in private populations
Prateek Anand ⋅ Anji Liu ⋅ Meihua Dang ⋅ Boyang Fu ⋅ Xinzhu Wei ⋅ Guy Van den Broeck ⋅ Sriram Sankararaman
Abstract
Artificial genomes (AGs) are increasingly used to benchmark genomic pipelines, test population genetic hypotheses, and construct reference panels for genotype imputation, while avoiding restrictions associated with sharing real genomes. However, existing approaches often struggle to jointly achieve realism, computational efficiency, and privacy preservation. We introduce Genetic Probabilistic Circuits (GPC), a deep generative model for genetic variation data based on hidden Chow--Liu trees represented as probabilistic circuits. GPC captures long-range dependencies among SNPs and is simple to train. We evaluate GPC across multiple ancestries in two large-scale datasets, the 1000 Genomes Project and UK Biobank. GPC matches or exceeds prior methods in generating AGs that resemble real genomes with the AGs retaining population structure underlying the training genomes. The AGs from GPC more faithfully reproduce patterns of linkage disequilibrium (LD; correlations between nearby genetic variants) across length scales. We also find that GPC consistently improves imputation accuracy by 3--33\% in $r^2$ over the next best generative model, with gains of 13--279\% for low-frequency variants (MAF $<$1\%). For underrepresented populations, GPC improves accuracy by 12--96\% over European-only reference panels. Finally, we demonstrate that GPC provides improved privacy-utility tradeoffs compared to existing approaches, enabling accurate inference when sharing real genomes is restricted.
Video
Chat is not available.
Successful Page Load