Ancestry Inference with GNNs on IBD Graphs for Genetically Similar Populations
Abstract
Graph Neural Networks (GNNs) have recently shown significant effectiveness in analyzing structured graph data across diverse domains. At the same time, accurate inference of ancestry from genetic data, especially among genetically similar populations, remains challenging due to internal complexity of the genetic relationships and high dimensionality of SNP data. To address these challenges, we propose a novel GNN-based method for inferring individual's ancestry from a graph that represents the genetic relatedness between individuals. Genetic relatedness between two individuals is measured according to shared identity-by-descent (IBD) segments, which are the segments of a genome inherited from a close common ancestor. In this context, the ancestry inference task is formalized as node classification on graphs. We present three key contributions. First, we advance the population genetics methodology with a unique GNN-based framework for ancestry inference for closely related populations. Second, we present a novel GNN architecture which improves training stability and predictive performance for ancestry inference on IBD graphs. Third, we demonstrate that augmenting the dataset with unlabeled vertices (individuals with unknown ancestry) significantly improves prediction scores, because message-passing in GNNs effectively propagates ancestry-related information throughout the network.