Benchmarking COX1 Embeddings across Genomic Foundation Models, kmers, and Imbalance-Aware Losses
Abstract
Generative AI in genomics increasingly relies on pretrained genomic foundation models (gLMs) as reusable sequence encoders, yet practical deployment faces persistent barriers: tokenization mismatch with biological signal, domain shift between pretraining corpora and target assays, and extreme long-tail label distributions that stress standard objectives. We study these challenges in the ecologically central COI/COX1 gene by benchmarking an alignment-free pipeline that converts nucleotide sequences into fixed-length embeddings (mean-pooled hidden states) and trains lightweight MLP classifiers for independent rank-wise prediction from Domain to Species. We evaluate two complementary regimes that jointly expose frontiers for scalable genomic representation learning: eKOI (15,947 sequences; protist-rich; 11,047 species) and MetaCOXI (5.6M metazoan sequences; 743,671 species). Across diverse gLM families (autoregressive decoders and masked-language encoders) and explicit compositional baselines (overlapping kmer frequencies up to k=6), we find that the effective motif length induced by tokenization is a dominant driver of fine-rank separability, while corpus alignment (eukaryote- vs. prokaryote-pretraining) materially impacts transfer even under identical tokenization. Finally, imbalance-aware objectives (weighted cross-entropy and a hybrid weighted+contrastive loss) can stabilize rare-taxonomy performance but remain representation-dependent.