ON THE IMPACT OF EMBEDDING ANISOTROPY IN GENOMIC LANGUAGE MODELS FOR BACTERIAL TAXONOMY
Abstract
Genomic language models have emerged as powerful tools for representing DNA sequences, yet the impact of intrinsic properties of pre-trained embeddings, such as anisotropy, on downstream genomic tasks remains underexplored. In this work, we examine the geometric structure of DNABERT-2 embeddings derived from full-length 16S rRNA gene sequences and analyze how anisotropy affects bacterial taxonomic classification. We compare raw embeddings with post-processed representations obtained through a simple whitening transformation and evaluate their performance using distance-based classification across multiple taxonomic ranks. Our results show that DNABERT-2 embeddings exhibit severe anisotropy and that whitening substantially improves isotropy and consistently enhances classification performance, particularly at finer-grained taxonomic levels. These findings highlight the importance of embedding geometry when deploying genomic language models for downstream biological analysis.