Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Genomics Explorations (MLGenX)

Recurrent memory augmentation of GENA-LM improves performance on long DNA sequence tasks

Yuri Kuratov · Aleksei Shmelev · Veniamin Fishman · Olga Kardymon · Mikhail Burtsev


Abstract: Utilizing DNA language models based on the transformer architecture represents a significant advancement in the field of computational genomics. However, these models face a critical challenge due to their inherent limitations in handling input lengths comparable to those of individual vertebrate genes (ranging from $10^4$ to $10^5$ nucleotides) and complete genomes (typically around $10^9$ nucleotides). Currently, the architecture with the longest sequence input among publicly available transformer-based DNA language models, GENA-LM, is constrained to a maximum input length of merely $3\cdot10^4$ nucleotides. In this study, we investigate the efficacy of the Recurrent Memory Transformer (RMT) in enhancing GENA-LM for multiple genomic analysis tasks that require processing long DNA sequence inputs. Our results demonstrate that augmenting GENA-LMs with RMT leads to a substantial enhancement in performance, particularly in tasks such as species classification and prediction of epigenetic features. This underscores the significance of the recurrent memory approach in advancing the field of computational genomics and its potential for addressing critical challenges associated with processing long sequence inputs.

Chat is not available.