Methylation-Aware Embedding Geometry Emerges from Bisulfite Pretraining in DNA Language Models
Jiajie Xiao ⋅ Salwan Butrus ⋅ Nathan Hunkapiller
Abstract
DNA methylation encodes regulatory information beyond the DNA sequence, but most genomic language models (gLMs) miss this important modality because they are pretrained on native DNA only. We test whether a widely used DNA checkpoint can be retrofitted into a methylation-aware model by continual pretraining on bisulfite sequencing (BS-seq) reads, where methylation is implicitly encoded into token identities via C$\rightarrow$T conversion. Rather than proposing a new architecture, we ask for compact, interpretable evidence that methylation is encoded in representation space. Using DNABERT2 continually pretrained on a multi-tissue BS-seq atlas, we show two simple geometric diagnostics: (i) per-read embedding norms become bimodal and align with hypo/hypermethylated contexts, and (ii) cosine distances between genomically matched tumor--normal read pairs increase substantially after BS-seq adaptation, relative to the native checkpoint. These results suggest that simple BS-seq retrofitting can endow a standard DNA gLM with biologically meaningful, increased, label-light epigenetic sensitivity.
Chat is not available.
Successful Page Load