Poster
in
Workshop: AI for Nucleic Acids (AI4NA)
Understanding DNA Discrete Diffusion for Engineering Regulatory DNA Sequences
Anirban Sarkar · Yijie Kang · Nirali Somia · Peter Koo
Engineering regulatory DNA sequences with precise activity levels remains a major challenge in medicine and biotechnology due to the vast combinatorial space of possible sequences and the complex regulatory grammars governing gene expression. DNA discrete diffusion (D3) has emerged as a promising approach for learning these distributions and generating biologically relevant sequences, yet several key aspects of its capabilities remain unexplored. Here we systematically investigate D3’s performance in biologically relevant, understudied scenarios. First, we demonstrate that D3 maintains robust performance even with limited training data, highlighting its practical utility in real-world applications where data is scarce. Second, we extend D3’s conditional generation capabilities for categorical data, employing classifier-free guidance to improve the quality and specificity of generated sequences. Third, we analyze sequence trajectories during the diffusion process, providing insights into how discrete diffusion navigates the sequence-function landscape.Together, these findings expand our understanding of D3’s strengths and limitations, while introducing new methodological advances for engineering functional regulatory DNA sequences.