Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
Abstract
Masked Diffusion Language Models (MDLMs) have emerged as an alternative to autoregressive language models, with a denoising objective that in principle enables more uniform context utilisation. We study the context comprehension of MDLMs and identify two key limitations. First, despite a more global training objective, MDLMs exhibit a strong locality bias: performance depends heavily on the proximity of relevant information to the prediction target. Second, we show that appending mask tokens—required for generation—can substantially degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, impairing the model’s ability to process relevant context. To mitigate this effect, we propose a mask-agnostic loss that enforces prediction invariance to the number of appended masks. Fine-tuning with this objective significantly improves robustness. Overall, our results reveal important shortcomings of current MDLMs and suggest concrete directions for improving context comprehension.