Unifying Autoregressive and Discrete Diffusion Language Modeling via Cross-Regressive Decoding
Dmitry Abulkhanov ⋅ Daniil Strizhakov ⋅ Maxim Panov
Abstract
Inference acceleration can unintentionally change model behavior, complicating alignment-sensitive deployments where post-training (like RLHF) should be preserved. We introduce $\textbf{Cross-Regression}$, a decoding-time method that accelerates generation while providing an explicit mechanism to preserve or relax distributional fidelity. Cross-Regression augments a pretrained autoregressive transformer with a dual-stream design: a frozen control stream computes exact next-token probabilities, and a predictive stream proposes multi-token drafts in parallel. An energy-based acceptance test, derived from the per-token log probability ratio between control and predictive streams, determines how many proposed tokens can be safely committed. The method provides an explicit control between $\textit{lossless sampling}$ and a faster $\textit{lossy regime}$ with controllable deviation. Across models from 1.5B to 70B parameters, we observe strong scaling of acceptance length and realize $3–6\times$ speedups with near-complete quality retention across reasoning, code, and dialogue benchmarks, and we demonstrate modality transfer by accelerating Whisper decoding.
Chat is not available.
Successful Page Load