Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 3 P3-#612

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Feijiang Han ⋅ Xiaodong Yu ⋅ Jianheng Tang ⋅ Delip Rao ⋅ Weihua Du ⋅ Lyle Ungar

[ Slides] [ OpenReview]

Abstract

Token-level attention tuning -- a class of training-free methods including Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT) -- has emerged as a promising approach for improving frozen LLMs via interpretable interventions. However, these methods rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and limit applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible. We propose a simpler alternative: intervening only on the initial token (e.g., in LLaMA). We theoretically show that adding lightweight biases to this token’s attention logits systematically shifts and reshapes downstream attention patterns -- an effect amplified by its natural role as an attention sink. Empirically, we find that this tuning can improve LLM performance and better elicit pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these findings, we introduce ZeroTuning, a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring no parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and an unsupervised mode that directly minimizes output entropy. ZeroTuning requires no KV-cache or decoding changes and is kernel-agnostic (works with SDPA and FlashAttention). It requires only four lines of modification to standard \texttt{LlamaAttention} code, achieves gains across 15 datasets, and outperforms prior, more complex methods. For example, on Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out of the box with quantized inference and maintains its improvements as context length increases. Our work provides a lightweight tool for inference-time improvement, advancing both optimization and interpretability. Our code and runnable demo are available at https://anonymous.4open.science/r/ZeroTuning.

Video

Chat is not available.