BUDDY: Blending Training and Deployment Data with Weighted Expert Ensembles for Post-hoc LLM Calibration
Abstract
Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, but poor calibration and an inability to robustly handle distribution shift during deployment limit their utility. While existing approaches, such as fine-tuning and conformal prediction, can improve calibration, they are often computationally expensive or unreliable in small sample regimes. We propose a post-hoc calibration framework for a setting in which we have access to a large corpus dataset and a small dataset from a target deployment distribution. We assume these datasets correspond to the same task and have some overlap, but that the target distribution is shifted from that of the corpus dataset. Our method learns an ensemble calibrator on the corpus distribution, applies group-conditional temperature scaling to correct structured miscalibration, and then performs lightweight target adaptation by optimizing a discriminator-weighted, tilted empirical risk that emphasizes target-like examples. We provide finite-sample guarantees for groupwise temperature estimation and the resulting groupwise calibration error. Empirically, we demonstrate calibration under distribution shift in settings with few target samples. Across multiple-choice and open-ended question answering (QA) datasets, our method consistently improves calibration over baselines.