FedAgree: Leveraging Federated Checkpoints for Label-Free OOD Evaluation via Agreement
Abstract
Federated Learning (FL) has recently emerged as a popular paradigm in many domains, enabling collaborative model training across partners while preserving their privacy. However, distribution shifts in realistic conditions can lead to substantial performance degradation when models are deployed at new sites. Out-of-distribution (OOD) performance estimation is thus critical, but obtaining labelled OOD data is frequently impractical. Let's consider a practical example: in healthcare – where shifts across hospitals are common due to different acquisition devices, patient populations, or clinical protocols – assessing this degradation would be essential, but obtaining labelled data for evaluation is often scarce, time-costly, or too expensive. Agreement-on-the-Line (AotL) (Baek et al., 2022) addresses this by predicting OOD accuracy without labelled data via agreement between pairs of model checkpoints, though obtaining multiple models for this purpose is computationally expensive. We observe that FL naturally resolves this: diverse client checkpoints are already produced during training at no additional cost. We thus propose FedAgree, a method to facilitate agreement-based OOD evaluation in federated settings by leveraging both local and cross-client checkpoints. We introduce five checkpoint strategies that progressively expand the use of cross-client information evaluate them across standard OOD benchmarks and diverse medical imaging modalities (dermoscopy, retinopathy, histopathology), under both IID and non-IID settings. Our empirical results demonstrate that FedAgree consistently outperforms AotL and confidence-based baselines, confirming that federated settings offer an ideal environment for practical, label-free OOD evaluation.