Keywords: [ calibration ] [ histogram binning ] [ uncertainty quantification ]
We investigate the relationship between commonly considered notions of multiclass calibration and the calibration algorithms used to achieve these notions, leading to two broad contributions. First, we propose a new and arguably natural notion of top-label calibration, which requires the reported probability of the most likely label to be calibrated. Along the way, we highlight certain philosophical issues with the closely related and popular notion of confidence calibration. Second, we outline general 'wrapper' multiclass-to-binary (M2B) algorithms that can be used to achieve confidence, top-label, and class-wise calibration, using underlying binary calibration routines. Our wrappers can also be generalized to other notions of calibration, if required for certain practical applications. We instantiate these wrappers with the binary histogram binning (HB) algorithm, and show that the overall procedure has distribution-free calibration guarantees. In an empirical evaluation, we find that with the right M2B wrapper, HB performs significantly better than other calibration approaches. Code for this work is available at https://github.com/aigen/df-posthoc-calibration.