Towards Statistical Verification for Trustworthy AI
Abstract
Although modern machine learning systems such as large language models perform well in low-stakes settings, their deployment in autonomous decision-making roles raises serious safety and reliability concerns. Contemporary alignment strategies (e.g., RLHF and RLAIF) are widely used to shape system behavior, but typically rely on empirical validation and provide limited protection against falsely certifying safety constraints under statistical uncertainty. A growing class of alignment methods addresses this limitation by enforcing probabilistic behavioral constraints. We analyze these approaches through the lens of statistical certification and introduce a unifying proof framework that captures a broad family of probabilistic alignment methods while making their statistical assumptions explicit. Using this framework, we identify a critical failure mode in modern alignment pipelines: reliance on proxy data (e.g., inferred feedback or learned evaluators) can introduce dependencies that violate assumptions required for valid certification. Consequently, systems may appear statistically certified while still violating safety constraints. We characterize how such dependency violations can compromise the trustworthiness of probabilistic guarantees, and derive sufficient conditions under which valid certification can be recovered.