SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Abstract
Recursive self-improvement is shifting from theory to practice: modern systems can critique, revise, and evaluate their outputs, yet iterative self-modification risks subtle alignment drift. We introduce \textbf{SAHOO}, a practical framework to monitor and control drift via three complementary safeguards: (i) the \emph{Goal Drift Index} (GDI), a learned multi-signal detector combining semantic, lexical, structural and distributional measures; (ii) \emph{constraint preservation} checks that enforce safety-critical invariants (e.g., syntactic correctness, non-hallucination); and (iii) \emph{regression-risk} quantification to flag when improvement cycles undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains (e.g., +18.3\% on code, +16.8\% on reasoning) while preserving constraints in two domains and maintaining low violations in truthfulness, with thresholds calibrated on a small validation set (18 tasks, 3 cycles). We further map the capability–alignment frontier, showing efficient early cycles but rising alignment costs later and exposing domain-specific tensions (e.g., fluency vs.\ factuality). SAHOO thus renders alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.