The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics
Abstract
We challenge the prevailing assumption that large language models (LLMs) safety alignment generalizes as a semantic capability across protected groups. By conducting a controlled adversarial stress test with 44,000 prompts balanced across 16 demographics and two languages, we demonstrate that current models rely on selective memorization rather than universal principles. We identify a rigid "two-tiered" safety hierarchy: models robustly protect high-visibility groups in US discourse (e.g., LGBTQIA+) while systematically neglecting marginalized communities (e.g., disabilities), with defense rates varying by up to 33\% for identical attack vectors. Crucially, we report an Inverse Scaling of Equity: contrary to standard scaling laws, increasing model parameters exacerbates these disparities, linearly increasing the variance in safety performance between groups. These findings suggest that current alignment techniques incentivize the overfitting of dominant safety priors, where scaling functions as a bias magnifier rather than a solution to robustness.