The Viability Boundary of Differential Privacy
Arinbjörn Kolbeinsson ⋅ Benedikt Kolbeinsson
Abstract
Differentially private synthetic data can enable data sharing without compromising individual privacy, but DP-SGD adds noise that can destroy utility when training data is scarce. How much data is enough is poorly understood. We characterise a sharp \emph{viability boundary}, a training set size below which DP models produce random-chance output and above which they approach non-private baselines. Across six tabular datasets spanning healthcare, census and ecology domains, we find that the ratio $N/d$ (training samples per encoded dimension) consistently predicts this transition, with viability emerging between $N/d \approx 50$ and $300$. The boundary is insensitive to model size. The data cost of strong privacy is sublinear, with $\varepsilon = 1$ requiring only ${\sim}2.5\times$ more data than $\varepsilon = 10$, well below formal DP-ERM predictions. A controlled dimension-reduction experiment confirms that $N/d$, not $N$ alone, drives viability. These results give practitioners an actionable heuristic: check $N/d$ before investing in DP synthetic data generation, and prefer feature engineering over data collection when the ratio is too low.
Chat is not available.
Successful Page Load