Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Privacy Regulation and Protection in Machine Learning

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou · Akshat Shrivastava · Hongyuan Zhan · Rylan Conway · Trang Le · Adithya Sagar · Giulia Fanti · Daniel Lazar


Abstract: On-device training is the most common way to use private user data to train machine learning (ML) models. This has major drawbacks: (1) user devices are too small to train large models on-device, (2) it is communication and computation intensive for users, and (3) it can be hard to deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under the high privacy regime ($\epsilon = 1.29$). We achieve these results while using 7x less total client computation and 40x less communication than on-device training. Altogether, these results suggest in the high-privacy regime, training on DP synthetic data may be a better option than training models on-device on private distributed data.

Chat is not available.