Skip to yearly menu bar Skip to main content


Synthetic Data Generation: Quality, Privacy, Bias

Sergul Aydore · Krishnaram Kenthapadi · Haipeng Chen · Edward Choi · Jamie Hayes · Mario Fritz · Rachel Cummings · Krishnaram Kenthapadi

Fri 7 May, 7 a.m. PDT

Data are the most valuable ingredient of machine learning models to help researchers and companies make informed decisions. However, access to rich, diverse, and clean datasets may not always be possible. One of the reasons for the lack of rich datasets is the substantial amount of time needed for data collection, especially when manual annotation is required. Another reason is the need for protecting privacy, whenever raw data contains sensitive information about individuals and hence cannot be shared directly. A powerful solution that can address both of these challenging scenarios is generating synthetic data. Thanks to the recent advances in generative models, it is possible to create realistic synthetic samples that closely match the distribution of complex, real data. In the case of limited labeled data, synthetic data can be used to augment training data to mitigate overfitting. In the case of protecting privacy, data curators can share the synthetic data instead of the original data, where the utility of the original data is preserved but privacy is protected. Despite the substantial benefits from using synthetic data, the process of synthetic data generation is still an ongoing technical challenge. Although the two scenarios of limited data and privacy concerns share similar technical challenges such as quality and fairness, they are often studied separately. We will bring together researchers from both fields in order to discuss challenges and advances in synthetic data generation.

Chat is not available.
Timezone: America/Los_Angeles