Skip to yearly menu bar Skip to main content


Workshop

Will Synthetic Data Finally Solve the Data Access Problem?

Zheng Xu · Peter Kairouz · Herbie Bradley · Rachel Cummings · Giulia Fanti · Lipika Ramaswamy · Chulin Xie

Peridot 202-203

Sat 26 Apr, 5:55 p.m. PDT

Accessing large scale and high quality data has been shown to be one of the most important factors to the performance of machine learning models. Recent works show that large (language) models can greatly benefit from training with massive data from diverse (domain specific) sources and aligning with user intention. However, the use of certain data sources can trigger privacy, fairness, copyright, and safety concerns. The impressive performance of generative artificial intelligence popularized the usage of synthetic data, and many recent works suggest (guided) synthesization can be useful for both general purpose and domain specific applications. For example, Yu et al. 2024, Xie et al. 2024, Hou et al. 2024 demonstrate promising preliminary results in synthesizing private-like data, while Wu et al. 2024 highlight existing gaps and challenges. As techniques like self-instruct (Wang et al. 2021) and self-alignment (Li et al. 2024) gain traction, researchers are questioning the implications of synthetic data (Alemohammad et al. 2023, Dohmatob et al. 2024, Shumailov et al. 2024). Will synthetic data ultimately solve the data access problem for machine learning? This workshop seeks to address this question by highlighting the limitations and opportunities of synthetic data. It aims to bring together researchers working on algorithms and applications of synthetic data, general data access for machine learning, privacy-preserving methods such as federated learning and differential privacy, and large model training experts to discuss lessons learned and chart important future directions.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles

Schedule

Log in and register to view live content