Synthetic Data Generation: Quality, Privacy, Bias

Sergul Aydore, Krishnaram Kenthapadi, Haipeng Chen, Edward Choi, Jamie Hayes, Mario Fritz, Rachel Cummings, Krishnaram Kenthapadi


Data are the most valuable ingredient of machine learning models to help researchers and companies make informed decisions. However, access to rich, diverse, and clean datasets may not always be possible. One of the reasons for the lack of rich datasets is the substantial amount of time needed for data collection, especially when manual annotation is required. Another reason is the need for protecting privacy, whenever raw data contains sensitive information about individuals and hence cannot be shared directly. A powerful solution that can address both of these challenging scenarios is generating synthetic data. Thanks to the recent advances in generative models, it is possible to create realistic synthetic samples that closely match the distribution of complex, real data. In the case of limited labeled data, synthetic data can be used to augment training data to mitigate overfitting. In the case of protecting privacy, data curators can share the synthetic data instead of the original data, where the utility of the original data is preserved but privacy is protected. Despite the substantial benefits from using synthetic data, the process of synthetic data generation is still an ongoing technical challenge. Although the two scenarios of limited data and privacy concerns share similar technical challenges such as quality and fairness, they are often studied separately. We will bring together researchers from both fields in order to discuss challenges and advances in synthetic data generation.

Chat is not available.

Timezone: »


Fri 7:00 a.m. - 7:10 a.m.

Opening Remarks

Sergul Aydore
Fri 7:10 a.m. - 7:35 a.m.

Mihaela van der Schaar is John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge, a Turing Faculty Fellow at The Alan Tur-ing Institute in London, and a Chancellor’s Professor at UCLA. Mihaela was elected IEEE Fellow in 2009. She has received numerous awards, including the Oon Prize on Preventative Medicine fromthe University of Cambridge (2018), an NSF Career Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and severalbest paper awards, including the IEEE Darlington Award. Mihaela’s work has also led to 35 USApatents (many widely cited and adopted in standards) and 45+ contributions to international standards for which she received 3 International ISO (International Organization for Standardization) Awards.

Mihaela van der Schaar
Fri 7:35 a.m. - 7:40 a.m.
Q&A with Mihaela van der Schaar (Q&A)
Fri 7:40 a.m. - 7:42 a.m.
Introducing contributed talks 1-2 (Intro)
Jamie Hayes
Fri 7:42 a.m. - 7:51 a.m.
Contributed Talk: Synthetic Data for Model selection (Contributed Talk)   
Nadav Bhonker, Alon Shoshan
Fri 7:51 a.m. - 8:00 a.m.
Contributed Talk: Ensembles of GANs for synthetic training data generation (Contributed Talk)   
Gabriel Eilertsen
Fri 8:00 a.m. - 8:01 a.m.
Intoducing Jan Kautz (Intro)
Jamie Hayes
Fri 8:01 a.m. - 8:25 a.m.

Jan Kautz is VP of Learning and Perception Research at NVIDIA. Jan and his team pursue fundamental research in the areas of computer vision and deep learning, including visual perception, geometric vision, generative models, and efficient deep learning. His team's work has been recognized with various awards and has been regularly featured in the media. Before joining NVIDIA in 2013, Jan was a tenured faculty member at University College London. He holds a BSc in Computer Science from the University of Erlangen-Nürnberg (1999), an MMath from the University of Waterloo (1999), received his PhD from the Max-Planck-Institut für Informatik (2003), and worked as a post-doctoral researcher at the Massachusetts Institute of Technology (2003-2006).

Jan Kautz
Fri 8:25 a.m. - 8:30 a.m.
Q&A with Jan Kautz (Q&A)
Fri 8:30 a.m. - 9:00 a.m.

Please join us in GatherTown (using FireFox or Chrome) for our first poster session:

  • Bayesian Perspective on Visual Data Augmentation for Efficient Utilization of Sub-sampled Data

  • One-shot GAN: Learning to Generate Samples from Single Images and Videos

  • You only need adversarial supervision for semantic image synthesis

  • Towards creativity characterization of generative models via group-based subset scanning

  • Unconditional Synthesis of Complex Scenes Using a Semantic Bottleneck

  • Evaluating the Quality of Synthetic Images of Porous Media: A morphological and physics-based approach

  • Joint Text and Label Generation for Spoken Language Understanding

  • What if I don’t have in-domain unlabeled data for semi-supervised learning? Well, generate some!

Fri 9:00 a.m. - 9:01 a.m.
Intoducing Jinsung Yoon (Intro)
Edward Choi
Fri 9:01 a.m. - 9:25 a.m.

Jinsung Yoon is a research scientist at Google Cloud AI. Prior to Google Cloud, Jinsung was a PhD student in the Electrical and Computer Engineering Department at UCLA. He received his PhD from UCLA in 2020 and his PhD thesis was on machine learning for medicine (titled as End-to-End Machine Learning Frameworks for Medicine: Data Imputation, Model Interpretation and Synthetic Data Generation). His main research interests have been on synthetic data generation with privacy guarantee, data imputation, model interpretation, and transfer learning using adversarial learning and reinforcement learning frameworks. He has published various papers and served as a reviewer in top-tier machine learning conferences (NeurIPS, ICML, ICLR, AAAI).

Jinsung Yoon
Fri 9:25 a.m. - 9:30 a.m.
Q&A with Jinsung Yoon (Q&A)
Fri 9:30 a.m. - 9:32 a.m.
Introducing contributed talks 3-4 (Intro)
Jamie Hayes
Fri 9:32 a.m. - 9:41 a.m.
Contributed Talk: Few-shot learning via tensor hallucination (Contributed Talk)   
Michalis Lazarou
Fri 9:41 a.m. - 9:50 a.m.
Contributed Talk: Leveraging Public Data for Practical Private Query Release (Contributed Talk)   
Terrance Liu
Fri 9:50 a.m. - 9:51 a.m.
Introducing Manuela M. Veloso (Intro)
Sergul Aydore
Fri 9:51 a.m. - 10:15 a.m.

Manuela M. Veloso joined J.P.Morgan as Managing Director to create and head the Artificial Intelligence Research Lab. With her group, she investigates opportunities for automated, optimized, and novel approaches to AI in Finance. Veloso is on leave from Carnegie Mellon University (CMU) as Herbert A. Simon University Professor in the School of Computer Science, and where she was the Head of the Machine Learning Department. She researches in AI, Robotics, and Machine Learning. At CMU, she founded and directs the CORAL research laboratory, for the study of autonomous agents that Collaborate, Observe, Reason, Act, and Learn. Veloso and her students research a variety of autonomous robots, including mobile service robots and soccer robots. Veloso is Fellow of the AAAI, AAAS, ACM, and IEEE. She is Einstein Chair Professor of the Chinese National Academy of Science, the co-founder and past President of RoboCup, and past President of AAAI. As of now, Professor Veloso has graduated 40 PhD students and co-authored more than 300 journal and conference publications.

Manuela Veloso
Fri 10:15 a.m. - 10:20 a.m.
Q&A with Manuela M. Veloso (Q&A)
Fri 10:20 a.m. - 10:50 a.m.

Please join us in GatherTown (using FireFox or Chrome) for our second poster session:

  • Privacy Preserving Object Detection

  • Differentially Private Query Release through adaptive projection

  • Leveraging Public Data for Practical Private Query release

  • Extremely Private Supervised Learning

  • Overcoming Barriers to Data Sharing with Medical Image Generation: A Comprehensive Evaluation

  • PrivSyn: Differentially Private Data Synthesis

  • Privacy-preserving High-dimensional Data Collection with Federated Generative Autoencoder

  • FFPGD: Fast, fair and Private Data Generation

Fri 10:50 a.m. - 10:51 a.m.
Introducing Stefano Ermon (Intro)
Krishnaram Kenthapadi
Fri 10:51 a.m. - 11:15 a.m.

Stefano Ermon is an Assistant Professor of Computer Science in the CS Department at Stanford University, where he is affiliated with the Artificial Intelligence Laboratory, and a fellow of the Woods Institute for the Environment. His research is centered on techniques for probabilistic modeling of data and is motivated by applications in the emerging field of computational sustainability. He has won several awards, including four Best Paper Awards (AAAI, UAI and CP), a NSF Career Award, ONR and AFOSR Young Investigator Awards, a Sony Faculty Innovation Award, a Hellman Faculty Fellowship, Microsoft Research Fellowship, Sloan Fellowship, and the IJCAI Computers and Thought Award. Stefano earned his Ph.D. in Computer Science at Cornell University in 2015.

Stefano Ermon
Fri 11:15 a.m. - 11:20 a.m.
Q&A with Stefano Ermon (Q&A)
Fri 11:20 a.m. - 11:23 a.m.
Introducing contributed talks 5-6-7 (Intro)
Haipeng Chen
Fri 11:23 a.m. - 11:32 a.m.
Contributed Talk: FFPDG: Fast, Fair and Private Data Generation (Contributed Talk)   
weijie Xu
Fri 11:32 a.m. - 11:41 a.m.
Contributed Talk: Overcoming Barriers to Data Sharing with Medical Image Generation: A Comprehensive Evaluation (Contributed Talk)   
Stefan Bauer, August DuMont Schütte
Fri 11:41 a.m. - 11:50 a.m.
Contributed Talk: Imperfect ImaGANation: Implications of GANs Exacerbating Biases on Facial Data (Contributed Talk)   
Alberto Olmo, Niharika Jain
Fri 11:50 a.m. - 11:51 a.m.
Intoducing Sander Dieleman (Intro)
Haipeng Chen
Fri 11:51 a.m. - 12:15 p.m.

Sander Dieleman is a research scientist at DeepMind in London, UK, where he has worked on the AlphaGo and WaveNet projects. His research interests include generative modelling and representation learning in the audio and visual domains, with a particular focus on music, as well as recommender systems and equivariance in neural networks. He obtained his Ph.D. in Computer Science from Ghent University in 2016, working on feature learning and deep learning techniques for learning hierarchical representations of musical audio signals.

Sander Dieleman
Fri 12:15 p.m. - 12:20 p.m.
Q&A with Sander Dieleman (Q&A)
Fri 12:20 p.m. - 12:50 p.m.

Please join us in GatherTown (using FireFox or Chrome) for our third poster session:

  • Imperfect Imagination: Implications of GANs exacerbating biases on facial data

  • Transitioning from real to synthetic data: quantifying the bias in model

  • Representative and Fair Synthetic Data

  • Synthetic Data for Model Selection

  • Ensembles of GANs for synthetic training data generation

  • Few Shot Learning via Tensor Hallucination

  • A Scriptable Tool for Photo Realistic Synthetic Image Generation

  • Improving augmentation and evaluation schemes for semantic image synthesis

Fri 12:50 p.m. - 12:51 p.m.
Introducing Emily Denton (Intro)
Krishnaram Kenthapadi
Fri 12:51 p.m. - 1:15 p.m.

Emily Denton is a Research Scientist on Google’s Ethical AI team where they examine the societal impacts of AI technology. Their recent research centers on critically examining the norms, values, and work practices that structure the development and use of machine learning datasets. Prior to joining Google, Emily received their PhD in machine learning from the Courant Institute of Mathematical Sciences at New York University, where they focused on unsupervised learning and generative modeling of images and video.

Emily Denton
Fri 1:15 p.m. - 1:20 p.m.
Q&A with Emily Denton (Q&A)
Fri 1:20 p.m. - 2:20 p.m.
Discussion Panel by All invited speakers (Discussion Panel)
Mario Fritz
Fri 2:20 p.m. - 2:30 p.m.
Closing Remarks and Award Ceremony (Remark)
Jamie Hayes