Navigating and Addressing Data Problems for Foundation Models (DPFM)

Workshop

Navigating and Addressing Data Problems for Foundation Models (DPFM)

Ruoxi Jia · Tatsunori Hashimoto · Pang Wei Koh · Jerone Andrews · Sang Michael Xie · Lingjiao Chen · Myeongseob Ko · Feiyang Kang

Stolz 0

Sat 11 May, midnight PDT

[ Abstract ] Workshop Website

[ Contact: fyk@vt.edu ]

Foundation Models (FMs, e.g., GPT-3/4, LLaMA, DALL-E, Stable Diffusion, etc.) have been achieving sweeping success on a wide range of tasks. As researchers strive to keep up with the understanding of the capabilities and limitations of FMs as well as their implications following the rapid evolution, the attention is now shifting to the emerging notion of data-centric AI. The curation of training data has been shown to be crucially important for the performance and reliability of FMs and a wealth of recent works demonstrate that data-perspective research sheds light on a promising direction toward critical issues such as safety, alignment, efficiency, security, privacy, interpretability, etc. Recent year has seen a spur of individual works exploring many frontiers related to this topic, providing now an excellent opportunity to bring together brilliant minds to search for a systematic framework and roadmap for research. This workshop aims to discuss and explore a better understanding of the new paradigm for research on data problems for foundation models. Our technical agenda is composed of four modules with 12 confirmed speakers:- A. Data Quality, Dataset Curation, and Data Generation–Recent Achievements and Current Efforts- B. A Data Perspective to Efficiency, Interpretability, and Alignment–Latest Advancement and Breakthroughs- C. A Data Perspective to Safety and Ethics–Risks, Limitations, and Opportunities- D. Copyright, Legal Issues, and Data Economy–A Broader LandscapeWe strive to build a community behind this essential topic. Noting that the current data practices of foundation models are largely opaque, one mission of this workshop is to create a community effort on open source data efforts at the pretraining stage itself. Subsequent efforts include creating datasets, benchmark, and dedicated venues to promote research on data problems for foundation models and ultimately facilitate the widespread deployment of FMs in a sociotechnical-friendly way that provides benefit at large. Examples of our target communities include researchers on data problems (e.g., data-centric AI, dataset/data curation, data market) and foundation models (alignment, safety/trustworthiness, fairness/ethics), practitioners of downstream applications, tech companies providing innovative solutions and beyond.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 11:50 p.m. - 12:00 a.m.	Opening Remarks ( Intro ) > SlidesLive Video	🔗
Sat 12:00 a.m. - 12:30 a.m.	Invited Talk #1 - Bridging the Gap Between Pre-training Data and Alignment [Speaker: Mike Lewis (Meta AI)] ( Invited Talk ) > SlidesLive Video	Mike Lewis 🔗
Sat 12:30 a.m. - 12:45 a.m.	Best Paper Oral Presentation #1 - Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models [Speaker: Ken Liu (Stanford University)] ( Oral Presentation ) > link SlidesLive Video Link	Ken Liu 🔗
Sat 12:45 a.m. - 1:00 a.m.	Best Paper Oral Presentation #2 - The Science of Data Filtering: Data Curation cannot be Compute Agnostic [Speakers: Sachin Goyal & Pratyush Maini (CMU)] ( Oral Presentation ) > link SlidesLive Video Link	Sachin Goyal · Pratyush Maini 🔗
Sat 1:00 a.m. - 2:00 a.m.	Poster Session I & Coffee Break (ALL posters) ( Poster Session ) >	🔗
Sat 2:00 a.m. - 2:30 a.m.	Invited Talk #2 - A data-centric view on reliable generalization: From ImageNet to LAION-5B & DataComp [Speaker: Ludwig Schmidt (Anthropic, Stanford, and U Washington)] ( Invited Talk ) > SlidesLive Video	Ludwig Schmidt 🔗
Sat 2:30 a.m. - 2:45 a.m.	Best Paper Oral Presentation #3 - VideoCon: Robust Video-Language Alignment via Contrast Captions [Speaker: Hritik Bansal (UCLA)] ( Oral Presentation ) > link SlidesLive Video Link	Hritik Bansal 🔗
Sat 2:45 a.m. - 3:00 a.m.	Best Paper Oral Presentation #4 - What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety [Speaker: Luxi He (Princeton University)] ( Oral Presentation ) > link SlidesLive Video Link	Luxi He 🔗
Sat 3:00 a.m. - 4:00 a.m.	Lunch Break ( Lunchtime ) >	🔗
Sat 4:00 a.m. - 4:30 a.m.	Invited Talk #3 - Making “GPT-Next” Trustworthy Through Data [Speaker: Eric Wallace (OpenAI)] ( Invited Talk ) > SlidesLive Video	Eric Wallace 🔗
Sat 4:30 a.m. - 4:45 a.m.	Best Paper Oral Presentation #5 - Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis [Speaker: Lukas Struppek (TU Darmstadt)] ( Oral Presentation ) > link SlidesLive Video Link	Lukas Struppek 🔗
Sat 4:45 a.m. - 5:00 a.m.	Best Paper Oral Presentation #6 - Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms [Speaker: Jiaqi Ma (UIUC)] ( Oral Presentation ) > link SlidesLive Video Link	Jiaqi Ma 🔗
Sat 5:00 a.m. - 6:00 a.m.	Poster Session II & Coffee Break (ALL posters) ( Poster Session ) >	🔗
Sat 6:00 a.m. - 6:30 a.m.	Invited Talk #4 - Characterizing Machine Unlearning through Definitions and Implementations [Speaker: Nicolas Papernot (University of Toronto & Vector Institute)] ( Invited Talk ) > SlidesLive Video	Nicolas Papernot 🔗
Sat 6:30 a.m. - 7:00 a.m.	Invited Talk #5 - Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models [Speaker: Luke Zettlemoyer (U Washington/Meta)] ( Invited Talk ) > SlidesLive Video	Luke Zettlemoyer 🔗
Sat 7:00 a.m. - 7:30 a.m.	Interactive Panel Discussion ( Panel Discussion ) > SlidesLive Video	🔗
Sat 7:30 a.m. - 7:35 a.m.	Closing Remarks ( Remarks ) >	🔗
-	Label-free Neural Semantic Image Synthesis ( Poster ) > link Link	Jiayi Wang · Kevin Laube · Yumeng Li · Jan Hendrik Metzen · Shin-I Cheng · Julio Borges · Anna Khoreva 🔗
-	What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety ( Poster ) > link Link	Luxi He · Mengzhou Xia · Peter Henderson 🔗
-	Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models ( Poster ) > link Link	Minhao Jiang · Ken Liu · Ming Zhong · Rylan Schaeffer · Siru Ouyang · Jiawei Han · Sanmi Koyejo 🔗
-	Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models ( Poster ) > link Link	Zachary Ankner · Cody Blakeney · Kartik Sreenivasan · Max M Marion · Matthew Leavitt · Mansheej Paul 🔗
-	Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models ( Poster ) > link Link	Yuancheng Xu · Jiarui Yao · Manli Shu · Yanchao Sun · Zichu Wu · Ning Yu · Tom Goldstein · Furong Huang 🔗
-	[Online Presentation] Distributional Dataset Distillation with Subtask Decomposition ( Poster ) > link Link	Tian Qin · Zhiwei Deng · David Alvarez-Melis 🔗
-	Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates ( Poster ) > link Link	Avanika Narayan · Mayee Chen · Kush Bhatia · Christopher Re 🔗
-	Evaluating Large Language Models in an Emerging Domain: A Pilot Study in Decentralized Finance ( Poster ) > link Link	Joshua Pearlson · Xiaoyuan Liu · Chengsong Huang · Kripa George · Dawn Song · Chenguang Wang 🔗
-	QuRating: Selecting High-Quality Data for Training Lanugage Models ( Poster ) > link Link	Alexander Wettig · Aatmik Gupta · Saumya Malik · Danqi Chen 🔗
-	Toward Data-driven Skill Identification for General-purpose Vision-language Models ( Poster ) > link Link	Anthony Tiong · Junqi Zhao · Junnan Li · Steven Hoi · Caiming Xiong · Boyang Albert Li 🔗
-	TOFU: A Task of Fictitious Unlearning for LLMs ( Poster ) > link Link	Pratyush Maini · Zhili Feng · Avi Schwarzschild · Zachary Lipton · J Kolter 🔗
-	Incentivizing Inclusive Data Contributions in Personalized Federated Learning ( Poster ) > link Link	Enpei Zhang · Jingyi Chai · Rui Ye · Yanfeng Wang · Siheng Chen 🔗
-	How to Craft Backdoors with Unlabeled Data Alone? ( Poster ) > link Link	Yifei Wang · Wenhan Ma · Stefanie Jegelka · Yisen Wang 🔗
-	Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines ( Poster ) > link Link	Yuchen Li · Alexandre Kirchmeyer · Aashay Mehta · Yilong Qin · Boris Dadachev · Kishore Papineni · Sanjiv Kumar · Andrej Risteski 🔗
-	Feedback-guided Data Synthesis for Imbalanced Classification ( Poster ) > link Link	Reyhane Askari Hemmat · Mohammad Pezeshki · Florian Bordes · Michal Drozdzal · Adriana Romero-Soriano 🔗
-	Efficient Global Data Attribution for Diffusion Models ( Poster ) > link Link	MingYu Lu · Chris Lin · Su-In Lee 🔗
-	Scalable Data Extraction from Retrieval-Augmented Generation Systems ( Poster ) > link Link	Zhenting Qi · Hanlin Zhang · Eric Xing · Sham Kakade · Hima Lakkaraju 🔗
-	Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases ( Poster ) > link Link	Elad Levi · Eli Brosh · Matan Friedmann 🔗
-	A Tale of Tails: Model Collapse as a Change of Scaling Laws ( Poster ) > link Link	Yunzhen Feng · Elvis Dohmatob · Pu Yang · François Charton · Julia Kempe 🔗
-	AdaDemo: Adaptive Online Demonstration Expansion for Multi-task Visual Policy Learning ( Poster ) > link Link	Tongzhou Mu · Yijie Guo · Jie Xu · Ankit Goyal · Hao Su · Dieter Fox · Animesh Garg 🔗
-	Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models ( Poster ) > link Link	Yupan Huang · Zaiqiao Meng · Fangyu Liu · Yixuan Su · Nigel Collier · Yutong Lu 🔗
-	Autonomous Data Selection with Language Models for Mathematical Texts ( Poster ) > link Link	Yifan Zhang · Yifan Luo · Yang Yuan · Andrew Yao 🔗
-	Enhancing Data Quality in Federated Fine-Tuning of Foundation Models ( Poster ) > link Link	Wanru Zhao · Yaxin Du · Nic Lane · Siheng Chen · Yanfeng Wang 🔗
-	Multimodal Dataset Upgrading: a New Challenge for Data Annotation ( Poster ) > link Link	Haiwen Huang · Dan Zhang · Andreas Geiger 🔗
-	ON THE SCALABILITY OF GNNS FOR MOLECULAR GRAPHS ( Poster ) > link Link	Maciej Sypetkowski · Frederik Wenkel · Farimah Poursafaei · Nia Dickson · Karush Suri · Philip Fradkin · Dominique Beaini 🔗
-	Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis ( Poster ) > link Link	Lukas Struppek · Dominik Hintersdorf · Felix Friedrich · Manuel Brack · Patrick Schramowski · Kristian Kersting 🔗
-	Improving Practical Counterfactual Fairness with Limited Causal Knowledge ( Poster ) > link Link	Zeyu Zhou · Ruqi Bai · David Inouye 🔗
-	Vision-Language Dataset Distillation ( Poster ) > link Link	Xindi Wu · Byron Zhang · Zhiwei Deng · Olga Russakovsky 🔗
-	Data Alignment for Zero-Shot Concept Generation in Dermatology AI ( Poster ) > link Link	Soham Gadgil · Mahtab Bigverdi 🔗
-	CollabEdit: Towards Non-destructive Collaborative Knowledge Editing ( Poster ) >	Jiamu Zheng · Jinghuai Zhang · Futing Wang · Tianyu Du · Tao Lin 🔗
-	Scaling Laws for Downstream Task Performance of Large Language Models ( Poster ) > link Link	Berivan Isik · NATALIA PONOMAREVA · Hussein Hazimeh · Dimitris Paparas · Sergei Vassilvitskii · Sanmi Koyejo 🔗
-	Hallucination Augmented Recitations for Language Models ( Poster ) > link Link	Abdullatif Köksal · Renat Aksitov · Chung-Ching Chang 🔗
-	LongForm: Effective Instruction Tuning with Reverse Instructions ( Poster ) > link Link	Abdullatif Köksal · Timo Schick · Anna Korhonen · Hinrich Schuetze 🔗
-	Model & Data Insights using Pre-trained Language Models ( Poster ) > link Link	Saeid Asgari · Aliasghar Khani · Amir Khasahmadi · Aditya Sanghi · Karl Willis · Ali Mahdavi Amiri 🔗
-	LESS: Selecting Influential Data for Targeted Instruction Tuning ( Poster ) > link Link	Mengzhou Xia · Sadhika Malladi · Suchin Gururangan · Sanjeev Arora · Danqi Chen 🔗
-	Towards Unbiased Evaluation of Detecting Unanswerable Questions in EHRSQL ( Poster ) > link Link	Yongjin Yang · Sihyeon Kim · SangMook Kim · Gyubok Lee · Se-Young Yun · Edward Choi 🔗
-	Virtual Classifier: A Reversed Approach for Robust Image Evaluation ( Poster ) > link Link	Jizhe Zhang · Yifei Wang · Yisen Wang 🔗
-	[Online Presentation] DELE: Data Efficient LLM Evaluation ( Poster ) > link Link	Gayathri Saranathan · Mahammad Parwez Alam · JAMES LIM · Suparna Bhattacharya · Soon Wong · Martin Foltin · Cong Xu 🔗
-	CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution ( Poster ) > link Link	Alex Gu · Baptiste Roziere · Hugh Leather · Armando Solar-Lezama · Gabriel Synnaeve · Sida Wang 🔗
-	Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms ( Poster ) > link Link	Junwei Deng · Jiaqi Ma 🔗
-	Prompt Optimization with Logged Bandit Data ( Poster ) > link Link	Haruka Kiyohara · Yuta Saito · Daniel Cao · Thorsten Joachims 🔗
-	The Science of Data Filtering: Data Curation cannot be Compute Agnostic ( Poster ) > link Link	Sachin Goyal · Pratyush Maini · Zachary Lipton · Aditi Raghunathan · J Kolter 🔗
-	West-of-N: Synthetic Preference Generation for Improved Reward Modeling ( Poster ) > link Link	Alizée Pace · Jonathan Mallinson · Eric Malmi · Sebastian Krause · Aliaksei Severyn 🔗
-	Data Debiasing via Model-free Data Pruning ( Poster ) > link Link	Lei Hsiung · Yaoqing Yang 🔗
-	Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget ( Poster ) > link Link	Florian Eddie Dorner · Moritz Hardt 🔗
-	Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling ( Poster ) > link Link	Pratyush Maini · Skyler Seto · He Bai · David Grangier · Yizhe Zhang · Navdeep Jaitly 🔗
-	Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models ( Poster ) > link Link	Avi Singh · John Co-Reyes · Rishabh Agarwal 🔗
-	Pre-training Concept Frequency is predictive of CLIP Zero-shot Performance ( Poster ) > link Link	Vishaal Udandarao · Ameya Prabhu · Philip Torr · Adel Bibi · Samuel Albanie · Matthias Bethge 🔗
-	Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models ( Poster ) > link Link	Hritik Bansal · John Dang · Aditya Grover 🔗
-	VideoCon: Robust Video-Language Alignment via Contrast Captions ( Poster ) > link Link	Hritik Bansal · Yonatan Bitton · Idan Szpektor · Kai-Wei Chang · Aditya Grover 🔗
-	Augmenting Math Word Problems via Iterative Question Composing ( Poster ) > link Link	Haoxiong Liu · Yifan Zhang · Yifan Luo · Andrew Yao 🔗
-	OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning ( Poster ) > link Link	Rui Ye · WenHao Wang · Jingyi Chai · Dihan Li · Zexi Li · Yinda Xu · Yaxin Du · Yanfeng Wang · Siheng Chen 🔗