Workshop
The Future of Machine Learning Data Practices and Repositories
Rachel Longjohn · Markelle Roesti · Meera Desai · Shivani Kapania · Maria Antoniak · Padhraic Smyth · Sameer Singh · Joaquin Vanschoren · Amy Winecoff · Daniel Katz
Datasets are a central pillar of machine learning (ML) research—from pretraining to evaluation and benchmarking. However, a growing body of work highlights serious issues throughout the ML data ecosystem, including the under-valuing of data work, ethical issues in datasets that go undiscovered, a lack of standardized dataset deprecation procedures, the (mis)use of datasets out-of-context, an overemphasis on single metrics rather than holistic model evaluation, and the overuse of the same few benchmark datasets. Thus, developing guidelines, goals, and standards for data practices is critical; beyond this, many researchers have pointed to a need for a more fundamental culture shift surrounding data and benchmarking in ML. At present it is not clear how to mobilize the ML community for such a transformation. In this workshop, we aim to explore this question, including by examining the role of data repositories in the ML data landscape. These repositories have received relatively little attention in this context, despite their key role in the storage, documentation, and sharing of ML datasets. We envision that these repositories, as central purveyors of ML datasets, have the potential to instigate far-reaching changes to ML data and benchmarking culture via the features they implement and the standards they enforce (e.g., minting DOIs, requiring licenses, facilitating the provision of structured metadata).
Schedule
|
Sat 6:45 p.m. - 7:00 p.m.
|
Opening Remarks
(
Opening Remarks
)
>
SlidesLive Video |
🔗 |
|
Sat 7:00 p.m. - 8:00 p.m.
|
Spotlight Paper Presentations
(
Presentation
)
>
SlidesLive Video |
🔗 |
|
Sat 8:00 p.m. - 9:00 p.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗 |
|
Sat 9:00 p.m. - 10:30 p.m.
|
Lunch
|
🔗 |
|
Sat 10:30 p.m. - 11:00 p.m.
|
Policy of Dataset Transparency
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
|
Sat 11:05 p.m. - 11:40 p.m.
|
Open Machine Learning
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
|
Sat 11:45 p.m. - 12:00 a.m.
|
Break
|
🔗 |
|
Sun 12:00 a.m. - 12:30 a.m.
|
Consent in Crisis: The Rapid Decline of the AI Data Commons
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
|
Sun 12:35 a.m. - 1:05 a.m.
|
Ethical Considerations for Responsible Data Curation
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
|
Sun 1:10 a.m. - 1:45 a.m.
|
The Role of Annotation and Dataset Documentation in the ARIA Program
(
Invited Talk
)
>
|
🔗 |
|
Sun 1:45 a.m. - 2:30 a.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗 |
|
-
|
Rethinking Dataset Pruning From A Generalization Perspective ( Poster ) > link | Furui Xu · Shaobo Wang · Luo Zhongwei · Linfeng Zhang 🔗 |
|
-
|
DRUPI: Dataset Reduction Using Privileged Information ( Poster ) > link | Shaobo Wang · Yantai Yang · Shuaiyu Zhang · Chenghao Sun · Weiya Li · Xuming Hu · Linfeng Zhang 🔗 |
|
-
|
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation ( Poster ) > link | Yucheng Li 🔗 |
|
-
|
A Guide to Misinformation Detection Datasets ( Poster ) > link | Camille Thibault · Jacob-Junqi Tian · Gabrielle Péloquin-Skulski · Taylor Curtis · Florence Laflamme · James Zhou · Yuxiang Guan · Reihaneh Rabbany · Jean-François Godbout · Kellin Pelrine 🔗 |
|
-
|
Unreflected Use of Tabular Data Repositories Can Undermine Research Quality ( Poster ) > link | Andrej Tschalzev · Lennart Purucker · Stefan Lüdtke · Frank Hutter · Christian Bartelt · Heiner Stuckenschmidt 🔗 |
|
-
|
Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data ( Poster ) > link | Henrik Nolte · Michèle Finck · Kristof Meding 🔗 |
|
-
|
Towards Operationalizing Right to Data Protection ( Poster ) > link | Simra Shahid · Abhinav Java · Chirag Agarwal 🔗 |
|
-
|
Data Curation for Pluralistic Alignment ( Poster ) > link | Dalia Ali · Aysenur Kocak · Michèle Wieland · Dora Zhao · Allison Koenecke · Orestis Papakyriakopoulos 🔗 |
|
-
|
The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications ( Poster ) > link | Philippe Brouillard · Chandler Squires · Jonas Wahl · Konrad P Kording · Karen Sachs · Alexandre Drouin · Dhanya Sridhar 🔗 |
|
-
|
AutoML Benchmark with shorter time constraints and early stopping ( Poster ) > link | Israel Jurado · Pieter Gijsbers · Joaquin Vanschoren 🔗 |
|
-
|
Revisiting Multi-Modal LLM Evaluation ( Poster ) > link | Jian Lu · Shikhar Srivastava · Junyu Chen · Robik Shrestha · Manoj Acharya · Kushal Kafle · Christopher Kanan 🔗 |
|
-
|
Tracing Scientific Evolution: A 30-Year Cross-disciplinary Analysis ( Poster ) > link | Yiqiao Jin · Yijia Xiao · Yiyang Wang · Jindong Wang 🔗 |