Skip to yearly menu bar Skip to main content


Workshop

The Future of Machine Learning Data Practices and Repositories

Rachel Longjohn · Markelle Roesti · Meera Desai · Shivani Kapania · Maria Antoniak · Padhraic Smyth · Sameer Singh · Joaquin Vanschoren · Amy Winecoff · Daniel Katz

Hall 4 #2

Sat 26 Apr, 6:45 p.m. PDT

Datasets are a central pillar of machine learning (ML) research—from pretraining to evaluation and benchmarking. However, a growing body of work highlights serious issues throughout the ML data ecosystem, including the under-valuing of data work, ethical issues in datasets that go undiscovered, a lack of standardized dataset deprecation procedures, the (mis)use of datasets out-of-context, an overemphasis on single metrics rather than holistic model evaluation, and the overuse of the same few benchmark datasets. Thus, developing guidelines, goals, and standards for data practices is critical; beyond this, many researchers have pointed to a need for a more fundamental culture shift surrounding data and benchmarking in ML. At present it is not clear how to mobilize the ML community for such a transformation. In this workshop, we aim to explore this question, including by examining the role of data repositories in the ML data landscape. These repositories have received relatively little attention in this context, despite their key role in the storage, documentation, and sharing of ML datasets. We envision that these repositories, as central purveyors of ML datasets, have the potential to instigate far-reaching changes to ML data and benchmarking culture via the features they implement and the standards they enforce (e.g., minting DOIs, requiring licenses, facilitating the provision of structured metadata).

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles

Schedule

Log in and register to view live content