The aim of the workshop is to discuss and propose standards for evaluating ML research, in order to better identify promising new directions and to accelerate real progress in the field of ML research. The problem requires understanding the kinds of practices that add or detract from the generalizability or reliability of results reported, and incentives for researchers to follow best practices. We may draw inspiration from adjacent scientific fields, from statistics, or history of science. Acknowledging that there is no consensus on best practices for ML, the workshop will have a focus on panel discussions and a few invited talks representing a variety of perspectives. The call to papers will welcome opinion papers as well as more technical papers on evaluation of ML methods. We plan to summarize the findings and topics that emerged during our workshop in a short report.
Call for Papers: https://ml-eval.github.io/call-for-papers/
Submission Site: https://cmt3.research.microsoft.com/SMILES2022
| Welcome Note | |
| Invited Talk (Thomas Wolf) (Invited Talk) | |
| Q&A for Thomas Wolf (Q&A) | |
| Invited Talk (Frank Schneider) (Invited Talk) | |
| Q&A for (Phillip Henning, Frank Schneider) (Q&A) | |
| Invited Talk (Rotem Dror) (Invited Talk) | |
| Q&A for Rotem Dror (Q&A) | |
| Experimental Standards for Deep Learning Research: A Natural Language Processing Perspective (Oral (Contributed Talk)) | |
| A Case for Better Evaluation Standards in NLG (Oral (Contributed Talk)) | |
| Reproducibility and Rigor in ML (Panel) (Panel Discussion) | |
| Poster Session 1 (Gather.Town) (Poster Session) | |
| Invited Talk (James Evans) (Invited Talk) | |
| Q&A for James Evans (Q&A) | |
| Slow vs Fast Science (Panel) (Panel Discussion) | |
| Coffee Break (Break) | |
| Invited Talk (Melanie Mitchell) (Invited Talk) | |
| Q&A for Melanie Mitchell (Q&A) | |
| Invited Talk (Katherine Heller) (Invited Talk) | |
| Q&A for Katherine Heller (Q&A) | |
| Invited Talk (Corinna Cortes) (Invited Talk) | |
| Q&A for Corrina Cortes (Q&A) | |
| A Siren Song of Open Source Reproducibility (Oral (Contributed Talk)) | |
| Integrating Rankings into Quantized Scores in Peer Review (Oral (Contributed Talk)) | |
| Tradeoffs in Preventing Manipulation in Paper Bidding for Reviewer Assignment (Oral (Contributed Talk)) | |
| Incentives for Better Evaluation (Panel) (Panel Discussion) | |
| Poster Session 2 & Closing Remarks (Gather.Town) | |
| A Quality-Diversity-based Evaluation Strategy for Symbolic Music Generation (Poster) | |
| Towards Yet Another Checklist for New Datasets (Poster) | |
| Does the Market of Citations Reward Reproducible Work? (Poster) | |
| Strengthening Subcommunities: Towards Sustainable Growth in AI Research (Poster) | |
| System Analysis for Responsible Design of Modern AI/ML Systems (Poster) | |
| Rethinking Machine Learning Model Evaluation in Pathology (Poster) | |
| Integrating Rankings into Quantized Scores in Peer Review (Poster) | |
| A Siren Song of Open Source Reproducibility (Poster) | |
| What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability (Poster) | |
| A Brief Guide to Designing and Evaluating Human-Centered Interactive Machine Learning (Poster) | |
| Experimental Standards for Deep Learning Research: A Natural Language Processing Perspective (Poster) | |
| Reproducible Subjective Evaluation (Poster) | |
| Machine Learning State-of-the-Art with Uncertainties (Poster) | |
| A Case for Better Evaluation Standards in NLG (Poster) | |
| A meta analysis of data-driven newsvendor approaches (Poster) | |
| CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance (Poster) | |
| A Survey On Uncertainty Toolkits For Deep Learning (Poster) | |
| Incentivizing Empirical Science in Machine Learning: Problems and Proposals (Poster) | |
| Rethinking Streaming Machine Learning Evaluation (Poster) | |
| Tradeoffs in Preventing Manipulation in Paper Bidding for Reviewer Assignment (Poster) | |
| Why External Validity Matters for Machine Learning Evaluation: Motivation and Open Problems (Poster) | |
| deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks (Poster) | |
| Increasing Confidence in Adversarial Robustness Evaluations (Poster) | |
| Setting Clear Expectations for Uncertainty Estimation (Poster) | |
| A Revealing Large-Scale Evaluation of Unsupervised Anomaly Detection Algorithms (Poster) | |
| Are Ground Truth Labels Reproducible? An Empirical Study (Poster) | |