Please bring an ID or a credit card and your registration receipt QR code to check in. Avoid brining your passport to the convention center.
The emerging science of benchmarks
Benchmarks are the keystone that hold the machine learning community together. Growing as a research paradigm since the 1980s, there's much we've done with them, but little we know about them. In this talk, I will trace the rudiments of an emerging science of benchmarks through selected empirical and theoretical observations. Specifically, we'll discuss the role of annotator errors, external validity of model rankings, and the promise of multi-task benchmarks. The results in each case challenge conventional wisdom and underscore the benefits of developing a science of benchmarks.
Blog Track Session 7
Tiny Papers Poster Session 7
Hanwang Zhang is currently Associate Professor at School of Computer Science and Engineering, NTU. He joined in NTU as Nanyang Assistant Professor in 2018. He was a research scientist (postdoc) at Columbia University in 2017-2018, and a senior research fellow at NUS in 2014-2016. He received a Ph.D. from NUS in 2014 and a B.Eng from Zhejiang University, China in 2009, both in Computer Science. His research interests include Computer Vision, Natural Language Processing, Causal Inference, and their combinations. Due to his contribution in applied causality, he has received numerous awards including the Singapore President Award Young Scientist 2021, IEEE AI’s-10-To-Watch 2020, Alibaba Innovative Research Award 2019, Nanyang Assistant Professorship 2018, and several best paper awards.
Alexander "Sasha" Rush is an Associate Professor at Cornell Tech and a researcher at Hugging Face. His research interest is in the study of language models with applications in controllable text generation, efficient inference, and applications in summarization and information extraction. In addition to research, he has written several popular open-source software projects supporting NLP research, programming for deep learning, and virtual academic conferences. His projects have received paper and demo awards at major NLP, ML, visualization, and hardware conferences, an NSF Career Award, and a Sloan Fellowship.
Rosanne Liu is the Co-founder and Executive Director of ML Collective, a non-profit organization providing research training for all, and is concurrently doing science and being a manager at Google DeepMind (previously Brain). She was a founding member of Uber AI. Rosanne obtained her PhD in Computer Science at Northwestern University, and has published well-cited research at NeurIPS, ICLR, ICML, Nature and other top venues. She builds communities for researchers around the world, organizes symposiums, workshops, and a long-running weekly reading group “Deep Learning: Classics and Trends.” She serves as the Diversity, Equity & Inclusion chair of ICLR 2022-2024, and NeurIPS 2023.
Reliable evaluations are critical for improving language models, but they're difficult to achieve. Traditional automated benchmarks often fail to reflect real-world settings, and open source evaluation sets are empirically overfitted. Conducting evaluations in-house is burdensome and demands significant human effort from model builders.
To tackle these issues, Scale AI has created a set of evaluation prompt datasets in areas like instruction following, coding, math, multilinguality, and safety. Summer Yue, Chief of Staff, AI; Director of Safety and Standards at Scale AI will discuss these eval sets, as well as the launch of a new platform which allows researchers to gain insights into their models' performance. Furthermore, she will introduce a unique feature which warns developers of potential overfitting on these sets.
Your new Scholar profile
Google Scholar is widely used to form opinions about researchers, but it is not a passive measuring tool. Its deliberate decisions on what to show and what to hide have a massive impact on how science is done today: they influence what researchers decide to work on, their methodologies, and career advancements.
We believe Google Scholar profiles are not serving science in the best way. We wish to share our vision of a Better Scholar for the future and gather your observations and feedback.
Moritz Hardt is a director at the Max Planck Institute for Intelligent Systems. Prior to joining the institute, he was Associate Professor for Electrical Engineering and Computer Sciences at the University of California, Berkeley. His research contributes to the scientific foundations of machine learning and algorithmic decision making with a focus on social questions.
Tiny Papers Oral Session 4
Luke Zettlemoyer is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and a Research Director at Meta. His research focuses on empirical methods for natural language semantics, and involves designing machine learning algorithms, introducing new tasks and datasets, and, most recently, studying how to best develop self-supervision signals for pre-training. His honors include being named an ACL Fellow as well as winning a PECASE award, an Allen Distinguished Investigator award, and multiple best paper awards. Luke received his PhD from MIT and was a postdoc at the University of Edinburgh.
Tatsunori Hashimoto is an Assistant Professor in the Computer Science Department at Stanford University. He is a member of the statistical machine learning and natural language processing groups at Stanford and his work focuses on statistical approaches to improving and understanding language models. Work from his group spans many areas, including instruction-following and controllable language models, differentially private fine-tuning, and benchmarks for LM safety and capabilities. He received his Ph.D. at MIT under the supervision of Tommi Jaakkola and David Gifford, and is a Kavli fellow, a Sony and Amazon research award winner, and his work has been recognized with best paper awards at ICML and CHI.
Claire is a Group Leader at the University of Tuebingen, in the Cluster of Excellence Machine Learning for Science. She was awarded an Emmy Noether award under the AI Initiative call in 2022. Her research is on sequential decision making. It mostly spans bandit problems, and theoretical Reinforcement Learning, but her research interests extend to Learning Theory and principled learning algorithms. While keeping in mind concrete problems, she focuses on theoretical approaches, aiming for provably optimal algorithms. Between November 2018 and December 2022, she was a Research Scientist at DeepMind in London UK in the Foundations team lead by Prof. Csaba Szepesvari. She did a post-doc in 2018 with Prof. Alexandra Carpentier at the University of Magdeburg in Germany while working part-time as an Applied Scientist at Amazon in Berlin. She received her PhD from Telecom ParisTech in October 2017, under the guidance of Prof. Olivier Cappé.
Xuezhi Wang is a Research Scientist at Google Brain. Her primary interests are robustness and fairness in NLP models, and enabling systematic generalization in language models. Xuezhi received her PhD degree from the Computer Science Department in Carnegie Mellon University in 2016.
Together with two other co-founders, Rich Bonneau and Vlad Gligorijevic, I founded Prescient Design in January 2021, in order to build a lab-in-the-loop protein design platform based on our earlier research. Prescient Design was fully acquired by Genentech (Roche) on August 2021, and began to focus more specifically on antibody design. It has been more than three years since its founding and more than 2.5 years since the acquisition. In this talk, I will share Prescient Design's lab-in-the-loop antibody design, both the platform and the outcome, as well as what went behind in building this platform from the perspective of machine learning.
Blog Track Session 8
Tiny Papers Poster Session 8
Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning
Human beings can make adaptive decisions in a preparatory manner, i.e., by making preparations in advance, which offers significant advantages in scenarios where both online and offline experiences are expensive and limited. Meanwhile, current reinforcement learning methods commonly rely on numerous environment interactions but hardly obtain generalizable policies. In this paper, we introduce the idea of \textit{rehearsal} into policy optimization, where the agent plans for all possible outcomes in mind and acts adaptively according to actual responses from the environment. To effectively rehearse, we propose ReDM, an algorithm that generates a diverse and eligible set of dynamics models and then rehearse the policy via adaptive training on the generated model set. Rehearsal enables the policy to make decision plans for various hypothetical dynamics and to naturally generalize to previously unseen environments. Our experimental results demonstrate that ReDM is capable of learning a valid policy solely through rehearsal, even with \emph{zero} interaction data. We further extend ReDM to scenarios where limited or mismatched interaction data is available, and our experimental results reveal that ReDM produces high-performing policies compared to other offline RL baselines.