Workshop
Workshop on Enormous Language Models: Perspectives and Benchmarks
Colin Raffel · Adam Roberts · Amanda Askell · Daphne Ippolito · Ethan Dyer · Guy Gur-Ari · Jared Kaplan · Jascha Sohl-Dickstein · Katherine Lee · Melanie Subbiah · Sam McCandlish · Tom Brown · William Fedus · Vedant Misra · Ambrose Slone · Daniel Freeman
Fri 7 May, 7:45 a.m. PDT
Language models that have been trained on unlabeled text data are a cornerstone of modern natural language processing (NLP) research, and many recent state-of-the-art results in NLP were achieved by leveraging these self-supervised models. The success of this recipe is largely thanks to scalability: Better results can often be obtained by training larger models on larger amounts of unlabeled text data. This places our field at a crossroads. Will scaling lead to models that outperform humans on all text-based tasks, or are there limits to the scalability of these models? Should we focus on simply scaling these models, or should we design more sophisticated architectures and training schemes? Do our current benchmark effectively test capabilities that humans can master but large language models lack? How can we address the legal and ethical issues that arise from using unstructured web crawls for training language models? What can we learn from the fields of cognition, linguistics, and philosophy as we attempt to measure the “intelligence” of machines? The goal of this workshop is to find answers to these questions by inviting a diverse group of researchers to critically examine the state of giant language models.
This workshop will have a non-standard submission format: Rather than submitting research papers, participants will be invited to contribute diverse tasks that they believe measure uniquely human or particularly challenging capabilities for large language models. Teams at Google and OpenAI have committed to evaluate this task set on their best-performing model architectures, across models spanning from tens of thousands through hundreds of billions or more of parameters. Researchers will also be invited to contribute and evaluate their own models on these tasks. We will analyze these experiments, and report the results at the workshop, with a particular focus on how model performance on different task types scales with model size. By inviting contributions of tasks or models, we provide a means for researchers to participate whether or not they have the (cost-prohibitive) computational resources to train giant language models. The end result will be the Beyond the Imitation Game Benchmark (BIG Bench): A novel participant-driven test of the limits of giant language models. Find out more about BIG Bench and participate here.