Reliable evaluations are critical for improving language models, but they're difficult to achieve. Traditional automated benchmarks often fail to reflect real-world settings, and open source evaluation sets are empirically overfitted. Conducting evaluations in-house is burdensome and demands significant human effort from model builders.
To tackle these issues, Scale AI has created a set of evaluation prompt datasets in areas like instruction following, coding, math, multilinguality, and safety. Summer Yue, Chief of Staff, AI; Director of Safety and Standards at Scale AI will discuss these eval sets, as well as the launch of a new platform which allows researchers to gain insights into their models' performance. Furthermore, she will introduce a unique feature which warns developers of potential overfitting on these sets.