Skip to yearly menu bar Skip to main content


Poster

SaMer: A Scenario-aware Multi-dimensional Evaluator for Large Language Models

Kehua Feng · Keyan Ding · Jing Yu · Yiwen Qu · Zhiwen Chen · chengfei lv · Gang Yu · Qiang Zhang · Huajun Chen

Hall 3 + Hall 2B #266
[ ] [ Project Page ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Evaluating the response quality of large language models (LLMs) for open-ended questions poses a significant challenge, especially given the subjectivity and multi-dimensionality of "quality" in natural language generation. Existing LLM evaluators often neglect that different scenarios require distinct evaluation criteria. In this work, we propose SaMer, a scenario-aware multi-dimensional evaluator designed to provide both overall and fine-grained assessments of LLM-generated responses. Unlike fixed-dimension evaluation approaches, SaMer adapts to different scenarios by automatically identifying and prioritizing relevant evaluation dimensions tailored to the given query. To achieve this, we construct a large-scale fine-grained preference dataset spanning multiple real-world scenarios, each with distinct evaluation dimensions. We then leverage a text embedding model combined with three specialized heads to predict the appropriate evaluation dimensions and corresponding scores, as well as the respective weights that contribute to the overall score. The resulting model offers fine-grained and interpretable evaluations and shows robust adaptability across diverse scenarios. Extensive experiments on eight single rating and pairwise comparison datasets demonstrate that SaMer outperforms existing baselines in a variety of evaluation tasks, showcasing its robustness, versatility, and generalizability.

Live content is unavailable. Log in and register to view live content