Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Reliable and Efficient Amortized Model-based Evaluation

Sang Truong · Yuheng Tu · Percy Liang · Bo Li · Sanmi Koyejo

Project Page [ OpenReview]

Abstract

Current generative model evaluations are costly and sensitive to test set selection, making iterative evaluation impractical. To address this issue, we employ a model-based evaluation framework using Item Response Theory (IRT), which decouples model performance from the test characteristics, improving reliability and efficiency. To operationalize IRT in generative model evaluation, we propose (1) amortized calibration to reduce the cost of estimating question characteristics and (2) a conditional question generator based on a large language model to automate diverse question generation. Our experiments on 25 common natural language benchmarks and 184 large language models show that this approach is more reliable and efficient compared to current common practice, offering a practical solution to evaluating generative models. Our implementation is available at https://anonymous.4open.science/r/reeval-1187

Chat is not available.