Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation
Reliable and Efficient Amortized Model-based Evaluation
Sang Truong · Yuheng Tu · Percy Liang · Bo Li · Sanmi Koyejo
Current generative model evaluations are costly and sensitive to test set selection, making iterative evaluation impractical. To address this issue, we employ a model-based evaluation framework using Item Response Theory (IRT), which decouples model performance from the test characteristics, improving reliability and efficiency. To operationalize IRT in generative model evaluation, we propose (1) amortized calibration to reduce the cost of estimating question characteristics and (2) a conditional question generator based on a large language model to automate diverse question generation. Our experiments on 25 common natural language benchmarks and 184 large language models show that this approach is more reliable and efficient compared to current common practice, offering a practical solution to evaluating generative models. Our implementation is available at https://anonymous.4open.science/r/reeval-1187