Skip to yearly menu bar Skip to main content


Poster

MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

Jian Wu · Linyi Yang · Dongyuan Li · Yuliang Ji · Manabu Okumura · Yue Zhang

Hall 3 + Hall 2B #572
[ ]
Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT
 
Oral presentation: Oral Session 2B
Thu 24 Apr 12:30 a.m. PDT — 2 a.m. PDT

Abstract:

While large language models (LLMs) have made strides in understanding tabular data, current tabular evaluation benchmarks, such as WikiTableQuestions and WikiSQL, are focus on single-table scenarios, which cannot necessarily reflect the complexity of real-world applications. To bridge this gap, we present a \textbf{M}ulti-table and Multi-hop Question Answering (MMQA) dataset to assess LLMs' understanding and reasoning capabilities in handling multi-table tasks. The MMQA dataset demands that models perform multiple inferences by drawing evidence from various tables, which are designed to be connected with each other and require models to identify and utilize relationships such as foreign and primary keys. Then, we introduce a comprehensive evaluation framework that tailors to assess LLMs' capabilities in several aspects including Multi-Table Retrieval, Text-to-SQL Generation, Multi-Table QA, Primary Key Selection, and Foreign Key Selection. Finally, we propose a novel multi-table retrieval method that achieves state-of-the-art (SOTA) performance on the MMQA dataset compared to several strong baselines. Our experiment results reveal that, compared with human performance, both open-source and commercial LLMs leave significant performance room for improvements in multi-table understanding and reasoning tasks. We believe that the MMQA benchmark will enhance and facilitate LLMs' multi-table capabilities in real-world scenarios.

Live content is unavailable. Log in and register to view live content