Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Large Language Models for Agents

Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification

Soumya Sanyal · Tianyi Xiao · Jiacheng Liu · Wenya Wang · Xiang Ren


Abstract:

Humans make numerous inferences in text comprehension to understand the meaning. This paper aims to understand the similarities and differences between humans and state-of-the-art Large Language Models (LLMs) in their ability to judge valid inferences. To this end, we leverage a comprehensively curated entailment verification benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises and requiring different types of knowledge. Our findings reveal LLMs’ superiority in multi-hop reasoning across extended contexts requiring slow thinking, while humans excel in simple deductive reasoning tasks. Using these insights, we introduce a fine-tuned Flan-T5 model that outperforms GPT-3.5 and rivals GPT-4, offering a superior open-source LLM for entailment verification. As a practical application, we showcase the efficacy of our finetuned model in enhancing the self-consistency in model-generated CoT rationales, resulting in a 6% performance boost on average across three multiple-choice question-answering datasets.

Chat is not available.