Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Confidence-Gated RAG for Adaptive Retrieval in Sequential Agents

Srikanth Devarakonda ⋅ RAJESH LINGAM ⋅ Vagdevi Challa

Project Page [ OpenReview]

Abstract

Large language model (LLM) agents increasingly rely on external retrieval to mitigate incomplete knowledge and uncertainty during multi-step decision-making. However, existing retrieval strategies in agentic systems typically follow static policies—either retrieving at every step or performing a single upfront query—without accounting for evolving execution risk under partial observability. We formulate retrieval as a risk-aware test-time control problem in which evidence acquisition is treated as an adaptive action conditioned on predicted downstream failure. We introduce a confidence-gated retrieval framework that maintains an explicit step-wise confidence score derived from self-consistency, contradiction detection, and dependency risk signals, implemented as a weighted heuristic risk estimator. This confidence signal guides decisions to retrieve, execute, or replan by approximating expected failure likelihood while balancing retrieval cost. We evaluate the approach on a hidden-constraint cloud-configuration planning benchmark and report both a lightweight task-completion proxy metric and a strict semantic judge metric for hidden-constraint satisfaction. On a 20-task hard-small subset, multi-seed results (3 seeds, mean ± 95% CI) show that confidence-gated retrieval attains the highest task-completion proxy at substantially lower retrieval cost than always-retrieve; we include a Self-RAG-style baseline (LLM decides per step whether to retrieve) for comparison. Under TF-IDF, semantic success is 0–2%; under dense retrieval (sentence-transformers + FAISS), it reaches 3–7% for some agents, consistent with higher retrieval coverage. We report retrieval quality (recall@1/5/10 and mean oracle rank) for both retrievers—TF-IDF yields 0% recall, dense yields 2% recall@10 and mean oracle rank 7—and validate the judge with a small human evaluation (100% agreement on 20 examples).

Chat is not available.