The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
Vyzantinos Repantis · Harshvardhan Singh · Tony Joseph · Cien Zhang · Akash Vishwakarma · Svetlana Karslioglu · Michael Thot · Ameya Gawde
Abstract
For most of information retrieval's history, search results were designed for human consumers who could scan, filter, and discard irrelevant content. This shaped retrieval systems to optimize for finding and ranking relevant documents, but not for minimizing noise, because humans served as the final filter. Retrieval-augmented generation (RAG) and tool-using agents flip these assumptions. Now the consumer is often an LLM, not a person, and the model does not skim. In practice, introducing excessive or irrelevant context into the input can dilute the model's ability to identify and focus on the most critical information. We define selectivity as the ability of a retrieval system to surface relevant items while excluding irrelevant ones. It is measured relative to random chance. We introduce Bits-over-Random (BoR), a measure of retrieval selectivity that reveals when high success rates mask random-level performance. A system with high selectivity finds needles without bringing along the haystack items. BoR uses a logarithmic scale where each bit represents a doubling in selectivity. This framework is grounded in information theory: $\text{BoR} = \log_2(P_{\text{obs}}/P_{\text{rand}})$, where $P_{\text{obs}}$ is the observed success rate (we use Success@K). $P_{\text{rand}}$ represents the expected success rate of random selection. BoR is measured in bits. By studying reported system performance in the literature for the MS MARCO dataset and by testing two datasets (BIER SciFact and 20 Newsgroups classification), we demonstrate how to measure selectivity in retrieval and LLM-based systems. On MS MARCO at $K=1000$, we analyzed reported performance of 41 different retrieval systems spanning three decades of retrieval technology. BM25 baseline (85.7% recall) achieves 12.89 bits, while state-of-the-art SimLM (98.7% recall) achieves 13.09 bits. This is a difference of only 0.20 bits despite a 13-point recall gap. All 41 systems clustered close to the theoretical ceiling of 13.11 bits, suggesting diminishing returns from retriever improvements alone for this dataset, and at this scale and depth. We see similar results on BIER SciFact. In our 20 Newsgroups retrieval task, each query has over 500 relevant items on average. We perform this stress test because there are many similarities to agentic tool selection setups. Increasing retrieval depth from $K=10$ to $K=100$ raises Success@K to 100%, indicating near-perfect retrieval. However, LLM classification accuracy drops by 10-16%, and token costs increase tenfold. Traditional metrics fail to detect this failure, which resembles random chance retrieval. BoR clearly reveals the issue by dropping to nearly zero at this task and depth. The "collapse zone" is where meaningful selectivity becomes mathematically impossible regardless of system quality. This occurs when $\lambda = \frac{K \cdot \bar{R}_q}{N}$ reaches 3-5, where $K$ is retrieval depth, $\bar{R}_q$ is average relevant items per query, and $N$ is corpus size. When $\lambda$ exceeds this threshold, even perfect systems achieve near-zero BoR because random selection already succeeds most of the time. The collapse boundary reveals critical implications for LLM agent tool selection. Industry-reported case studies outline examples where systems present their full suite of tools to an LLM (for example, $N=58$, $K=58$, $R_q \approx 4$). This means that such a system operates at high $\lambda$ ($\lambda \approx 4.0$), deep into the collapse zone. Through the lens of BoR, we analyze such cases and conclude that even a perfect tool selector achieves low selectivity over random chance (in one case, only ~0.02 bits). This explains why "wrong tool selection" is the most common failure mode for tool agentic systems. This pattern also affects any selection problem with a small $N$ and relatively high $K$ and $R_q$, including API endpoints, agent skills, or multi-hop retrieval chains. We also establish the "doubling rule": when retrieval depth plateaus in success rate, doubling $K$ loses approximately 1 bit of selectivity, while $10\times$ increase loses ~3.3 bits. This quantifies the hidden cost of "just retrieve more", a common but potentially harmful strategy in LLM systems. BoR can work with various success conditions (Success@K, Recall@K, and rules requiring multiple relevant items). Our work reveals three critical insights: 1. Performance ceilings exist even for perfect systems, determined entirely by the random baseline. 2. The collapse zone makes selectivity impossible when $\lambda$ reaches 3-5. 3. Depth-selectivity trade-offs become explicit through measuring differences ($\Delta\text{BoR}$) between different depths. For practitioners, BoR offers operational guidance: monitor $\lambda$ to avoid the collapse zone, stop increasing $K$ when $\text{BoR}_{\text{max}}$ drops below ~0.1 bits, and use aggressive filtering for tool-based agents where small $N$ makes collapse inevitable.
Successful Page Load