Poster
Needle Threading: Can LLMs Follow Threads Through Near-Million-Scale Haystacks?
Jonathan Roberts · Kai Han · Samuel Albanie
Hall 3 + Hall 2B #314
As the context limits of Large Language Models (LLMs) increase, the range ofpossible applications and downstream functions broadens. In many real-worldtasks, decisions depend on details scattered across collections of often disparatedocuments containing mostly irrelevant information. Long-context LLMs appearwell-suited to this form of complex information retrieval and reasoning, which hastraditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To addressthis, we conduct a set of retrieval experiments designed to evaluate the capabilitiesof 17 leading LLMs, such as their ability to follow threads of information throughthe context window. Strikingly, we find that many models are remarkably thread-safe: capable of simultaneously following multiple threads without significant lossin performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing asthe context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared—they oftencorrespond to substantially different numbers of written characters. We releaseour code and long context experimental data.
Live content is unavailable. Log in and register to view live content