Poster

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Peng Xu · Wei Ping · Xianchao Wu · Chejian Xu · Zihan Liu · Mohammad Shoeybi · Bryan Catanzaro

2025 Poster

[ Slides] [ OpenReview]

Abstract

In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128Kcontext window, designed to bridge the gap between open-source LLMs andleading proprietary models (e.g., GPT-4-Turbo-2024-04-09) in long context un-derstanding and retrieval-augmented generation (RAG) capabilities. These twocapabilities are complementary to each other and essential for LLMs to processlarge volumes of information that cannot fit into a single prompt. We presenta detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tun-ing process to enhance the model’s instruction-following, RAG performance,and long-context understanding capabilities. Our results demonstrate that theLlama3-ChatQA-2-70B model outperforms most existing state-of-the-art models,including GPT-4-Turbo-2024-04-09, Qwen2-72B-Instruct, and Llama3.1-70B-Instruct, on ultra-long tasks beyond 100K tokens, as well as on the RAG benchmarkusing only a 4K context window, showing the strong long context capability acrossvarying sequence lengths. We further provide extensive comparisons betweendirect long-context and RAG solutions using the same state-of-the-art long-contextLLMs. Interestingly, we find that the performance of strong long-context LLMsusing RAG improves when retrieving a larger number of chunks. With a large setof top-k chunks, RAG consistently outperforms direct long-context solution usingthe same state-of-the-art long-context models (e.g., Llama3-ChatQA-2-70B andQwen2-72B-Instruct) on both 32K and 128K benchmarks. We open-source themodel weights, training data, and the evaluation setup for the for the community:https://chatqa2-project.github.io/

Video

Chat is not available.