ICLR Poster $\text{D}_{2}\text{O}$: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Poster

$\text{D}_{2}\text{O}$ : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Zhongwei Wan · Xinjian Wu · Yu Zhang · Yi Xin · Chaofan Tao · Zhihong Zhu · Xin Wang · Siqi Luo · Jing Xiong · Longyue Wang · Mi Zhang

Hall 3 + Hall 2B #225

[ Abstract ] [ Project Page ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: Efficient generative inference in Large Language Models (LLMs) is impeded by the growing memory demands of Key-Value (KV) cache, especially for longer sequences. Traditional KV Cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce **D**ynamic **D**iscriminative **O**perations (

$\mathbf{D_2 O}$ ), a novel method that optimizes KV cache size dynamically and discriminatively at two levels without fine-tuning, while preserving essential context. At **layer-level**, by observing the varying densities of attention weights between shallow and deep layers, we dynamically determine which layers should avoid excessive eviction via our proposed ***dynamic allocation strategy*** to minimize information loss. At **token-level**, for the eviction strategy in each layer,

$\mathbf{D_2 O}$ innovatively incorporates a ***compensation mechanism*** that maintains a similarity threshold to re-discriminate the importance of currently discarded tokens, determining whether they should be recalled and merged with similar tokens. Extensive experiments on various benchmarks and LLM architectures have shown that

$\mathbf{D_2 O}$ not only achieves significant memory savings and enhances inference throughput by more than 3

$\times$ but also maintains high-quality long-text generation.

Live content is unavailable. Log in and register to view live content

Poster

D2O\text{D}_{2}\text{O}: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Zhongwei Wan · Xinjian Wu · Yu Zhang · Yi Xin · Chaofan Tao · Zhihong Zhu · Xin Wang · Siqi Luo · Jing Xiong · Longyue Wang · Mi Zhang

Hall 3 + Hall 2B #225

$\text{D}_{2}\text{O}$ : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models