ICLR Poster Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Poster

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa · Zijian He · Reyna Abhyankar · Dongming Li · Yiying Zhang

Hall 3 + Hall 2B #260

[ Abstract ]

Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Prompts to large language models (LLMs) have evolved beyond simple user questions.For LLMs to solve complex problems, today’s practices are to include domain-specificinstructions, illustration of tool usages, and/or long context such as textbook chapters inprompts. As such, many parts of prompts are repetitive across requests. Recent workspropose to cache and reuse KV state of prompts. However, they are all confined to a single-GPU optimization, while production LLM serving systems are distributed by nature.This paper proposes Preble, the first distributed LLM serving platform that targets and op-timizes for prompt sharing. We designed a distributed scheduling system that co-optimizesKV state reuse and computation load-balancing with a new scheduling algorithm and ahierarchical scheduling mechanism. Our evaluation of Preble with real workloads and re-quest arrival patterns on two open-source LLMs shows that Preble outperforms the SOTAserving systems by 1.5× to 14.5× on average latency and 2× to 10× on p99 latency.

Live content is unavailable. Log in and register to view live content