ICLR Expressing and Exploiting Parallelism in Language Model Decoding

Poster
in
Workshop: Workshop on Large Language Models for Agents

Expressing and Exploiting Parallelism in Language Model Decoding

Tian Jin · Ellie Cheng · Michael Carbin

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

For autoregressive language models, decoding naturally occurs sequentially, generating tokens one after another.Recent attempts to introduce parallelism require a pre-determined structure to implement parallel generation, such as generating an outline and dividing the responses into parallel sub-tasks.In this work we explore a new technique to automate parallel generation by dynamically exploiting various parallel structure in the semantics of the language model response.Specifically, we introduce a simple annotation language MSG that allows language models to express parallelism in their outputs.We then develop an interpreter for MSG that performs on-the-fly parallel generation during decoding, exploiting the parallelism expressed in the MSG-annotated outputs.We demonstrate that our approach can improve tokens generated per second by 21\% while maintaining the same quality of output.

Chat is not available.

Poster in Workshop: Workshop on Large Language Models for Agents

Expressing and Exploiting Parallelism in Language Model Decoding

Tian Jin · Ellie Cheng · Michael Carbin

Poster
in
Workshop: Workshop on Large Language Models for Agents