ICLR Poster Jamba: Hybrid Transformer-Mamba Language Models

Poster

Jamba: Hybrid Transformer-Mamba Language Models

Barak Lenz · Opher Lieber · Alan Arazi · Amir Bergman · Avshalom Manevich · Barak Peleg · Ben Aviram · Chen Almagor · Clara Fridman · Dan Padnos · Daniel Gissin · Daniel Jannai · Dor Muhlgay · Dor Zimberg · Edden Gerber · Elad Dolev · Eran Krakovsky · Erez Sa · Erez Schwartz · Gal Cohen · Gal Shachaf · Haim Rozenblum · Hofit Bata · Ido Blass · Inbal Magar · Itay Dalmedigos · Jhonathan Osin · Julie Fadlon · Maria Rozman · Matan Danos · Michael Gokhman · Mor Zusman · Naama Gidron · Nir Ratner · Noam Gat · Noam Rozen · Oded Fried · Ohad Leshno · Omer Antverg · Omri Abend · Or Dagan · Orit Cohavi · Raz Alon · Ro'i Belson · Roi Cohen · Rom Gilad · Roman Glozman · Shahar Lev · Shai Shalev-Shwartz · Shaked Meirom · Tal Delbari · Tal Ness · Tomer Asida · Tom Ben Gal · Tom Braude · Uriya Pumerantz · Joshua Cohen · Yonatan Belinkov · Yuval Globerson · Yuval Levy · Yoav Shoham

Hall 3 + Hall 2B #248

[ Abstract ] [ Project Page ]

Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

We present Jamba, a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. We implement two configurations: Jamba-1.5-Large, with 94B active parameters, and Jamba-1.5-mini, with 12B active parameters. Built at large scale, Jamba models provide high throughput and small memory footprint compared to vanilla Transformers, especially at long-context tasks, with an effective context length of 256K tokens, the largest amongst open-weight models. At the same time, they are also competitive on standard language modeling and chatbot benchmarks. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. To support cost-effective inference, we introduce ExpertsInt8, a novel quantization technique that allows fitting Jamba-1.5-Large on a machine with 8 80GB GPUs when processing 256K-token contexts without loss of quality. We also describe several interesting properties of this architecture that the training and evaluation of Jamba have revealed. The model weights are publicly available.

Live content is unavailable. Log in and register to view live content