Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Scaling Laws and Efficient Inference for Ternary Language Models
Tejas Vaidhya · Ayush Kaushal · Vineet Jain · Francis Couture-Harpin · Prashant Shishodia · Majid Behbahani · Irina Rish · Yuriy Nevmyvaka
Abstract:
Large language models (LLMs) are increasingly deployed across research and industry applications, yet their high inference cost poses a major challenge. In this work, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements as a potential solution. We present three key contributions: (1) a comprehensive scaling law analysis showing these models benefit more from scaling training data compared to their floating point counterparts; (2) the introduction of Spectra-1.1, an open-source family of state-of-the-art TriLMs trained on up to 1.2 trillion tokens, demonstrating competitive performance with Llama-1 7B; and (3) ternary kernels for efficient inference, utilizing novel 1.6-bit and 2-bit packing schemes. Notably, our GPU kernel using 2-bit packing, called TriRun, achieves up to an 8$\times$ speedup over float16 baselines, enabling efficient inference in memory-constrained environments. We will be releasing the Spectra-1.1 models along with optimized inference kernels to encourage further research on TriLM models.
Chat is not available.
Successful Page Load