Blog Track Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 3 P3-#1518

UnigramLM: An Attempt at Writing The Missing Manual

Clara Meister

[ OpenReview]

Abstract

This post is my attempt to write down the UnigramLM tokenization algorithm cleanly and explicitly because, well, I still haven't found such a derivation and I think understanding the theory behind the method could help us make it better. I'll formalize the generative model around which the algorithm is based, derive the EM updates, explain why pruning is needed (and how it's done), and point out the spots where the practical implementation defined by the SentencePiece library diverges from the pretty mathematical models.

Video

Chat is not available.