jina-vlm: Small Multilingual Vision Language Model
Andreas Koukounas ⋅ Georgios Mastrapas ⋅ Florian Hönicke ⋅ Sedigheh Eslami ⋅ ⋅ Scott Martens ⋅ Han Xiao
Abstract
We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study—systematically removing task, domain, modality, and language categories—to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
Chat is not available.
Successful Page Load