Poster
in
Workshop: 3rd Workshop on Navigating and Addressing Data Problems For Foundation Models (DATA-FM)

jina-vlm: Small Multilingual Vision Language Model

Andreas Koukounas ⋅ Georgios Mastrapas ⋅ Florian Hönicke ⋅ Sedigheh Eslami ⋅ ⋅ Scott Martens ⋅ Han Xiao

Project Page [ OpenReview]

Abstract

We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study—systematically removing task, domain, modality, and language categories—to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.

Chat is not available.