[Short] Downstream Effects of Translation Scale with Language Difficulty
Aditya Vikas Kulkarni ⋅ ⋅ Ammar Ahmed Pallikonda Latheef ⋅ Pritam Mukherjee ⋅ Jacob Luber ⋅ Paul Yi
Abstract
Large language models (LLMs) perform best in high resource languages, motivating Machine Translation (MT) as a preprocessing step for multilingual inference. However, translation may alter task-relevant linguistic cues, degrading downstream models. It remains unclear whether such degradation is arbitrary or systematic across languages. We quantify translation-induced downstream drift using round-trip translation (English to pivot language to English) across eight pivot languages from Europe (German, Spanish, French, Italian, Portuguese) and Asia (Chinese, Hindi, Thai) while holding the source texts and downstream models fixed. Across two downstream tasks (radiology finding extraction from clinical reports and text retrieval), translation introduces performance drops that increase with language difficulty (US State Dept. categories; Spearman $\lvert$$\rho$$\rvert$ $\geq$ 0.83), suggesting systematic rather than random drift and providing an external, pre-translation diagnostic. Over repeated round-trip translations, performance drops early, and then stabilizes in subsequent round-trips. Semantic similarity metrics (COMET) can track this drift, providing a lightweight post-translation diagnostic for downstream drift. Our findings suggest that preprocessing non-English texts using MT may introduce systematic biases that could degrade downstream trained models and tasks, highlighting an important pitfall in the equitable multilingual use of LLMs at scale.
Chat is not available.
Successful Page Load