Learning From Dictionary: Enhancing Robustness of Machine-Generated Text Detection in Zero-Shot Language via Adversarial Training
Abstract
Machine-generated text (MGT) detection is critical for safeguarding online content integrity and preventing the spread of misleading information. Although existing detectors achieve high accuracy in monolingual settings, they exhibit severe performance degradation on zero-shot languages and are vulnerable to adversarial attacks. To tackle these challenges, we propose a robust adversarial training framework named \textbf{T}ranslation-based \textbf{A}ttacker \textbf{S}trengthens Mul\textbf{T}ilingual Def\textbf{E}nder (\detectorname). \detectorname comprises two core components: an attacker that performs code-switching by querying translation dictionaries to generate adversarial examples, and a detector trained to resist these attacks while generalizing to unseen languages. We further introduce a novel Language-Agnostic Adversarial Loss (LAAL), which encourages the detector to learn language-invariant feature representations and thus enhances zero-shot detection performance and robustness against unseen attacks. Additionally, the attacker and detector are synchronously updated, enabling continuous improvement of defensive capabilities. Experimental results on 9 languages and 8 attack types show that our \detectorname surpasses 8 SOTA detectors, improving the average F1 score by \textbf{0.064} and reducing the average Attack Success Rate (ASR) by \textbf{3.8\%}. Our framework offers a promising approach for building robust, multilingual MGT detectors with strong generalization to real-world adversarial scenarios. We will release our code, models, and dataset upon acceptance.