Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots
Abstract
The capacity of Large Language Models (LLMs) to generate large quantities of text at speed has intensified the scale and strategic manipulation of political discourse on social media, contributing to the propagation of conflict escalation narratives. Existing literature largely focuses on platform-led moderation as a countermeasure, yet, platform-level approaches have been found to face significant challenges in combatting misinformation. In this paper, we propose a user-centric view of ``jailbreaking" as an emergent, non-violent de-escalation practice. Jailbreaking in this setting involves online users engaging with suspected LLM-powered accounts to circumvent LLM safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives. Jailbreaking supports user-led efforts to unveil inauthentic accounts and support peace building endeavours.