ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
Rishav Rishav ⋅ Pushpak Pujari ⋅ Pushpendre Rastogi
Abstract
Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process that distinguishes success from failure on the same input. We introduce \textbf{ContraPrompt}, built on the observation that when a language model fails on a task but succeeds on a subsequent retry with feedback, the difference between its two \emph{chain-of-thought traces} constitutes an optimization signal not captured by prior prompt-optimization methods operating on single traces or on final-output comparisons. Unlike prior contrastive methods that compare final outputs or prompt variants, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so the differences that remain are the reasoning strategy and (as a consequence of the retry mechanism) the appended error feedback. We call this operation \emph{dyadic reasoning trace analysis}. The multi-attempt solving phase is structured as an instrumented agentic retry loop that generates this contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree that routes instructions by observable input characteristics. Evaluated on four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA~\citep{agrawal2025gepa} on all four, with absolute gains of $+8.29$ pp on HotPotQA ($+20.8\%$ rel.), $+2.21$ pp on GDPR-Bench ($+18.2\%$ rel.), $+7.14$ pp on GPQA~Diamond ($+10.6\%$ rel.), and $+0.74$ pp on BBH ($+0.85\%$ rel.). Ablations confirm that dyadic trace contrastivity is the critical component, with a $-16\%$ relative average performance drop upon its removal. The mechanism generalizes beyond prompt optimization: on 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA head-to-head on 11 problems, ties on 41, and loses on 1 at equal budget; and on FiNER-139 financial named entity recognition~\citep{loukas2022finer} (a 139-class high-cardinality classification task), ContraPrompt achieves $+7.77$ pp over the unoptimized baseline ($+11.6\%$ rel.) and $+1.94$ pp over GEPA ($+2.66\%$ rel.), with the input-aware tree producing branch conditions that align with standard US GAAP financial-instrument categories. We release artefacts such as optimized prompts for reproduction here\footnote{\href{https://github.com/rishvv/contraprompt_artefacts/}{https://github.com/rishvv/contraprompt$\textunderscore$artefacts/}}.
Chat is not available.
Successful Page Load