Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution

PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING

Yixiong Hao · Ayush Panda · Stepan Shabalin · Sheikh Abdur Raheem Ali

2025 Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution

Project Page [ OpenReview]

Abstract

Controlling the behavior of Large Language Models (LLMs) remains a significantchallenge due to their inherent complexity and opacity. While techniqueslike fine-tuning can modify model behavior, they typically require extensivecomputational resources. Recent work has introduced a class of contrastiveactivation engineering (CAE) techniques as promising approaches for steeringLLM outputs through targeted modifications to their internal representations.Applied at inference-time with zero cost, CAE has the potential to introducea new paradigm of flexible, task-specific LLM behavior tuning. We analyzethe performance of CAE in in-distribution, out-of-distribution settings, evaluatedrawbacks, and begin to develop comprehensive guidelines for its effectivedeployment. We find that 1. CAE is only reliably effective when applied toin-distribution contexts. 2. Increasing the number of samples used to generatesteering vectors has diminishing returns at around 80 samples. 3. Steering vectorsare susceptible to adversarial inputs that reverses the behavior that is steered for.4. Steering vectors harm the overall model perplexity. 5. Larger models are moreresistant to steering-induced degradation.

Chat is not available.