Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

Patterns and Mechanisms of Contrastive Activation Engineering

Yixiong Hao · Ayush Panda · Stepan Shabalin · Sheikh Abdur Raheem Ali


Abstract:

Controlling the behavior of Large Language Models (LLMs) remains a significantchallenge due to their inherent complexity and opacity. While techniqueslike fine-tuning can modify model behavior, they typically require extensivecomputational resources. Recent work has introduced a class of contrastiveactivation engineering (CAE) techniques as promising approaches for steeringLLM outputs through targeted modifications to their internal representations.Applied at inference-time with zero cost and token-level control, CAE haspotential to introduce a new paradigm of flexible, task-specific LLM behaviortuning. This paper presents early results from a systematic investigation of CAEbehavior in practical applications and begins to develop comprehensive guidelinesfor its effective deployment. We find that 1. CAE is only reliably effective whenapplied in in-distribution settings. 2. The marginal value of using more samplesto generate steering vectors diminishes at around 80 samples. 3. Steering vectorsare susceptible to adversarial inputs. 4. Steering vectors harm model perplexity.5. Larger models are more resistant to steering-induced degradation.

Chat is not available.