Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation
PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING
Yixiong Hao · Ayush Panda · Stepan Shabalin · Sheikh Abdur Raheem Ali
To this day, controlling the behavior of Large Language Models (LLMs) remainsa significant challenge due to their inherent complexity and opacity. Whiletechniques like fine-tuning can modify model behavior, they typically requireextensive computational resources. Recent work has introduced a class ofcontrastive activation engineering (CAE) techniques as promising approaches forsteering LLM outputs at inference time through targeted modifications to theirinternal representations. This study presents early results from a systematicinvestigation of CAE behavior and practical applications and begins to developcomprehensive guidelines for its effective deployment. We find that 1. CAE isonly reliably effective when applied in in-distribution settings. 2. The marginalvalue of using more samples to generate steering vectors diminishes at around 80samples. 3. Steering vectors are susceptible to adversarial inputs. 4. Steeringvectors harm model perplexity. 5. Larger models are more resistent to steeringinduceddegradation. 6. Provide a lightweight out-of-distribution evaluationmethod for steering vectors.