Auditing Preference-Based Post-Training of LLMs via Strong Membership Inference Attacks
Abstract
Preference-based post-training is critical for aligning large language models (LLMs) with human intent; however, it raises privacy concerns as the instruction and feedback data used in this stage may contain sensitive information, such as personal identifiers or user-specific preferences. While membership inference attacks (MIAs) have been widely studied for pre-training and supervised fine-tuning, their effectiveness in the context of preference-based post-training remains less explored. In this work, we systematically evaluate privacy vulnerabilities in modern post-training pipelines and present a systematic analysis of strong MIAs for preference-based post-training. We introduce LiRA-J, a preference-aware variant of LiRA for membership inference on preference data. Through comprehensive experiments across a range of datasets and model families, we reveal privacy risks and compare the most prevalent post-training approaches, uncovering vulnerability patterns. Our analysis further examines key factors that affect privacy risk in preference-based post-training, including regularization strategies. Our findings highlight privacy vulnerabilities in preference-based post-training and underscore the need to audit aligned models with preference-aware membership inference protocols.