SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
Abstract
Fingerprinting Large Language Models (LLMs) is essential for provenance verification and model attribution. Existing methods typically extract post-hoc signatures based on training dynamics, data exposure, or hyperparameters—properties that only emerge after substantial training. As a result, prior evaluations largely focus on lineage verification after fine-tuning, where detection is considerably easier, potentially giving a false sense of safety. In contrast, we propose a stronger and more intrinsic notion of LLM fingerprinting: SeedPrints, a method that leverages random initialization biases as persistent, seed-dependent identifiers present even before training begins. We show that untrained models exhibit reproducible prediction biases induced by their initialization seed. Although weak in magnitude, these biases remain statistically detectable throughout training, enabling high-confidence lineage verification. Unlike prior techniques that are unreliable before convergence or vulnerable to distribution shifts, SeedPrints remains effective across all training stages and robust under domain shifts and parameter modifications. Experiments on LLaMA-style and Qwen-style models demonstrate seed-level distinguishability and enable birth-to-lifecycle identity verification akin to a biometric fingerprint. Evaluations on large-scale pretrained models and fingerprinting benchmarks further confirm its effectiveness under prolonged training and realistic deployment scenarios. Together, these results suggest that initialization itself imprints a unique and persistent identity on LLMs, forming a true ``Galtonian'' fingerprint.