Background · ~18 min read

The Watermarking Landscape

When an LLM generates text, how do we prove it? Watermarking embeds imperceptible signals into generated text — provable, scalable, and increasingly deployed in production systems serving hundreds of millions of users. This post maps 12+ schemes from 30 research papers spanning 2023–2025.

12+

Watermark Schemes

4

Watermark Types

100%

TPR @ 0% FPR (Data Prov.)

20M+

Texts Marked (SynthID)

    Core insight: Unlike steganography (hiding a message in content), LLM watermarking modifies the sampling distribution during generation — making certain token patterns statistically unlikely to appear in unwatermarked text. Detection is a hypothesis test: was this text generated with a watermarked model?
  

Why Watermark?

Prove AI authorship for copyright attribution, detect AI-generated misinformation at scale, enable policy enforcement (EU AI Act mandates AI text disclosure), and protect model IP against extraction attacks.

The Detection Problem

A watermark detector runs a statistical hypothesis test on candidate text. It needs to answer: could this token sequence have been generated by random chance, or does it show a statistically significant green-token bias? Z-score thresholds govern the tradeoff.

Legal Admissibility

Data Provenance Auditing (2025) achieves 100% TPR at 0% FPR using Unicode cue-reply pairs — a court-admissible standard. SynthID-Text (Google, Nature 2024) is already deployed across Gemini APIs at production scale.

How does watermarking differ from traditional steganography? ▲

Steganography hides a pre-defined message inside existing content (e.g., flipping LSBs in image pixels) without changing the visible content. LLM watermarking is different: there is no existing content to hide in — the model generates content fresh at inference time. Instead, watermarking biases the sampling process: the vocabulary is split into green/red lists, and the model samples preferentially from the green list. The "message" is implicit in the statistical deviation from expected sampling behaviour. This means watermarks are probabilistic — they can be detected but not decoded like steganographic messages unless bit-level schemes (REMARK-LLM, PersonaMark) are used.

What 4 types of watermark schemes exist? ▲

1. Token-Level: Biases next-token probabilities during decoding (Kirchenbauer KGW, DiPmark). Fastest, widest adoption, but fragile to paraphrasing.

2. Semantic: Embeds signal in semantic/syntactic structure rather than surface tokens (REMARK-LLM, PersonaMark). More robust to surface-level edits, higher bit capacity.

3. Attribution / Provenance: Designed to identify specific users or sources (WASA, TRACE, Data Provenance Auditing). Uses entropy-gated injection or cue-reply pairs.

4. Production-Scale: End-to-end systems for real deployment (SynthID-Text via speculative sampling, ModelShield for model IP protection). Optimise for near-zero quality degradation.

Mechanisms · Kirchenbauer 2023

How Token-Level Watermarks Work

The Kirchenbauer (KGW) scheme — the most cited watermarking paper in this space — partitions the vocabulary into green and red lists using the previous token as a pseudorandom seed. During generation the model adds a hardness bias δ to green-token logits, skewing sampling without changing the visible topic.

Green/Red Vocabulary Partition

For each generated token position, a hash of the previous token seeds a PRNG that splits the full vocabulary into a green list (~50%) and a red list (~50%). The LM's logit distribution is then shifted: green-token logits receive +δ (hardness parameter). At inference the model still samples probabilistically — but green tokens are now more likely. Over a sequence of T tokens, the fraction of green tokens will be significantly above 0.5 if a watermark is present.

Green fraction (γ) 0.50

Hardness bias (δ) 2.0

Sequence length (T) 200

z-score ≈ 8.26
Expected green tokens: 100
Observed green (est.): ~141

✓ Detectable at z > 4 threshold (p < 0.00003)

Why Previous-Token Seeding?

Using the previous token as the PRNG seed means the green/red split changes at every position — an attacker cannot learn a fixed "green vocabulary" to exploit. The partition is context-dependent and unpredictable without the secret key. KGW uses a one-way hash so the key cannot be recovered from observed tokens.

Quality Impact

With δ=2.0 and γ=0.5, KGW reduces MAUVE text quality scores by ~5–8%. The quality-detectability tradeoff is the core tension: higher δ makes watermarks more detectable but degrades fluency. DiPmark (2023) eliminates this tradeoff by preserving the full token probability distribution while still biasing green tokens.

KGW vs DiPmark: what's the key difference? ▲

KGW adds a flat +δ to all green token logits before softmax, which distorts the probability distribution — words that happen to be in the red list become systematically underrepresented regardless of their semantic appropriateness. DiPmark instead applies a multiplicative reweighting that preserves the rank order of token probabilities within green/red groups. The expected output distribution is the same as the original model, meaning DiPmark is distribution-preserving (hence the name). This comes at no cost to detection power — both use the same z-score test — but DiPmark has near-zero measurable quality degradation in MAUVE and PPL benchmarks.

Mechanisms · 6 Key Schemes

Watermark Scheme Gallery

Six schemes span the design space: token-level (fast, fragile), semantic (robust, slow), attribution (user-traceable), and production-scale (deployed). Click any card to load its profile on the radar chart.

Select a scheme to see its profile and design philosophy.

    Key insight: No single scheme dominates all 5 axes. SynthID-Text leads on scalability and quality (production-proven across 20M+ texts), while REMARK-LLM leads on robustness and bit capacity. TRACE leads for attribution in black-box settings. Choosing a watermark scheme is an engineering tradeoff, not a technical optimum.
  

Analysis · Hypothesis Testing

Detection & Verification

Watermark detection is a one-sided hypothesis test. The null hypothesis H₀ is "this text was generated without a watermark" — i.e., green-token fraction ≈ γ by random chance. The alternative H₁ is "the watermark bias is present." A z-score above threshold triggers detection.

Detection threshold (z*) 4.0

Token sequence length 200

At z* = 4.0 and T = 200:
False Positive Rate: ~0.003%
True Positive Rate: ~94%

Detection requires ~50 tokens minimum at this threshold.

Z-Score Formula

z = (|s|₊ - γT) / √(Tγ(1-γ))

Where |s|₊ = observed green token count, T = sequence length, γ = expected green fraction. Under H₀, z ~ N(0,1).

Multi-Bit Detection

REMARK-LLM and PersonaMark embed actual bit strings (not just presence/absence). Detection uses a correlation score between the observed token sequence and each possible bit pattern. Enables user-level attribution — who generated this text.

Black-Box vs White-Box

White-box detection requires access to model logits (exact green/red lists). Black-box detection (TRACE, DE-COP) only needs the output text — achievable via API calls or document comparison. TRACE achieves 72% attribution accuracy without model weights.

How does TRACE's entropy-gated approach work? ▲

TRACE (2025) observes that high-entropy positions (where the model is uncertain between many tokens) are more flexible — any of several tokens would be semantically valid. Low-entropy positions (where one token is heavily favoured) are "committed" choices that cannot be changed without semantic distortion. TRACE only injects watermark bias at high-entropy positions, leaving low-entropy tokens untouched. This means the watermark signal is invisible in the most "fixed" parts of the text and detectable only by knowing which positions were high-entropy — knowledge only the watermark authority has. Black-box detection works by re-running the target model on the same prompt and checking entropy patterns.

What is the Data Provenance Auditing 100% TPR claim? ▲

Data Provenance Auditing (2025) uses a different paradigm: Unicode-based cue-reply pairs embedded in training data. When the model is fine-tuned on watermarked data, it learns to reproduce specific Unicode character sequences (invisible to humans) in response to trigger prompts. Detection queries the model with these cues and checks for the expected Unicode reply. In controlled experiments this achieves 100% TPR at 0% FPR — essentially a keyed authentication code rather than a probabilistic watermark. The tradeoff: it requires controlling the training data (not applicable to existing models) and the approach is fragile if the training set is modified.

Analysis · Attack Vectors

Robustness vs Quality Tradeoffs

Every watermarking scheme faces a two-front challenge: maintaining text quality (users must not notice the watermark) and surviving adversarial attacks (attackers try to remove or spoof the watermark). The table below shows attack success rate — how often each attack strips the watermark signal.

Attack ↓ / Scheme →	KGW	DiPmark	REMARK	PersonaMark	TRACE	SynthID

Low (<40% watermark removed) Medium (40–65%) High (>65% removed)

    The Waterfall insight (2024): Rather than fighting paraphrase attacks, Waterfall uses an LLM paraphraser as the watermark medium itself. The watermark is embedded in the paraphraser's output distribution, making it training-free and inherently paraphrase-resistant. This sidesteps the token-level vulnerability entirely.
  

Why Paraphrase Attacks Are Hard to Stop

Token-level watermarks (KGW, DiPmark) embed the signal in which specific tokens appear. A paraphrase that replaces 60–70% of tokens while preserving meaning will effectively destroy the green-token bias. Semantic watermarks (REMARK-LLM) are more resistant because the signal is in sentence structure and semantic choices — harder to change without altering meaning.

The RegionMarker Approach (2025)

RegionMarker targets Embedding-as-a-Service (EaaS) providers. Instead of watermarking output text, it embeds a watermark in a trigger region of the low-dimensional embedding space. The embedding model maps certain semantic regions to predictable coordinates, enabling dataset-level attribution without modifying generated text at all.

Applications · SynthID-Text & Attribution

Production & Attribution at Scale

From research prototype to 20 million texts: how Google's SynthID-Text closes the gap between watermarking theory and production deployment — and how attribution systems like WASA and Data Provenance Auditing handle the legal chain-of-custody problem.

SynthID-Text (Google DeepMind, Nature 2024)

Deployed across 20M+ Gemini API responses. Uses tournament-sampling: at each position, multiple candidate tokens are sampled and the one with the best score on a pseudorandom scoring function is selected. Unlike KGW which adds a logit bias, tournament sampling provably preserves the exact output distribution — there is no measurable quality degradation in human preference studies. Detection uses a likelihood-ratio test against the scoring function. Publicly available via Google Cloud Vertex AI.

Nature 2024 20M+ Texts Zero Quality Loss

Speculative Sampling Integration

SynthID-Text integrates with speculative decoding (which uses a small draft model to propose tokens, then a large model to verify). The watermark scoring function is applied during the verification step — no additional latency is added to the inference pipeline. This makes it compatible with production serving infrastructure without architectural changes.

WASA: Who Are You Sampling?

WASA (2023) focuses on source attribution for training data. If a model's output matches a particular dataset's style and vocabulary, WASA's attribution model identifies the source. Uses contrastive analysis across reference corpora, achieving meaningful attribution signals even for indirect memorisation where verbatim reproduction is absent.

    ModelShield (IEEE TIFS 2025): Protects against model extraction attacks — where adversaries query an API model extensively to train a "stolen" clone. ModelShield embeds a self-watermark in the model's output that persists through fine-tuning and distillation. When the stolen model is later queried, the watermark is still detectable, proving the clone's lineage. This is the only scheme designed to survive model-to-model copying.
  

How does PersonaMark enable per-user attribution? ▲

PersonaMark (2024) assigns each user a unique secret key derived from their user ID via a one-way hash. This key seeds the green/red partition for that user's generations. When suspicious text is found, the operator runs detection with each registered user's key — only the actual generator's key will yield a significant z-score. The scheme uses sentence-level structure (not individual tokens) as the watermark medium, making it more robust to character-level attacks. Crucially, a user cannot detect or remove their own watermark because they do not know their key.

Open Challenges

The Arms Race & Open Problems

Every watermarking scheme published since 2023 has been followed by an attack demonstrating its limitations. The field is in active adversarial competition — and several fundamental challenges remain unsolved as of 2025.

1. Paraphrase Fragility

Token-level watermarks are destroyed by 2–3 rounds of paraphrase. Neural paraphrasers (GPT-4o, Claude 3) achieve near-native fluency while eliminating the green-token signal. No token-level scheme has shown reliable robustness to adaptive paraphrase attacks from frontier LLMs.

2. Spoofing Attacks

An adversary who knows the watermark scheme (but not the key) can craft texts that appear watermarked using trial and error. Spoofing is particularly dangerous for attribution systems — false positives could wrongly attribute human writing to an AI, or frame a specific user.

3. Multi-Modal Gap

Watermarking for text-to-image (SynthID-Image) and text-to-audio exists separately. There is no unified framework that watermarks multimodal outputs — an LLM that generates both text and image descriptions in the same response cannot be consistently watermarked across modalities.

4. Quantization Brittleness

4-bit quantization of LLMs changes token probability distributions — potentially destroying watermarks embedded via logit biasing. Research (from the machine unlearning domain) shows 4-bit quantization recovers 83% of "forgotten" knowledge, suggesting quantization similarly degrades watermark signals.

5. Long-Document Detection

Z-score tests require a minimum token count (~50–200) for statistical significance. Short outputs (tweets, code comments, single sentences) cannot be reliably watermarked with current token-level schemes. Semantic and attribution approaches have higher per-token overhead, making short-text watermarking an active research gap.

6. Dataset-Level Watermarks

Wei et al. (2024) demonstrated dataset-level watermarks on BLOOM-176B using hypothesis testing — detecting if a model was trained on a specific dataset, not just if individual texts were generated by it. This is the frontier for copyright enforcement: proving a model's training violated copyright without needing the model's weights.

    Research trajectory (2023–2025): The field began with Kirchenbauer's green/red token partition (2023) and has rapidly evolved through distribution-preserving variants (DiPmark), semantic embedding (REMARK-LLM), entropy-gated attribution (TRACE), and production deployment (SynthID-Text, Nature 2024). The next frontier is multi-modal and adversarially robust watermarking that can survive frontier-model paraphrase attacks while remaining legally admissible.
  

What does the EU AI Act require regarding watermarking? ▲

The EU AI Act (in force August 2024) requires providers of general-purpose AI systems to mark AI-generated content in a machine-readable format where "technically feasible." The Act does not mandate a specific watermarking scheme but establishes liability for providers who fail to implement disclosure mechanisms. SynthID-Text's deployment across Gemini ahead of the Act's enforcement timeline suggests major providers are already building compliance infrastructure. The challenge is that "technically feasible" has no clear threshold — a scheme that works for long-form text may not satisfy the requirement for short outputs.

Can watermarking replace membership inference for copyright enforcement? ▲

No — they answer different questions. Watermarking asks: "was this text generated by a watermarked model?" Membership Inference attacks ask: "was this text in the model's training data?" For copyright enforcement, you typically need both: first prove the model was trained on copyrighted data (MIA), then prove that specific outputs are derived from that data (watermarking for generation attribution). Data Provenance Auditing bridges this gap by using training-data watermarks that affect generation behaviour — but requires controlling the training process, making it inapplicable to existing deployed models.