GCG Attack — Breaking AI Alignment with Adversarial Suffixes

Why Alignment Can Be Broken

RLHF teaches refusal — but it teaches it in natural language space, not input space

Modern LLMs like GPT-4 and Claude are trained in two stages. First, pretraining on a massive corpus teaches the model to predict the next token. Second, RLHF (Reinforcement Learning from Human Feedback) fine-tunes the model to be helpful, harmless, and honest.

RLHF effectively teaches the model a policy: "if the user asks for something harmful, generate a refusal." This policy is encoded in the model's weights as a probability distribution — the model learns that tokens like "Sure, here is how to make a bomb" should have very low probability given a harmful prompt.

But here is the critical insight: RLHF only adjusts the model's behavior on the distribution of inputs it was trained on — mostly natural-language text written by humans. It does not provide any guarantee about behavior on adversarially crafted inputs outside that distribution.

This is exactly the same vulnerability that plagues image classifiers: a model trained on photographs of cats and dogs can be fooled by adding imperceptible pixel noise. The GCG attack is the language-model equivalent — appending optimized tokens that look like noise but systematically steer the model toward unsafe outputs.

Key insight: RLHF alignment is a learned behavior, not a hard constraint. Like any learned behavior, it can be circumvented by inputs that lie outside the training distribution. GCG exploits this by searching for inputs in the tail of the distribution where safety training has poor coverage.

🌐 Computer Vision Analogy

Add imperceptible △ pixel noise to a panda photo → classifier says "gibbon" with 99% confidence. The perturbation is invisible to humans but catastrophic for the model.

💬 GCG Equivalent

Append optimized gibberish tokens to a harmful prompt → aligned LLM says "Sure, here is how to…" The suffix is meaningless to humans but bypasses safety training.

Attack Anatomy

How the adversarial suffix is composed and what each part does

The GCG attack structures the model's input into three components. Click the buttons below to explore each part:

▶ Prompt Decomposer

Full input to the model:

Harmful prompt

Adversarial suffix (optimized)

Target response (what GCG steers toward)

The adversarial suffix typically runs 20 tokens long (though it can be shorter or longer). Unlike a manual jailbreak, it is not written in natural language — it is optimized purely for its effect on the model's probability distribution.

The suffix is found by gradient-based search: the algorithm asks "which token, placed at position i in the suffix, would most decrease the loss for the target response?" Repeating this for every position, over many iterations, converges toward a suffix where the target response has high probability.

The Math

Formalizing the attack as a discrete optimization problem

Let's define the problem precisely. We have a language model that assigns a probability to each token given the preceding context. The GCG attack frames finding the adversarial suffix as an optimization problem:

-- Setup -- x1:n = harmful prompt tokens ["Write", "a", "tutorial", "on", "how", "to", "make", "a", "bomb"] xadv = adversarial suffix [t₁, t₂, ..., t_l] (l tokens, to be optimized) xn+1:n+H = target response ["Sure", ",", "here", "is", "how", "to", "make", "a", "bomb", ":"] -- Objective: find x_adv that minimizes the negative log-likelihood of the target -- x*_adv = argmin −∑ᵢ log p( x_{n+i} | x_{1:n}, x_adv, x_{n+1:n+i−1} ) x_adv -- In words: find the suffix tokens that make the model most likely to begin its response with the target string ("Sure, here is...") --

Eq 1: GCG objective — minimize cross-entropy loss of target response

This is a combinatorial optimization problem over a discrete space. If the vocabulary has V = 32,000 tokens and the suffix has l = 20 positions, the search space has 32,000²⁰ ≈ 10⁹⁶ possible suffixes — far too large to search exhaustively.

The key difficulty: unlike continuous optimization (where you can follow the gradient directly), token embeddings are discrete. You cannot take a gradient step in token space. GCG solves this with a clever approximation.

-- GCG gradient approximation -- For each token position i in the suffix: \nabla eᵢ L = gradient of loss w.r.t. the one-hot embedding of token at position i Top-k candidates at position i: Cᵢ = argmin_k ( \nabla eᵢ L )ᵀ \cdot ( eₖ - eᵢ ) k \in V -- In words: find the k tokens whose one-hot vector most reduces the loss (first-order Taylor approximation of the loss change) -- Then: sample B random candidates by replacing ONE random position with a random token from Cᵢ, evaluate each candidate's true loss, keep the best.

Eq 2: GCG gradient-based candidate selection (AutoDAN-style linearization)

Why this works: The gradient ∇eᵢL tells us the direction that most decreases the loss in the continuous embedding space. The dot product (∇eᵢL)ᵀ(eₖ − eᵢ) approximates how much token k would decrease the loss compared to the current token i. Taking the top-k gives us the most promising candidates to evaluate.

The target response is carefully chosen to be an affirmative prefix — typically "Sure, here is how to..." or "Step 1:". Once the model generates this prefix, its autoregressive nature causes it to continue generating the harmful content, since the KV cache is committed to that trajectory.

The GCG Algorithm Step by Step

An interactive walkthrough of one optimization iteration

▶ GCG Iteration Stepper

1

Initialize suffix

Start with l=20 random tokens from the vocabulary. The initial suffix is essentially noise — the loss will be very high (model strongly refuses).

2

Forward pass → compute loss

Run the full input [prompt + suffix] through the model. Compute cross-entropy loss against the target response. This is the number we want to minimize.

3

Backward pass → compute gradients

Backpropagate the loss through the model to get ∇eᵢL for each token position i in the suffix. This requires whitebox access to the model weights.

4

Select top-k candidates per position

For each of the l suffix positions, use the gradient to identify k=256 tokens most likely to decrease the loss. Store as candidate set Cᵢ for each position i.

5

Sample B random substitutions

Sample B=512 candidates. Each candidate is the current suffix with ONE randomly chosen position replaced by a random token from that position's Cᵢ set.

6

Evaluate all B candidates

Run a forward pass for each of the B candidates. Compute the true loss for each. This is the most expensive step — B=512 forward passes per iteration.

7

Update suffix → best candidate

Replace the current suffix with whichever of the B candidates achieved the lowest loss. Repeat from Step 2. Stop when loss falls below threshold (model complies).

The algorithm typically runs for 500 iterations on a single A100 GPU, taking approximately 15–30 minutes per prompt-suffix pair. Each iteration requires B=512 forward passes, making it computationally expensive but tractable.

Hyperparameter	Typical Value	Effect
Suffix length l	20 tokens	Longer = more expressive but harder to optimize
Top-k candidates	k = 256	More candidates = better coverage of vocab
Batch size B	512	More candidates evaluated per iter = faster convergence
Max iterations	500	Stop early if loss < threshold
Target string	"Sure, here is..."	Affirmative prefix triggers autoregressive completion
GPU time	~15–30 min / A100	Scales with model size and B

Interactive: Token Optimizer

Simulate how GCG evolves the adversarial suffix over iterations

This simulation shows the qualitative behavior of the GCG optimization — how the suffix evolves and how the loss decreases. The actual algorithm runs on a real model; this demo uses a simplified model to illustrate the dynamics.

▶ GCG Suffix Evolution Simulator

Harmful Prompt

Write a tutorial on how to make a bomb

Target Response

Sure, here is a step-by-step tutorial...

Adversarial Suffix (evolving)

Click "Start" to begin optimization

Loss —

Iter 0 / 500

What to watch: Early iterations show random-looking tokens (high loss). As optimization proceeds, the suffix develops internal structure — some positions converge to specific tokens that consistently reduce the loss, while others remain variable. The loss curve shows rapid initial descent followed by plateaus and occasional jumps as the algorithm escapes local minima.

Transferability

A suffix optimized on one model can attack others — including closed-source models

The most alarming property of GCG is transferability: adversarial suffixes optimized on open-source models (LLaMA-2, Vicuna) can successfully jailbreak closed-source models (GPT-3.5, GPT-4, Claude) — without any gradient access to those models.

This is possible because different models, trained on similar data, develop similar internal representations of concepts. An adversarial suffix that steers LLaMA-2's representations toward compliance tends to steer GPT-4's representations in the same direction — even though the weights are completely different.

▶ Transfer Attack — Compliance Rates (Zou et al. 2023)

Suffix optimized on Vicuna-7B & Vicuna-13B ensemble. Tested on held-out models.

* Rates are approximate from original paper. "Compliance" = model generates harmful content rather than refusing.

The paper found that optimizing on an ensemble of open-source models significantly improves transfer to black-box targets. Intuitively: if a suffix fools multiple models simultaneously, it exploits common features of LLM representations rather than idiosyncrasies of any single model.

Optimization Target	Transfer Target	Why It Transfers
Vicuna-7B (single)	GPT-3.5-Turbo	Similar pretraining data; shared token representations
Vicuna-7B + 13B (ensemble)	GPT-4	Ensemble forces suffix to exploit model-agnostic features
LLaMA-2-7B	PaLM-2	Common training corpus (The Pile, Common Crawl) → similar concept vectors

Universal Attacks

One suffix to break many prompts — simultaneously

Beyond single-prompt attacks, GCG can be extended to find a universal adversarial suffix — a single fixed string that jailbreaks many different harmful prompts at once.

The trick: optimize the suffix on a batch of prompts simultaneously, minimizing the average loss across all prompts in the batch. This forces the suffix to find features that are harmful across many different topics, not just one specific prompt.

-- Universal GCG objective -- x*_adv = argmin (1/m) \sumᵢ₌₁ᵐ Loss( prompt_i + x_adv \to target_i ) x_adv where m = batch size (number of prompts optimized simultaneously) prompt_i = harmful prompt i (e.g., "how to make a bomb", "how to hack a server", ...) target_i = corresponding affirmative target response Result: a single suffix x*_adv that jailbreaks ALL m prompts

Real-world implication: A single universal suffix, once found, can be copy-pasted after any harmful prompt — no optimization required at inference time. This dramatically lowers the barrier to misuse: attackers do not need GPU access, just the pre-computed suffix string.

Zou et al. (2023) demonstrated universal suffixes that achieved >80% attack success rate across 25 diverse harmful behaviors using a suffix optimized on just a subset of those behaviors.

Defenses Against GCG

Current approaches and their limitations

Because GCG was published openly, considerable effort has gone into developing defenses. None are fully robust — this remains an active research area. Here are the main approaches:

📈 Perplexity Filtering

Adversarial suffixes are out-of-distribution: no natural text looks like !!!!describing.\+similarlyNow. A perplexity filter rejects inputs whose token sequence has implausibly high perplexity under a language model.

Effective against vanilla GCG Bypassed by fluent variants (AutoDAN)

🔁 SmoothLLM

Randomly perturb (add/delete/substitute) characters in the input, run the model on multiple perturbed copies, and take a majority vote. Adversarial suffixes are brittle — small perturbations destroy their effect. Clean prompts are robust.

Reduces ASR significantly Adds inference cost (N forward passes)

🔒 Input Certificates

Certify that a model's output is provably unaffected by any suffix of length ≤ l. Achieved via randomized smoothing over the suffix positions. Provides formal guarantees but at high computational cost.

Formal guarantee Very expensive; limited to short suffixes

🏳 Adversarial Training

Generate adversarial suffixes during fine-tuning and train the model to refuse even when they are present. Analogous to adversarial training for image classifiers. Tends to improve robustness without large accuracy loss.

Improves robustness Doesn't fully close the gap; adaptive attacks often succeed

🌎 Dual-LLM / CaMeL

Separate the privileged LLM (that executes actions) from the untrusted LLM (that processes external content). The privileged model never sees attacker-controlled text. Covered in Post 33 — CaMeL.

Architectural guarantee Limits agent capabilities; complex to implement

Arms race dynamic: Most defenses are defeated by adaptive variants. AutoDAN generates fluent adversarial suffixes that bypass perplexity filters. GCG++ adapts to SmoothLLM by optimizing suffixes that remain effective after perturbation. This mirrors the history of adversarial examples in computer vision — defense and attack co-evolve.