DSPy — Programming Language Models, Not Prompting Them

The Paradigm Shift

Programming Language Models, Not Prompting Them

DSPy (Declarative Self-improving Python) treats LLM pipeline development as a machine learning problem. Instead of writing brittle prompt strings by hand, you declare what you want — and a compiler figures out how to achieve it.

+65%

Improvement over standard few-shot (Llama2-13b)

+46%

Over expert-written demos (GPT-3.5-turbo)

40%

Token reduction while maintaining accuracy

35×

Fewer rollouts than GRPO for same performance

The Old Way: Prompting

Hand-craft prompt strings, iterate manually, break when you switch models, spend days tuning wording. A single word change can drop accuracy by 10%. There’s no principled way to improve.

The DSPy Way: Programming

Declare your pipeline in Python. Specify inputs, outputs, and metrics. Run the compiler. It automatically discovers the best prompts, instructions, and few-shot examples for your specific model and task.

Why It Works

Optimization is a solved problem in ML. DSPy applies that discipline to prompt engineering — turning an art form into a science. Programs become portable, versioned, and reproducible across model changes.

The Prompting Problem →

Why This Matters

The Prompting Problem — Fragile, Expensive, and Unscalable

Traditional prompt engineering is riddled with fundamental flaws: high sensitivity to wording, model-specific brittleness, and an inability to scale. Click each failure mode to explore it.

What Goes Wrong

A single changed word in "Answer the following question:" vs "Please answer this question:" can change accuracy by 10-15%. Systems break silently. Teams spend weeks tuning prompts, only to have model updates reset their work.

What DSPy Does Instead

Compiled DSPy programs show minimal variation when prompts are paraphrased because the semantics are learned, not hand-coded. Switching from GPT-3.5 to Llama only requires recompiling — the program structure stays identical.

Prompt Fragility Demo →

Interactive Lab 3

Prompt Fragility Demo

Switch the LLM backend and watch what happens. Manual prompts break silently. DSPy recompiles and adapts. This is the core portability argument made concrete.

Configure

Model

Prompt Variant

Approach

Accuracy: —

Configure the settings above and press Run.

Signatures →

Core Abstraction

Signatures — Declaring What, Not How

A Signature is a natural-language typed function declaration. It says what a text transformation should accomplish without specifying how the LM should be prompted. Field names carry semantic meaning that guides compilation.

    Key insight: The field names "question" and "answer" tell DSPy this is a QA task. It generates different prompts than "query" and "response" — because the compiler understands semantic intent.
  

Portability

The same signature works with GPT-4, Llama-3, and Claude. When you compile for a different model, DSPy generates model-appropriate prompts from the same signature specification.

Type Safety

Signatures support typed outputs: bool, int, float, list[str]. DSPy automatically parses and validates LM outputs into the declared types.

Reusability

One signature can serve multiple modules. A Predict and a ChainOfThought can share the same signature but produce different prompting strategies.

Live Signature Builder →

Interactive Lab 1

Live Signature Builder

Type field names and descriptions below — watch DSPy generate the exact prompt template in real time. This is what the compiler sees before optimization begins.

Define Your Signature

Input Fields (comma-separated)

Output Fields (comma-separated)

Task Description (optional)

Module Type

Live Preview

    Notice: ChainOfThought automatically adds a Reasoning field between your inputs and outputs. You never write this — DSPy injects it based on the module type.
  

Modules →

Building Blocks

Modules — Composable Prompting Techniques

DSPy modules are composable building blocks that abstract prompting techniques. Like PyTorch layers, you stack them into programs. Each has learnable parameters: instructions, demonstrations, and optionally LM weights.

Learnable Parameters

Every module stores three types of learnable parameters: (1) LM instructions — the task description prepended to prompts, (2) Demonstrations — few-shot examples automatically selected, (3) Optionally LM weights for fine-tuning.

Composition

Modules compose with standard Python. Loops, conditionals, and function calls all work naturally. A program with 20 LM calls is just a Python class with 20 module instances — no special framework syntax.

Optimization-aware

When you compile a program, each module’s parameters are optimized independently. The optimizer traces execution, identifies failures, and updates instructions + demonstrations for each module separately.

Computation Graphs →

Program Structure

Computation Graphs — LLM Pipelines as Code

A DSPy program is a differentiable computation graph where nodes are modules and edges are data flow. Programs are expressed in ordinary Python — control flow, loops, and branches all define the graph structure at runtime.

Click any node to inspect inputs/outputs

Select a node

Click any node in the diagram to see what that module does in the pipeline.

Compilation Pipeline →

The Engine

Compilation Pipeline — From Declaration to Optimized Program

Compilation is a one-time process that transforms a high-level DSPy program into an optimized version with learned instructions and few-shot demonstrations. The compiled program is just a Python object — no special runtime needed.

Step 0 / 5

Ready to compile

Press "Step Through" to walk through the DSPy compilation process step by step.

What Gets Optimized

Instructions (the task description in the prompt), Demonstrations (the few-shot examples), and optionally LM weights (for BootstrapFinetune). The program structure and Python logic remain unchanged.

Compile Once, Run Forever

Compilation happens during development, not at inference time. The compiled program runs as a regular Python object. No optimization overhead at serving time — just the learned prompts embedded in the modules.

Version Control

Compiled programs are serializable Python objects. You can save them as JSON, version them in Git, and redeploy without recompilation. Rollback is just loading a previous version.

Compilation Trace →

Interactive Lab 2

Compilation Trace Replay

Watch a real BootstrapFewShot compilation unfold: trainset examples pass through the program, successful traces are selected as demonstrations, instructions are refined, and accuracy climbs.

Speed: Ready — press Play to start

Waiting to start

The compilation trace shows how BootstrapFewShot selects training examples, filters for successful traces, and builds few-shot demonstrations — all automatically.

Optimizer Zoo →

Teleprompters

Optimizer Zoo — Choosing the Right Compiler

DSPy provides a suite of optimizers (also called teleprompters) with different cost-performance tradeoffs. The right optimizer depends on your dataset size, budget, and quality requirements.

Training examples: ~10 ex

BootstrapFewShot

Loading...

Token Budget →

Interactive Lab 5

Token Budget Visualizer

Toggle which context components to include in a prompt. See the token count grow — and compare how DSPy's lazy skill loading and compiled signatures reduce context overhead vs manual approaches.

Context Components

Approach

Total Tokens

0

≈ $0.00 per 1000 calls (GPT-4)

Assertions & Constraints →

Self-Refinement

Assertions & Constraints — Teaching LMs to Self-Correct

DSPy Assertions are computational constraints that LMs must satisfy. When a constraint fails, DSPy backtracks — injecting the failure reason into the prompt so the model can self-correct. This enables principled self-refinement without manual retry logic.

164%

More constraints passed with Assertions

37%

Higher quality responses

16.7%

Citation faithfulness improvement

Step 0 / 5

dspy.Assert (Hard)

Halts execution if the constraint is still violated after max_backtracking_attempts retries. Raises AssertionError. Use during development to catch logical failures early. Triggers backtracking on each failure.

dspy.Suggest (Soft)

Same backtracking mechanism, but continues execution if finally violated. Logs the failure. Use in production for graceful degradation — best-effort constraint satisfaction with monitoring.

Assert Simulator →

Interactive Lab 6

Assert / Retry Simulator

Run a constrained DSPy program and watch it fail, backtrack, and self-correct. The key: each retry injects the failure reason into the prompt — the model learns from its own mistakes within a single call.

Constraint:

Attempts

0

Failures

0

Passed

—

Tokens Used

0

Weight Tuning →

Beyond Prompt Optimization

Weight Tuning — When Prompts Aren't Enough

Every optimizer you've seen so far (BootstrapFewShot, MIPROv2, GEPA) only changes what is written in the prompt — the model weights stay frozen. DSPy also ships BootstrapFinetune: an optimizer that takes the same bootstrapped traces and uses them as supervised fine-tuning data, actually updating the model's weights.

2

Optimization levels: Prompt & Weights

10×

Cheaper inference after fine-tuning

0

Prompt tokens needed at runtime (fine-tuned model)

1

Unified DSPy program: same code, both paths

    The key insight: Both prompt optimization and weight optimization start identically — bootstrapping successful traces. The difference is what you do with those traces: stuff them into the prompt context (few-shot), or use them as fine-tuning examples to update the model's weights. Same traces, different destinations.
  

Two Paths, One Program

Both paths start from the same DSPy program and the same bootstrapped traces. Select a tab to explore each approach.

What Actually Changes

In prompt optimization, the weights (billions of float32 parameters inside the transformer) stay completely frozen. Only the text prepended to your query changes. In weight optimization, gradient descent runs on those float32 values — the model literally learns new behaviour at the matrix multiplication level.

The Distillation Angle

BootstrapFinetune is essentially LLM distillation. A large teacher model (e.g., GPT-4) generates high-quality traces via bootstrapping. Those traces are used to fine-tune a small student model (e.g., Llama-3-8b). The student internalizes the teacher's behaviour — without needing the teacher at inference time.

DSPy Unifies Both

The same DSPy program code runs both paths. Switching from prompt optimization to weight optimization is one line: replace BootstrapFewShot(...) with BootstrapFinetune(...). The compiled program structure, signatures, and modules are identical — only the optimizer changes.

BootstrapFinetune in code

# Same program as always
class RAGProgram(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought(
            "context, question -> answer"
        )

# Switch optimizer to weight-level
optimizer = dspy.BootstrapFinetune(
    metric=exact_match,
    num_threads=4,
    teacher_settings=dict(lm=gpt4),   # teacher
    student_settings=dict(lm=llama8b) # student
)

compiled = optimizer.compile(
    RAGProgram(), trainset=trainset
)
# compiled now uses fine-tuned llama8b weights
# no few-shot context needed at inference
      

What the fine-tuning data looks like

After bootstrapping, each successful trace becomes one SFT training example:

{"messages": [
{"role":"user","content":"[question]"},
{"role":"assistant","content":"Reasoning: [CoT trace] Answer: [answer]"}
]}

The model learns to produce the same reasoning+answer pattern without needing examples in the prompt. The behaviour is internalized into weights.

RAG with DSPy →

Retrieval-Augmented Generation

RAG with DSPy — Retrieval as a First-Class Module

DSPy treats retrieval as just another module in the computation graph. dspy.Retrieve() is a learnable component whose query formulation gets optimized alongside the generation prompts. Result: retrieval + generation quality improve together.

Retrieve Module

dspy.Retrieve(k=3) queries any connected vector database (Qdrant, Weaviate, Chroma, Pinecone). The query itself is a learnable parameter — compilation optimizes how to phrase retrieval queries for your specific corpus.

Joint Optimization

Unlike LangChain RAG where you tune prompts manually, DSPy optimizes both retrieval queries and generation prompts together. SemanticF1 improved from 42% → 61% in real benchmarks through MIPROv2 optimization.

Multi-hop Retrieval

Complex questions require chained retrieval. The answer to step 1 informs the query for step 2. DSPy’s module composition makes this natural Python — a loop with Retrieve + Predict creates a multi-hop reasoning chain.

Benchmark Results →

Evidence

Benchmark Results — Numbers from the Paper

The original DSPy paper (Khattab et al., 2023) benchmarked on HotPotQA, FEVER, GSM8K, and more. More recent work shows even larger gains with advanced optimizers like GEPA and MIPROv2.

HotPotQA

Multi-hop question answering over Wikipedia. DSPy BootstrapFewShot: 71% vs 68% baseline (+3 points). The gain comes entirely from learned demonstrations — no architecture changes.

FEVER

Fact verification (3-way: supports/refutes/not enough info). DSPy: 91% vs 85% baseline (+6 points). Largest absolute gains on tasks with complex multi-step reasoning requirements.

MATH (GEPA 2025)

67% (unoptimized CoT) → 93% (GEPA-optimized) — a 26-point improvement. GEPA also outperforms GRPO by 20% with 35× fewer rollouts, showing optimization efficiency matters as much as capability.

Benchmark Explorer →

Interactive Lab 4

Benchmark Explorer

Eight tasks from the DSPy paper and follow-up research. Hover any point to see task details, baseline accuracy, and DSPy improvement. Use the filters to explore by task type or optimizer.

Filter by optimizer:

Hover any bubble to inspect task details

DSPy vs Alternatives →

Ecosystem Context

DSPy vs Alternatives — When to Use What

DSPy is not the right tool for every job. Understanding its tradeoffs vs traditional prompting and LangChain helps you choose the right approach for your use case.

Choose DSPy When

Complex multi-step reasoning (3+ LM calls), you have evaluation metrics, prompt optimization consumes development time, model migrations happen, and you need reproducible production systems.

Choose LangChain When

Multiple data source integrations needed, rapid prototype-to-demo, team familiar with the framework, diverse agent workflows, and established community support matters for your project.

Choose Manual Prompting When

One-off POC or demo, minimal LM calls (1-2 per request), fastest possible iteration, or evaluation data doesn’t exist yet. DSPy’s value increases with system complexity and production longevity.

Knowledge Quiz →

Knowledge Check

DSPy Knowledge Quiz

Test your understanding of DSPy’s core concepts. 8 questions covering signatures, modules, optimizers, and compilation.