DSPy (Declarative Self-improving Python) treats LLM pipeline development as a machine learning problem. Instead of writing brittle prompt strings by hand, you declare what you want — and a compiler figures out how to achieve it.
+65%
Improvement over standard few-shot (Llama2-13b)
+46%
Over expert-written demos (GPT-3.5-turbo)
40%
Token reduction while maintaining accuracy
35×
Fewer rollouts than GRPO for same performance
The Old Way: Prompting
Hand-craft prompt strings, iterate manually, break when you switch models, spend days tuning wording. A single word change can drop accuracy by 10%. There’s no principled way to improve.
The DSPy Way: Programming
Declare your pipeline in Python. Specify inputs, outputs, and metrics. Run the compiler. It automatically discovers the best prompts, instructions, and few-shot examples for your specific model and task.
Why It Works
Optimization is a solved problem in ML. DSPy applies that discipline to prompt engineering — turning an art form into a science. Programs become portable, versioned, and reproducible across model changes.
The Prompting Problem — Fragile, Expensive, and Unscalable
Traditional prompt engineering is riddled with fundamental flaws: high sensitivity to wording, model-specific brittleness, and an inability to scale. Click each failure mode to explore it.
What Goes Wrong
A single changed word in "Answer the following question:" vs "Please answer this question:" can change accuracy by 10-15%. Systems break silently. Teams spend weeks tuning prompts, only to have model updates reset their work.
What DSPy Does Instead
Compiled DSPy programs show minimal variation when prompts are paraphrased because the semantics are learned, not hand-coded. Switching from GPT-3.5 to Llama only requires recompiling — the program structure stays identical.
Switch the LLM backend and watch what happens. Manual prompts break silently. DSPy recompiles and adapts. This is the core portability argument made concrete.
A Signature is a natural-language typed function declaration. It says what a text transformation should accomplish without specifying how the LM should be prompted. Field names carry semantic meaning that guides compilation.
Key insight: The field names "question" and "answer" tell DSPy this is a QA task. It generates different prompts than "query" and "response" — because the compiler understands semantic intent.
Portability
The same signature works with GPT-4, Llama-3, and Claude. When you compile for a different model, DSPy generates model-appropriate prompts from the same signature specification.
Type Safety
Signatures support typed outputs: bool, int, float, list[str]. DSPy automatically parses and validates LM outputs into the declared types.
Reusability
One signature can serve multiple modules. A Predict and a ChainOfThought can share the same signature but produce different prompting strategies.
Type field names and descriptions below — watch DSPy generate the exact prompt template in real time. This is what the compiler sees before optimization begins.
Define Your Signature
Live Preview
Notice: ChainOfThought automatically adds a Reasoning field between your inputs and outputs. You never write this — DSPy injects it based on the module type.
DSPy modules are composable building blocks that abstract prompting techniques. Like PyTorch layers, you stack them into programs. Each has learnable parameters: instructions, demonstrations, and optionally LM weights.
Learnable Parameters
Every module stores three types of learnable parameters: (1) LM instructions — the task description prepended to prompts, (2) Demonstrations — few-shot examples automatically selected, (3) Optionally LM weights for fine-tuning.
Composition
Modules compose with standard Python. Loops, conditionals, and function calls all work naturally. A program with 20 LM calls is just a Python class with 20 module instances — no special framework syntax.
Optimization-aware
When you compile a program, each module’s parameters are optimized independently. The optimizer traces execution, identifies failures, and updates instructions + demonstrations for each module separately.
A DSPy program is a differentiable computation graph where nodes are modules and edges are data flow. Programs are expressed in ordinary Python — control flow, loops, and branches all define the graph structure at runtime.
Click any node to inspect inputs/outputs
Select a node
Click any node in the diagram to see what that module does in the pipeline.
Compilation Pipeline — From Declaration to Optimized Program
Compilation is a one-time process that transforms a high-level DSPy program into an optimized version with learned instructions and few-shot demonstrations. The compiled program is just a Python object — no special runtime needed.
Step 0 / 5
Ready to compile
Press "Step Through" to walk through the DSPy compilation process step by step.
What Gets Optimized
Instructions (the task description in the prompt), Demonstrations (the few-shot examples), and optionally LM weights (for BootstrapFinetune). The program structure and Python logic remain unchanged.
Compile Once, Run Forever
Compilation happens during development, not at inference time. The compiled program runs as a regular Python object. No optimization overhead at serving time — just the learned prompts embedded in the modules.
Version Control
Compiled programs are serializable Python objects. You can save them as JSON, version them in Git, and redeploy without recompilation. Rollback is just loading a previous version.
Watch a real BootstrapFewShot compilation unfold: trainset examples pass through the program, successful traces are selected as demonstrations, instructions are refined, and accuracy climbs.
Speed:Ready — press Play to start
Waiting to start
The compilation trace shows how BootstrapFewShot selects training examples, filters for successful traces, and builds few-shot demonstrations — all automatically.
DSPy provides a suite of optimizers (also called teleprompters) with different cost-performance tradeoffs. The right optimizer depends on your dataset size, budget, and quality requirements.
Toggle which context components to include in a prompt. See the token count grow — and compare how DSPy's lazy skill loading and compiled signatures reduce context overhead vs manual approaches.
Assertions & Constraints — Teaching LMs to Self-Correct
DSPy Assertions are computational constraints that LMs must satisfy. When a constraint fails, DSPy backtracks — injecting the failure reason into the prompt so the model can self-correct. This enables principled self-refinement without manual retry logic.
164%
More constraints passed with Assertions
37%
Higher quality responses
16.7%
Citation faithfulness improvement
Step 0 / 5
dspy.Assert (Hard)
Halts execution if the constraint is still violated after max_backtracking_attempts retries. Raises AssertionError. Use during development to catch logical failures early. Triggers backtracking on each failure.
dspy.Suggest (Soft)
Same backtracking mechanism, but continues execution if finally violated. Logs the failure. Use in production for graceful degradation — best-effort constraint satisfaction with monitoring.
Run a constrained DSPy program and watch it fail, backtrack, and self-correct. The key: each retry injects the failure reason into the prompt — the model learns from its own mistakes within a single call.
Every optimizer you've seen so far (BootstrapFewShot, MIPROv2, GEPA) only changes what is written in the prompt — the model weights stay frozen. DSPy also ships BootstrapFinetune: an optimizer that takes the same bootstrapped traces and uses them as supervised fine-tuning data, actually updating the model's weights.
2
Optimization levels: Prompt & Weights
10×
Cheaper inference after fine-tuning
0
Prompt tokens needed at runtime (fine-tuned model)
1
Unified DSPy program: same code, both paths
The key insight: Both prompt optimization and weight optimization start identically — bootstrapping successful traces. The difference is what you do with those traces: stuff them into the prompt context (few-shot), or use them as fine-tuning examples to update the model's weights. Same traces, different destinations.
Two Paths, One Program
Both paths start from the same DSPy program and the same bootstrapped traces. Select a tab to explore each approach.
Step 0 / 6
What Actually Changes
In prompt optimization, the weights (billions of float32 parameters inside the transformer) stay completely frozen. Only the text prepended to your query changes. In weight optimization, gradient descent runs on those float32 values — the model literally learns new behaviour at the matrix multiplication level.
The Distillation Angle
BootstrapFinetune is essentially LLM distillation. A large teacher model (e.g., GPT-4) generates high-quality traces via bootstrapping. Those traces are used to fine-tune a small student model (e.g., Llama-3-8b). The student internalizes the teacher's behaviour — without needing the teacher at inference time.
DSPy Unifies Both
The same DSPy program code runs both paths. Switching from prompt optimization to weight optimization is one line: replace BootstrapFewShot(...) with BootstrapFinetune(...). The compiled program structure, signatures, and modules are identical — only the optimizer changes.
BootstrapFinetune in code
# Same program as alwaysclassRAGProgram(dspy.Module):
def__init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought(
"context, question -> answer"
)
# Switch optimizer to weight-level
optimizer = dspy.BootstrapFinetune(
metric=exact_match,
num_threads=4,
teacher_settings=dict(lm=gpt4), # teacher
student_settings=dict(lm=llama8b) # student
)
compiled = optimizer.compile(
RAGProgram(), trainset=trainset
)
# compiled now uses fine-tuned llama8b weights# no few-shot context needed at inference
What the fine-tuning data looks like
After bootstrapping, each successful trace becomes one SFT training example:
DSPy treats retrieval as just another module in the computation graph. dspy.Retrieve() is a learnable component whose query formulation gets optimized alongside the generation prompts. Result: retrieval + generation quality improve together.
Retrieve Module
dspy.Retrieve(k=3) queries any connected vector database (Qdrant, Weaviate, Chroma, Pinecone). The query itself is a learnable parameter — compilation optimizes how to phrase retrieval queries for your specific corpus.
Joint Optimization
Unlike LangChain RAG where you tune prompts manually, DSPy optimizes both retrieval queries and generation prompts together. SemanticF1 improved from 42% → 61% in real benchmarks through MIPROv2 optimization.
Multi-hop Retrieval
Complex questions require chained retrieval. The answer to step 1 informs the query for step 2. DSPy’s module composition makes this natural Python — a loop with Retrieve + Predict creates a multi-hop reasoning chain.
The original DSPy paper (Khattab et al., 2023) benchmarked on HotPotQA, FEVER, GSM8K, and more. More recent work shows even larger gains with advanced optimizers like GEPA and MIPROv2.
HotPotQA
Multi-hop question answering over Wikipedia. DSPy BootstrapFewShot: 71% vs 68% baseline (+3 points). The gain comes entirely from learned demonstrations — no architecture changes.
FEVER
Fact verification (3-way: supports/refutes/not enough info). DSPy: 91% vs 85% baseline (+6 points). Largest absolute gains on tasks with complex multi-step reasoning requirements.
MATH (GEPA 2025)
67% (unoptimized CoT) → 93% (GEPA-optimized) — a 26-point improvement. GEPA also outperforms GRPO by 20% with 35× fewer rollouts, showing optimization efficiency matters as much as capability.
Eight tasks from the DSPy paper and follow-up research. Hover any point to see task details, baseline accuracy, and DSPy improvement. Use the filters to explore by task type or optimizer.
DSPy is not the right tool for every job. Understanding its tradeoffs vs traditional prompting and LangChain helps you choose the right approach for your use case.
Choose DSPy When
Complex multi-step reasoning (3+ LM calls), you have evaluation metrics, prompt optimization consumes development time, model migrations happen, and you need reproducible production systems.
Choose LangChain When
Multiple data source integrations needed, rapid prototype-to-demo, team familiar with the framework, diverse agent workflows, and established community support matters for your project.
Choose Manual Prompting When
One-off POC or demo, minimal LM calls (1-2 per request), fastest possible iteration, or evaluation data doesn’t exist yet. DSPy’s value increases with system complexity and production longevity.