SkillOpt — Teaching AI Agents to Optimize Their Own Skills

Overview

›

Problem

›

Text-Space Opt

›

4-Step Loop

›

Stability

›

Landscape

›

Results

›

Playground

Post 53 Agents & Systems Microsoft · May 2026 Self-Evolving Agents

SkillOpt
Teaching AI Agents to Optimize Their Own Skills

Agent skills are typically hand-crafted, generated once, or loosely self-revised. SkillOpt changes this: it treats skill documents as external state to be optimized with the same rigor applied to model weights — bounded text-space edits, strict validation gates, and stability mechanisms borrowed from gradient descent. The result: consistent improvements across all 52 model × benchmark × harness combinations tested.

The Core Idea

If model weights can be optimized with gradient descent, why can't agent skills be optimized with text-space descent? SkillOpt introduces exactly this: a separate optimizer model proposes bounded add/delete/replace edits to a skill document, accepted only when validation strictly improves.

Why It Matters

Optimized skills transfer across models, harnesses, and related benchmarks — with zero inference overhead at deployment. A skill optimized on GPT-5.5 is a standalone markdown file any model can use. No fine-tuning. No extra API calls. All cost is in optimization, not serving.

The Problem with Static Skills

Why hand-crafted and one-shot skill generation keeps failing in production

An agent skill is a reusable document that tells an agent how to approach a class of tasks — which tools to use, in what order, how to handle edge cases, what mistakes to avoid. Think of it as a standing operating procedure written for an AI.

The problem: most skills are either hand-crafted by humans (expensive, brittle, doesn't scale), generated once by an LLM and frozen (never improves, drifts as tasks evolve), or subject to loose self-revision — agents rewriting their own skills with no convergence guarantee, no stability mechanism, and no way to tell if the edit helped or hurt.

The failure mode is the same in all three cases: skills are treated as static artifacts, not as something that can be systematically improved over time. The missing ingredient is a principled optimizer for text.

✗ Hand-Crafted Skills

Written by domain experts. Correct at creation. Brittle when tasks shift. Cannot incorporate failure signals from the agent's own execution history. Doesn't scale across diverse task domains.

✗ One-Shot Generation

LLM generates a skill from a prompt or a few examples. Fast. Zero improvement after creation. The skill that worked on task 1 may fail on task 47 — no feedback loop to fix it.

✗ Loose Self-Revision

Agent rewrites its own skill based on recent failures. No edit budget. No validation gate. An edit that helps on recent tasks may overwrite procedures that worked on older ones — catastrophic forgetting in text space.

What's needed is something analogous to what made weight-space learning so powerful: a principled optimizer with bounded updates, a validation criterion, and mechanisms to prevent unstable convergence. SkillOpt brings all three to the text domain.

Skills as External State — Text-Space Optimization

The analogy that unlocks systematic skill learning

The central insight of SkillOpt is a tight analogy between weight-space optimization and text-space optimization. In weight-space learning, a loss is computed on a training batch, gradients are propagated, weights are updated by a bounded step (learning rate), and the update is validated on held-out data. The process is systematic, convergent, and reproducible.

SkillOpt applies the same structure to skill documents. The "weights" are the words and procedures in the skill. The "gradient" is a natural-language reflection generated by an optimizer model that has read successful and failed rollouts. The "learning rate" is a textual budget bounding how much the skill can change per step. The "validation set" is a held-out task split that must show strict improvement before any edit is accepted.

Critically, the target model is frozen throughout. SkillOpt never touches the model. It only evolves the skill document — a separate, portable, model-agnostic artifact. This enables zero inference overhead at deployment and direct transferability across model families.

Concept	Weight-Space (Gradient Descent)	Text-Space (SkillOpt)
Trainable artifact	Model weights (floats)	Skill document (markdown text)
Training signal	Loss on training batch	Scored rollouts (successes + failures)
Update mechanism	Gradient × learning rate	Optimizer reflection → bounded edits
Update bounds	Learning rate clipping	Textual LR budget (token diff cap)
Validation gate	Held-out val loss	Held-out task score (strict improvement)
Slow / periodic updates	Slow weights / EMA	Epoch-wise meta updates
Momentum memory	Gradient momentum buffer	Rejected-edit buffer
Inference overhead	None (weights baked in)	None (skill is a static markdown file)

The Four-Step Optimization Loop

Rollout → Reflect → Edit → Gate — click each step to explore

1 · Rollout

2 · Reflect

3 · Edit

4 · Gate

Step 1 of 4

Rollout — Generate Scored Trajectories

The frozen target model executes a minibatch of tasks using the current version of the skill document. Each execution produces a trajectory: the sequence of tool calls, intermediate states, and final outputs. Every trajectory is scored by a task-specific evaluator — did the formula return the correct value? did the code patch pass the test suite?

Both successful trajectories and failed trajectories are collected and passed forward. Failures are not discarded — they are the primary learning signal. This mirrors supervised learning where negative examples are as important as positive ones.

Example (SWE-bench task):
Current skill: "Use grep to locate the relevant function before editing."
Rollout result: 7 successes, 3 failures — all failures involve multi-file changes where grep missed secondary definitions in imported modules.
Batch scores: [1, 1, 0, 1, 0, 1, 1, 0, 1, 1] → mean 0.70

Step 2 of 4

Reflect — Extract Reusable Lessons

A separate optimizer model — distinct from the frozen target — reads the minibatch of scored trajectories. It does not see the current skill yet. Its job: identify what consistently separates success from failure in this minibatch, then express that distinction as a reusable, procedural insight.

This reflection is the "gradient" in text space: it captures the direction the skill needs to move. The optimizer produces a natural-language description of the proposed change — not the change itself, but the reasoning that should motivate it.

Example (continued):
Reflection: "All 3 failures involved multi-file changes. The skill instructs using grep for location, but grep searches one file at a time. Failures could have been avoided by checking imports before starting edits. The skill should add: when the task involves class inheritance or imported modules, expand search to the full repo before editing."

Step 3 of 4

Edit — Apply Bounded Modifications

The optimizer model now sees both the current skill document and its reflection. It proposes concrete modifications: add new steps or rules, delete outdated or counterproductive instructions, or replace existing steps with improved versions.

All edits are constrained by a textual learning-rate budget — a ceiling on the total token count of changes per optimization step. This prevents the optimizer from rewriting the entire skill at once, which would destroy all accumulated knowledge. Small, targeted edits converge more reliably than large rewrites.

Example (continued):
+ Step 2b: If the task involves class inheritance or imported modules, run `grep -r <symbol> .` before editing. Check all files that import the affected class.

[No deletions this step. Budget used: 18 / 50 tokens — well within bounds.]

Step 4 of 4

Gate — Accept Only Strict Improvements

The proposed edited skill is evaluated on a held-out validation set — separate from the minibatch used in the rollout. The edit is accepted and replaces the current skill only if the validation score is strictly higher than the current skill's score.

If the edit fails the gate, it enters the rejected-edit buffer — a memory of directions the optimizer should avoid re-proposing. This is one of the three core stability mechanisms. The loop then repeats with the next minibatch, keeping the current skill unchanged.

Example (continued):
Current skill validation score: 0.70
Edited skill validation score: 0.81

Decision: ACCEPTED — edited skill replaces current skill. Loop repeats with new rollouts using the updated skill.

Three Stability Mechanisms

What stops text-space optimization from diverging — click each to explore

① Textual LR Budget

② Rejected-Edit Buffer

③ Slow / Meta Updates

Textual Learning-Rate Budget

In gradient descent, the learning rate caps how far weights move per step — preventing oscillation and divergence. SkillOpt introduces the analogous concept for text: a token-diff budget that caps the total size of edits (adds + deletes + replacements) per optimization step.

Without this budget, the optimizer would rewrite entire sections of the skill in one step — destroying all accumulated procedures from prior iterations. Large rewrites also make it impossible to isolate which specific change caused performance to improve or degrade, breaking the optimization signal.

Skill Document — Edit Under Budget Constraint (50 token budget)

# Skill: Code Review Agent ## Step 1: Identify changed files Run `git diff --name-only HEAD~1` to list modified files. ## Step 2: Check for tests Look for files matching *_test.py or test_*.py. Look for files matching *_test.py, test_*.py, or tests/ directory. If no tests exist, flag as high-priority review item. ## Step 3: Review diff Summarize changes per file, focusing on logic branches. ## Step 3b: Check edge cases For each changed function, verify: null inputs, empty collections, boundary values.

+ 28 tokens added − 7 tokens removed Budget used: 35 / 50 ✓

Rejected-Edit Buffer

When a proposed edit fails the validation gate, it isn't simply discarded — it is stored in a rejected-edit buffer, a memory of optimization directions to avoid. In subsequent steps, the optimizer model reads the buffer and is instructed not to re-propose semantically similar edits.

This mirrors gradient momentum in weight-space optimization: knowing which directions made the loss worse and downweighting them. Without this buffer, the optimizer tends to cycle — proposing the same failing edit repeatedly when similar failure patterns generate similar reflections.

Rejected-Edit Buffer — Avoiding Optimization Cycles

Step 3

Proposed: "Add a retry loop when tool call returns null"
Gate: REJECTED — val score 0.68 < current 0.71. Buffered.

Step 6

Proposed: "When tool returns empty response, retry up to 3 times"
Buffer match detected. Optimizer informed: retry-loop variants have already failed. Redirected.

Step 7

Proposed: "When tool returns null, log and skip — do not retry, move to fallback"
Gate: ACCEPTED — val score 0.78 > current 0.71.

Epoch-Wise Slow / Meta Updates

Within each optimization epoch, edits are small and targeted — controlled by the LR budget. At the end of each epoch, SkillOpt performs a meta-update: a larger, holistic review of the full skill document in light of all the epoch's accumulated learning.

This mirrors slow-weight updates or EMA in weight-space training — periodic consolidation that smooths noise and integrates incremental changes. The meta-update removes contradictions introduced by successive small edits and reorganizes the skill for clarity. It still must pass the strict validation gate before replacing the current skill.

Epoch Structure

Steps 1–N

Mini edits

(LR budget)

→

Epoch End

Meta-update

(holistic)

→

Gate

Strict improve

required

→

Next Epoch

Clean, consistent

skill document

The Skill Evolution Landscape

How agent skill learning evolved from 2023 to 2026 — click any system to expand

2023

Wang et al. · NeurIPS 2023

Voyager — Skills as Executable Code

First to demonstrate autonomous skill library construction in open-ended environments.

Mechanism: GPT-4 writes executable JavaScript programs for Minecraft tasks. Successful programs are stored in a vector DB and retrieved for future tasks. Skills compose — complex behaviors built from simpler ones via curriculum-guided exploration.

Key result: 3.3× more unique items collected, 15.3× faster tech-tree progression than prior SOTA. Skills transfer to novel Minecraft worlds.

Limitation: Skills are domain-specific executable code (JavaScript), not general natural-language procedures. No systematic skill refinement — skills are written once and stored. No validation gate or stability mechanism.

2024

Zhao et al. · AAAI 2024

ExpeL — Experience as Natural Language Insight

Trial-and-error learning without fine-tuning via accumulated experience pools.

Mechanism: Agents collect successful and failed trajectories in an experience pool. A synthesis step extracts abstract procedural insights in natural language — e.g. "always check return type before casting" — injected into future prompts. Works with closed API models (GPT-4, Claude) where weights are inaccessible.

Key result: Consistent improvement across tasks with no weight updates. Generalizes across task types through accumulated insight.

Limitation: Insights are injected at inference time — growing context overhead. No systematic edit gating or budget. No mechanism to remove outdated or contradictory insights.

2026

arXiv 2603 · 2026

Trace2Skill — Parallel Trajectory Distillation

Fleet of specialist sub-agents distills diverse trajectories into transferable skill patches.

Mechanism: Success-analyst and error-analyst sub-agents process trajectory batches in parallel. Proposals are conflict-free merged hierarchically into a unified skill directory. Prevents local overfitting through diversity of analysis perspectives.

Key result: Transfers across LLM scales, generalizes out-of-distribution. Outperforms Anthropic's official skills on spreadsheet, vision QA, and math reasoning tasks.

Limitation: Parallel fleet adds inference cost during optimization. No textual LR budget — edits can be large and destabilizing. No rejected-edit memory to prevent optimization cycles.

2026

arXiv 2603 · 2026

EvoSkill — Failure-Focused Multi-File Evolution

Iterative skill discovery from failed trajectories, generating structured multi-file skill packages.

Mechanism: Applies textual feedback descent specifically to failure cases. Proposes multiple skill and prompt mutations jointly. Generates multi-file, structured skill packages — not just a single document. Information-isolated surrogate verification for failure diagnostics.

Key result: Outperforms standard prompt optimization (GEPA, TextGrad) on agent tasks. Multi-file skill packages are more modular than single-document skills.

Limitation: No bounded textual learning rate. No rejected-edit memory. Less stable convergence than SkillOpt in ablation comparisons.

2026

Yang et al. · Microsoft Research · May 2026

SkillOpt — First Systematic Text-Space Optimizer ★

Principled text-space optimization with LR budgeting, rejected-edit buffer, and meta-updates.

Mechanism: Frozen target model + separate optimizer model. Four-step loop: rollout → reflect → bounded edit → strict validation gate. Three stability mechanisms prevent divergence. Skills deployed as standalone markdown with zero inference overhead.

Key result: Best or tied on all 52 model × benchmark × harness combinations. +23.5 pts GPT-5.5 direct chat, +24.8 Codex agentic loop, +19.1 Claude Code.

Advantage over predecessors: Only system with all three stability mechanisms. Only system with cross-model, cross-harness, and cross-benchmark transferability tested at scale. Zero inference overhead at deployment time.

Benchmark Results

52 model × benchmark × harness combinations — SkillOpt best or tied on all 52

Direct Chat Harness

GPT-5.5

+23.5

points improvement over no-skill baseline

Codex Agentic Loop

GPT-5.5

+24.8

points improvement over no-skill baseline

Claude Code Harness

GPT-5.5

+19.1

points improvement over no-skill baseline

The evaluation covers 6 benchmarks (OSWorld, SWE-bench, WebArena, GAIA, and domain-specific office task suites), 7 target models (GPT-5.5, Qwen 3.5, Qwen 3.6 variants, and others), and 3 execution harnesses (direct chat, Codex agentic loop, Claude Code). Across all 52 cells, SkillOpt achieves best or tied score.

Baselines tested: TextGrad (gradient-based text optimizer), GEPA (genetic-evolution prompt optimizer), EvoSkill (failure-focused skill evolution), Trace2Skill (parallel trajectory distillation). SkillOpt beats all four per-cell across all 52 combinations.

Relative Performance vs Baselines (illustrative, GPT-5.5 / SWE-bench)

SkillOpt

Trace2Skill

EvoSkill

GEPA

TextGrad

Key Properties Validated

Cross-Model Transfer

A skill optimized on GPT-5.5 transfers to Qwen 3.5 and Qwen 3.6 without re-optimization. SkillOpt finds model-agnostic procedural knowledge, not model-specific prompt hacks.

Cross-Harness Transfer

Skills transfer from direct chat to agentic loop harnesses (Codex, Claude Code) without modification. The skill encodes procedure, not harness-specific formatting.

Cross-Benchmark Transfer

Skills generalize to nearby benchmarks in the same domain without retraining. Optimized on task distribution A → improves performance on related distribution B.

Zero Inference Overhead

At deployment, the optimized skill is a standalone markdown file. No extra LLM calls, no runtime optimizer. All compute cost is in the optimization phase, not serving.

Live Playground

Run a real SkillOpt loop on Text-to-SQL tasks — Rollout → Optimize → Re-run — using your OpenAI API key (gpt-4o-mini, ~$0.001 per full run)

0 Setup — OpenAI API Key

Your key is stored only in your browser (localStorage) and sent directly to api.openai.com. It never touches any other server.

1 Rollout — Agent runs with initial skill

This is the starting skill document — it contains three deliberately wrong rules about SQL. Click Run Rollout to watch gpt-4o-mini answer 3 Text-to-SQL questions guided by this flawed skill.

Initial Skill Document

Rollout Results

Initial Score

—

The initial skill has three wrong rules: use WHERE for aggregates, prefer INNER JOIN, check for zeros/NULLs on joins. The optimizer will reflect on these failures and rewrite the skill with correct SQL rules.

2 Optimize — Reflect & Rewrite Skill

The optimizer LLM reads the rollout failures and rewrites the skill document — this is SkillOpt's Reflect → Edit step in action.

Skill Diff — What the Optimizer Changed

Green lines were added · Red lines were removed · Gray lines are unchanged

3 Re-run — Validate Optimized Skill

Run the same 3 tasks with the optimized skill. If the score improves, the Validation Gate accepts the new skill.

Re-run Results

Score Comparison — Gate Decision