🔬
Visual Summary
Post 54 · Agents & Systems · Advanced
Incorrect password — try again
Overview
Problem
Framework
Generation
Consumption
Transfer
Findings
Exercises
Post 54 Agents & Systems Microsoft Research · May 2026 Skill Lifecycle
SkillLens
The Full Lifecycle Study of Agent Skills
Agents repeat the same mistakes because they never retain what they learned. Skills are supposed to fix that — but nobody had studied the entire pipeline: how skills are extracted from raw experience, what makes them reusable, how to consume them effectively, and when they actually hurt. SkillLens is the first systematic answer, spanning 5 domains, 3 extraction families, and 3 consumption strategies across state-of-the-art models.
5
Task Domains
3
Extraction Methods
3
Consumption Strategies
70%
Optimal Quality Threshold
3–10
Optimal Skill Count
What SkillLens Studied
The complete agent skill lifecycle — from raw task trajectories through skill extraction to downstream agent consumption — treated as a unified, measurable system for the first time.
The Surprising Finding
Model scale does not predict skill quality. Larger models do not reliably produce better skills or consume them more effectively — generation method matters more than model size.
The Problem
Why agent skills remain poorly understood despite years of research
No Persistent Learning
Standard LLM agents reset at the start of each task. Even after solving a complex problem, the agent retains nothing — the next similar task starts from scratch.
Fragmented Prior Work
Existing research studied only slices of the pipeline: skill extraction in isolation, or skill consumption in isolation. Nobody measured the full lifecycle end-to-end.
Negative Transfer is Real
Low-quality or mismatched skills don't just fail to help — they actively degrade agent performance by 5–10%. Without systematic study, there was no way to predict when skills would hurt.
The Dual-Perspective Gap
Prior evaluations conflated two different questions:
Supply side: Is this skill internally well-formed? (extraction quality)
Demand side: Does this skill actually help an agent complete tasks? (practical utility)

A skill can score high on supply-side metrics yet provide zero — or negative — demand-side value. SkillLens is the first framework to measure both simultaneously, revealing that they are only loosely correlated (R² ≈ 0.78).
The SkillLens Framework
Three stages, measured end-to-end for the first time — click each stage to explore
Stage 1 — Experience Generation
Stage 2 — Skill Extraction
Stage 3 — Skill Consumption
Raw Experience Generation
Agents execute tasks and produce trajectories — step-by-step records of observations, reasoning, actions, and outcomes. These are the raw material for skill learning.
What goes in
Task instructions + environment observations
What comes out
Execution traces: action sequences, tool calls, intermediate results, final outcomes (success/failure)
Quality signal
Trajectory quality directly caps skill quality — garbage in, garbage out. Only 55–70% of generated trajectories produce reusable skills.
Example Trajectory Fragment
Task: Search for quarterly revenue data
       and create a summary chart

Step 1: navigate_to("finance-dashboard")
  → Observation: loaded, 4 datasets visible

Step 2: query("Q3 revenue by region")
  → Observation: returned 847 rows, CSV

Step 3: filter(region="APAC", year=2025)
  → Observation: 212 rows remaining

Step 4: create_chart(type="bar", x="month",
         y="revenue")
  → Observation: chart rendered ✓

Outcome: SUCCESS (task completed in 4 steps)
Skill Extraction
Raw trajectories are distilled into structured skill units — reusable, domain-applicable knowledge artifacts that future agents can retrieve and apply.
Skill Representation Components
Natural language description + Code or action template + Reasoning chain + Success/failure examples (demonstrations)
Most Critical Components (ablation)
Descriptions (40–50% drop if removed) and Demonstrations (50–60% retention) dominate. Reasoning chains matter most for novel tasks.
Three Extraction Families
Refinement-based · Evolution-based · Trace-to-Skill — each with different speed/quality/generalizability tradeoffs (see Generation section)
Example Extracted Skill
# Skill: Dashboard Data Query & Chart

## Description
Query a finance dashboard for structured
data and create a summary visualization.

## Steps
1. Navigate to the target dashboard
2. Issue structured query with filters
   (region, date range, metric)
3. Validate row count before proceeding
4. Create chart with appropriate axis mapping

## When to apply
Tasks requiring data extraction + visual
summary from tabular dashboards.

## Example (success)
Input: "Q3 APAC revenue summary chart"
Result: 4-step completion, bar chart ✓
Skill Consumption
Extracted skills are fed to a downstream agent for a new task. Three consumption strategies are studied, each with distinct performance profiles.
In-Context Learning (+60–72%)
Skills are appended to the agent's context window as reference examples. Simple and broadly effective.
Skill-Aware Planning (+65–75%)
Skills inform the planning step before execution begins — highest gains on complex, multi-step tasks.
Direct Execution (+55–70%)
Skill code or action templates are executed directly. Fast but brittle — requires high-precision skill extraction.
Prepare → Select → Execute
1
Prepare
Index extracted skills; embed for retrieval; apply quality threshold filter
2
Select
Retrieve top-k relevant skills for the current task (optimal k = 3–10)
3
Execute
Apply chosen consumption strategy; monitor for negative transfer signals
Skill Generation Methods
Three families of extraction, each with distinct speed, quality, and generalizability tradeoffs
Refinement-Based
Evolution-Based
Trace-to-Skill
Refinement-Based
Iteratively clean and polish extracted skills through critique-and-refine loops. Each round, an LLM critiques the current skill document and produces an improved version.
Examples
AutoRefine, Praxis
Best for
Production pipelines where speed matters and the task domain is narrow and well-defined
Performance Profile
Speed
Very Fast
Quality
75%
Generalize
Moderate
Complexity
Low
70–80% task success rate on same-domain tasks. Generalization drops sharply outside training domain.
Evolution-Based
Uses evolutionary search — mutation, crossover, and selection — to discover skill variants that generalize beyond the training distribution. Slower but finds skills that work across task families.
Examples
CoEvoSkills, EvoSkill
Best for
Skills intended to transfer across related domains; longer-horizon planning tasks; research settings where compute is not the bottleneck
Performance Profile
Speed
Slow
Quality
82%
Generalize
Best
Complexity
High
75–85% success rate. Best cross-domain transfer of all three families. 2–3× more compute than refinement-based.
Trace-to-Skill
Directly parse execution traces into structured skill representations using an LLM as a parser. No iterative refinement — one forward pass converts a trajectory into a skill document.
Examples
Trace2Skill (2026)
Best for
High-volume pipelines; cases where trajectory data is abundant and fast extraction is needed over high generalizability
Performance Profile
Speed
Fastest
Quality
68%
Generalize
Limited
Complexity
Lowest
68–78% accuracy. Lowest complexity, easiest to deploy. Weakest generalization — skills are highly trajectory-specific.
Side-by-Side Summary
MethodSpeedQualityGeneralizationBest Use
Refinement Fast 70–80% Moderate Same-domain production
Evolution Slow 75–85% Best Cross-domain research
Trace-to-Skill Fastest 68–78% Limited High-volume pipelines
Skill Consumption & Tuning
How many skills to retrieve, what quality threshold to apply, and which strategy to use
Optimal Skill Count
Too few: insufficient coverage. Too many: context overhead and noise. Sweet spot is 3–10 skills.
1 skill
+12%
+12%
3–5 skills
+30%
+30%
7–10 skills
+34%
+34%
15+ skills
+31%
+31%
Highlighted bars = recommended range. Beyond 10 skills, context window overhead begins to erode gains.
Quality Threshold Ablation
Filter skills below a quality score. Too strict removes useful skills; too loose admits noise. Optimal: ~70%.
0% +10% +20% +30% 50% 60% 70% 80% 90% Quality Threshold → +28% ★
Performance vs quality threshold across all 5 domains. The inverted-U shape is consistent across all tested extraction methods.
Cross-Domain Transfer
Skills learned in one domain rarely transfer to dissimilar domains — hover any cell to see the transfer rate and why
Source ↓ / Target → WebCodeMathSheetsStructured
Transfer rate:
Low (<40%)
Medium (40–65%)
High (65–80%)
Same-domain (>80%)
Same-Domain Transfer
75–90% effectiveness when source and target domains match. Skills encode domain-specific procedural knowledge that remains valid within-domain.
Cross-Domain Transfer
Only 20–35% effectiveness for dissimilar domains (e.g., Math → Web). Domain specificity is the single largest bottleneck in the skill ecosystem.
Key Findings
Five results that change how you should think about building agent skill systems
Finding 01
Skills help on average — but negative transfer is real and unpredictable
Across all 5 domains, skill-augmented agents outperform skill-free baselines by +15–35%. However, low-quality or mismatched skills cause −5–10% degradation. Without quality gating, skill libraries are actively dangerous.
Finding 02
Model scale does not predict skill quality
Larger models do not reliably produce better skills or consume them more effectively. Performance varies unpredictably across model sizes. Generation method matters more than model size. This directly challenges the intuition that scaling alone solves skill learning.
Finding 03
Domain specificity is the biggest bottleneck
Same-domain skill transfer: 75–90%. Cross-domain (dissimilar): only 20–35%. Skills encode domain-specific procedural knowledge so tightly that they rarely generalize across task families. Evolution-based methods partially close this gap, but the problem remains unsolved.
Finding 04
Generated skills reach ~80% of human-crafted quality
The best extraction methods produce skills at 75–85% the quality of hand-crafted expert skills. The remaining gap comes from trajectory noise, missing domain context, and the absence of human judgement about edge cases. This gap narrows as trajectory quality improves.
Finding 05
Supply quality ≠ demand utility (R² ≈ 0.78)
A skill that scores well on internal quality metrics (clarity, completeness, correctness) is not guaranteed to improve agent task performance. The correlation is real (R² ≈ 0.78) but imperfect — ~22% of the variance in utility is unexplained by quality scores alone. Task-match and consumption strategy account for much of the gap.
SkillLens vs Prior Work
Prior MethodCoverageSkillLens Advantage
SkillsBench Partial evaluation +30–40% more evaluation dimensions; full lifecycle coverage
AutoRefine Extraction only Similar quality, 2–3× faster generation; better generalization measures
Trace2Skill Extraction only +10–15% accuracy on complex trajectories; dual-perspective evaluation
Agent-SkillOS Consumption only +25–35% agent performance with SkillLens-generated skills
Practice Exercises
Apply what you've learned — three browser exercises and one live lab (optional OpenAI key)
1  Pipeline Stage Sort
Click any item to cycle it through the three SkillLens stages. Assign all 6 correctly, then click Check Answers.
Stage 1: Experience
Stage 2: Extraction
Stage 3: Consumption
2  Pick the Right Extraction Method
For each scenario, select the best extraction approach. Click a method to lock in your answer and see instant feedback.
3  The Goldilocks Lab — Find the Optimal Configuration
Adjust both sliders to maximise simulated agent performance. There is a sweet spot — can you find it?
Skill Count: 5
15101520
Quality Threshold: 70%
40%55%70%85%95%
Simulated Performance Gain
+30%
  Live Lab — Extract a Skill from a Raw Trajectory
Watch the SkillLens pipeline in action: paste any agent trajectory → gpt-4o-mini extracts a structured, reusable skill document. Optional — requires your OpenAI key (stored only in your browser, ~$0.0005 per run).
Raw Agent Trajectory (pre-filled — edit freely)
Extracted Skill Document
This is what a downstream agent would retrieve from the skill library. The model preserved the reusable procedure and stripped the trajectory noise — exactly what SkillLens Stage 2 produces.
Related Posts
Build your understanding of agent skill systems and persistent agent learning
← Previous Post
Post 53 — SkillOpt
Next Post →
Post 55 — Securing MCP