SkillLens — The Full Lifecycle Study of Agent Skills

Overview

›

Problem

›

Framework

›

Generation

›

Consumption

›

Transfer

›

Findings

›

Exercises

Post 54 Agents & Systems Microsoft Research · May 2026 Skill Lifecycle

SkillLens
The Full Lifecycle Study of Agent Skills

Agents repeat the same mistakes because they never retain what they learned. Skills are supposed to fix that — but nobody had studied the entire pipeline: how skills are extracted from raw experience, what makes them reusable, how to consume them effectively, and when they actually hurt. SkillLens is the first systematic answer, spanning 5 domains, 3 extraction families, and 3 consumption strategies across state-of-the-art models.

Task Domains

Extraction Methods

Consumption Strategies

70%

Optimal Quality Threshold

3–10

Optimal Skill Count

What SkillLens Studied

The complete agent skill lifecycle — from raw task trajectories through skill extraction to downstream agent consumption — treated as a unified, measurable system for the first time.

The Surprising Finding

Model scale does not predict skill quality. Larger models do not reliably produce better skills or consume them more effectively — generation method matters more than model size.

The Problem

Why agent skills remain poorly understood despite years of research

No Persistent Learning

Standard LLM agents reset at the start of each task. Even after solving a complex problem, the agent retains nothing — the next similar task starts from scratch.

Fragmented Prior Work

Existing research studied only slices of the pipeline: skill extraction in isolation, or skill consumption in isolation. Nobody measured the full lifecycle end-to-end.

Negative Transfer is Real

Low-quality or mismatched skills don't just fail to help — they actively degrade agent performance by 5–10%. Without systematic study, there was no way to predict when skills would hurt.

The Dual-Perspective Gap

Prior evaluations conflated two different questions:
• Supply side: Is this skill internally well-formed? (extraction quality)
• Demand side: Does this skill actually help an agent complete tasks? (practical utility)

A skill can score high on supply-side metrics yet provide zero — or negative — demand-side value. SkillLens is the first framework to measure both simultaneously, revealing that they are only loosely correlated (R² ≈ 0.78).

The SkillLens Framework

Three stages, measured end-to-end for the first time — click each stage to explore

Stage 1 — Experience Generation

Stage 2 — Skill Extraction

Stage 3 — Skill Consumption

Raw Experience Generation

Agents execute tasks and produce trajectories — step-by-step records of observations, reasoning, actions, and outcomes. These are the raw material for skill learning.

What goes in

Task instructions + environment observations

What comes out

Execution traces: action sequences, tool calls, intermediate results, final outcomes (success/failure)

Quality signal

Trajectory quality directly caps skill quality — garbage in, garbage out. Only 55–70% of generated trajectories produce reusable skills.

Example Trajectory Fragment

Task: Search for quarterly revenue data
       and create a summary chart

Step 1: navigate_to("finance-dashboard")
  → Observation: loaded, 4 datasets visible

Step 2: query("Q3 revenue by region")
  → Observation: returned 847 rows, CSV

Step 3: filter(region="APAC", year=2025)
  → Observation: 212 rows remaining

Step 4: create_chart(type="bar", x="month",
         y="revenue")
  → Observation: chart rendered ✓

Outcome: SUCCESS (task completed in 4 steps)

Skill Extraction

Raw trajectories are distilled into structured skill units — reusable, domain-applicable knowledge artifacts that future agents can retrieve and apply.

Skill Representation Components

Natural language description + Code or action template + Reasoning chain + Success/failure examples (demonstrations)

Most Critical Components (ablation)

Descriptions (40–50% drop if removed) and Demonstrations (50–60% retention) dominate. Reasoning chains matter most for novel tasks.

Three Extraction Families

Refinement-based · Evolution-based · Trace-to-Skill — each with different speed/quality/generalizability tradeoffs (see Generation section)

Example Extracted Skill

# Skill: Dashboard Data Query & Chart

## Description
Query a finance dashboard for structured
data and create a summary visualization.

## Steps
1. Navigate to the target dashboard
2. Issue structured query with filters
   (region, date range, metric)
3. Validate row count before proceeding
4. Create chart with appropriate axis mapping

## When to apply
Tasks requiring data extraction + visual
summary from tabular dashboards.

## Example (success)
Input: "Q3 APAC revenue summary chart"
Result: 4-step completion, bar chart ✓

Skill Consumption

Extracted skills are fed to a downstream agent for a new task. Three consumption strategies are studied, each with distinct performance profiles.

In-Context Learning (+60–72%)

Skills are appended to the agent's context window as reference examples. Simple and broadly effective.

Skill-Aware Planning (+65–75%)

Skills inform the planning step before execution begins — highest gains on complex, multi-step tasks.

Direct Execution (+55–70%)

Skill code or action templates are executed directly. Fast but brittle — requires high-precision skill extraction.

Prepare → Select → Execute

Prepare

Index extracted skills; embed for retrieval; apply quality threshold filter

Select

Retrieve top-k relevant skills for the current task (optimal k = 3–10)

Execute

Apply chosen consumption strategy; monitor for negative transfer signals

Skill Generation Methods

Three families of extraction, each with distinct speed, quality, and generalizability tradeoffs

Refinement-Based

Evolution-Based

Trace-to-Skill

Refinement-Based

Iteratively clean and polish extracted skills through critique-and-refine loops. Each round, an LLM critiques the current skill document and produces an improved version.

Examples

AutoRefine, Praxis

Best for

Production pipelines where speed matters and the task domain is narrow and well-defined

Performance Profile

Speed

Very Fast

Quality

75%

Generalize

Moderate

Complexity

Low

70–80% task success rate on same-domain tasks. Generalization drops sharply outside training domain.

Evolution-Based

Uses evolutionary search — mutation, crossover, and selection — to discover skill variants that generalize beyond the training distribution. Slower but finds skills that work across task families.

Examples

CoEvoSkills, EvoSkill

Best for

Skills intended to transfer across related domains; longer-horizon planning tasks; research settings where compute is not the bottleneck

Performance Profile

Speed

Slow

Quality

82%

Generalize

Best

Complexity

High

75–85% success rate. Best cross-domain transfer of all three families. 2–3× more compute than refinement-based.

Trace-to-Skill

Directly parse execution traces into structured skill representations using an LLM as a parser. No iterative refinement — one forward pass converts a trajectory into a skill document.

Examples

Trace2Skill (2026)

Best for

High-volume pipelines; cases where trajectory data is abundant and fast extraction is needed over high generalizability

Performance Profile

Speed

Fastest

Quality

68%

Generalize

Limited

Complexity

Lowest

68–78% accuracy. Lowest complexity, easiest to deploy. Weakest generalization — skills are highly trajectory-specific.

Side-by-Side Summary

Method	Speed	Quality	Generalization	Best Use
Refinement	Fast	70–80%	Moderate	Same-domain production
Evolution	Slow	75–85%	Best	Cross-domain research
Trace-to-Skill	Fastest	68–78%	Limited	High-volume pipelines

Skill Consumption & Tuning

How many skills to retrieve, what quality threshold to apply, and which strategy to use

Optimal Skill Count

Too few: insufficient coverage. Too many: context overhead and noise. Sweet spot is 3–10 skills.

1 skill

+12%

3–5 skills

+30%

7–10 skills

+34%

15+ skills

+31%

Highlighted bars = recommended range. Beyond 10 skills, context window overhead begins to erode gains.

Quality Threshold Ablation

Filter skills below a quality score. Too strict removes useful skills; too loose admits noise. Optimal: ~70%.

Performance vs quality threshold across all 5 domains. The inverted-U shape is consistent across all tested extraction methods.

Cross-Domain Transfer

Skills learned in one domain rarely transfer to dissimilar domains — hover any cell to see the transfer rate and why

Source ↓ / Target →	Web	Code	Math	Sheets	Structured

Transfer rate:

Low (<40%)

Medium (40–65%)

High (65–80%)

Same-domain (>80%)

Same-Domain Transfer

75–90% effectiveness when source and target domains match. Skills encode domain-specific procedural knowledge that remains valid within-domain.

Cross-Domain Transfer

Only 20–35% effectiveness for dissimilar domains (e.g., Math → Web). Domain specificity is the single largest bottleneck in the skill ecosystem.

Key Findings

Five results that change how you should think about building agent skill systems

Finding 01

Skills help on average — but negative transfer is real and unpredictable

Across all 5 domains, skill-augmented agents outperform skill-free baselines by +15–35%. However, low-quality or mismatched skills cause −5–10% degradation. Without quality gating, skill libraries are actively dangerous.

Finding 02

Model scale does not predict skill quality

Larger models do not reliably produce better skills or consume them more effectively. Performance varies unpredictably across model sizes. Generation method matters more than model size. This directly challenges the intuition that scaling alone solves skill learning.

Finding 03

Domain specificity is the biggest bottleneck

Same-domain skill transfer: 75–90%. Cross-domain (dissimilar): only 20–35%. Skills encode domain-specific procedural knowledge so tightly that they rarely generalize across task families. Evolution-based methods partially close this gap, but the problem remains unsolved.

Finding 04

Generated skills reach ~80% of human-crafted quality

The best extraction methods produce skills at 75–85% the quality of hand-crafted expert skills. The remaining gap comes from trajectory noise, missing domain context, and the absence of human judgement about edge cases. This gap narrows as trajectory quality improves.

Finding 05

Supply quality ≠ demand utility (R² ≈ 0.78)

A skill that scores well on internal quality metrics (clarity, completeness, correctness) is not guaranteed to improve agent task performance. The correlation is real (R² ≈ 0.78) but imperfect — ~22% of the variance in utility is unexplained by quality scores alone. Task-match and consumption strategy account for much of the gap.

SkillLens vs Prior Work

Prior Method	Coverage	SkillLens Advantage
SkillsBench	Partial evaluation	+30–40% more evaluation dimensions; full lifecycle coverage
AutoRefine	Extraction only	Similar quality, 2–3× faster generation; better generalization measures
Trace2Skill	Extraction only	+10–15% accuracy on complex trajectories; dual-perspective evaluation
Agent-SkillOS	Consumption only	+25–35% agent performance with SkillLens-generated skills

Practice Exercises

Apply what you've learned — three browser exercises and one live lab (optional OpenAI key)

1 Pipeline Stage Sort

Click any item to cycle it through the three SkillLens stages. Assign all 6 correctly, then click Check Answers.

Stage 1: Experience

Stage 2: Extraction

Stage 3: Consumption

2 Pick the Right Extraction Method

For each scenario, select the best extraction approach. Click a method to lock in your answer and see instant feedback.

3 The Goldilocks Lab — Find the Optimal Configuration

Adjust both sliders to maximise simulated agent performance. There is a sweet spot — can you find it?

Skill Count: 5

15101520

Quality Threshold: 70%

40%55%70%85%95%

Simulated Performance Gain

+30%

★ Live Lab — Extract a Skill from a Raw Trajectory

Watch the SkillLens pipeline in action: paste any agent trajectory → gpt-4o-mini extracts a structured, reusable skill document. Optional — requires your OpenAI key (stored only in your browser, ~$0.0005 per run).

Raw Agent Trajectory (pre-filled — edit freely)

Extracted Skill Document

This is what a downstream agent would retrieve from the skill library. The model preserved the reusable procedure and stripped the trajectory noise — exactly what SkillLens Stage 2 produces.

Build your understanding of agent skill systems and persistent agent learning

Post 53 — SkillOpt

The optimization engine for skill documents — treats skills like model weights and applies gradient-descent analogues. Directly complements SkillLens: SkillLens studies what makes skills good; SkillOpt automatically makes them better.

Post 52 — LLM Agent Orchestration

Coordination patterns for multi-agent systems. Skills (studied by SkillLens) are the knowledge layer that agents carry into orchestrated pipelines — understanding both is essential for production agent architectures.

Post 50 — Memory in LLM Agents

Survey of agent memory types: inside-trial, cross-trial, external. SkillLens's skill libraries are cross-trial external memory — the paper provides the most thorough empirical analysis of this memory type to date.

← Previous Post

Post 53 — SkillOpt

Post 55 — Securing MCP