The Self-Improving Agent Architecture · Visual Summary
Incorrect password. Try again.
SOUL.md
⟶
Memory
⟶
Skills
⟶
Curator
⟶
GEPA
Introduction
Hermes Agent: The Agent That Gets Better
Hermes Agent crossed 148,000 GitHub stars by May 2026 — reaching 90K in just two months. It ships with a learning loop that no other open-source agent combines: runtime skill learning, persistent multi-layer memory, and an offline optimization pipeline.
148K
GitHub stars as of May 2026
90K
Stars in first 2 months
47
Built-in tools dispatched
687
Skills across 18 categories
The One-Line Pitch
"An agent that gets better the longer you use it."
Three usually separate capabilities sit in one framework: runtime skill learning, persistent multi-layer memory, and an optional weight-free training pipeline. No other open-source agent ships all three.
The 4-Part Learning Loop
1
Remembers across sessions
Three-tier memory persists facts, procedures, and conversation history.
2
Writes its own reusable skills
One-time discoveries become permanent procedural memory (SKILL.md files).
3
Prunes them in the background
The Curator consolidates stale and overlapping skills automatically.
4
Validates them offline via GEPA
Genetic-Pareto Prompt Evolution tests and improves skills without GPUs.
Architecture
How It's Built
Everything flows through a single AIAgent class in run_agent.py. CLI, messaging gateway, IDE, batch runner — all entry points into the same core agent. This is what makes "platform-agnostic" actually work.
Local terminal, Docker, SSH, Modal, Daytona, or Singularity — same code, one config change. Move from laptop to cloud GPU without touching the agent.
3 API Modes (Any Model)
A translation layer routes any provider through one of three API formats. Swap Claude → GPT → Gemini → Ollama with one command.
90-Turn Hard Cap
Without it, a stuck agent (retrying a failing API, re-reading the same file) silently burns credits. Subagents share the same budget — no runaway delegation chains.
The closest open-source comparison. Both are persistent and messaging-friendly, but they make opposite architectural choices. Post 11 covers OpenClaw in depth →
"Hermes packages a gateway around a learning agent. OpenClaw packages an agent around a messaging gateway." — Kilo blog
OpenClaw
Gateway-first. WebSocket routing with agent attached. 50+ messaging channels, 13,700+ community skills. Memory is plain Markdown files.
Skills stay static — no learning loop. But the community skill ecosystem is vastly larger.
Zero agent-specific CVEs as of April 2026. Snapshot rollback before every file op.
Dimension
OpenClaw
Hermes Agent
Architecture
Gateway-first, WebSocket routing
Agent-first, learning runtime
Channel breadth
50+ messaging channels
18+ focused channels
Skill ecosystem
13,700+ community skills
~120 bundled + skills.sh + GitHub taps
Learning loop
Skills stay static
Self-evolve, Curator prunes, GEPA optimizes
Memory
Plain Markdown files
Three-tier: bounded MD, FTS5 search, 8 providers
Safety
6 CVEs, 341+ malicious skills flagged
Zero agent CVEs, snapshot rollback
When to choose Hermes: You want an agent that improves over time with your workflow, care about safety, and don't need 50 messaging channels. When to choose OpenClaw: You need the broadest channel support or the largest pre-built skill library, and don't need runtime learning.
Identity Layer
SOUL.md — Before Memory, Before Skills
Memory is what the agent knows. Skills are how it does things. But neither tells you who it is when it shows up. SOUL.md solves this — it's the identity that everything else flows through.
What SOUL.md Is
1
Slot #1 in the system prompt
Loaded before MEMORY.md, USER.md, or any skills. The fixed foundation everything else builds on.
2
Hand-authored and static
You write it once, tweak it over time. Never auto-generated — unlike memories and skills.
3
Defines personality, tone, limits
Communication style, hard limits, what to optimize for, what to refuse.
4
Per-profile isolation
Each profile (designer, programmer, researcher) has its own SOUL.md. Same engine, genuinely different agents.
# ~/.hermes/SOUL.md (default engineer)You are a pragmatic senior engineerwith strong taste.You optimize for truth, clarity,and usefulness over politeness theater.
# ~/.hermes/profiles/programmer/SOUL.mdYou are my staff engineer.Terse, direct, pragmatic.You read code before you write code.You write the smallest change thatsolves the problem. Standard libraryover dependencies, boring tech overshiny tech, explicit over clever.Always check: does this already exist?Are there tests? What breaks if this fails?Run the tests before saying "done."
If SOUL.md is missing, Hermes falls back to a built-in default identity. But the fallback is generic — the self-improvement story loses its frame without a custom SOUL.md.
Why it matters for self-improvement: Every memory the agent writes, every skill it creates, every way it consolidates knowledge — all happen through the lens of this identity. SOUL.md is the fixed frame. Memory and skills are the moving parts inside it.
Memory System
Three-Tier Memory: Three Speeds
Hermes doesn't have a single "memory." Each tier trades speed for capacity. The agent picks the right one for the question.
Tier 1 — In-Prompt
Tiny Markdown Files
MEMORY.md (2,200 chars) + USER.md (1,375 chars). Injected as a frozen snapshot at session start. Prefix-cache safe.
Speed: InstantCapacity: Tiny
Changes written mid-session persist to disk immediately but don't appear in the system prompt until next session. At ~80% capacity, agent must consolidate — merging related entries into denser forms.
Tier 2 — Session Search
SQLite + FTS5
Every conversation (CLI and messaging) stored in SQLite with full-text search. Agent can search weeks of past conversations on demand. Summaries via Gemini Flash.
Speed: On-demandCapacity: ∞
Requires active search + LLM summarization — not free to access. Critical facts live in Tier 1. Everything else is searchable here.
Tier 3 — External Providers
8 Pluggable Backends
8 memory providers that run alongside built-in memory (never replacing it). Only one active at a time. Prefetches before each turn, syncs after response, extracts on session end.
Speed: SlowerCapacity: Deep
Each provider has unique features: Honcho (dialectic user modeling), Holographic (HRR algebra + trust scoring), Supermemory (context fencing, multi-container).
Provider
Storage
Cost
Unique Feature
Honcho
Cloud
Paid
Dialectic user modeling + session-scoped context
OpenViking
Self-hosted
Free
Filesystem hierarchy + tiered loading
Mem0
Cloud
Paid
Server-side LLM extraction
Hindsight
Cloud/Local
Free/Paid
Knowledge graph + reflect synthesis
Holographic
Local
Free
HRR algebra + trust scoring
RetainDB
Cloud
$20/mo
Delta compression
ByteRover
Local/Cloud
Free/Paid
Pre-compression extraction
Supermemory
Cloud
Paid
Context fencing + session graph + multi-container
Interactive
Memory System Explorer
Step through what gets stored in each tier, when it's accessed, and how consolidation works.
MEMORY.md (2,200 chars max)
## Environment
- Python 3.12, uv package manager
- Repo: ~/projects/api-server
## Tool Quirks
- docker ps hangs on VPN — use ssh backend
## Lessons Learned
- Always run uv sync before testing
- main branch needs PR, never force push
## Project State
- Auth refactor in progress (branch: feat/auth)
[capacity: 71%]
USER.md (1,375 chars max)
## Profile
- Name: Bhaskarjit
- Timezone: IST (UTC+5:30)
- Skill level: Senior engineer
## Preferences
- Terse responses — no summaries
- Prefer standard library
- Dark mode code blocks
## Avoid
- Emoji in code comments
- Long explanations when brief works
[capacity: 58%]
Both files are loaded as a frozen snapshot when the session starts. Mid-session writes persist to disk but don't enter the active context until next session. This is intentional — it prevents mid-conversation context drift.
Every conversation is written to SQLite with FTS5 full-text search indexing. The agent uses a session_search tool to query it.
# Example: agent recalls a past debugging sessionsession_search("docker VPN hang fix", limit=3)
# Returns:"2026-04-12: Discovered docker ps hangs on corp VPN. Fix: switch terminal backend to ssh. Confirmed working."
Tier 2 has unlimited capacity but requires active search + LLM summarization — it's not free to access. The agent decides when to search based on task context. Critical facts should be in Tier 1 for instant access.
When a Tier 3 provider is active, Hermes runs a 3-phase lifecycle automatically — no manual calls needed:
PRE
Prefetch before each turn
Provider returns relevant memories based on the incoming message. Injected into context.
SYNC
Sync after each response
Conversation turns written to the provider's store for future retrieval.
When MEMORY.md hits ~80% capacity (shown in the system prompt header), the agent consolidates automatically:
Before (80%+ full)
- Python 3.12 installed
- Python version is 3.12
- uv is package manager
- Using uv, not pip
- docker hangs on VPN
- Don't use docker on VPN
- Always use ssh backend on VPN
→
After (consolidated)
- Stack: Python 3.12, uv
- On corp VPN: use ssh terminal
backend (docker ps hangs)
Only useful information survives. The agent must actively compress — there's no infinite memory. This forces good epistemic hygiene: low-value notes get dropped, high-signal facts get denser.
Procedural Memory
Self-Evolving Skills
Memory handles facts. Skills handle procedures. They are Markdown files with YAML frontmatter — the agent's procedural memory that it writes, reads, and improves autonomously.
Anatomy of a Skill File
---name: k8s-pod-debugdescription: >
Activate for crashing pods, CrashLoopBackOff,
"why is my pod restarting", container failures.
version: 1.2.0author: agentplatforms: [linux, macos]
---## Procedure
1. Get pod status → check events → pull logs
2. Look for OOMKilled, ImagePullBackOff, config errors
## Pitfalls
- Forgetting --previous flag on restarted containers
## Verification
- Pod stays Running with 0 restarts for 5+ minutes
Progressive Disclosure — Token Efficiency
Level 0
Names + descriptions only. ~3k tokens for the full catalog of 687 skills. Always in context.
Level 1
Full skill content loaded when the agent determines this skill is needed. On-demand.
Level 2
Drill into specific reference files within a skill (scripts, configs, golden test sets).
When Skills Are Created
The agent creates skills autonomously via the skill_manage tool. Creation triggers when:
patch is preferred — it's token-efficient (targeted fix, not full rewrite).
The Self-Improvement Loop
Problem
e.g. CrashLoopBackOff
→
Trial & Error
5+ tool calls, retries
→
Working Solution
Found
→
skill_manage
create / patch
→
SKILL.md Saved
~/.hermes/skills/
→
Next Session
Loads at Level 0
One-time discoveries become permanent procedural memory.
Garbage Collection
The Curator — Background Skill Maintenance
Without maintenance, agent-created skills pile up — dozens of narrow, overlapping playbooks that waste tokens and pollute the catalog. The Curator handles this automatically.
Trigger Conditions
The Curator runs on an inactivity check, not a cron daemon:
Condition 1: 7+ days since last Curator run
AND
Condition 2: Agent idle for 2+ hours
↓
Background fork spins up with its own prompt cache — never touching the active conversation.
Two Operating Phases
Phase 1 — Deterministic (no LLM)
Skills unused for 30 days → Stale. Skills unused for 90 days → Archived. No judgement, just timestamp math.
Phase 2 — LLM Review (up to 8 iterations)
A forked agent surveys all agent-created skills and decides per-skill: keeppatchconsolidatearchive
Safety Constraints
✓
Never touches bundled or hub-installed skills. Only agent-authored ones.
✓
Never auto-deletes. Worst outcome: archival to ~/.hermes/skills/.archive/ — one command to restore.
✓
tar.gz snapshot before every pass. Rollback is one command. Rollbacks are themselves reversible.
Pin critical skills:hermes curator pin <skill> protects a skill from archival and deletion permanently. Patches and edits still go through — the agent can improve a pinned skill without unpinning it first.
Interactive
Skill Lifecycle Simulator
Click each state to see what triggers the transition and what it means.
📌
Pinned
Protected
⚡
Active
Used recently
📅
Stale
30d unused
📦
Archived
90d unused
↺
Restored
1 command
Click a state above to learn about it.
Offline Optimization
GEPA — Genetic-Pareto Prompt Evolution
The in-agent learning loop has a known weakness: the agent tends toward self-congratulation. GEPA solves this by reading execution traces instead of asking the agent how it did.
The Problem with Self-Evaluation: The agent almost always thinks it performed well, even when it didn't. Community feedback on Hermes has confirmed this. Worse, the system that auto-generates skills can also overwrite manual customizations with worse versions.
What GEPA Is
GEPA is not built into the Hermes runtime. It lives in a companion repo (NousResearch/hermes-agent-self-evolution) and operates as an offline optimization pipeline.
Core idea: instead of asking "did you do well?", GEPA reads execution traces to understand why things failed, then proposes targeted improvements through evolutionary search.
No GPU requiredAPI calls only$2–10 per run
The 6-Step Pipeline
1
Read the current skill from the Hermes repo
2
Generate evaluation dataset (Claude Opus synthetic cases, real SQLite history, or curated golden sets)
Evaluate candidates with LLM-as-judge rubrics (not binary pass/fail)
5
Apply constraint gates: 100% test suite pass, <15KB skill size, semantic purpose must not drift
6
Best variant goes out as a PR against the Hermes repo — never a direct commit
Optimization Comparison
GEPA vs GRPO — Same Rollout, Different Signal
GRPO (used in Blog 44's RLHF section and Blog 6's GRPO deep-dive) and GEPA both improve agent behavior from experience. But they handle the signal very differently.
GEPA discards the signal. GEPA reads it. Try GEPA before moving to full fine-tuning or RL-based fine-tuning. It's a great intermediate step when you've hit a wall on skill quality without the infrastructure cost of GRPO.
Summary
The Full Hermes Learning Loop
Four layers, each with a distinct role. Remove any one of them and you lose a meaningful piece of the self-improvement story.
🧐
SOUL.md
Sets the identity. The fixed frame through which everything else operates.
🔄
Runtime Loop
Captures experience. ReAct loop that writes memories and creates skills autonomously.
🧹
Curator
Keeps the library clean. Prunes stale skills and consolidates overlapping ones.
🧬
GEPA
Makes sure what's in the library actually works. Offline validation via execution traces.