🧠
Hermes Agent
The Self-Improving Agent Architecture · Visual Summary
Incorrect password. Try again.
SOUL.md
Memory
Skills
Curator
GEPA

Hermes Agent: The Agent That Gets Better

Hermes Agent crossed 148,000 GitHub stars by May 2026 — reaching 90K in just two months. It ships with a learning loop that no other open-source agent combines: runtime skill learning, persistent multi-layer memory, and an offline optimization pipeline.

148K
GitHub stars
as of May 2026
90K
Stars in
first 2 months
47
Built-in tools
dispatched
687
Skills across
18 categories
The One-Line Pitch
"An agent that gets better the longer you use it."
Three usually separate capabilities sit in one framework: runtime skill learning, persistent multi-layer memory, and an optional weight-free training pipeline. No other open-source agent ships all three.
The 4-Part Learning Loop
1
Remembers across sessions
Three-tier memory persists facts, procedures, and conversation history.
2
Writes its own reusable skills
One-time discoveries become permanent procedural memory (SKILL.md files).
3
Prunes them in the background
The Curator consolidates stale and overlapping skills automatically.
4
Validates them offline via GEPA
Genetic-Pareto Prompt Evolution tests and improves skills without GPUs.

How It's Built

Everything flows through a single AIAgent class in run_agent.py. CLI, messaging gateway, IDE, batch runner — all entry points into the same core agent. This is what makes "platform-agnostic" actually work.

The Core Loop — ReAct Style
Build prompt Check compression API call (interruptible) Execute tool calls Loop (max 90 turns)
Entry Points
❯_
CLI
Terminal chat interface, primary mode
•••
Gateway
Telegram, Slack, Discord (18+ channels)
</>
ACP (IDE)
VS Code, Cursor, JetBrains integration
Batch
Automated pipeline runs
API Server
REST endpoint for programmatic access
Key Design Decisions
6 Execution Backends
Local terminal, Docker, SSH, Modal, Daytona, or Singularity — same code, one config change. Move from laptop to cloud GPU without touching the agent.
3 API Modes (Any Model)
A translation layer routes any provider through one of three API formats. Swap Claude → GPT → Gemini → Ollama with one command.
90-Turn Hard Cap
Without it, a stuck agent (retrying a failing API, re-reading the same file) silently burns credits. Subagents share the same budget — no runaway delegation chains.
Tool Backends (47 tools)
Terminal (6 backends) Browser Web MCP File Vision TTS SQLite skill_manage session_search cronjob delegate_task

Hermes vs OpenClaw

The closest open-source comparison. Both are persistent and messaging-friendly, but they make opposite architectural choices. Post 11 covers OpenClaw in depth →

"Hermes packages a gateway around a learning agent. OpenClaw packages an agent around a messaging gateway."
— Kilo blog
OpenClaw
Gateway-first. WebSocket routing with agent attached. 50+ messaging channels, 13,700+ community skills. Memory is plain Markdown files.
Skills stay static — no learning loop. But the community skill ecosystem is vastly larger.
6 CVEs in 2026, 341+ malicious skills flagged, 135K+ exposed Shodan instances.
Hermes Agent
Agent-first. Learning runtime with gateway as one entry point. 18+ channels, ~120 bundled skills + GitHub taps. Memory is three-tier (Markdown + SQLite + 8 providers).
Skills self-evolve, Curator prunes, GEPA optimizes offline. Smaller channel count but deeper learning.
Zero agent-specific CVEs as of April 2026. Snapshot rollback before every file op.
DimensionOpenClawHermes Agent
ArchitectureGateway-first, WebSocket routingAgent-first, learning runtime
Channel breadth50+ messaging channels18+ focused channels
Skill ecosystem13,700+ community skills~120 bundled + skills.sh + GitHub taps
Learning loopSkills stay staticSelf-evolve, Curator prunes, GEPA optimizes
MemoryPlain Markdown filesThree-tier: bounded MD, FTS5 search, 8 providers
Safety6 CVEs, 341+ malicious skills flaggedZero agent CVEs, snapshot rollback
When to choose Hermes: You want an agent that improves over time with your workflow, care about safety, and don't need 50 messaging channels. When to choose OpenClaw: You need the broadest channel support or the largest pre-built skill library, and don't need runtime learning.

SOUL.md — Before Memory, Before Skills

Memory is what the agent knows. Skills are how it does things. But neither tells you who it is when it shows up. SOUL.md solves this — it's the identity that everything else flows through.

What SOUL.md Is
1
Slot #1 in the system prompt
Loaded before MEMORY.md, USER.md, or any skills. The fixed foundation everything else builds on.
2
Hand-authored and static
You write it once, tweak it over time. Never auto-generated — unlike memories and skills.
3
Defines personality, tone, limits
Communication style, hard limits, what to optimize for, what to refuse.
4
Per-profile isolation
Each profile (designer, programmer, researcher) has its own SOUL.md. Same engine, genuinely different agents.
# ~/.hermes/SOUL.md (default engineer) You are a pragmatic senior engineer with strong taste. You optimize for truth, clarity, and usefulness over politeness theater.
# ~/.hermes/profiles/programmer/SOUL.md You are my staff engineer. Terse, direct, pragmatic. You read code before you write code. You write the smallest change that solves the problem. Standard library over dependencies, boring tech over shiny tech, explicit over clever. Always check: does this already exist? Are there tests? What breaks if this fails? Run the tests before saying "done."
If SOUL.md is missing, Hermes falls back to a built-in default identity. But the fallback is generic — the self-improvement story loses its frame without a custom SOUL.md.
Why it matters for self-improvement: Every memory the agent writes, every skill it creates, every way it consolidates knowledge — all happen through the lens of this identity. SOUL.md is the fixed frame. Memory and skills are the moving parts inside it.

Three-Tier Memory: Three Speeds

Hermes doesn't have a single "memory." Each tier trades speed for capacity. The agent picks the right one for the question.

Tier 1 — In-Prompt
Tiny Markdown Files
MEMORY.md (2,200 chars) + USER.md (1,375 chars). Injected as a frozen snapshot at session start. Prefix-cache safe.
Speed: Instant Capacity: Tiny
Changes written mid-session persist to disk immediately but don't appear in the system prompt until next session. At ~80% capacity, agent must consolidate — merging related entries into denser forms.
Tier 2 — Session Search
SQLite + FTS5
Every conversation (CLI and messaging) stored in SQLite with full-text search. Agent can search weeks of past conversations on demand. Summaries via Gemini Flash.
Speed: On-demand Capacity: ∞
Requires active search + LLM summarization — not free to access. Critical facts live in Tier 1. Everything else is searchable here.
Tier 3 — External Providers
8 Pluggable Backends
8 memory providers that run alongside built-in memory (never replacing it). Only one active at a time. Prefetches before each turn, syncs after response, extracts on session end.
Speed: Slower Capacity: Deep
Each provider has unique features: Honcho (dialectic user modeling), Holographic (HRR algebra + trust scoring), Supermemory (context fencing, multi-container).
ProviderStorageCostUnique Feature
HonchoCloudPaidDialectic user modeling + session-scoped context
OpenVikingSelf-hostedFreeFilesystem hierarchy + tiered loading
Mem0CloudPaidServer-side LLM extraction
HindsightCloud/LocalFree/PaidKnowledge graph + reflect synthesis
HolographicLocalFreeHRR algebra + trust scoring
RetainDBCloud$20/moDelta compression
ByteRoverLocal/CloudFree/PaidPre-compression extraction
SupermemoryCloudPaidContext fencing + session graph + multi-container

Memory System Explorer

Step through what gets stored in each tier, when it's accessed, and how consolidation works.

MEMORY.md (2,200 chars max)
## Environment - Python 3.12, uv package manager - Repo: ~/projects/api-server ## Tool Quirks - docker ps hangs on VPN — use ssh backend ## Lessons Learned - Always run uv sync before testing - main branch needs PR, never force push ## Project State - Auth refactor in progress (branch: feat/auth) [capacity: 71%]
USER.md (1,375 chars max)
## Profile - Name: Bhaskarjit - Timezone: IST (UTC+5:30) - Skill level: Senior engineer ## Preferences - Terse responses — no summaries - Prefer standard library - Dark mode code blocks ## Avoid - Emoji in code comments - Long explanations when brief works [capacity: 58%]
Both files are loaded as a frozen snapshot when the session starts. Mid-session writes persist to disk but don't enter the active context until next session. This is intentional — it prevents mid-conversation context drift.

Self-Evolving Skills

Memory handles facts. Skills handle procedures. They are Markdown files with YAML frontmatter — the agent's procedural memory that it writes, reads, and improves autonomously.

Anatomy of a Skill File
--- name: k8s-pod-debug description: > Activate for crashing pods, CrashLoopBackOff, "why is my pod restarting", container failures. version: 1.2.0 author: agent platforms: [linux, macos] --- ## Procedure 1. Get pod status → check events → pull logs 2. Look for OOMKilled, ImagePullBackOff, config errors ## Pitfalls - Forgetting --previous flag on restarted containers ## Verification - Pod stays Running with 0 restarts for 5+ minutes
Progressive Disclosure — Token Efficiency
Level 0
Names + descriptions only. ~3k tokens for the full catalog of 687 skills. Always in context.
Level 1
Full skill content loaded when the agent determines this skill is needed. On-demand.
Level 2
Drill into specific reference files within a skill (scripts, configs, golden test sets).
When Skills Are Created
The agent creates skills autonomously via the skill_manage tool. Creation triggers when:
Complex task completed (5+ tool calls)
Errors or dead ends found and resolved
User corrects the agent's approach
Non-trivial workflow discovered
skill_manage Actions
create patch (preferred) edit (full rewrite) delete write_file remove_file
patch is preferred — it's token-efficient (targeted fix, not full rewrite).
The Self-Improvement Loop
Problem
e.g. CrashLoopBackOff
Trial & Error
5+ tool calls, retries
Working Solution
Found
skill_manage
create / patch
SKILL.md Saved
~/.hermes/skills/
Next Session
Loads at Level 0
One-time discoveries become permanent procedural memory.

The Curator — Background Skill Maintenance

Without maintenance, agent-created skills pile up — dozens of narrow, overlapping playbooks that waste tokens and pollute the catalog. The Curator handles this automatically.

Trigger Conditions
The Curator runs on an inactivity check, not a cron daemon:
Condition 1: 7+ days since last Curator run
AND
Condition 2: Agent idle for 2+ hours
Background fork spins up with its own prompt cache — never touching the active conversation.
Two Operating Phases
Phase 1 — Deterministic (no LLM)
Skills unused for 30 days → Stale. Skills unused for 90 days → Archived. No judgement, just timestamp math.
Phase 2 — LLM Review (up to 8 iterations)
A forked agent surveys all agent-created skills and decides per-skill: keep patch consolidate archive
Safety Constraints
Never touches bundled or hub-installed skills. Only agent-authored ones.
Never auto-deletes. Worst outcome: archival to ~/.hermes/skills/.archive/ — one command to restore.
tar.gz snapshot before every pass. Rollback is one command. Rollbacks are themselves reversible.
Pin critical skills: hermes curator pin <skill> protects a skill from archival and deletion permanently. Patches and edits still go through — the agent can improve a pinned skill without unpinning it first.

Skill Lifecycle Simulator

Click each state to see what triggers the transition and what it means.

📌
Pinned
Protected
Active
Used recently
📅
Stale
30d unused
📦
Archived
90d unused
Restored
1 command
Click a state above to learn about it.

GEPA — Genetic-Pareto Prompt Evolution

The in-agent learning loop has a known weakness: the agent tends toward self-congratulation. GEPA solves this by reading execution traces instead of asking the agent how it did.

The Problem with Self-Evaluation: The agent almost always thinks it performed well, even when it didn't. Community feedback on Hermes has confirmed this. Worse, the system that auto-generates skills can also overwrite manual customizations with worse versions.
What GEPA Is
GEPA is not built into the Hermes runtime. It lives in a companion repo (NousResearch/hermes-agent-self-evolution) and operates as an offline optimization pipeline.
Core idea: instead of asking "did you do well?", GEPA reads execution traces to understand why things failed, then proposes targeted improvements through evolutionary search.
No GPU required API calls only $2–10 per run
The 6-Step Pipeline
1
Read the current skill from the Hermes repo
2
Generate evaluation dataset (Claude Opus synthetic cases, real SQLite history, or curated golden sets)
3
Run GEPA optimizer: read traces → understand failure points → generate candidate variants
4
Evaluate candidates with LLM-as-judge rubrics (not binary pass/fail)
5
Apply constraint gates: 100% test suite pass, <15KB skill size, semantic purpose must not drift
6
Best variant goes out as a PR against the Hermes repo — never a direct commit

GEPA vs GRPO — Same Rollout, Different Signal

GRPO (used in Blog 44's RLHF section and Blog 6's GRPO deep-dive) and GEPA both improve agent behavior from experience. But they handle the signal very differently.

GRPO
IN
Reads execution trace (reasoning steps, tool calls, compiler errors, judge rationale)
Reduces to a scalar reward (+1 or -1). Discards the full trace signal.
Spreads policy gradient across all tokens in the rollout. Which module broke? Unclear.
OUT
Updates model weights. Needs ~24,000 rollouts. Opaque — hard to inspect what changed.
Signal kept: 1 bit (scalar reward)
GEPA
IN
Reads the full execution trace with a reflection LLM. Keeps all the signal.
Diagnosis localized to one module. Understands exactly which step failed and why.
Generates a new prompt for only that module. Module A, B, or C — only the broken one updates.
OUT
Updates SKILL.md prompts (not weights). Needs ~678 rollouts. Readable — diff the SKILL.md.
Signal kept: full trace
DimensionGRPOGEPA
Rollouts needed~24,000~678
What updatesModel weightsSkill/prompt files
InterpretabilityOpaque (weight delta)Readable (diff the SKILL.md)
GPU requiredYes (large clusters)No (API calls only)
Cost$$$$ (compute)$2–10 per run
Signal used1 bit (scalar)Full trace
GEPA discards the signal. GEPA reads it. Try GEPA before moving to full fine-tuning or RL-based fine-tuning. It's a great intermediate step when you've hit a wall on skill quality without the infrastructure cost of GRPO.

The Full Hermes Learning Loop

Four layers, each with a distinct role. Remove any one of them and you lose a meaningful piece of the self-improvement story.

🧐
SOUL.md
Sets the identity. The fixed frame through which everything else operates.
🔄
Runtime Loop
Captures experience. ReAct loop that writes memories and creates skills autonomously.
🧹
Curator
Keeps the library clean. Prunes stale skills and consolidates overlapping ones.
🧬
GEPA
Makes sure what's in the library actually works. Offline validation via execution traces.
vs Other Agent Approaches
ApproachIdentityMemoryLearns Skills?Prunes Skills?Offline Validation?
Plain ReAct AgentSystem promptNone / windowNoNoNo
OpenClawSystem promptMarkdown filesNo (static)NoNo
Hermes AgentSOUL.md (slot #1)3-tier (MD + SQLite + 8 providers)Yes (skill_manage)Yes (Curator)Yes (GEPA)
← Previous Post
Post 45 — Foundations of RL
Next Post →
Post 47 — GCG Attack