Introduction

Hermes Agent: The Agent That Gets Better

Hermes Agent crossed 148,000 GitHub stars by May 2026 — reaching 90K in just two months. It ships with a learning loop that no other open-source agent combines: runtime skill learning, persistent multi-layer memory, and an offline optimization pipeline.

148K

GitHub stars
as of May 2026

90K

Stars in
first 2 months

47

Built-in tools
dispatched

687

Skills across
18 categories

The One-Line Pitch

"An agent that gets better the longer you use it."

Three usually separate capabilities sit in one framework: runtime skill learning, persistent multi-layer memory, and an optional weight-free training pipeline. No other open-source agent ships all three.

The 4-Part Learning Loop

1

Remembers across sessions

Three-tier memory persists facts, procedures, and conversation history.

2

Writes its own reusable skills

One-time discoveries become permanent procedural memory (SKILL.md files).

3

Prunes them in the background

The Curator consolidates stale and overlapping skills automatically.

4

Validates them offline via GEPA

Genetic-Pareto Prompt Evolution tests and improves skills without GPUs.

Architecture

How It's Built

Everything flows through a single AIAgent class in run_agent.py. CLI, messaging gateway, IDE, batch runner — all entry points into the same core agent. This is what makes "platform-agnostic" actually work.

The Core Loop — ReAct Style

      Build prompt
      →
      Check compression
      →
      API call (interruptible)
      →
      Execute tool calls
      →
      Loop (max 90 turns)
    

Entry Points

❯_

CLI

Terminal chat interface, primary mode

•••

Gateway

Telegram, Slack, Discord (18+ channels)

</>

ACP (IDE)

VS Code, Cursor, JetBrains integration

▤

Batch

Automated pipeline runs

↓

API Server

REST endpoint for programmatic access

Key Design Decisions

6 Execution Backends

Local terminal, Docker, SSH, Modal, Daytona, or Singularity — same code, one config change. Move from laptop to cloud GPU without touching the agent.

3 API Modes (Any Model)

A translation layer routes any provider through one of three API formats. Swap Claude → GPT → Gemini → Ollama with one command.

90-Turn Hard Cap

Without it, a stuck agent (retrying a failing API, re-reading the same file) silently burns credits. Subagents share the same budget — no runaway delegation chains.

Tool Backends (47 tools)

Terminal (6 backends) Browser Web MCP File Vision TTS SQLite skill_manage session_search cronjob delegate_task

Framework Comparison

Hermes vs OpenClaw

The closest open-source comparison. Both are persistent and messaging-friendly, but they make opposite architectural choices. Post 11 covers OpenClaw in depth →

"Hermes packages a gateway around a learning agent. OpenClaw packages an agent around a messaging gateway."
— Kilo blog

OpenClaw

Gateway-first. WebSocket routing with agent attached. 50+ messaging channels, 13,700+ community skills. Memory is plain Markdown files.

Skills stay static — no learning loop. But the community skill ecosystem is vastly larger.

6 CVEs in 2026, 341+ malicious skills flagged, 135K+ exposed Shodan instances.

Hermes Agent

Agent-first. Learning runtime with gateway as one entry point. 18+ channels, ~120 bundled skills + GitHub taps. Memory is three-tier (Markdown + SQLite + 8 providers).

Skills self-evolve, Curator prunes, GEPA optimizes offline. Smaller channel count but deeper learning.

Zero agent-specific CVEs as of April 2026. Snapshot rollback before every file op.

Dimension	OpenClaw	Hermes Agent
Architecture	Gateway-first, WebSocket routing	Agent-first, learning runtime
Channel breadth	50+ messaging channels	18+ focused channels
Skill ecosystem	13,700+ community skills	~120 bundled + skills.sh + GitHub taps
Learning loop	Skills stay static	Self-evolve, Curator prunes, GEPA optimizes
Memory	Plain Markdown files	Three-tier: bounded MD, FTS5 search, 8 providers
Safety	6 CVEs, 341+ malicious skills flagged	Zero agent CVEs, snapshot rollback

When to choose Hermes: You want an agent that improves over time with your workflow, care about safety, and don't need 50 messaging channels. When to choose OpenClaw: You need the broadest channel support or the largest pre-built skill library, and don't need runtime learning.

Identity Layer

SOUL.md — Before Memory, Before Skills

Memory is what the agent knows. Skills are how it does things. But neither tells you who it is when it shows up. SOUL.md solves this — it's the identity that everything else flows through.

What SOUL.md Is

1

Slot #1 in the system prompt

Loaded before MEMORY.md, USER.md, or any skills. The fixed foundation everything else builds on.

2

Hand-authored and static

You write it once, tweak it over time. Never auto-generated — unlike memories and skills.

3

Defines personality, tone, limits

Communication style, hard limits, what to optimize for, what to refuse.

4

Per-profile isolation

Each profile (designer, programmer, researcher) has its own SOUL.md. Same engine, genuinely different agents.

# ~/.hermes/SOUL.md  (default engineer)

You are a pragmatic senior engineer
with strong taste.
You optimize for truth, clarity,
and usefulness over politeness theater.

# ~/.hermes/profiles/programmer/SOUL.md

You are my staff engineer.
Terse, direct, pragmatic.

You read code before you write code.
You write the smallest change that
solves the problem. Standard library
over dependencies, boring tech over
shiny tech, explicit over clever.

Always check: does this already exist?
Are there tests? What breaks if this fails?
Run the tests before saying "done."

If SOUL.md is missing, Hermes falls back to a built-in default identity. But the fallback is generic — the self-improvement story loses its frame without a custom SOUL.md.

Why it matters for self-improvement: Every memory the agent writes, every skill it creates, every way it consolidates knowledge — all happen through the lens of this identity. SOUL.md is the fixed frame. Memory and skills are the moving parts inside it.

Memory System

Three-Tier Memory: Three Speeds

Hermes doesn't have a single "memory." Each tier trades speed for capacity. The agent picks the right one for the question.

Tier 1 — In-Prompt

Tiny Markdown Files

MEMORY.md (2,200 chars) + USER.md (1,375 chars). Injected as a frozen snapshot at session start. Prefix-cache safe.

Speed: Instant Capacity: Tiny

Changes written mid-session persist to disk immediately but don't appear in the system prompt until next session. At ~80% capacity, agent must consolidate — merging related entries into denser forms.

Tier 2 — Session Search

SQLite + FTS5

Every conversation (CLI and messaging) stored in SQLite with full-text search. Agent can search weeks of past conversations on demand. Summaries via Gemini Flash.

Speed: On-demand Capacity: ∞

Requires active search + LLM summarization — not free to access. Critical facts live in Tier 1. Everything else is searchable here.

Tier 3 — External Providers

8 Pluggable Backends

8 memory providers that run alongside built-in memory (never replacing it). Only one active at a time. Prefetches before each turn, syncs after response, extracts on session end.

Speed: Slower Capacity: Deep

Each provider has unique features: Honcho (dialectic user modeling), Holographic (HRR algebra + trust scoring), Supermemory (context fencing, multi-container).

Provider	Storage	Cost	Unique Feature
Honcho	Cloud	Paid	Dialectic user modeling + session-scoped context
OpenViking	Self-hosted	Free	Filesystem hierarchy + tiered loading
Mem0	Cloud	Paid	Server-side LLM extraction
Hindsight	Cloud/Local	Free/Paid	Knowledge graph + reflect synthesis
Holographic	Local	Free	HRR algebra + trust scoring
RetainDB	Cloud	$20/mo	Delta compression
ByteRover	Local/Cloud	Free/Paid	Pre-compression extraction
Supermemory	Cloud	Paid	Context fencing + session graph + multi-container

Interactive

Memory System Explorer

Step through what gets stored in each tier, when it's accessed, and how consolidation works.

MEMORY.md (2,200 chars max)

## Environment
- Python 3.12, uv package manager
- Repo: ~/projects/api-server

## Tool Quirks
- docker ps hangs on VPN — use ssh backend

## Lessons Learned
- Always run uv sync before testing
- main branch needs PR, never force push

## Project State
- Auth refactor in progress (branch: feat/auth)
[capacity: 71%]

USER.md (1,375 chars max)

## Profile
- Name: Bhaskarjit
- Timezone: IST (UTC+5:30)
- Skill level: Senior engineer

## Preferences
- Terse responses — no summaries
- Prefer standard library
- Dark mode code blocks

## Avoid
- Emoji in code comments
- Long explanations when brief works
[capacity: 58%]

Both files are loaded as a frozen snapshot when the session starts. Mid-session writes persist to disk but don't enter the active context until next session. This is intentional — it prevents mid-conversation context drift.

Procedural Memory

Self-Evolving Skills

Memory handles facts. Skills handle procedures. They are Markdown files with YAML frontmatter — the agent's procedural memory that it writes, reads, and improves autonomously.

Anatomy of a Skill File

---
name: k8s-pod-debug
description: >
  Activate for crashing pods, CrashLoopBackOff,
  "why is my pod restarting", container failures.
version: 1.2.0
author: agent
platforms: [linux, macos]
---

## Procedure
1. Get pod status → check events → pull logs
2. Look for OOMKilled, ImagePullBackOff, config errors

## Pitfalls
- Forgetting --previous flag on restarted containers

## Verification
- Pod stays Running with 0 restarts for 5+ minutes

Progressive Disclosure — Token Efficiency

Level 0

Names + descriptions only. ~3k tokens for the full catalog of 687 skills. Always in context.

Level 1

Full skill content loaded when the agent determines this skill is needed. On-demand.

Level 2

Drill into specific reference files within a skill (scripts, configs, golden test sets).

When Skills Are Created

The agent creates skills autonomously via the skill_manage tool. Creation triggers when:

▸ Complex task completed (5+ tool calls)

▸ Errors or dead ends found and resolved

▸ User corrects the agent's approach

▸ Non-trivial workflow discovered

skill_manage Actions

create patch (preferred) edit (full rewrite) delete write_file remove_file

patch is preferred — it's token-efficient (targeted fix, not full rewrite).

The Self-Improvement Loop

Problem

e.g. CrashLoopBackOff

→

Trial & Error

5+ tool calls, retries

→

Working Solution

Found

→

skill_manage

create / patch

→

SKILL.md Saved

~/.hermes/skills/

→

Next Session

Loads at Level 0

One-time discoveries become permanent procedural memory.

Garbage Collection

The Curator — Background Skill Maintenance

Without maintenance, agent-created skills pile up — dozens of narrow, overlapping playbooks that waste tokens and pollute the catalog. The Curator handles this automatically.

Trigger Conditions

The Curator runs on an inactivity check, not a cron daemon:

Condition 1: 7+ days since last Curator run

AND

Condition 2: Agent idle for 2+ hours

↓

Background fork spins up with its own prompt cache — never touching the active conversation.

Two Operating Phases

Phase 1 — Deterministic (no LLM)

Skills unused for 30 days → Stale. Skills unused for 90 days → Archived. No judgement, just timestamp math.

Phase 2 — LLM Review (up to 8 iterations)

A forked agent surveys all agent-created skills and decides per-skill: keep patch consolidate archive

Safety Constraints

✓

Never touches bundled or hub-installed skills. Only agent-authored ones.

✓

Never auto-deletes. Worst outcome: archival to ~/.hermes/skills/.archive/ — one command to restore.

✓

tar.gz snapshot before every pass. Rollback is one command. Rollbacks are themselves reversible.

Pin critical skills: hermes curator pin <skill> protects a skill from archival and deletion permanently. Patches and edits still go through — the agent can improve a pinned skill without unpinning it first.

Interactive

Skill Lifecycle Simulator

Click each state to see what triggers the transition and what it means.

📌

Pinned

Protected

⚡

Active

Used recently

📅

Stale

30d unused

📦

Archived

90d unused

↺

Restored

1 command

Click a state above to learn about it.

Offline Optimization

GEPA — Genetic-Pareto Prompt Evolution

The in-agent learning loop has a known weakness: the agent tends toward self-congratulation. GEPA solves this by reading execution traces instead of asking the agent how it did.

The Problem with Self-Evaluation: The agent almost always thinks it performed well, even when it didn't. Community feedback on Hermes has confirmed this. Worse, the system that auto-generates skills can also overwrite manual customizations with worse versions.

What GEPA Is

GEPA is not built into the Hermes runtime. It lives in a companion repo (NousResearch/hermes-agent-self-evolution) and operates as an offline optimization pipeline.

Core idea: instead of asking "did you do well?", GEPA reads execution traces to understand why things failed, then proposes targeted improvements through evolutionary search.

No GPU required API calls only $2–10 per run

The 6-Step Pipeline

1

Read the current skill from the Hermes repo

2

Generate evaluation dataset (Claude Opus synthetic cases, real SQLite history, or curated golden sets)

3

Run GEPA optimizer: read traces → understand failure points → generate candidate variants

4

Evaluate candidates with LLM-as-judge rubrics (not binary pass/fail)

5

Apply constraint gates: 100% test suite pass, <15KB skill size, semantic purpose must not drift

6

Best variant goes out as a PR against the Hermes repo — never a direct commit

Optimization Comparison

GEPA vs GRPO — Same Rollout, Different Signal

GRPO (used in Blog 44's RLHF section and Blog 6's GRPO deep-dive) and GEPA both improve agent behavior from experience. But they handle the signal very differently.

GRPO

IN

Reads execution trace (reasoning steps, tool calls, compiler errors, judge rationale)

⟶

Reduces to a scalar reward (+1 or -1). Discards the full trace signal.

⟶

Spreads policy gradient across all tokens in the rollout. Which module broke? Unclear.

OUT

Updates model weights. Needs ~24,000 rollouts. Opaque — hard to inspect what changed.

Signal kept: 1 bit (scalar reward)

GEPA

IN

Reads the full execution trace with a reflection LLM. Keeps all the signal.

⟶

Diagnosis localized to one module. Understands exactly which step failed and why.

⟶

Generates a new prompt for only that module. Module A, B, or C — only the broken one updates.

OUT

Updates SKILL.md prompts (not weights). Needs ~678 rollouts. Readable — diff the SKILL.md.

Signal kept: full trace

Dimension	GRPO	GEPA
Rollouts needed	~24,000	~678
What updates	Model weights	Skill/prompt files
Interpretability	Opaque (weight delta)	Readable (diff the SKILL.md)
GPU required	Yes (large clusters)	No (API calls only)
Cost	$$$$ (compute)	$2–10 per run
Signal used	1 bit (scalar)	Full trace

GEPA discards the signal. GEPA reads it. Try GEPA before moving to full fine-tuning or RL-based fine-tuning. It's a great intermediate step when you've hit a wall on skill quality without the infrastructure cost of GRPO.

Summary

The Full Hermes Learning Loop

Four layers, each with a distinct role. Remove any one of them and you lose a meaningful piece of the self-improvement story.

🧐

SOUL.md

Sets the identity. The fixed frame through which everything else operates.

🔄

Runtime Loop

Captures experience. ReAct loop that writes memories and creates skills autonomously.

🧹

Curator

Keeps the library clean. Prunes stale skills and consolidates overlapping ones.

🧬

GEPA

Makes sure what's in the library actually works. Offline validation via execution traces.

vs Other Agent Approaches

Approach	Identity	Memory	Learns Skills?	Prunes Skills?	Offline Validation?
Plain ReAct Agent	System prompt	None / window	No	No	No
OpenClaw	System prompt	Markdown files	No (static)	No	No
Hermes Agent	SOUL.md (slot #1)	3-tier (MD + SQLite + 8 providers)	Yes (skill_manage)	Yes (Curator)	Yes (GEPA)