Post 50 — Memory in LLM Agents
Enter the access code to continue
Post 50 · Agents & Systems

Memory in LLM Agents
How Machines Remember, Manage & Recall

A deep dive into the memory mechanisms that power LLM-based agents. Covers the three-dimensional taxonomy (sources × forms × operations), key architectures from Generative Agents to MemGPT, retrieval strategies, evaluation methods, and the open frontiers of multi-agent and lifelong memory.

Based on: Zhang et al. 2024 · arXiv:2404.13501
Category: Agents & Systems
Level: Intermediate
Post: 50 of 50

The Memory Problem in AI Agents

Without memory, every conversation starts from zero.

A standalone LLM is stateless — it processes a fixed context window, then forgets everything. But real-world agents need to accumulate experience, recall relevant facts, and improve over time. Memory is the bridge between a language model and an intelligent agent.

Consider a personal assistant that needs to remember your movie preferences across sessions, a Minecraft agent that stores successful crafting strategies for future runs, or a medical system that learns from thousands of patient interactions. Each demands a different memory architecture with different tradeoffs.

Paper Overview
Zhang et al. (2024) survey 50+ agent systems and propose a unified three-dimensional framework: Memory Sources (what goes in), Memory Forms (how it's stored), and Memory Operations (how it's accessed). This gives us a coherent vocabulary to compare architectures that would otherwise look very different.
3 Memory Sources
Inside-trial, Cross-trial, External Knowledge — where the agent's information comes from
2 Memory Forms
Textual (context-injected) vs. Parametric (weight-encoded) — how information persists
3 Memory Operations
Write → Manage → Read — the universal pipeline from observation to action

Memory Sources

Where does the agent's knowledge come from? Three orthogonal sources feed into any agent's memory system.

Every piece of information an agent stores originates from one of three sources. Most real systems blend multiple sources — a ReAct agent reads a Wikipedia page (external) during a multi-step task (inside-trial) and reuses the strategy from a previous failed run (cross-trial).

Source 1
Inside-Trial Information
Historical steps within a single session or task — the most intuitive and universally adopted source. Every agent that maintains a conversation history is using inside-trial memory. It forms the "short-term" episodic buffer for the current task.
Example: Dialogue Agent
User: "My name is Alice and I prefer action films."
Agent stores this in session history. Later in the same conversation, when asked for a recommendation, the agent recalls Alice's preferences from the same conversation turn sequence.
Example: Game Agent (Minecraft)
The agent mines stone, fails to build a shelter, collects wood. This action-observation-failure sequence within one run constitutes inside-trial memory used to plan the next action.
Systems using this source:
MemGPTGenerative AgentsReActMemoChatSCMMost agents
Source 2
Cross-Trial Information
Accumulated knowledge across multiple attempts — past successes, failures, and extracted lessons that persist beyond individual sessions. This is what enables true learning from experience: an agent that failed a task yesterday should not repeat the same mistake today.
Example: Reflexion (Verbal RL)
Trial 1: Agent attempts a HotpotQA question, answers incorrectly. The failure reason — "I searched for the wrong entity" — is stored as a verbal insight. Trial 2: The agent reads this insight before starting, changes strategy, answers correctly.
Example: Voyager (Minecraft)
Successful code snippets from previous episodes are stored as skills. A new run can reuse the "mine_iron_ore" function developed in run 3 without relearning it from scratch.
Systems using this source:
ReflexionExpeLVoyagerMemoryBankGITMRetroformer
Source 3
External Knowledge
Static information from outside the agent-environment loop, accessed via APIs, tool calls, or database lookups. This source provides factual grounding that goes beyond what the agent can learn from its own experience — encyclopedic facts, domain knowledge, real-time data.
Example: ReAct on HotpotQA
The agent calls Wikipedia's search API mid-task to look up "Marie Curie's birthplace." This retrieved text is injected into context as external knowledge, enabling the agent to answer factual questions it couldn't otherwise ground.
Example: Medical Agents (Huatuo)
A medical diagnosis agent queries PubMed for recent treatment guidelines. The result is processed into the context as external knowledge, providing up-to-date clinical information beyond what was encoded in model weights at training time.
Systems using this source:
ReActGITMChatDBRET-LLMHuatuoInvestLM

Memory Forms

How information is stored determines what tradeoffs the agent makes between interpretability, cost, and adaptability.

The survey identifies two fundamentally different ways to store memory: injecting it into the LLM's context as text (textual memory), or encoding it permanently into the model's weights (parametric memory). Each comes with distinct engineering tradeoffs that affect everything from latency to catastrophic forgetting risk.

Aspect
Textual Memory
Effectiveness
Comprehensive, flexible — but constrained by context window length
Write Speed
Fast — just append, summarize, or embed
Read Speed
Requires retrieval + injection into prompt
Interpretability
High — stored as natural language, human-readable
Forgetting Risk
None — doesn't affect weights; old info stays in storage
Best For
Conversations, episodic memory, recency-sensitive tasks
Complete Interactions — stores the entire agent-environment history verbatim. Maximally informative but quickly exhausts the context window. Suffers from position bias (LLMs attend unevenly to early vs. late tokens) and truncation loss when history exceeds the limit. Baseline strategy used by most simple chatbots.
Recent Interactions (Sliding Window) — maintains only the most recent K steps via a FIFO cache, inspired by the principle of temporal locality. Efficient and avoids token overflow, but loses distant critical information. SCM (Selective Contextual Memory) uses a memory controller to gate what gets kept, preserving the t-1 most important recent steps rather than a blind window.
Retrieved Interactions — selects memory by relevance, recency, or importance using embedding-based retrieval (cosine similarity, LSH, FAISS). Top-K most relevant entries are fetched per query. Ensures that even distant but critical memories remain accessible. Used by MemoryBank (dual-tower + FAISS) and Generative Agents (multi-criterion scoring).
External Knowledge Text — API-acquired information (Wikipedia, medical databases, Minecraft Wiki) transformed into natural language and injected into the context window. Provides factual grounding beyond agent experience. Risk: the source data may be inaccurate, outdated, or privacy-violating.

Memory Operations: Write → Manage → Read

A unified formal framework captures every agent's memory pipeline in three composable operators.

The paper's key contribution is a formal unified framework — three mathematical operators (W, P, R) that describe the complete pipeline from raw observation to informed action. This lets researchers compare wildly different systems (from Generative Agents to ChatDB) using the same language.

Unified Agent Action Formula
at+1k = LLM{ R( P(Mt-1k, W({atk, otk})), ct+1k ) }

where k = trial index, t = step index, a = action, o = observation, c = action context
Environment
Observation otk
+
Action atk
Write W
Manage P
Read R
LLM → at+1

The cycle begins when the agent takes an action in the environment and receives an observation in return. This raw (action, observation) pair — denoted {atk, otk} — is the raw material that flows into the memory pipeline.

In a game agent: action = "mine stone block", observation = "+1 stone added to inventory". In a dialogue agent: action = agent utterance, observation = user's reply.

Environment
(at, ot)
Write W
mtk
Manage P
Read R
LLM
mtk = W({atk, otk})

Memory Writing projects raw observations into a concise stored memory entry. W is a projecting function — it decides what to remember and how to represent it.

  • Raw storage — copy the (action, observation) verbatim into memory
  • Summarization — compress a long interaction into a few key sentences (MemoChat)
  • Structured entry — extract entity relations and store as a typed database record (TiM)
  • Autonomous update — the LLM itself decides when and what to write (MemGPT)
  • Topic indexing — tag the memory with a key for later lookup (MemoChat, TiM)
W → mt
Manage P
Mtk
Read R
LLM
Mtk = P(Mt-1k, mtk)

Memory Management processes and maintains the quality of stored memory. P iterates over the existing memory store Mt-1 and the new entry mt to produce an updated store Mt. Three core management operations:

  • Reflection — generate high-level abstractions from raw events. Generative Agents trigger reflection when cumulative "importance scores" of events exceed a threshold, distilling daily events into personality insights.
  • Merging — consolidate redundant entries and build common reference patterns across multiple plans (GITM merges key actions; Voyager refines executable skills via environment feedback).
  • Forgetting — remove stale or irrelevant entries. Reduces negative interference from outdated memory. Can be implemented via knowledge editing or LLM-directed pruning.
Mt
Read R
tk
+ ct+1
LLM
tk = R(Mtk, ct+1k)

Memory Reading retrieves the most relevant subset of memory M̂ from the full store M, conditioned on the action context c for the next step. R is typically a similarity-based function.

  • Cosine similarity — embed both the query context and each memory entry; return top-K by cosine score
  • Multi-criterion scoring — weighted combination of relevance + recency + importance (Generative Agents)
  • SQL retrieval — for structured databases, generate and execute SQL queries (ChatDB)
  • FAISS lookup — approximate nearest-neighbor search over a large vector store (MemoryBank, ExpeL)
t
+
ct+1
LLM
at+1k
at+1k = LLM{ M̂tk ∥ ct+1k }

The retrieved memory M̂ is concatenated with the action context ct+1 and passed to the LLM. The LLM then generates the next action at+1 — which feeds back into the environment to produce the next observation, completing the loop.

This is where memory makes a difference: the same LLM, given retrieved context about a prior failure or a retrieved fact from Wikipedia, will generate a better action than it would without that context. The quality of every upstream operation (Write, Manage, Read) directly determines the quality of this final generation.

Key Systems

Six landmark agent architectures that illustrate different memory design philosophies.

Select a system above

Retrieval Strategies

How an agent decides which memories are relevant shapes both performance and computational cost.

Not all stored memories are equally useful for any given query. The retrieval function R must identify which subset of M is worth loading into the prompt. The survey identifies three main retrieval paradigms, each with distinct tradeoffs.

Query Context ct+1
"What do I need to craft a sword in Minecraft?"
Memory Store — Cosine Similarity Scores
"Crafted wooden pickaxe using 3 planks + 2 sticks"
0.92
"Smelted iron ore into 2 iron ingots at furnace"
0.87
"Found a village to the east, traded wheat for emeralds"
0.44
"Built a shelter from oak logs near spawn point"
0.21
Top-2 Retrieved (threshold: 0.7)
Memories about crafting tools and iron ingots are retrieved. The agent now knows it has iron and basic crafting knowledge — enough to plan toward smelting 2 more ingots for a sword.

Similarity-based retrieval embeds both query and memories using a shared encoder. Returns top-K by cosine score. Fast with FAISS approximate search. Used by: MemoryBank, RET-LLM, ExpeL.

Query Context ct+1
"What should I do at the party this evening?"
Memory Store — Multi-Criterion Scoring (Generative Agents)
"Alice invited me to a party tonight" (recent: 0.9, relevance: 0.95, importance: 0.8)
0.91
"I enjoy dancing and meeting new people at social events" (relevance: 0.75, importance: 0.9)
0.82
"I went to the grocery store 3 days ago" (recent: 0.2, relevance: 0.1)
0.38
Weighted Score: α·Relevance + β·Recency + γ·Importance
Multi-criterion scoring prevents purely old-but-important memories from dominating and purely recent-but-trivial memories from flooding the context. Generative Agents use all three signals with equal weights.

Generative Agents combine three normalized scores (relevance, recency, importance) to select memories. Each dimension is normalized 0–1 across the memory store. Recency uses exponential decay on last-access time.

Query Context ct+1
"How many times did I buy coffee in the last week?"
LLM-Generated SQL Query (ChatDB approach)
SELECT COUNT(*) FROM memory
WHERE category = 'purchase'
AND item LIKE '%coffee%'
AND date >= date('now', '-7 days')
Structured Query Result
Result: 4 coffee purchases found. The LLM receives this structured result as context, enabling precise quantitative answers that would be impossible with fuzzy similarity search.

ChatDB stores memories as relational database records. The agent generates SQL queries to retrieve specific information. Enables precise quantitative lookups, joins, and aggregations — impossible with embedding-based search. Trades flexibility for precision.

Evaluation Landscape

How do you know if memory is working? The survey identifies direct and indirect evaluation approaches across five key metrics.

💬
Coherence
Direct · Subjective
🧠
Rationality
Direct · Subjective
Result Correctness
Direct · Objective
📊
Reference F1
Direct · Objective
⏱️
Time & Hardware Cost
Direct · Objective
🎯
Task Performance
Indirect
The Evaluation Gap
Most papers evaluate memory only indirectly — by measuring task success — without isolating the memory component's contribution. This makes it hard to compare memory architectures fairly. A system that performs well on HotpotQA might succeed because of a better prompt, not better memory. Direct memory evaluation (coherence, F1) is rarer but more informative.

Challenges & Future Directions

Seven open problems define the frontier of agent memory research.

📚
Underdeveloped Parametric Memory
Fine-tuning and knowledge editing methods receive far less attention than textual memory in the agent literature. Catastrophic forgetting, high training costs, and the difficulty of online updates remain unsolved. Future work should explore efficient knowledge editing and lightweight personalization methods.
♾️
No Lifelong Learning Framework
No current agent architecture cleanly handles indefinite continuous operation — accumulating memory over months of deployment without degrading or requiring periodic wipes. Lifelong memory with graceful forgetting of irrelevant content while retaining critical long-term facts is an open challenge.
🤝
Multi-Agent Memory Undefined
When multiple agents share an environment, how should they share, merge, or conflict-resolve their individual memories? No standard protocol exists for shared memory banks, distributed memory consistency, or consensus mechanisms for conflicting agent beliefs. MetaGPT and ChatDev touch this but don't fully solve it.
Memory Contradictions
A single agent can simultaneously have inside-trial, cross-trial, and external knowledge that contradict each other ("The user said they dislike coffee" vs. a retrieved database entry "User ordered coffee 4 times this week"). No robust strategy exists for detecting contradictions and deciding which source to trust.
📍
Position Bias & Context Efficiency
LLMs attend unevenly to different positions in long contexts — information at the beginning and end is weighted more than the middle ("lost in the middle" phenomenon). This creates systematic retrieval failure even when the right memory is loaded. Long-context architectures need position-aware attention solutions.
🤖
Embodied & Multimodal Memory
Physical-world agents (robotics, AR glasses) need to integrate sensor streams, images, and spatial coordinates into memory. Pure text memory is insufficient. Multimodal memory systems that handle vision + language + spatial context are in early stages, with no unified framework.
🔒
Knowledge Reliability & Privacy
External knowledge sources can be inaccurate, biased, or contain private user data. Agents that blindly store and retrieve external content risk propagating misinformation. Privacy-preserving memory architectures and source credibility scoring are unexplored research directions.