Post 50 · Memory in LLM Agents

Why It Matters

The Memory Problem in AI Agents

Without memory, every conversation starts from zero.

A standalone LLM is stateless — it processes a fixed context window, then forgets everything. But real-world agents need to accumulate experience, recall relevant facts, and improve over time. Memory is the bridge between a language model and an intelligent agent.

Consider a personal assistant that needs to remember your movie preferences across sessions, a Minecraft agent that stores successful crafting strategies for future runs, or a medical system that learns from thousands of patient interactions. Each demands a different memory architecture with different tradeoffs.

Paper Overview

Zhang et al. (2024) survey 50+ agent systems and propose a unified three-dimensional framework: Memory Sources (what goes in), Memory Forms (how it's stored), and Memory Operations (how it's accessed). This gives us a coherent vocabulary to compare architectures that would otherwise look very different.

3 Memory Sources

Inside-trial, Cross-trial, External Knowledge — where the agent's information comes from

2 Memory Forms

Textual (context-injected) vs. Parametric (weight-encoded) — how information persists

3 Memory Operations

Write → Manage → Read — the universal pipeline from observation to action

Dimension 1

Memory Sources

Where does the agent's knowledge come from? Three orthogonal sources feed into any agent's memory system.

Every piece of information an agent stores originates from one of three sources. Most real systems blend multiple sources — a ReAct agent reads a Wikipedia page (external) during a multi-step task (inside-trial) and reuses the strategy from a previous failed run (cross-trial).

Source 1

Inside-Trial Information

Historical steps within a single session or task — the most intuitive and universally adopted source. Every agent that maintains a conversation history is using inside-trial memory. It forms the "short-term" episodic buffer for the current task.

Example: Dialogue Agent

User: "My name is Alice and I prefer action films."
Agent stores this in session history. Later in the same conversation, when asked for a recommendation, the agent recalls Alice's preferences from the same conversation turn sequence.

Example: Game Agent (Minecraft)

The agent mines stone, fails to build a shelter, collects wood. This action-observation-failure sequence within one run constitutes inside-trial memory used to plan the next action.

Systems using this source:

MemGPTGenerative AgentsReActMemoChatSCMMost agents

Source 2

Cross-Trial Information

Accumulated knowledge across multiple attempts — past successes, failures, and extracted lessons that persist beyond individual sessions. This is what enables true learning from experience: an agent that failed a task yesterday should not repeat the same mistake today.

Example: Reflexion (Verbal RL)

Trial 1: Agent attempts a HotpotQA question, answers incorrectly. The failure reason — "I searched for the wrong entity" — is stored as a verbal insight. Trial 2: The agent reads this insight before starting, changes strategy, answers correctly.

Example: Voyager (Minecraft)

Successful code snippets from previous episodes are stored as skills. A new run can reuse the "mine_iron_ore" function developed in run 3 without relearning it from scratch.

Systems using this source:

ReflexionExpeLVoyagerMemoryBankGITMRetroformer

Source 3

External Knowledge

Static information from outside the agent-environment loop, accessed via APIs, tool calls, or database lookups. This source provides factual grounding that goes beyond what the agent can learn from its own experience — encyclopedic facts, domain knowledge, real-time data.

Example: ReAct on HotpotQA

The agent calls Wikipedia's search API mid-task to look up "Marie Curie's birthplace." This retrieved text is injected into context as external knowledge, enabling the agent to answer factual questions it couldn't otherwise ground.

Example: Medical Agents (Huatuo)

A medical diagnosis agent queries PubMed for recent treatment guidelines. The result is processed into the context as external knowledge, providing up-to-date clinical information beyond what was encoded in model weights at training time.

Systems using this source:

ReActGITMChatDBRET-LLMHuatuoInvestLM

Dimension 2

Memory Forms

How information is stored determines what tradeoffs the agent makes between interpretability, cost, and adaptability.

The survey identifies two fundamentally different ways to store memory: injecting it into the LLM's context as text (textual memory), or encoding it permanently into the model's weights (parametric memory). Each comes with distinct engineering tradeoffs that affect everything from latency to catastrophic forgetting risk.

Aspect

Textual Memory

Effectiveness

Comprehensive, flexible — but constrained by context window length

Write Speed

Fast — just append, summarize, or embed

Read Speed

Requires retrieval + injection into prompt

Interpretability

High — stored as natural language, human-readable

Forgetting Risk

None — doesn't affect weights; old info stays in storage

Best For

Conversations, episodic memory, recency-sensitive tasks

Complete Interactions — stores the entire agent-environment history verbatim. Maximally informative but quickly exhausts the context window. Suffers from position bias (LLMs attend unevenly to early vs. late tokens) and truncation loss when history exceeds the limit. Baseline strategy used by most simple chatbots.

Recent Interactions (Sliding Window) — maintains only the most recent K steps via a FIFO cache, inspired by the principle of temporal locality. Efficient and avoids token overflow, but loses distant critical information. SCM (Selective Contextual Memory) uses a memory controller to gate what gets kept, preserving the t-1 most important recent steps rather than a blind window.

Retrieved Interactions — selects memory by relevance, recency, or importance using embedding-based retrieval (cosine similarity, LSH, FAISS). Top-K most relevant entries are fetched per query. Ensures that even distant but critical memories remain accessible. Used by MemoryBank (dual-tower + FAISS) and Generative Agents (multi-criterion scoring).

External Knowledge Text — API-acquired information (Wikipedia, medical databases, Minecraft Wiki) transformed into natural language and injected into the context window. Provides factual grounding beyond agent experience. Risk: the source data may be inaccurate, outdated, or privacy-violating.

Dimension 3

Memory Operations: Write → Manage → Read

A unified formal framework captures every agent's memory pipeline in three composable operators.

The paper's key contribution is a formal unified framework — three mathematical operators (W, P, R) that describe the complete pipeline from raw observation to informed action. This lets researchers compare wildly different systems (from Generative Agents to ChatDB) using the same language.

Unified Agent Action Formula a t+1 k = LLM{ R( P(M t-1 k, W({a t k, o t k})), c t+1 k) } where k = trial index, t = step index, a = action, o = observation, c = action context

Environment

→

Observation o_t^k

+

Action a_t^k

→

Write W

→

Manage P

→

Read R

→

LLM → a_t+1

The cycle begins when the agent takes an action in the environment and receives an observation in return. This raw (action, observation) pair — denoted {a_t^k, o_t^k} — is the raw material that flows into the memory pipeline.

In a game agent: action = "mine stone block", observation = "+1 stone added to inventory". In a dialogue agent: action = agent utterance, observation = user's reply.

Environment

→

(a_t, o_t)

→

Write W

→

m_t^k

→

Manage P

→

Read R

→

LLM

m_t^k = W({a_t^k, o_t^k})

Memory Writing projects raw observations into a concise stored memory entry. W is a projecting function — it decides what to remember and how to represent it.

Raw storage — copy the (action, observation) verbatim into memory
Summarization — compress a long interaction into a few key sentences (MemoChat)
Structured entry — extract entity relations and store as a typed database record (TiM)
Autonomous update — the LLM itself decides when and what to write (MemGPT)
Topic indexing — tag the memory with a key for later lookup (MemoChat, TiM)

W → m_t

→

Manage P

→

M_t^k

→

Read R

→

LLM

M_t^k = P(M_t-1^k, m_t^k)

Memory Management processes and maintains the quality of stored memory. P iterates over the existing memory store M_t-1 and the new entry m_t to produce an updated store M_t. Three core management operations:

Reflection — generate high-level abstractions from raw events. Generative Agents trigger reflection when cumulative "importance scores" of events exceed a threshold, distilling daily events into personality insights.
Merging — consolidate redundant entries and build common reference patterns across multiple plans (GITM merges key actions; Voyager refines executable skills via environment feedback).
Forgetting — remove stale or irrelevant entries. Reduces negative interference from outdated memory. Can be implemented via knowledge editing or LLM-directed pruning.

M_t

→

Read R

→

M̂_t^k

→

+ c_t+1

→

LLM

M̂_t^k = R(M_t^k, c_t+1^k)

Memory Reading retrieves the most relevant subset of memory M̂ from the full store M, conditioned on the action context c for the next step. R is typically a similarity-based function.

Cosine similarity — embed both the query context and each memory entry; return top-K by cosine score
Multi-criterion scoring — weighted combination of relevance + recency + importance (Generative Agents)
SQL retrieval — for structured databases, generate and execute SQL queries (ChatDB)
FAISS lookup — approximate nearest-neighbor search over a large vector store (MemoryBank, ExpeL)

M̂_t

+

c_t+1

→

LLM

→

a_t+1^k

a_t+1^k = LLM{ M̂_t^k ∥ c_t+1^k }

The retrieved memory M̂ is concatenated with the action context c_t+1 and passed to the LLM. The LLM then generates the next action a_t+1 — which feeds back into the environment to produce the next observation, completing the loop.

This is where memory makes a difference: the same LLM, given retrieved context about a prior failure or a retrieved fact from Wikipedia, will generate a better action than it would without that context. The quality of every upstream operation (Write, Manage, Read) directly determines the quality of this final generation.

Architecture Tour

Key Systems

Six landmark agent architectures that illustrate different memory design philosophies.

🏙️

Generative Agents

Park et al. (2023) · Stanford

🔄

Reflexion

Shinn et al. (2023) · Verbal RL

🧠

MemGPT

Packer et al. (2023) · OS-Inspired

🏦

MemoryBank

Zhong et al. (2023) · Dense Retrieval

🎓

ExpeL

Zhao et al. (2024) · Experience Learning

🚀

Voyager

Wang et al. (2023) · Minecraft

Select a system above

Memory Reading Deep Dive

Retrieval Strategies

How an agent decides which memories are relevant shapes both performance and computational cost.

Not all stored memories are equally useful for any given query. The retrieval function R must identify which subset of M is worth loading into the prompt. The survey identifies three main retrieval paradigms, each with distinct tradeoffs.

Query Context c_t+1

"What do I need to craft a sword in Minecraft?"

Memory Store — Cosine Similarity Scores

"Crafted wooden pickaxe using 3 planks + 2 sticks"

0.92

"Smelted iron ore into 2 iron ingots at furnace"

0.87

"Found a village to the east, traded wheat for emeralds"

0.44

"Built a shelter from oak logs near spawn point"

0.21

Top-2 Retrieved (threshold: 0.7)

Memories about crafting tools and iron ingots are retrieved. The agent now knows it has iron and basic crafting knowledge — enough to plan toward smelting 2 more ingots for a sword.

Similarity-based retrieval embeds both query and memories using a shared encoder. Returns top-K by cosine score. Fast with FAISS approximate search. Used by: MemoryBank, RET-LLM, ExpeL.

Query Context c_t+1

"What should I do at the party this evening?"

Memory Store — Multi-Criterion Scoring (Generative Agents)

"Alice invited me to a party tonight" (recent: 0.9, relevance: 0.95, importance: 0.8)

0.91

"I enjoy dancing and meeting new people at social events" (relevance: 0.75, importance: 0.9)

0.82

"I went to the grocery store 3 days ago" (recent: 0.2, relevance: 0.1)

0.38

Weighted Score: α·Relevance + β·Recency + γ·Importance

Multi-criterion scoring prevents purely old-but-important memories from dominating and purely recent-but-trivial memories from flooding the context. Generative Agents use all three signals with equal weights.

Generative Agents combine three normalized scores (relevance, recency, importance) to select memories. Each dimension is normalized 0–1 across the memory store. Recency uses exponential decay on last-access time.

Query Context c_t+1

"How many times did I buy coffee in the last week?"

LLM-Generated SQL Query (ChatDB approach)

SELECT COUNT(*) FROM memory WHERE category = 'purchase' AND item LIKE '%coffee%' AND date >= date('now', '-7 days')

Structured Query Result

Result: 4 coffee purchases found. The LLM receives this structured result as context, enabling precise quantitative answers that would be impossible with fuzzy similarity search.

ChatDB stores memories as relational database records. The agent generates SQL queries to retrieve specific information. Enables precise quantitative lookups, joins, and aggregations — impossible with embedding-based search. Trades flexibility for precision.

Measuring Memory

Evaluation Landscape

How do you know if memory is working? The survey identifies direct and indirect evaluation approaches across five key metrics.

💬

Coherence

Direct · Subjective

🧠

Rationality

Direct · Subjective

✅

Result Correctness

Direct · Objective

📊

Reference F1

Direct · Objective

⏱️

Time & Hardware Cost

Direct · Objective

🎯

Task Performance

Indirect

The Evaluation Gap

Most papers evaluate memory only indirectly — by measuring task success — without isolating the memory component's contribution. This makes it hard to compare memory architectures fairly. A system that performs well on HotpotQA might succeed because of a better prompt, not better memory. Direct memory evaluation (coherence, F1) is rarer but more informative.

Open Problems

Challenges & Future Directions

Seven open problems define the frontier of agent memory research.

📚

Underdeveloped Parametric Memory

Fine-tuning and knowledge editing methods receive far less attention than textual memory in the agent literature. Catastrophic forgetting, high training costs, and the difficulty of online updates remain unsolved. Future work should explore efficient knowledge editing and lightweight personalization methods.

♾️

No Lifelong Learning Framework

No current agent architecture cleanly handles indefinite continuous operation — accumulating memory over months of deployment without degrading or requiring periodic wipes. Lifelong memory with graceful forgetting of irrelevant content while retaining critical long-term facts is an open challenge.

🤝

Multi-Agent Memory Undefined

When multiple agents share an environment, how should they share, merge, or conflict-resolve their individual memories? No standard protocol exists for shared memory banks, distributed memory consistency, or consensus mechanisms for conflicting agent beliefs. MetaGPT and ChatDev touch this but don't fully solve it.

⚡

Memory Contradictions

A single agent can simultaneously have inside-trial, cross-trial, and external knowledge that contradict each other ("The user said they dislike coffee" vs. a retrieved database entry "User ordered coffee 4 times this week"). No robust strategy exists for detecting contradictions and deciding which source to trust.

📍

Position Bias & Context Efficiency

LLMs attend unevenly to different positions in long contexts — information at the beginning and end is weighted more than the middle ("lost in the middle" phenomenon). This creates systematic retrieval failure even when the right memory is loaded. Long-context architectures need position-aware attention solutions.

🤖

Embodied & Multimodal Memory

Physical-world agents (robotics, AR glasses) need to integrate sensor streams, images, and spatial coordinates into memory. Pure text memory is insufficient. Multimodal memory systems that handle vision + language + spatial context are in early stages, with no unified framework.

🔒

Knowledge Reliability & Privacy

External knowledge sources can be inaccurate, biased, or contain private user data. Agents that blindly store and retrieve external content risk propagating misinformation. Privacy-preserving memory architectures and source credibility scoring are unexplored research directions.

Go Deeper

Memory in LLM AgentsHow Machines Remember, Manage & Recall

The Memory Problem in AI Agents

Memory Sources

Memory Forms

Memory Operations: Write → Manage → Read

Key Systems

Retrieval Strategies

Evaluation Landscape

Challenges & Future Directions

Related Posts

Memory in LLM Agents
How Machines Remember, Manage & Recall