Layer 3 — State & Context

Memory Management

Also known as: Agent Memory, Long-Term Memory, Working Memory + Long-Term Memory

Tier the agent into working, episodic, and semantic stores it can read and edit.

A horizontal flowchart in which a User turn enters Working memory inside the context window, an Agent step decides whether to persist the turn as an episode, persist a learned fact, or discard it, with the episodic and semantic stores feeding a Summariser and Retriever that re-inject relevant slices back into Working memory on the next turn.

Decision

Use when ✓	Avoid when ✗
+Apply when conversations or task threads span multiple sessions and the agent must recall prior interactions, learned facts, or user preferences after the context window resets.	−When the entire interaction fits comfortably in one context window and never needs to outlive the session, the tiering pays no rent and adds a write path that can fail silently.
+Use where the working set exceeds the context window in tokens or in cost — long-running assistants, project-scoped coding agents, customer support copilots that accumulate ticket history.	−Without a write policy and an eviction policy, the long-term store grows monotonically and retrieval pulls in stale or contradictory facts the agent will dutifully act on.
+Reach for it when state has clear schema: episodes you can timestamp, facts you can normalise, procedures you can name and retrieve by intent.	−When the data the agent would persist is regulated or sensitive and the storage layer has no per-user namespace, encryption, or deletion path, the pattern becomes a compliance liability rather than a feature.
+Prefer it when retention, eviction, and namespacing matter to the product — per-user memory in a multi-tenant assistant, redaction on request, audit trails on what the agent learned and when.

In the wild

Source	Claim
langchain-ai.github.io →	LangGraph ships a long-term memory store keyed by namespace tuple and thread id, exposing get/put/search primitives that agents call inside graph nodes — the canonical worked example of the persistent-store half of this pattern.
mastra.ai →	Mastra documents an agent memory layer that combines a recent-messages working window, a semantic-recall index for older turns, and a working-memory document the agent rewrites between sessions — a full three-tier implementation behind a single API surface.
github.com →	Letta (the company spun out of the MemGPT paper) ships an open-source server whose agents page facts in and out of an external store under the model’s own control, exposing the OS-style memory hierarchy as a runnable product.

Reader gotcha

A long-term memory the model writes to is also a long-term memory an attacker can write to. Documented prompt-injection demos persist hostile instructions into ChatGPT memory through a single conversation, and the agent then re-reads them on every later turn — including after the operator believes the session ended. Treat any tool that writes durable memory as untrusted input, namespace by user, and confirm writes through a reviewer for sensitive scopes. source

Implementation sketch

import { generateText, tool } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'

type Episode = { ts: number; userId: string; text: string }
type Fact = { userId: string; key: string; value: string }
const episodes: Episode[] = []
const facts: Fact[] = []

const recall = tool({
  description: 'Retrieve facts and recent episodes for the current user.',
  parameters: z.object({ userId: z.string(), query: z.string() }),
  execute: async ({ userId, query }) => ({
    facts: facts.filter((f) => f.userId === userId),
    episodes: episodes.filter((e) => e.userId === userId && e.text.includes(query)).slice(-5),
  }),
})

const remember = tool({
  description: 'Persist a durable fact about the user.',
  parameters: z.object({ userId: z.string(), key: z.string(), value: z.string() }),
  execute: async ({ userId, key, value }) => {
    facts.push({ userId, key, value })
    return { ok: true }
  },
})

async function turn(userId: string, message: string): Promise<string> {
  episodes.push({ ts: Date.now(), userId, text: message })
  const { text } = await generateText({
    model: openai('gpt-4o'),
    tools: { recall, remember },
    prompt: `User ${userId}: ${message}`,
  })
  return text
}

export {}

First-party TS SDK

LangGraph
Mastra
LangChain

References

PAPERMemGPT: Towards LLMs as Operating Systems
Packer et al.·2023·DOI: 10.48550/arXiv.2310.08560
foundational paper; frames memory as paging between fast and slow tiers under model control
PAPERGenerative Agents: Interactive Simulacra of Human Behavior
Park et al.·2023·UIST 2023 · DOI: 10.48550/arXiv.2304.03442
introduces the episodic / semantic / reflection memory tiers most agent stacks now mirror
DOCSPrompt caching
Anthropic·2024·accessed 2026-05-03
caches the working-tier prefix between turns; the cheap optimisation the pattern enables
DOCSLangGraph — Memory concepts
LangChain team·2024·accessed 2026-05-03
DOCSMastra — Memory overview
Mastra team·2024·accessed 2026-05-03
BOOKAgentic Design Patterns, Chapter 8: Memory Management
Antonio Gulli·2026·Springer·pp. 122–142

Overview · 1-paragraph mechanism

Memory Management treats the agent's state as a tiered system rather than a single ever-growing prompt. The working tier is whatever currently fits in the context window: the system message, the most recent turns, the active tool outputs. Around it sit one or more out-of-context stores the agent reads from and writes to: an episodic log keyed by session or task, a semantic store that holds learned facts about the user or the world, and sometimes a procedural store of routines the agent has learned to invoke. The tiers are connected by explicit operations the agent calls — append, summarise, retrieve, evict — rather than by hidden window-management heuristics.

Background · context and trade-offs

The pattern earns its keep when conversations or task threads outlive a single context window. MemGPT framed this as paging between fast and slow memory under the agent's own control; LangGraph and Mastra ship the same idea as a store keyed by `(thread_id, namespace)` that survives process restarts. The episodic tier records what happened. The semantic tier records what was learned. A summariser collapses old episodes into stable facts so the working tier stays small. Without the summariser the long-term store grows linearly and retrieval drowns in stale context; without the episodic log the agent forgets the same lesson on every session.

The pattern is distinct from Reflexion's memory layer and from RAG. Reflexion stores critiques across attempts of the same task — a narrow loop. RAG retrieves over a corpus the agent did not write. Memory Management is the broader case: the agent reads and writes its own state across sessions, and the schema is designed up front (what counts as an episode, what counts as a fact, when to evict). The cost is operational: someone decides retention and TTL, namespaces by user, prevents prompt-injection writes, and reconciles contradictory facts written months apart. Most production failures live in those choices, not in the retrieval itself.