AI Agent Memory — How to Make Your Chatbot and Agent Remember Users Across Sessions
AI agent memory is the mechanism that lets a language model retain information beyond a single conversation — because the LLM itself is stateless and forgets everything once the session ends. In practice you build it at two levels: short-term memory is managing the context window within one conversation (through summaries and compression), and long-term memory is storing facts and events in an external database (usually a vector store) and retrieving them in later sessions using the RAG pattern. If your agent needs to remember a customer's preferences, the context of previous tickets, or conclusions from earlier analyses — you need long-term memory, not a bigger context window. Ready-made frameworks (Mem0, Zep, Letta, LangMem) give you this without building from scratch.
The complete guide to AI agent memory: why LLMs are stateless, how working, episodic, semantic and procedural memory differ, how to manage the context window through compression and summarization, how the Mem0, Zep, Letta and LangMem frameworks work, how to implement memory in code and n8n, and how to reconcile it with GDPR.
A customer writes to your chatbot: "the same problem as last time again." The agent replies: "Could you describe the problem?" The customer already knows they are talking to a machine with no memory — and trust evaporates. Meanwhile a human support agent would open the history and say: "I see you reported this two weeks ago, let's check whether the fix worked."
That difference — remembering context between conversations — separates a demo toy from an agent you can deploy in a company. And because a language model inherently remembers nothing between calls, memory has to be designed and built separately. This article shows how: from memory types, through context management and frameworks, to code, n8n and GDPR.
Why an LLM does not remember — the statelessness problem
Every call to a language model is independent. The model has no "state" between requests — it takes text in, returns text out and immediately forgets. The impression that ChatGPT "remembers" a conversation is an illusion: each time, the application sends the entire conversation history so far to the model as part of the prompt. It is not the model remembering — it is the application re-attaching the history.
This mechanism works up to a point, then hits three hard walls:
- Context window limit — even models with 200k–1M token windows have an upper bound; a long conversation or large documents eventually will not fit
- Cost and latency — you pay for every input token on every call; re-attaching the full history to each request means cost grows quadratically with conversation length
- No continuity across sessions — when the user closes the chat and returns tomorrow, the history is empty; without external storage the agent starts from zero
AI agent memory solves all three: instead of re-attaching everything, it stores information externally and retrieves only what is relevant to the current decision. This is exactly the same idea as RAG for company knowledge (I covered it in the article on building a knowledge base) — just applied to interaction history and facts about the user.
The four types of AI agent memory
/// AI AGENT MEMORY ARCHITECTURE
4 types of agent memory — each lives elsewhere
Research on agent architecture (including "Cognitive Architectures for Language Agents") distinguishes four memory types borrowed from cognitive psychology. Understanding the differences is crucial, because each type needs a different store and a different retrieval strategy — and the most common mistake is dumping everything into one vector database.
- Working memory — the active context window: the current conversation, loaded files, tool results from this session. You manage it like a token budget, not a search problem — through compression and prioritization, not similarity search
- Episodic memory — a record of what the agent did and when: conversation traces, decisions made, action trails. Used for auditing, debugging and learning from history. The key is chronological storage, not similarity search
- Semantic memory — facts about the world, the user and the domain: customer preferences, industry knowledge, company data. RAG was built for this — content-similarity retrieval is the right approach here
- Procedural memory — how to perform a task: learned procedures, reusable skills, routines. In practice stored in system prompts, tool definitions and the agent's code
The most important design rule: do not mix episodic memory (event logs) with semantic memory (facts) in one vector index. Similarity search over event logs degrades retrieval quality for both — a log "user clicked X at 14:32" and a fact "user prefers email contact" require completely different access strategies.
Short-term memory — managing the context window
Short-term memory is managing what fits in the context window during one session. As a conversation or an agent's actions grow longer, tool observations can consume 70–80% of the token budget — and they need intelligent reduction. Here are the main techniques:
| Technique | How it works | When to use |
|---|---|---|
| Sliding window | Keep the last N turns in full, discard older ones | Simple chats where only fresh context matters |
| Rolling summary | Last N turns in full + a concise summary of everything older | Long conversations where early context still matters |
| Compaction | At a token threshold an LLM compresses history, preserving decisions | Multi-step agents with many tool calls |
| Prompt compression | Token-level pruning (e.g. LLMLingua) removes low-information tokens | When you need maximum reduction while keeping content |
| Tool result limiting | Cap tool response length before it enters the context | Tools returning large JSON blobs or whole documents |
In production, two approaches are separated. "Prevention" agents structurally bound context growth — they limit message scope and trim tool results immediately. "Cure" agents let context grow and compress only past a token threshold, triggering LLM-based summarization. For most business use cases a rolling summary is enough: keep the last 8–10 exchanges in full and maintain everything older as a living, updated summary.
Watch the trap: compression is lossy. Every summary loses detail, and if the agent summarizes summaries, after a few iterations it is left with a vague caricature of the conversation. That is why you should extract important facts (order number, agreements, decisions) into long-term memory as structured data before they go under the compression knife.
Long-term memory — how an agent remembers across sessions
Long-term memory lives outside the context window — in an external database, usually a vector store — and survives closing the chat, restarting the server or the user returning a week later. Its lifecycle has three phases:
- 1.Write — during or after a conversation the agent extracts important facts and stores them. Key: you do not store the raw transcript, but extracted, structured information ("customer X prefers courier delivery", "project Y has a June 30 deadline")
- 2.Retrieve — before the agent responds, it searches semantic memory for facts relevant to the current query and injects them into the context window. This is the core RAG pattern: keep knowledge outside, pull in only what is needed
- 3.Update — when new information contradicts old information, memory must be updated, not just appended. Otherwise you accumulate conflicting facts ("prefers email" and "prefers phone") and the agent will guess
The last phase is the hardest and most often skipped. Good memory frameworks do deduplication and conflict resolution automatically — they detect that a new fact replaces an old one and overwrite it instead of multiplying versions. That is why it is worth reaching for a ready-made framework instead of building memory on a raw vector database: writing and retrieving you can code in an hour, but managing the lifecycle of facts is months of refinement.
Memory frameworks: Mem0, Zep, Letta, LangMem
/// MEM0 vs ZEP vs LETTA vs LANGMEM — WHICH MEMORY FRAMEWORK?
By 2026 the agent memory market has matured enough that you do not have to build it yourself. Four frameworks dominate, each with a different architectural approach:
| Framework | Architecture | Strength | Best for |
|---|---|---|---|
| Mem0 | Vectors + optional graph | Fastest start, ~48k stars, up to 80% token reduction | Most companies — the default choice |
| Zep (Graphiti) | Temporal knowledge graph | Top accuracy (63.8% LongMemEval), relationships over time | Apps with entities and relations (CRM, contact networks) |
| Letta (MemGPT) | Self-editing memory in an agent runtime | Full control over a long-running agent's state | Complex, autonomous agents |
| LangMem | Memory SDK for LangChain | Native LangGraph integration | Teams already building on LangChain |
The decision rule in practice:
- Start with Mem0 unless you have a strong reason to choose otherwise — lowest entry barrier, the managed SaaS removes graph-database and scaling overhead, and self-hosting is available for privacy requirements
- Choose Zep when understanding relationships and how they change over time is key — who works with whom, how preferences evolve, links between entities; here a temporal graph beats a plain vector store
- Consider Letta when you are building an autonomous agent running for hours or days and need memory as a first-class runtime element, not a library bolted on the side
- Use LangMem if your stack is already LangGraph — you get memory tooling that fits your existing orchestration without adding a new dependency
There is no single winner — the choice depends on whether you value deployment speed (Mem0), accuracy and relationships (Zep), control (Letta) or stack consistency (LangMem).
How to implement memory in code — a Mem0 example
The simplest route is Mem0. Below is a customer support agent that writes and retrieves memory per user — just a few lines beyond a normal model call:
from mem0 import Memoryfrom openai import OpenAImemory = Memory()client = OpenAI()def chat(user_id: str, message: str) -> str: # 1. Retrieve facts relevant to the current query relevant = memory.search(query=message, user_id=user_id, limit=5) context = "\n".join(m["memory"] for m in relevant["results"]) # 2. Inject memory into the system prompt system = ( "You are a customer support assistant. " "Known facts about the user:\n" + (context or "(none)") ) resp = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system}, {"role": "user", "content": message}, ], ) answer = resp.choices[0].message.content # 3. Store new facts from this exchange (extraction happens automatically) memory.add( messages=[ {"role": "user", "content": message}, {"role": "assistant", "content": answer}, ], user_id=user_id, ) return answerThree things that make a difference here:- **user_id isolates memory** — each customer has their own space; Mem0 never mixes two users' facts, which is critical for privacy- **memory.add() does not store the raw conversation** — under the hood an LLM extracts facts worth remembering and discards small talk; you do not clutter the database with "hello" and "thanks"- **search() retrieves only the top 5** — you inject a few of the most relevant facts into context, not the whole history; cost and latency stay constant no matter how long the customer has been with youFor self-hosting (article #40) you swap OpenAI for a local Ollama or vLLM endpoint, and configure Mem0 to use a local model for extraction and a local vector store — all memory stays in your infrastructure.
Memory in n8n and no-code tools
Not every deployment needs code. In n8n the AI Agent node has built-in memory options you configure by clicking:
- Simple Memory (window buffer) — keeps the last N messages in instance memory; short-term memory within one workflow, the simplest but volatile (gone after a restart)
- Memory in an external database — connect Postgres, Redis or a vector store as the store; memory survives restarts and works across sessions
- Mem0 / Zep via an HTTP node or a dedicated integration — full long-term memory with fact extraction, without writing code
The key in n8n is the session key — the identifier the agent uses to know whose memory to load. Usually it is a CRM customer ID, a phone number or an email address. Without a stable session key every conversation is anonymous and memory does not work across sessions. This is exactly the same mechanism as user_id in code — just set in the interface.
For simple cases (a FAQ assistant remembering context within one conversation) Simple Memory is enough. For an agent that should recognize a returning customer and their history — you need memory in an external database with a sensible session key.
Memory security and privacy (GDPR)
Agent memory is by definition a store of personal data — preferences, contact history, sometimes sensitive data. That places it squarely under GDPR and requires designing compliance from the start, not after the fact:
- Right to be forgotten (Art. 17) — you must be able to delete a specific user's entire memory on request; this is why isolation by user_id is not just good practice but a requirement — without it you cannot erase one person's data without touching others'
- Data minimization (Art. 5) — store only facts needed for the agent to function; fact extraction (instead of raw transcripts) helps, but configure it so it does not capture sensitive data without a legal basis
- Tenant isolation (multi-tenancy) — in an app serving multiple companies, one client's memory must never leak into another's context; test this deliberately, because an isolation bug is a data breach
- Encryption and data location — for sensitive data consider self-hosting memory (a local vector store + a local extraction model) so data never leaves your infrastructure — this is an argument for self-hosted Mem0/Zep over managed SaaS
- Prompt injection via memory — if the agent writes user-supplied content to memory, an attacker can inject an instruction that executes on the next retrieval; memory is another vector from the prompt injection article (#39) — treat stored content as untrusted data
The practical takeaway: before deploying memory to production, answer the question "how do I delete one user's data on request?" If you do not have a simple answer, the memory architecture is not GDPR-ready.
AI agent memory deployment checklist
- 1.Separate memory types: working (context), episodic (logs), semantic (facts) — do not dump everything into one vector database
- 2.For short-term memory start with a rolling summary: last 8–10 exchanges in full + a living summary of older ones
- 3.Extract important facts into long-term memory as structured data before compression loses them
- 4.Choose a framework: Mem0 as the default, Zep for relationships over time, Letta for autonomous agents, LangMem for a LangGraph stack
- 5.Isolate memory by user_id / session key from day one — it is the foundation of privacy and GDPR
- 6.Configure fact extraction instead of storing raw transcripts — smaller database, lower cost, less personal data
- 7.Handle update and deduplication: a new fact should overwrite a conflicting old one, not pile up beside it
- 8.Retrieve only the top-K relevant facts (3–5) — cost and latency stay constant regardless of history
- 9.For sensitive data consider self-hosting memory (local store + local extraction model)
- 10.Treat stored user content as untrusted data — memory is a prompt injection vector
- 11.Implement per-user memory deletion (right to be forgotten) before production
- 12.Measure quality: does the agent retrieve the right facts? Test on realistic returning-user scenarios
Key takeaways
An LLM is stateless — memory must be built separately, at two levels. Short-term is managing the context window through summaries and compression; long-term is writing facts to an external store and retrieving them with the RAG pattern. Separate the four memory types (working, episodic, semantic, procedural) and do not dump everything into one vector database. Do not build from scratch — Mem0 is the default choice, Zep for relationships over time, Letta for autonomous agents, LangMem for LangGraph. From day one, isolate memory by user_id, extract facts instead of transcripts and plan per-user deletion — because agent memory is a store of personal data under the full GDPR regime and another prompt injection vector.
---
I help companies design and deploy memory for AI agents and chatbots — from choosing the architecture and framework (Mem0, Zep, Letta), through integration with code or n8n, to GDPR compliance and security. Get in touch — I start with a free 30-minute analysis of your use case.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
