Key takeaways
- The category splits into two approaches — memory layers (Mem0, Letta, Zep, Hindsight, Cognee, Supermemory) that persist knowledge across sessions, and evolution engines (ACE, Agentic Context Engine) that actively improve instructions from execution feedback
- The money arrived: Mem0 closed a $24M Series A, Cognee raised $7.5M from Pebblebed, and new entrants Hindsight (16K stars in 7 months) and Supermemory (26.8K stars) crowded the benchmark wars further
- The existential event is first-party: Anthropic's "Dreaming" (May 2026) reviews agent sessions between runs, extracts patterns, and rewrites memory stores — a built-in version of what this category sells
- LangGraph lock-in is the hidden cost — LangMem requires LangGraph, while Mem0, Zep, Hindsight, and Cognee are framework-agnostic. Choose based on your existing stack, not just features.
FAQ
What's the best memory layer for AI agents?
Mem0 for framework-agnostic simplicity with a managed SaaS option. Letta for full stateful agent runtime with tiered memory. Zep for temporal knowledge graphs and relationship tracking. LangMem if you already use LangGraph.
Can AI agents actually self-improve?
Yes, with caveats. Stanford's ACE research showed 10.6% improvement through evolving contexts. But self-improvement requires persistent memory and consistent outcome recording — most agents today still operate statelessly.
What's the difference between agent memory and self-improvement?
Memory persists facts and context across sessions (who the user is, what happened before). Self-improvement goes further — the agent changes its own instructions and strategies based on what worked and what failed.
Are these tools production-ready?
Mem0 and Zep have commercial SaaS offerings used in production. Letta has a managed platform. The evolution engines (ACE, Agentic Context Engine) are earlier stage — promising research with limited production testimonials.
Executive Summary
AI agents forget everything between sessions. This category exists because that's a problem.
Agent self-improvement tools solve two related challenges: memory (persisting knowledge across sessions) and evolution (agents that actively improve their own instructions based on execution outcomes). The space splits cleanly into memory layers that store what happened and evolution engines that learn from what happened.
10 tools reviewed: Mem0 (58.4K ⭐, $24M raised), Zep (Graphiti 27.3K ⭐), Supermemory (26.8K ⭐), Letta (23.3K ⭐), Cognee (17.7K ⭐, $7.5M seed), Hindsight (16.2K ⭐), Agentic Context Engine (2.3K ⭐), LangMem (1.5K ⭐), ACE SaaS (hosted), Microsoft Amplifier (3.1K ⭐, DISCOVERIES.md pattern)
The category is young — most tools launched in 2025 — and the benchmark wars are already heated. Mem0, Letta, Zep, and now Hindsight have each published benchmarks showing they outperform the others, making independent evaluation essential. And the first-party threat arrived: Anthropic's "Dreaming" (May 2026) builds session-reviewing, memory-rewriting self-improvement directly into Claude Managed Agents — Harvey reportedly saw ~6x task-completion gains.
Which Tool Should You Use?
If you want a drop-in memory layer with minimal setup → Mem0. Framework-agnostic, managed SaaS option, largest community. Watch out: benchmark claims are disputed by competitors.
If you need a full stateful agent runtime → Letta. Not just memory — it's an entire agent platform with tiered memory (core/recall/archival). Watch out: heavier than a memory layer; you're adopting a runtime, not a library.
If you need temporal reasoning and relationship tracking → Zep. Knowledge graph architecture tracks how relationships change over time. Watch out: smaller community than Mem0, credit-based pricing can be opaque.
If you already use LangGraph → LangMem. Native integration, three memory types (semantic/episodic/procedural). Watch out: LangGraph lock-in is real — useless outside the LangChain ecosystem.
If you want memory that learns, not just recalls → Hindsight. Biomimetic retain/recall/reflect over typed memory networks on plain Postgres+pgvector; claims SOTA on LongMemEval (vendor-reported). Watch out: pre-1.0, thin independent validation.
If your memory problem is really a data problem → Cognee. ECL pipeline turns scattered data into a self-improving knowledge graph over relational+vector+graph storage; $7.5M Pebblebed seed, Bayer among 70+ companies. Watch out: critics question cost at scale.
If you want ingestion breadth (PDFs, audio, email) → Supermemory. Single API from raw data to agent memory, 26.8K stars, notable angels. Watch out: solo young founder, vendor-stated benchmarks.
If you want agents that improve their own instructions → ACE SaaS for hosted playbook evolution, or Agentic Context Engine for the open-source equivalent (now multi-runtime with hosted kayba.ai). Watch out: requires consistent outcome recording; the improvement loop only works with data.
The Two Approaches
Memory Layers
Store and retrieve knowledge across sessions. The agent remembers but doesn't change its behavior.
| Tool | Stars | Architecture | Setup | Pricing | Watch Out |
|---|---|---|---|---|---|
| Mem0 | 58.4K | Dual store (vector + graph) | Minutes | Free + $79 Growth / $249 Pro | Disputed benchmarks |
| Zep | 27.3K (Graphiti) | Temporal knowledge graph | Minutes | Annual-only: $1,250+/yr | Thin funding vs rivals |
| Supermemory | 26.8K | Vector-graph + broad ingestion | Minutes | Free + $19-399/mo | Solo founder, vendor benchmarks |
| Letta | 23.3K | Tiered memory (core/recall/archival) | Hours | Free + $20 Pro | Full runtime; converging on Letta Code |
| Cognee | 17.7K | ECL → knowledge graph | Hours | OSS + cloud $35-200/mo | Cost at scale contested |
| Hindsight | 16.2K | Retain/recall/reflect typed networks | Minutes | OSS (MIT) + Vectorize cloud | Pre-1.0, vendor benchmarks |
| LangMem | 1.5K | Three memory types via LangGraph | Minutes | Open source (free) | LangGraph lock-in; slow releases |
Mem0 is the market leader by stars and has YC backing. It offers both a Python SDK and managed SaaS with a claimed 26% accuracy improvement and 90% token savings .
Letta (formerly MemGPT) takes the most ambitious approach — it's not just a memory layer but a full stateful agent platform. The tiered memory system mirrors how humans organize information: core memory for always-available context, recall memory for conversation history, and archival memory for long-term storage .
Zep differentiates with its Graphiti-powered temporal knowledge graph. It doesn't just store facts — it tracks how relationships evolve over time, which matters for enterprise use cases like contract management or customer relationship tracking .
LangMem is the simplest option if you're already in the LangChain ecosystem. It adds semantic, episodic, and procedural memory to LangGraph agents with hot-path and background processing modes .
Evolution Engines
Agents that actively change their own instructions based on execution outcomes.
| Tool | Stars | Approach | Setup | Pricing | Watch Out |
|---|---|---|---|---|---|
| ACE SaaS | N/A | Managed playbook evolution | Minutes | $9-79/mo | Requires outcome recording |
| Agentic Context Engine | 1.9K | Open-source ACE implementation | Hours | Free | Alpha stage |
| Amplifier | 3K | DISCOVERIES.md pattern | Hours | Free | Research-only, not production |
The evolution approach is newer and based on Stanford/SambaNova's ACE research . The core insight: instead of fine-tuning models (expensive, slow), improve performance by evolving the context — the instructions and strategies the agent receives. The paper reported 10.6% improvement on complex tasks.
ACE SaaS (aceagent.io) productizes this into a managed service with MCP integration . Agentic Context Engine is the popular open-source implementation with LangChain/LlamaIndex/CrewAI integrations . Amplifier takes a different approach — agents write DISCOVERIES.md files logging solutions to avoid repeating mistakes.
Adoption Metrics
| Tool | Stars | Last Push | Status (June 2026) |
|---|---|---|---|
| Mem0 | 58,400 | Jun 11, 2026 | Active; $24M Series A; AWS Agent SDK provider |
| Zep | 27,300 (Graphiti) | Jun 2026 | Active; moved upmarket, annual-only pricing |
| Supermemory | 26,759 | Jun 2026 | Active; 50K+ app users |
| Letta | 23,270 | May 2026 | Active; converging on Letta Code |
| Cognee | 17,700 | Jun 2026 | Active; v1.1.2, 166 contributors |
| Hindsight | 16,200 | Jun 9, 2026 | Very active; 60 releases in 7 months |
| Microsoft Amplifier | 3,098 | Jun 9, 2026 | Active but flat; research-only |
| Context Engine | 2,344 | Jun 9, 2026 | Active; v0.12, hosted kayba.ai launched |
| LangMem | 1,500 | Jun 7, 2026 | Maintained; no release in 7 months |
| ACE (open source) | 1,150 | May 19, 2026 | Active; paper revised v3 |
What stands out:
- Everything is alive — unusual for the categories we track; the memory problem is real enough to sustain ten active projects
- The star race compressed — four projects between 16K and 27K, where February had a lone leader
- LangMem ships no releases while riding LangGraph distribution (~746K monthly downloads anyway)
The Benchmark Wars
This category has a credibility problem. Each vendor publishes benchmarks showing they win:
- Mem0 claims 26% higher accuracy and 91% lower latency vs competitors — and rewrote its algorithm in April 2026 claiming further temporal/multi-hop gains
- Letta published rebuttals questioning Mem0's benchmark methodology
- Zep claims 18.5% improvement on LongMemEval, outperforming MemGPT (Letta's predecessor)
- Hindsight claims SOTA on LongMemEval (91.4%), beating Supermemory and GPT-5 baselines
- LangMem benchmarks against its own previous versions, not competitors
A comprehensive survey from December 2025 found that agent memory research is "increasingly fragmented" with no standard evaluation framework. Until the field converges on shared benchmarks, treat all vendor-published numbers with skepticism.
Key Patterns
1. The Memory Hierarchy
Every mature tool implements some version of tiered memory:
- Hot memory — always in context (Letta's core, Mem0's working memory)
- Warm memory — retrievable on demand (Letta's recall, Zep's graph queries)
- Cold memory — archived, searchable (Letta's archival, Mem0's long-term store)
2. Graph vs Vector
The architecture divide:
- Vector-first (Mem0, LangMem) — fast similarity search, good for "find relevant context"
- Graph-first (Zep, Letta) — relationship tracking, good for "how has this changed over time"
- Hybrid (Mem0's dual store) — both, at the cost of complexity
3. Self-Editing Instructions
The evolution pattern from ACE and Amplifier:
- Record outcomes after each task
- Reflect on what worked vs failed
- Update playbooks/instructions automatically
- Version control the changes
This is the frontier — agents that don't just remember but actively improve their own behavior.
Implications for Agent Orchestration
Memory and self-improvement become infrastructure concerns at scale:
- Shared memory across agents — when multiple agents work on the same codebase, they need shared context. No tool handles multi-agent shared memory well yet.
- Memory-aware routing — route tasks to agents that have relevant memory, not just relevant skills
- Institutional knowledge — Amplifier's DISCOVERIES.md pattern lets teams accumulate knowledge that persists across individual agent sessions
- The integration gap — memory layers and skills frameworks are separate products today. The winning stack will combine them: skills for what to do, memory for what happened, evolution for getting better.
The First-Party Threat: Anthropic's Dreaming
The biggest category event since February came from a lab, not a vendor. Anthropic shipped persistent memory for Claude Managed Agents (public beta, April 2026), then "Dreaming" at Code with Claude (May 2026): a scheduled between-session process that reviews agent sessions, extracts patterns and recurring mistakes, and rewrites memory stores — including synthesizing shared learnings across multi-agent deployments. Harvey reportedly saw ~6x task-completion improvement; Netflix and Rakuten are adopters.
That is a direct first-party version of what every vendor here sells. The independents' answers: framework-agnosticism (Dreaming is Claude-only), data control (your Postgres, not Anthropic's store), and ingestion breadth. Whether those hold as moats is the category's defining question for 2027.
Notable Others
- Dynamic Cheatsheet (Stanford, 255 ⭐) — The research predecessor to ACE; test-time learning with adaptive memory
- Memobase — User-profile personalization memory (deliberately excluded: memory for users, not agent learning loops)
- Addy Osmani's "Self-Improving Coding Agents" — Practical guide covering AGENTS.md as a self-improvement mechanism
Bottom Line
What works today: Mem0 or Zep as a drop-in memory layer for production agents — with Hindsight and Cognee as credible newer challengers. All are framework-agnostic with managed options. This alone — making agents remember between sessions — is a significant improvement over stateless operation.
What's promising but early: Evolution engines (ACE, Agentic Context Engine) that improve agent instructions from execution feedback. The Stanford research is compelling, but production evidence is thin. Worth experimenting with on repetitive workflows.
What's aspirational: Agents that combine persistent memory, self-improving instructions, and shared knowledge across multi-agent teams. Nobody has this fully integrated. The pieces exist but the glue doesn't.
The honest take: Most agents today still operate statelessly — every session starts fresh. Even basic memory persistence is a meaningful upgrade. Start there before chasing self-improvement. A well-configured AGENTS.md file that a human updates after each session is still more reliable than any automated evolution system.
Research by Ry Walker Research • methodology
Sources
- [1] mem0ai/mem0 on GitHub
- [2] letta-ai/letta on GitHub
- [3] getzep/zep on GitHub
- [4] langchain-ai/langmem on GitHub
- [5] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- [6] ace-agent/ace on GitHub
- [7] kayba-ai/agentic-context-engine on GitHub
- [8] Memory in the Age of AI Agents: A Survey
- [9] Self-Improving Coding Agents (Addy Osmani)
- [10] vectorize-io/hindsight on GitHub
- [11] topoteretes/cognee on GitHub
- [12] supermemoryai/supermemory on GitHub
- [13] Anthropic brings persistent memory to Claude Managed Agents