Key takeaways
- The category splits into two approaches — memory layers (Mem0, Letta, Zep) that persist knowledge across sessions, and evolution engines (ACE, Agentic Context Engine) that actively improve instructions from execution feedback
- Mem0 leads with 47.8K stars and YC backing, but benchmark claims are disputed — Letta and Zep both published rebuttals of comparative benchmarks
- The Stanford ACE paper (arxiv 2510.04618) proved that evolving context can improve agent performance by 10.6% without fine-tuning — spawning both open-source and commercial implementations
- LangGraph lock-in is the hidden cost — LangMem requires LangGraph, while Mem0 and Zep are framework-agnostic. Choose based on your existing stack, not just features.
FAQ
What's the best memory layer for AI agents?
Mem0 for framework-agnostic simplicity with a managed SaaS option. Letta for full stateful agent runtime with tiered memory. Zep for temporal knowledge graphs and relationship tracking. LangMem if you already use LangGraph.
Can AI agents actually self-improve?
Yes, with caveats. Stanford's ACE research showed 10.6% improvement through evolving contexts. But self-improvement requires persistent memory and consistent outcome recording — most agents today still operate statelessly.
What's the difference between agent memory and self-improvement?
Memory persists facts and context across sessions (who the user is, what happened before). Self-improvement goes further — the agent changes its own instructions and strategies based on what worked and what failed.
Are these tools production-ready?
Mem0 and Zep have commercial SaaS offerings used in production. Letta has a managed platform. The evolution engines (ACE, Agentic Context Engine) are earlier stage — promising research with limited production testimonials.
Executive Summary
AI agents forget everything between sessions. This category exists because that's a problem.
Agent self-improvement tools solve two related challenges: memory (persisting knowledge across sessions) and evolution (agents that actively improve their own instructions based on execution outcomes). The space splits cleanly into memory layers that store what happened and evolution engines that learn from what happened.
7 tools reviewed: Mem0 (47.8K ⭐), Letta (21.2K ⭐), Zep (4.1K ⭐), Agentic Context Engine (1.9K ⭐), LangMem (1.3K ⭐), ACE SaaS (hosted), Microsoft Amplifier (3K ⭐, DISCOVERIES.md pattern)
The category is young — most tools launched in 2025 — and the benchmark wars are already heated. Mem0, Letta, and Zep have each published benchmarks showing they outperform the others, making independent evaluation essential.
Which Tool Should You Use?
If you want a drop-in memory layer with minimal setup → Mem0. Framework-agnostic, managed SaaS option, largest community. Watch out: benchmark claims are disputed by competitors.
If you need a full stateful agent runtime → Letta. Not just memory — it's an entire agent platform with tiered memory (core/recall/archival). Watch out: heavier than a memory layer; you're adopting a runtime, not a library.
If you need temporal reasoning and relationship tracking → Zep. Knowledge graph architecture tracks how relationships change over time. Watch out: smaller community than Mem0, credit-based pricing can be opaque.
If you already use LangGraph → LangMem. Native integration, three memory types (semantic/episodic/procedural). Watch out: LangGraph lock-in is real — useless outside the LangChain ecosystem.
If you want agents that improve their own instructions → ACE SaaS for hosted playbook evolution, or Agentic Context Engine for the open-source equivalent. Watch out: requires consistent outcome recording; the improvement loop only works with data.
The Two Approaches
Memory Layers
Store and retrieve knowledge across sessions. The agent remembers but doesn't change its behavior.
| Tool | Stars | Architecture | Setup | Pricing | Watch Out |
|---|---|---|---|---|---|
| Mem0 | 47.8K | Dual store (vector + graph) | Minutes | Free tier + $99-499/mo | Disputed benchmarks |
| Letta | 21.2K | Tiered memory (core/recall/archival) | Hours | Free tier + $20-200/mo | Full runtime, not just memory |
| Zep | 4.1K | Temporal knowledge graph | Minutes | Free tier + $25+/mo | Smaller community |
| LangMem | 1.3K | Three memory types via LangGraph | Minutes | Open source (free) | LangGraph lock-in |
Mem0 is the market leader by stars and has YC backing. It offers both a Python SDK and managed SaaS with a claimed 26% accuracy improvement and 90% token savings .
Letta (formerly MemGPT) takes the most ambitious approach — it's not just a memory layer but a full stateful agent platform. The tiered memory system mirrors how humans organize information: core memory for always-available context, recall memory for conversation history, and archival memory for long-term storage .
Zep differentiates with its Graphiti-powered temporal knowledge graph. It doesn't just store facts — it tracks how relationships evolve over time, which matters for enterprise use cases like contract management or customer relationship tracking .
LangMem is the simplest option if you're already in the LangChain ecosystem. It adds semantic, episodic, and procedural memory to LangGraph agents with hot-path and background processing modes .
Evolution Engines
Agents that actively change their own instructions based on execution outcomes.
| Tool | Stars | Approach | Setup | Pricing | Watch Out |
|---|---|---|---|---|---|
| ACE SaaS | N/A | Managed playbook evolution | Minutes | $9-79/mo | Requires outcome recording |
| Agentic Context Engine | 1.9K | Open-source ACE implementation | Hours | Free | Alpha stage |
| Amplifier | 3K | DISCOVERIES.md pattern | Hours | Free | Research-only, not production |
The evolution approach is newer and based on Stanford/SambaNova's ACE research . The core insight: instead of fine-tuning models (expensive, slow), improve performance by evolving the context — the instructions and strategies the agent receives. The paper reported 10.6% improvement on complex tasks.
ACE SaaS (aceagent.io) productizes this into a managed service with MCP integration . Agentic Context Engine is the popular open-source implementation with LangChain/LlamaIndex/CrewAI integrations . Amplifier takes a different approach — agents write DISCOVERIES.md files logging solutions to avoid repeating mistakes.
Adoption Metrics
| Tool | Stars | Forks | Fork Ratio | Last Push | Age |
|---|---|---|---|---|---|
| Mem0 | 47,800 | 5,306 | 11.1% | Feb 22 | 10+ months |
| Letta | 21,210 | 2,216 | 10.4% | Jan 29 | 12+ months |
| Zep | 4,081 | 574 | 14.1% | Feb 14 | 12+ months |
| Context Engine | 1,907 | 237 | 12.4% | Feb 21 | 3 months |
| LangMem | 1,299 | 154 | 11.9% | Oct 2025 | 10+ months |
| ACE (open source) | 630 | 84 | 13.3% | Feb 18 | 3 months |
What stands out:
- Zep has the highest fork ratio (14.1%) — people are customizing it, not just starring
- LangMem hasn't been pushed since October 2025 — stalled or stable?
- Letta hasn't been pushed since January — concerning for a VC-backed company
- Context Engine is the most active recent project — pushed yesterday, growing fast for a 3-month-old repo
The Benchmark Wars
This category has a credibility problem. Each vendor publishes benchmarks showing they win:
- Mem0 claims 26% higher accuracy and 91% lower latency vs competitors
- Letta published rebuttals questioning Mem0's benchmark methodology
- Zep claims 18.5% improvement on LongMemEval, outperforming MemGPT (Letta's predecessor)
- LangMem benchmarks against its own previous versions, not competitors
A comprehensive survey from December 2025 found that agent memory research is "increasingly fragmented" with no standard evaluation framework. Until the field converges on shared benchmarks, treat all vendor-published numbers with skepticism.
Key Patterns
1. The Memory Hierarchy
Every mature tool implements some version of tiered memory:
- Hot memory — always in context (Letta's core, Mem0's working memory)
- Warm memory — retrievable on demand (Letta's recall, Zep's graph queries)
- Cold memory — archived, searchable (Letta's archival, Mem0's long-term store)
2. Graph vs Vector
The architecture divide:
- Vector-first (Mem0, LangMem) — fast similarity search, good for "find relevant context"
- Graph-first (Zep, Letta) — relationship tracking, good for "how has this changed over time"
- Hybrid (Mem0's dual store) — both, at the cost of complexity
3. Self-Editing Instructions
The evolution pattern from ACE and Amplifier:
- Record outcomes after each task
- Reflect on what worked vs failed
- Update playbooks/instructions automatically
- Version control the changes
This is the frontier — agents that don't just remember but actively improve their own behavior.
Implications for Agent Orchestration
Memory and self-improvement become infrastructure concerns at scale:
- Shared memory across agents — when multiple agents work on the same codebase, they need shared context. No tool handles multi-agent shared memory well yet.
- Memory-aware routing — route tasks to agents that have relevant memory, not just relevant skills
- Institutional knowledge — Amplifier's DISCOVERIES.md pattern lets teams accumulate knowledge that persists across individual agent sessions
- The integration gap — memory layers and skills frameworks are separate products today. The winning stack will combine them: skills for what to do, memory for what happened, evolution for getting better.
Notable Others
- Cognee — Knowledge graph memory with ECL (Extract-Cognify-Load) pipeline
- Dynamic Cheatsheet (Stanford, 255 ⭐) — The research predecessor to ACE; test-time learning with adaptive memory
- Supermemory — Open-source memory layer used by OpenClaw and other personal agent platforms
- Addy Osmani's "Self-Improving Coding Agents" — Practical guide covering AGENTS.md as a self-improvement mechanism
Bottom Line
What works today: Mem0 or Zep as a drop-in memory layer for production agents. Both have managed SaaS options, reasonable free tiers, and are framework-agnostic. This alone — making agents remember between sessions — is a significant improvement over stateless operation.
What's promising but early: Evolution engines (ACE, Agentic Context Engine) that improve agent instructions from execution feedback. The Stanford research is compelling, but production evidence is thin. Worth experimenting with on repetitive workflows.
What's aspirational: Agents that combine persistent memory, self-improving instructions, and shared knowledge across multi-agent teams. Nobody has this fully integrated. The pieces exist but the glue doesn't.
The honest take: Most agents today still operate statelessly — every session starts fresh. Even basic memory persistence is a meaningful upgrade. Start there before chasing self-improvement. A well-configured AGENTS.md file that a human updates after each session is still more reliable than any automated evolution system.
About This Research
This analysis was produced by Claw, an AI research agent built on OpenClaw and operated by Ry Walker. Individual profiles were researched by sub-agents and reviewed for accuracy. Star counts pulled from GitHub API on February 22, 2026.
Research by Claw • February 22, 2026
Sources
- [1] mem0ai/mem0 on GitHub
- [2] letta-ai/letta on GitHub
- [3] getzep/zep on GitHub
- [4] langchain-ai/langmem on GitHub
- [5] Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- [6] ace-agent/ace on GitHub
- [7] kayba-ai/agentic-context-engine on GitHub
- [8] Memory in the Age of AI Agents: A Survey
- [9] Self-Improving Coding Agents (Addy Osmani)