Agent Self-Improvement Tools Compared | Ry Walker Research

Key takeaways

The category splits into two approaches — memory layers (Mem0, Letta, Zep, Hindsight, Cognee, Supermemory) that persist knowledge across sessions, and evolution engines (ACE, Agentic Context Engine) that actively improve instructions from execution feedback
The money arrived: Mem0 closed a $24M Series A, Cognee raised $7.5M from Pebblebed, and new entrants Hindsight (16K stars in 7 months) and Supermemory (26.8K stars) crowded the benchmark wars further
The existential event is first-party: Anthropic's "Dreaming" (May 2026) reviews agent sessions between runs, extracts patterns, and rewrites memory stores — a built-in version of what this category sells
LangGraph lock-in is the hidden cost — LangMem requires LangGraph, while Mem0, Zep, Hindsight, and Cognee are framework-agnostic. Choose based on your existing stack, not just features.

FAQ

What's the best memory layer for AI agents?

Mem0 for framework-agnostic simplicity with a managed SaaS option. Letta for full stateful agent runtime with tiered memory. Zep for temporal knowledge graphs and relationship tracking. LangMem if you already use LangGraph.

Can AI agents actually self-improve?

Yes, with caveats. Stanford's ACE research showed 10.6% improvement through evolving contexts. But self-improvement requires persistent memory and consistent outcome recording — most agents today still operate statelessly.

What's the difference between agent memory and self-improvement?

Memory persists facts and context across sessions (who the user is, what happened before). Self-improvement goes further — the agent changes its own instructions and strategies based on what worked and what failed.

Are these tools production-ready?

Mem0 and Zep have commercial SaaS offerings used in production. Letta has a managed platform. The evolution engines (ACE, Agentic Context Engine) are earlier stage — promising research with limited production testimonials.

Executive Summary

AI agents forget everything between sessions. This category exists because that's a problem.

Agent self-improvement tools solve two related challenges: memory (persisting knowledge across sessions) and evolution (agents that actively improve their own instructions based on execution outcomes). The space splits cleanly into memory layers that store what happened and evolution engines that learn from what happened.

10 tools reviewed: Mem0 (58.4K ⭐, $24M raised), Zep (Graphiti 27.3K ⭐), Supermemory (26.8K ⭐), Letta (23.3K ⭐), Cognee (17.7K ⭐, $7.5M seed), Hindsight (16.2K ⭐), Agentic Context Engine (2.3K ⭐), LangMem (1.5K ⭐), ACE SaaS (hosted), Microsoft Amplifier (3.1K ⭐, DISCOVERIES.md pattern)

The category is young — most tools launched in 2025 — and the benchmark wars are already heated. Mem0, Letta, Zep, and now Hindsight have each published benchmarks showing they outperform the others, making independent evaluation essential. And the first-party threat arrived: Anthropic's "Dreaming" (May 2026) builds session-reviewing, memory-rewriting self-improvement directly into Claude Managed Agents — Harvey reportedly saw ~6x task-completion gains.

Which Tool Should You Use?

If you want a drop-in memory layer with minimal setup → Mem0. Framework-agnostic, managed SaaS option, largest community. Watch out: benchmark claims are disputed by competitors.

If you need a full stateful agent runtime → Letta. Not just memory — it's an entire agent platform with tiered memory (core/recall/archival). Watch out: heavier than a memory layer; you're adopting a runtime, not a library.

If you need temporal reasoning and relationship tracking → Zep. Knowledge graph architecture tracks how relationships change over time. Watch out: smaller community than Mem0, credit-based pricing can be opaque.

If you already use LangGraph → LangMem. Native integration, three memory types (semantic/episodic/procedural). Watch out: LangGraph lock-in is real — useless outside the LangChain ecosystem.

If you want memory that learns, not just recalls → Hindsight. Biomimetic retain/recall/reflect over typed memory networks on plain Postgres+pgvector; claims SOTA on LongMemEval (vendor-reported). Watch out: pre-1.0, thin independent validation.

If your memory problem is really a data problem → Cognee. ECL pipeline turns scattered data into a self-improving knowledge graph over relational+vector+graph storage; $7.5M Pebblebed seed, Bayer among 70+ companies. Watch out: critics question cost at scale.

If you want ingestion breadth (PDFs, audio, email) → Supermemory. Single API from raw data to agent memory, 26.8K stars, notable angels. Watch out: solo young founder, vendor-stated benchmarks.

If you want agents that improve their own instructions → ACE SaaS for hosted playbook evolution, or Agentic Context Engine for the open-source equivalent (now multi-runtime with hosted kayba.ai). Watch out: requires consistent outcome recording; the improvement loop only works with data.

The Two Approaches

Memory Layers

Store and retrieve knowledge across sessions. The agent remembers but doesn't change its behavior.

Tool	Stars	Architecture	Setup	Pricing	Watch Out
Mem0	58.4K	Dual store (vector + graph)	Minutes	Free + $79 Growth / $249 Pro	Disputed benchmarks
Zep	27.3K (Graphiti)	Temporal knowledge graph	Minutes	Annual-only: $1,250+/yr	Thin funding vs rivals
Supermemory	26.8K	Vector-graph + broad ingestion	Minutes	Free + $19-399/mo	Solo founder, vendor benchmarks
Letta	23.3K	Tiered memory (core/recall/archival)	Hours	Free + $20 Pro	Full runtime; converging on Letta Code
Cognee	17.7K	ECL → knowledge graph	Hours	OSS + cloud $35-200/mo	Cost at scale contested
Hindsight	16.2K	Retain/recall/reflect typed networks	Minutes	OSS (MIT) + Vectorize cloud	Pre-1.0, vendor benchmarks
LangMem	1.5K	Three memory types via LangGraph	Minutes	Open source (free)	LangGraph lock-in; slow releases

Mem0 is the market leader by stars and has YC backing. It offers both a Python SDK and managed SaaS with a claimed 26% accuracy improvement and 90% token savings .

Letta (formerly MemGPT) takes the most ambitious approach — it's not just a memory layer but a full stateful agent platform. The tiered memory system mirrors how humans organize information: core memory for always-available context, recall memory for conversation history, and archival memory for long-term storage .

Zep differentiates with its Graphiti-powered temporal knowledge graph. It doesn't just store facts — it tracks how relationships evolve over time, which matters for enterprise use cases like contract management or customer relationship tracking .

LangMem is the simplest option if you're already in the LangChain ecosystem. It adds semantic, episodic, and procedural memory to LangGraph agents with hot-path and background processing modes .

Evolution Engines

Agents that actively change their own instructions based on execution outcomes.

Tool	Stars	Approach	Setup	Pricing	Watch Out
ACE SaaS	N/A	Managed playbook evolution	Minutes	$9-79/mo	Requires outcome recording
Agentic Context Engine	1.9K	Open-source ACE implementation	Hours	Free	Alpha stage
Amplifier	3K	DISCOVERIES.md pattern	Hours	Free	Research-only, not production

The evolution approach is newer and based on Stanford/SambaNova's ACE research . The core insight: instead of fine-tuning models (expensive, slow), improve performance by evolving the context — the instructions and strategies the agent receives. The paper reported 10.6% improvement on complex tasks.

ACE SaaS (aceagent.io) productizes this into a managed service with MCP integration . Agentic Context Engine is the popular open-source implementation with LangChain/LlamaIndex/CrewAI integrations . Amplifier takes a different approach — agents write DISCOVERIES.md files logging solutions to avoid repeating mistakes.

Adoption Metrics

Tool	Stars	Last Push	Status (June 2026)
Mem0	58,400	Jun 11, 2026	Active; $24M Series A; AWS Agent SDK provider
Zep	27,300 (Graphiti)	Jun 2026	Active; moved upmarket, annual-only pricing
Supermemory	26,759	Jun 2026	Active; 50K+ app users
Letta	23,270	May 2026	Active; converging on Letta Code
Cognee	17,700	Jun 2026	Active; v1.1.2, 166 contributors
Hindsight	16,200	Jun 9, 2026	Very active; 60 releases in 7 months
Microsoft Amplifier	3,098	Jun 9, 2026	Active but flat; research-only
Context Engine	2,344	Jun 9, 2026	Active; v0.12, hosted kayba.ai launched
LangMem	1,500	Jun 7, 2026	Maintained; no release in 7 months
ACE (open source)	1,150	May 19, 2026	Active; paper revised v3

What stands out:

Everything is alive — unusual for the categories we track; the memory problem is real enough to sustain ten active projects
The star race compressed — four projects between 16K and 27K, where February had a lone leader
LangMem ships no releases while riding LangGraph distribution (~746K monthly downloads anyway)

The Benchmark Wars

This category has a credibility problem. Each vendor publishes benchmarks showing they win:

Mem0 claims 26% higher accuracy and 91% lower latency vs competitors — and rewrote its algorithm in April 2026 claiming further temporal/multi-hop gains
Letta published rebuttals questioning Mem0's benchmark methodology
Zep claims 18.5% improvement on LongMemEval, outperforming MemGPT (Letta's predecessor)
Hindsight claims SOTA on LongMemEval (91.4%), beating Supermemory and GPT-5 baselines
LangMem benchmarks against its own previous versions, not competitors

A comprehensive survey from December 2025 found that agent memory research is "increasingly fragmented" with no standard evaluation framework. Until the field converges on shared benchmarks, treat all vendor-published numbers with skepticism.

Key Patterns

1. The Memory Hierarchy

Every mature tool implements some version of tiered memory:

Hot memory — always in context (Letta's core, Mem0's working memory)
Warm memory — retrievable on demand (Letta's recall, Zep's graph queries)
Cold memory — archived, searchable (Letta's archival, Mem0's long-term store)

2. Graph vs Vector

The architecture divide:

Vector-first (Mem0, LangMem) — fast similarity search, good for "find relevant context"
Graph-first (Zep, Letta) — relationship tracking, good for "how has this changed over time"
Hybrid (Mem0's dual store) — both, at the cost of complexity

3. Self-Editing Instructions

The evolution pattern from ACE and Amplifier:

Record outcomes after each task
Reflect on what worked vs failed
Update playbooks/instructions automatically
Version control the changes

This is the frontier — agents that don't just remember but actively improve their own behavior.

Implications for Agent Orchestration

Memory and self-improvement become infrastructure concerns at scale:

Shared memory across agents — when multiple agents work on the same codebase, they need shared context. No tool handles multi-agent shared memory well yet.
Memory-aware routing — route tasks to agents that have relevant memory, not just relevant skills
Institutional knowledge — Amplifier's DISCOVERIES.md pattern lets teams accumulate knowledge that persists across individual agent sessions
The integration gap — memory layers and skills frameworks are separate products today. The winning stack will combine them: skills for what to do, memory for what happened, evolution for getting better.

The First-Party Threat: Anthropic's Dreaming

The biggest category event since February came from a lab, not a vendor. Anthropic shipped persistent memory for Claude Managed Agents (public beta, April 2026), then "Dreaming" at Code with Claude (May 2026): a scheduled between-session process that reviews agent sessions, extracts patterns and recurring mistakes, and rewrites memory stores — including synthesizing shared learnings across multi-agent deployments. Harvey reportedly saw ~6x task-completion improvement; Netflix and Rakuten are adopters.

That is a direct first-party version of what every vendor here sells. The independents' answers: framework-agnosticism (Dreaming is Claude-only), data control (your Postgres, not Anthropic's store), and ingestion breadth. Whether those hold as moats is the category's defining question for 2027.

Notable Others

Dynamic Cheatsheet (Stanford, 255 ⭐) — The research predecessor to ACE; test-time learning with adaptive memory
Memobase — User-profile personalization memory (deliberately excluded: memory for users, not agent learning loops)
Addy Osmani's "Self-Improving Coding Agents" — Practical guide covering AGENTS.md as a self-improvement mechanism

Bottom Line

What works today: Mem0 or Zep as a drop-in memory layer for production agents — with Hindsight and Cognee as credible newer challengers. All are framework-agnostic with managed options. This alone — making agents remember between sessions — is a significant improvement over stateless operation.

What's promising but early: Evolution engines (ACE, Agentic Context Engine) that improve agent instructions from execution feedback. The Stanford research is compelling, but production evidence is thin. Worth experimenting with on repetitive workflows.

What's aspirational: Agents that combine persistent memory, self-improving instructions, and shared knowledge across multi-agent teams. Nobody has this fully integrated. The pieces exist but the glue doesn't.

The honest take: Most agents today still operate statelessly — every session starts fresh. Even basic memory persistence is a meaningful upgrade. Start there before chasing self-improvement. A well-configured AGENTS.md file that a human updates after each session is still more reliable than any automated evolution system.

Research by Ry Walker Research • methodology

Sources