Engineering··4 min read

Memory Isn't Context: Why Universal LLM Memory Systems Fail

New benchmarks show 'smart' memory systems like Mem0 and Zep perform worse than simply feeding context directly. Kodebase takes a different approach entirely.

M

Miguel Carvalho

Founder

Share:

Sebastian Lund just published a devastating benchmark of two popular LLM memory systems.

The results should make anyone rethinking their AI architecture.

Long-context baseline: 84.6% precision, $1.98 total cost.

Mem0 (vector-based): 49.3% precision, $24.88 cost.

Zep (graph-based): 51.6% precision, ~$152.60 cost.

The "smart" memory systems performed worse than simply feeding context directly. And they cost 12x to 77x more to run.


The LLM-on-Write Problem

Both Mem0 and Zep use what Lund calls an LLM-on-Write architecture. Every message triggers background LLM processes:

Mem0 runs three parallel extraction jobs per interaction. Update the timeline. Extract facts to vector storage. Check for contradictions.

Zep's Graphiti extracts entities and relationships, then recursively updates a knowledge graph. Cascading LLM calls. Lund's experiment burned 1.17 million tokens per test case before he aborted after 9 hours.

The fundamental flaw: these systems rely on LLMs to interpret raw data into "facts" at write-time. This introduces hallucinations before the data even reaches the database.

Your primary LLM is at the mercy of the extractor LLM's accuracy.


Two Problems Confused as One

Lund identifies the core mistake: these systems conflate two fundamentally different requirements.

Semantic memory is user-focused. Preferences, history, rapport. It can be fuzzy. "The user prefers dark mode" is useful even if slightly imprecise.

Working memory is agent-focused. File paths, variable states, execution logs. It must be lossless and exact. "The file is at /src/utils/helper.ts" cannot tolerate interpretation.

The quote that stuck with me: "State is the application. You cannot compress your way to reliability."


Why Kodebase Doesn't Use "Memory"

When I first saw this benchmark, my reaction was: we should test Kodebase against it.

Then I realized the question doesn't apply.

Kodebase doesn't use memory systems. We use structured context.

No vector database. Artifacts are YAML files in Git. No embeddings. No semantic search over fuzzy representations.

No LLM-on-write. When you create an artifact, it's stored exactly as written. No extraction. No interpretation. No hallucination at write-time.

No graph extraction. Relationships between artifacts are explicit in metadata. blocks: ["A.1.2"] means exactly what it says. No LLM inferring connections.

File-system as truth. The source of truth is human-readable files you can inspect, version, and diff. Not a database you have to query to understand.

This isn't accidental. It's the result of recognizing that context decay is the core problem in AI-assisted development. And the solution isn't smarter compression. It's better structure.


The Long-Context Baseline Wins

The most striking result from Lund's benchmark: the simple approach won.

Feed the context directly. No fancy extraction. No knowledge graphs. No vector similarity.

84.6% precision at $1.98 vs. 49-51% precision at $25-$150.

This matches our experience building Kodebase. The executable documentation approach works because it gives AI exactly the context it needs. No more, no less. No lossy transformation.

When you have good structure, you don't need smart memory. You just need to feed the right context.


What This Means for AI Architecture

If you're building AI systems, the implications are clear:

Don't add memory systems to fix context problems. You'll pay more, get worse results, and add complexity. Fix the context problem directly.

Structure beats intelligence. A well-organized file system outperforms a "smart" knowledge graph. Explicit relationships beat inferred ones.

Avoid LLM-on-write. Every LLM interpretation is a potential hallucination. Store data as-written. Let the LLM interpret at read-time, when you can validate the output.

Benchmark before adopting. Mem0 and Zep sound compelling. The benchmarks tell a different story. Test claims against your actual use case.


The Bottom Line

Universal LLM memory doesn't exist. The systems promising it perform worse than the baseline and cost dramatically more.

The winning strategy is almost embarrassingly simple: organize your context well, then feed it directly.

That's what Kodebase does. Not because we're clever. Because we tried the clever approaches and they didn't work.

Context is everything. Structure is how you preserve it.

context-decayaimethodologyllm-memoryarchitecture
M

Miguel Carvalho

Founder