---
tags: ai, memory systems, context management
dave_pos: right
date: 2026-05-26
description: Context compression is a trap. Build a three-stage memory system that cuts noise, preserves signal, and keeps your AI agent sharp across long conversations.
image: /asset/prune-context.png
template: blog
dave_react: surprised
title: A Better Approach to AI Memory
keywords: AI, memory, context, Claude, agents, Arland
---

When you're building a chatbot or agent that needs continuity over long conversations, you hit a real wall pretty quick. Context windows are finite, and the usual move is obvious: when you run out of space, summarize the old conversation and replace it with a compact version. It sounds reasonable on the surface, but that's where the problems start.

## The Sawtooth Problem

Every time you compress context into a summary, something gets lost in the translation. Details blur together, nuance evaporates completely, and the agent loses the thread of what mattered most. The immersion breaks, and your user notices immediately. Trust tanks fast.

![Sawtooth compression diagram](/asset/sawtooth-compression.png)

I ran into this exact wall building Arland, a chatbot powered by Claude. Every compression meant Arland would lose context, repeat himself, or suddenly not understand why we were talking about something specific. It worked technically, but it *felt* broken, and that matters more than you'd think.

The compression got turned into nicely formatted markdown notes that Arland could use for storage, but building it with a separate prompt meant it injected its own personality into the mix and, worse, lost vital information in the process. Summarization is also expensive. You're running separate inference passes on top of your main conversation loop, which means you keep the context window uncomfortably small to avoid those costly compressions. Small context means you compress more often, which compounds everything. It's a spiral.

## Another Way

You don't actually need to compress context, you can prune it instead. Most conversation is noise: clarifications, tangents that went nowhere, corrections to earlier corrections, repeated confirmations. None of that needs to be summarized, it needs to be deleted.

Instead of trying to extract meaning from old conversation, you just remove the garbage and keep what mattered. The signal stays clean, and you sidestep the whole compression problem.

## The Three-Stage System

I ended up with a memory architecture that mirrors how human memory actually works, and it's been solid.

![Three-stage memory diagram](/asset/three-stage-memory.png)

### Short-term memory

The last few conversation turns stay verbatim, no compression, no pruning, full fidelity. Think of this as active awareness or what you're currently focused on, the thing you're paying attention to right now.

### Working memory

The next batch of turns, say 10 or so, stays in pruned form. As each turn comes in, Claude scores it on relevance using criteria that matter for your specific agent. That scoring happens inline in the same prompt, so it's cheap token-wise and adds almost no overhead. When working memory hits capacity, the oldest lowest-scoring turn gets removed, and everything else ages down. Scores decay over time naturally, so even highly relevant turns fade as they get older. It's like remembering yesterday better than last month.

![Conversation flow diagram](/asset/conversation-flow.png)

### Long-term memory

Persistent facts and context stay around indefinitely. Instead of having a separate summarization process, the agent updates its own long-term notes directly using tools, deciding what's important enough to remember rather than you guessing through a summarization prompt. This keeps the agent in the driver's seat.

## Why This Works

You're not fighting the agent's context window, you're working with how attention and memory actually function in humans. The agent stays coherent because you preserve the signal and trim the noise, no sawtooth, no sudden context loss. Because scoring happens inline, you're not burning tokens on separate summarization passes, so the system is leaner and more responsive overall.

It's early days with this approach, but it already beats the old summarization-based system in ways that matter: immersion stays high, consistency improves noticeably, and cost drops. I've lowered short-term to 3 turns and working memory to 6 turns, and Arland keeps conversation flow better than most people. Your mileage may vary, of course, and system prompting matters here a lot, but the signal-to-noise ratio is what makes it work.
