nightowlcoder@home:~$

Why Your AI Agent Forgets Everything

Why Your AI Agent Forgets Everything

Why Your AI Agent Forgets Everything

Part 1 of the “Building AI Memory Systems” series


Your AI coding assistant has a dirty secret: it forgets everything between sessions. That brilliant conversation where it learned your coding style, understood your architecture, and made perfect suggestions? Gone the moment you close the chat.

I discovered this the hard way when I checked my Memory MCP server and found 225 entities—and 90% of them were garbage.

The Problem: Context Windows Are Not Memory

When you chat with Claude, GPT, or any LLM, there’s no persistent memory. What feels like “remembering” is just the conversation history being fed back into the model each time.

You: "Remember, I prefer snake_case for Python"
AI: "Got it! I'll use snake_case going forward"

--- New Session ---

You: "Write a Python function"
AI: [Uses camelCase because it has NO IDEA what you said before]

The context window is your only memory, and it’s:

  • Finite (even the new 1 million tokens fills up fast with code)
  • Session-bound (dies when you close the chat)
  • Expensive (every token costs money to process)

The Standard Solutions (And Why They Fail)

1. Sliding Window

How it works: Drop oldest messages when context fills up.

The problem: Critical decisions made early get deleted. You lose the “why” behind the “what.”

Turn 1: "Let's use Redis for caching because PostgreSQL was too slow"
...150 turns later...
[Turn 1 deleted]

AI: "Should we cache this in PostgreSQL?"
You: 😤

2. Summarization

How it works: Use another LLM to compress old messages into summaries.

The problem: Summaries are lossy. Specific facts (“use port 8080”, “the API key is in .env.local”) get smoothed over. JetBrains Research found this can actually hurt performance—agents using summarization ran 15% longer because they lost the signals telling them when to stop.1

3. RAG (Retrieval-Augmented Generation)

How it works: Embed your documents, search for relevant chunks, inject into context.

The problem: Great for documentation, terrible for personal facts. RAG doesn’t know that YOU prefer tabs over spaces, or that YOUR production server is at 192.168.1.50.

The Real Problem: Tools Are Wrong for Memory

Here’s what I tried: expose memory operations as tools. Let the AI decide when to save and retrieve.

@tool
def save_to_memory(entity: str, fact: str):
    """Save a fact about an entity"""
    memory.add(entity, fact)

@tool  
def search_memory(query: str):
    """Search for relevant memories"""
    return memory.search(query)

Sounds reasonable, right?

The result: garbage data. My memory was full of:

  • Random file names the AI thought were “entities”
  • Duplicate facts saved multiple times
  • Generic observations nobody needs (“user is working on code”)
  • Missing the actually important stuff

Why? Because the AI has to decide when to use tools. That’s cognitive overhead. It’s trying to solve your coding problem AND manage its own memory at the same time. Memory management becomes an afterthought.

What Actually Works: Hybrid Architecture

After researching this problem (and getting external validation from Gemini 3), I landed on a different approach:

Don’t make the AI manage its own memory. Do it automatically.

┌─────────────────────────────────────────────────────┐
│                    Your Message                     │
└───────────────────────┬─────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────┐
│  PRE-PROCESSOR (Small Local Model - GPT-OSS)        │
│  ┌─────────────────────────────────────────────┐    │
│  │ 1. Extract entities from message            │    │
│  │ 2. Generate memory search query             │    │
│  │ 3. Retrieve relevant memories               │    │
│  │ 4. Inject context into prompt               │    │
│  └─────────────────────────────────────────────┘    │
└───────────────────────┬─────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────┐
│ PRIMARY LLM (Frontier Model) - With Enriched Context│
└───────────────────────┬─────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────┐
│  POST-PROCESSOR (Small Local Model)                 │
│  ┌─────────────────────────────────────────────┐    │
│  │ 1. Extract new facts from conversation      │    │
│  │ 2. Store in vector database (async)         │    │
│  └─────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────┘

The key insights:

  1. Dual-model architecture: Use a fast, cheap local model (like GPT-OSS 20B via Ollama) for memory processing. Reserve the expensive cloud model for actual conversation.

  2. Automatic, not optional: Memory retrieval happens on EVERY message via hooks/middleware, not when the AI “decides” to search.

  3. Structured schema: Define what’s worth remembering (Decisions, Preferences, Goals) vs. what’s noise (random file contents).

  4. Async storage: Don’t block the conversation to save memories. Do it in the background.

The Latency Budget

“But won’t this be slow?”

Here’s the math (validated by Gemini’s review):

Step Target Feasibility
GPT-OSS query generation <500ms ✅ Local, small output
Qdrant vector search <200ms ✅ Standard for vector DBs
Total sync overhead <700ms ✅ Sub-second
Extraction + storage ~1s ✅ Async (non-blocking)

Less than a second of overhead for persistent memory across sessions. Worth it.

What’s Next

This is Part 1 of a series where I build this system for real. Coming up:

  • Part 2: Dual-LLM Architecture - GPT-OSS as cognitive preprocessor
  • Part 3: Entity Schema Design - What to remember, what to forget
  • Part 4: Building the Memory Processor with Strands hooks
  • Part 5: Qdrant integration and search optimization
  • Part 6: Lessons learned and performance tuning

If you’re building AI agents and frustrated by the “goldfish memory” problem, follow along. The full spec and implementation will be open source.


References


This is Part 1 of the Building AI Memory Systems series. Next: Dual-LLM Architecture: Analysis vs Synthesis


Found this useful? Follow @NightOwlCoder for more AI engineering content.

  1. JetBrains Research, “Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents” (December 2025). Agents using LLM summarization ran 15% longer trajectories without solving more problems. Read the paper 




If you liked this post, you can share it with your followers and/or follow me on Twitter!