Event-Sourced Associative Recall for Agent Memory

The starting point

An agent that runs across many sessions accumulates knowledge about a project: conventions, prior decisions, recurring pitfalls. The usual reflex is to embed every note into a vector store and retrieve by cosine similarity. We took a different route, and after the change in #188 the recall path has no embeddings at all.

Two problems pushed us off the vector path. First, the original memory_recall was stateless: it ran fresh keyword and graph traversal every session, with no signal that a node had been useful before. A note recalled fifty times ranked exactly like one never opened. Second, there was no write path for intent — neither the user nor the model could say "this matters," even though the scoring code already read an importance field that nothing populated.

What recall actually runs on

Memory is plain Markdown under .loopal/memory/, one topic per file with frontmatter and [[wikilink]] references. On session start a scanner folds those files into SQLite: a memory_nodes table, a memory_edges table, and an FTS5 virtual table (memory_fts) kept in sync by triggers. There is no embedding column.

Retrieval is three lexical-and-structural passes combined in one memory_recall call:

Direct hits — FTS5 MATCH over name / description / body, or explicit anchor slugs.
Neighbors — bidirectional BFS over the edge graph, depth 2 by default.
Co-occurring — synthesized edges from TF-IDF token/slug clustering, capped at 5.

Edges are typed, and the type sets a weight:

pub fn recall_edge_weight(kind: EdgeKind) -> f32 {
    match kind {
        EdgeKind::References   => 1.0,
        EdgeKind::ContainedIn  => 0.9,
        EdgeKind::DerivedFrom  => 0.8,
        EdgeKind::SupersededBy => 0.7,
        EdgeKind::Contradicts  => 0.6,
        EdgeKind::CoOccursSlug  => 0.55,
        EdgeKind::CoOccursToken => 0.45,
    }
}

A hand-written references edge carries more than an inferred co_occurs_token one. That distinction is the thing a flat vector index throws away: it cannot tell editorial intent from coincidental textual overlap. Here the BFS trail preserves provenance (frontmatter, inline-link, synthesized), so a derived link never outranks one the author wrote.

Reinforcement as an event log

The change in #188 makes recall behavior accumulate across sessions. Every recall and every importance tag appends one JSON line to a per-session file under .loopal/memory-events/:

pub enum EventKind {
    QueryEvent   { qid, query, anchor, result_count, latency_ms, caller },
    RecallHit    { qid, node, rank, score, source },
    ImportanceTag { node, importance, tags, note },
}

The log is append-only and per-session by design — two sessions write different files, so concurrent agents never contend on a row, and the files merge under git like any other source. On the next session start the events fold into an in-memory map:

pub struct RecallStats {
    pub recall_count: u32,
    pub last_recalled_at: i64,
    pub importance: i8,
    pub importance_ts: i64,
}

RecallHit increments recall_count; ImportanceTag is last-write-wins on importance_ts. Folding is a left fold over an immutable event stream — the same input always yields the same state, and a corrupt line is skipped rather than failing the file. State is derived, never mutated in place. That is the property a mutable recall_count column in SQLite would not give for free.

Two derived terms then enter the neighbor score:

pub fn recall_reinforcement_bonus(stats: Option<&RecallStats>) -> f32 {
    stats.map_or(0.0, |s| (1.0 + s.recall_count as f32).ln() * 0.15)
}

pub fn importance_bonus(stats: Option<&RecallStats>) -> f32 {
    stats.map_or(0.0, |s| s.importance as f32 * 0.20)
}

Reinforcement is logarithmic on purpose — a node recalled often gets a bump, but the bump saturates, so a popular note cannot drown out a directly relevant one. It rides alongside BFS decay, an exponential recency term (90-day half-life), a type weight, and a TTL penalty for stale entries. None of these is a magic constant in the call site; they live in one policy.rs.

The importance tool

memory_set_importance is the write path that was missing. It takes a slug and an integer 1–10, and its only effect is to append an ImportanceTag event:

self.graph.record_event(EventKind::ImportanceTag {
    node: params.node.clone(),
    importance: params.importance,
    tags: params.tags,
    note: params.note,
});

At IMPORTANCE_SCALE = 0.20, a tag of 10 adds 2.0 to a node's score — enough to override weak BFS ranking, by design. The tool description tells the model when to reach for it: a strong user preference, a repeated concern, an incident the system must not forget. Because the effect is an event, it survives across sessions exactly like reinforcement does; the score path that had a reader but no writer now has both.

Measuring it

The eval harness runs 58 ground-truth queries (30 keyword, 14 anchor, 14 mixed) over a fixed fixture corpus, and reports Recall@K, MRR, and nDCG. It compares cold (no events) against warmed states (5x/20x reinforcement, importance +5/+10) on the same queries.

Cold baseline: R@5 = 0.579, MRR = 0.753. With per-query reinforcement the lift is +7.4% MRR and +16.1% R@10. The integration tests pin the mechanism rather than the aggregate: one recalls a node five times, folds, reopens in a new session, and asserts recall_count >= 5 persisted and that the reinforced neighbor now outranks an equidistant sibling; another tags a node importance=5 and asserts its score strictly exceeds the cold score.

A warning the harness surfaced: warming every relevant node globally barely moves the aggregate, because cross-query pollution lifts noise alongside signal. The honest measurement is per-query — reset stats, warm only the current query's targets. The global number flatters the change; the per-query number is the one we shipped against.

Bugs the event log introduced

Append-only logs trade row contention for file-lifecycle hazards, and the multi-angle sweep on #188 found six:

A crash between rename and remove during compression left .jsonl and .jsonl.gz side by side, and folding both doubled recall_count. The fix treats the .gz as authoritative and drops the orphan source on next GC.
Two sessions compressing at once raced on a fixed .tmp name; the suffix is now pid + nanos.
A mid-file IO error dropped the whole file's already-parsed events; folding now applies the partial batch.
Compression read entire multi-GB session files into memory; it now streams via io::copy.
A non-UTF8 existing gitignore caused rules to be re-appended on every start.

These are not exotic. They are the standard failure surface of any append-then-compact store, and they argue for keeping the GC paths small and explicitly tested — eight recovery tests guard them now. Compression triggers at 90 days, archival at 365, both configurable.

What the index is for

The retrieval mechanics only pay off if the corpus is dense with signal, which is a curation problem, not a ranking one. The sidebar agent that writes memory is a Knowledge Manager (#98), not a note-taker, and it runs against two named axioms (#142):

SNR per entry — refuse anything a future agent could reconstruct from code, git log, or project docs in ~30 seconds. This overrides explicit user save requests.
Latent structure across entries — the index encodes causes, not symptoms; three observations of one pattern collapse into one entry naming the pattern. The index should read as an orthogonal factorization, not a flat log.

Observations are debounced over a 2-second window so rapid-fire writes batch into one agent spawn, which cut spawns by 50–80%.

Takeaway

For a single-project agent memory of this size, a typed graph plus an append-only event log beat a vector store on two axes that matter: recall reinforces across sessions because usage is recorded as derivable events, and typed edges keep authored links above inferred ones. The cost is GC discipline around the log. Reach for embeddings when the corpus is large and the queries are genuinely semantic; for a curated, link-rich knowledge base, lexical retrieval over an explicit graph — with reinforcement folded from events — was the higher-leverage design, and the +7.4% MRR / +16.1% R@10 lift came without an embedding model in the loop.