EverOS: SOTA Results Across Four Memory Benchmarks and What It Means for LLM Agents

We have released our latest research on EverOS, now available on arXiv! Large Language Models are quickly evolving from “single turn chatbots” into long-term interactive agents. But as soon as an agent is expected to stay coherent across weeks of conversations, it runs into a practical ceiling: a limited context window and fragmented memory. Even with retrieval, many systems still behave like they are pulling isolated snippets—often missing conflicts, failing to update user state, or giving inconsistent guidance over time. In our latest research, we introduce EverOS, a self-organizing memory operating system that treats memory not as a flat store, but as a lifecycle—inspired by biological “engram” principles—so agents can continuously transform raw interactions into structured, evolving knowledge.

EverMind researchers

Jan 5, 2026

About 3 minutes to read

EverOS

long term memory

RAG

context

LoCoMo

LongMemEval

PersonaMem

The “cognitive wall”: why more context isn’t enough

A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.

EverOS is built around a simple thesis:

The future of long-term agents depends more on structured memory organization than on brute-force context expansion.

EverOS in one line

EverOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:

Episodic Trace Formation
Semantic Consolidation
Reconstructive Recollection

Phase I — Episodic Trace Formation: from dialogue streams to MemCells

EverOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.

A MemCell is defined as a tuple:

E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)
F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching
P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness
M (Metadata): timestamps and source pointers for grounding

To create MemCells robustly from noisy conversations, EverOS uses a pipeline that includes:

Semantic boundary detection (to segment continuous streams into coherent episodes),
Narrative synthesis (resolve coreferences / ambiguity into a clean episode),
Structured derivation of atomic facts + time-bounded foresight signals.

Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution

If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.

In Semantic Consolidation, EverOS performs online incremental clustering:

When a new MemCell arrives, it compares the cell to existing MemScene centroids.
If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.

Crucially, consolidation also drives profile evolution:

Instead of prompting over raw chat logs, EverOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.

This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.

Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall

In EverOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:

Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.

At a high level, EverOS:

selects relevant MemScenes,
retrieves episodes (MemCells) using hybrid retrieval,
and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”

Results

EverOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:

LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).

LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.

HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).

PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.

Why this matters for real agents (beyond benchmarks)

Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:

conflicting preferences vs. new constraints,
stable personalization,
time-bounded states (medications, deadlines, temporary plans),
and proactive, experience-grounded “foresight.”

EverOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.

What’s next (and how to try it)

EverOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.

Paper + Code:

arXiv page: https://arxiv.org/abs/2601.02163

code: https://github.com/EverMind-AI/EverOS