Arsenal

PingFang SC

Loading...
Loading...
Loading...

EverOS: SOTA Results Across Four Memory Benchmarks and What It Means for LLM Agents

Loading...
Loading...
Loading...
Loading...
EverOS
long term memory
RAG
context
LoCoMo
LongMemEval
PersonaMem
sota

The “cognitive wall”: why more context isn’t enough

A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.

EverOS is built around a simple thesis:

The future of long-term agents depends more on structured memory organization than on brute-force context expansion.

EverOS in one line

EverOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:

  1. Episodic Trace Formation

  2. Semantic Consolidation

  3. Reconstructive Recollection 

Phase I — Episodic Trace Formation: from dialogue streams to MemCells

EverOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.

A MemCell is defined as a tuple:

  • E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)

  • F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching

  • P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness

  • M (Metadata): timestamps and source pointers for grounding

To create MemCells robustly from noisy conversations, EverOS uses a pipeline that includes:

  • Semantic boundary detection (to segment continuous streams into coherent episodes),

  • Narrative synthesis (resolve coreferences / ambiguity into a clean episode),

  • Structured derivation of atomic facts + time-bounded foresight signals.

Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution

If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.

In Semantic Consolidation, EverOS performs online incremental clustering:

  • When a new MemCell arrives, it compares the cell to existing MemScene centroids.

  • If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.

Crucially, consolidation also drives profile evolution:

  • Instead of prompting over raw chat logs, EverOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.

This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.

Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall

In EverOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:

Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.

At a high level, EverOS:

  • selects relevant MemScenes,

  • retrieves episodes (MemCells) using hybrid retrieval,

  • and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”

Results

EverOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:

LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).

LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.

HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).

PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.

Why this matters for real agents (beyond benchmarks)

Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:

  • conflicting preferences vs. new constraints,

  • stable personalization,

  • time-bounded states (medications, deadlines, temporary plans),

  • and proactive, experience-grounded “foresight.”

EverOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.

What’s next (and how to try it)

EverOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.

Paper + Code:

arXiv page:  https://arxiv.org/abs/2601.02163

code:        https://github.com/EverMind-AI/EverOS


Loading...
Loading...
Loading...