Arsenal

PingFang SC

Loading...
Loading...
Loading...

EverMemOS: SOTA Results Across Four Memory Benchmarks and What It Means for LLM Agents

Loading...
Loading...
Loading...
Loading...
EverMemOS
long term memory
RAG
context
LoCoMo
LongMemEval
PersonaMem
sota
sota
sota

The “cognitive wall”: why more context isn’t enough

A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.

EverMemOS is built around a simple thesis:

The future of long-term agents depends more on structured memory organization than on brute-force context expansion.

EverMemOS in one line

EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:

  1. Episodic Trace Formation

  2. Semantic Consolidation

  3. Reconstructive Recollection 

Phase I — Episodic Trace Formation: from dialogue streams to MemCells

EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.

A MemCell is defined as a tuple:

  • E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)

  • F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching

  • P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness

  • M (Metadata): timestamps and source pointers for grounding

To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:

  • Semantic boundary detection (to segment continuous streams into coherent episodes),

  • Narrative synthesis (resolve coreferences / ambiguity into a clean episode),

  • Structured derivation of atomic facts + time-bounded foresight signals.

Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution

If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.

In Semantic Consolidation, EverMemOS performs online incremental clustering:

  • When a new MemCell arrives, it compares the cell to existing MemScene centroids.

  • If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.

Crucially, consolidation also drives profile evolution:

  • Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.

This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.

Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall

In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:

Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.

At a high level, EverMemOS:

  • selects relevant MemScenes,

  • retrieves episodes (MemCells) using hybrid retrieval,

  • and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”

Results

EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:

LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).

LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.

HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).

PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.

Why this matters for real agents (beyond benchmarks)

Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:

  • conflicting preferences vs. new constraints,

  • stable personalization,

  • time-bounded states (medications, deadlines, temporary plans),

  • and proactive, experience-grounded “foresight.”

EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.

What’s next (and how to try it)

EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.

Paper + Code:

arXiv page:  https://arxiv.org/abs/2601.02163

code:        https://github.com/EverMind-AI/EverMemOS


The “cognitive wall”: why more context isn’t enough

A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.

EverMemOS is built around a simple thesis:

The future of long-term agents depends more on structured memory organization than on brute-force context expansion.

EverMemOS in one line

EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:

  1. Episodic Trace Formation

  2. Semantic Consolidation

  3. Reconstructive Recollection 

Phase I — Episodic Trace Formation: from dialogue streams to MemCells

EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.

A MemCell is defined as a tuple:

  • E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)

  • F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching

  • P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness

  • M (Metadata): timestamps and source pointers for grounding

To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:

  • Semantic boundary detection (to segment continuous streams into coherent episodes),

  • Narrative synthesis (resolve coreferences / ambiguity into a clean episode),

  • Structured derivation of atomic facts + time-bounded foresight signals.

Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution

If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.

In Semantic Consolidation, EverMemOS performs online incremental clustering:

  • When a new MemCell arrives, it compares the cell to existing MemScene centroids.

  • If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.

Crucially, consolidation also drives profile evolution:

  • Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.

This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.

Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall

In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:

Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.

At a high level, EverMemOS:

  • selects relevant MemScenes,

  • retrieves episodes (MemCells) using hybrid retrieval,

  • and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”

Results

EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:

LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).

LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.

HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).

PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.

Why this matters for real agents (beyond benchmarks)

Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:

  • conflicting preferences vs. new constraints,

  • stable personalization,

  • time-bounded states (medications, deadlines, temporary plans),

  • and proactive, experience-grounded “foresight.”

EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.

What’s next (and how to try it)

EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.

Paper + Code:

arXiv page:  https://arxiv.org/abs/2601.02163

code:        https://github.com/EverMind-AI/EverMemOS


The “cognitive wall”: why more context isn’t enough

A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.

EverMemOS is built around a simple thesis:

The future of long-term agents depends more on structured memory organization than on brute-force context expansion.

EverMemOS in one line

EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:

  1. Episodic Trace Formation

  2. Semantic Consolidation

  3. Reconstructive Recollection 

Phase I — Episodic Trace Formation: from dialogue streams to MemCells

EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.

A MemCell is defined as a tuple:

  • E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)

  • F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching

  • P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness

  • M (Metadata): timestamps and source pointers for grounding

To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:

  • Semantic boundary detection (to segment continuous streams into coherent episodes),

  • Narrative synthesis (resolve coreferences / ambiguity into a clean episode),

  • Structured derivation of atomic facts + time-bounded foresight signals.

Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution

If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.

In Semantic Consolidation, EverMemOS performs online incremental clustering:

  • When a new MemCell arrives, it compares the cell to existing MemScene centroids.

  • If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.

Crucially, consolidation also drives profile evolution:

  • Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.

This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.

Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall

In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:

Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.

At a high level, EverMemOS:

  • selects relevant MemScenes,

  • retrieves episodes (MemCells) using hybrid retrieval,

  • and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”

Results

EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:

LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).

LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.

HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).

PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.

Why this matters for real agents (beyond benchmarks)

Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:

  • conflicting preferences vs. new constraints,

  • stable personalization,

  • time-bounded states (medications, deadlines, temporary plans),

  • and proactive, experience-grounded “foresight.”

EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.

What’s next (and how to try it)

EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.

Paper + Code:

arXiv page:  https://arxiv.org/abs/2601.02163

code:        https://github.com/EverMind-AI/EverMemOS


Loading...
Loading...
Loading...
Loading...
Loading...
Loading...