EverMemOS: SOTA Results Across Four Memory Benchmarks and What It Means for LLM Agents




The “cognitive wall”: why more context isn’t enough
A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.
EverMemOS is built around a simple thesis:
The future of long-term agents depends more on structured memory organization than on brute-force context expansion.
EverMemOS in one line
EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:
Episodic Trace Formation
Semantic Consolidation
Reconstructive Recollection
Phase I — Episodic Trace Formation: from dialogue streams to MemCells
EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.
A MemCell is defined as a tuple:
E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)
F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching
P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness
M (Metadata): timestamps and source pointers for grounding
To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:
Semantic boundary detection (to segment continuous streams into coherent episodes),
Narrative synthesis (resolve coreferences / ambiguity into a clean episode),
Structured derivation of atomic facts + time-bounded foresight signals.
Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution
If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.
In Semantic Consolidation, EverMemOS performs online incremental clustering:
When a new MemCell arrives, it compares the cell to existing MemScene centroids.
If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.
Crucially, consolidation also drives profile evolution:
Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.
This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.
Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall
In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:
Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.
At a high level, EverMemOS:
selects relevant MemScenes,
retrieves episodes (MemCells) using hybrid retrieval,
and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”
Results
EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:
LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).
LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.
HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).
PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.
Why this matters for real agents (beyond benchmarks)
Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:
conflicting preferences vs. new constraints,
stable personalization,
time-bounded states (medications, deadlines, temporary plans),
and proactive, experience-grounded “foresight.”
EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.
What’s next (and how to try it)
EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.
Paper + Code:
arXiv page: https://arxiv.org/abs/2601.02163
code: https://github.com/EverMind-AI/EverMemOS
The “cognitive wall”: why more context isn’t enough
A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.
EverMemOS is built around a simple thesis:
The future of long-term agents depends more on structured memory organization than on brute-force context expansion.
EverMemOS in one line
EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:
Episodic Trace Formation
Semantic Consolidation
Reconstructive Recollection
Phase I — Episodic Trace Formation: from dialogue streams to MemCells
EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.
A MemCell is defined as a tuple:
E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)
F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching
P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness
M (Metadata): timestamps and source pointers for grounding
To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:
Semantic boundary detection (to segment continuous streams into coherent episodes),
Narrative synthesis (resolve coreferences / ambiguity into a clean episode),
Structured derivation of atomic facts + time-bounded foresight signals.
Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution
If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.
In Semantic Consolidation, EverMemOS performs online incremental clustering:
When a new MemCell arrives, it compares the cell to existing MemScene centroids.
If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.
Crucially, consolidation also drives profile evolution:
Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.
This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.
Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall
In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:
Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.
At a high level, EverMemOS:
selects relevant MemScenes,
retrieves episodes (MemCells) using hybrid retrieval,
and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”
Results
EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:
LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).
LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.
HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).
PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.
Why this matters for real agents (beyond benchmarks)
Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:
conflicting preferences vs. new constraints,
stable personalization,
time-bounded states (medications, deadlines, temporary plans),
and proactive, experience-grounded “foresight.”
EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.
What’s next (and how to try it)
EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.
Paper + Code:
arXiv page: https://arxiv.org/abs/2601.02163
code: https://github.com/EverMind-AI/EverMemOS
The “cognitive wall”: why more context isn’t enough
A straightforward solution to long-term coherence is to expand the context window. But ultra-long contexts can be expensive and still degrade in effectiveness (e.g., “lost-in-the-middle” behavior). More importantly, many real failures aren’t caused by missing information—they’re caused by poor integration: the agent may retrieve relevant facts but fail to consolidate them into stable concepts, detect contradictions, or maintain a consistent user model.
EverMemOS is built around a simple thesis:
The future of long-term agents depends more on structured memory organization than on brute-force context expansion.
EverMemOS in one line
EverMemOS is a Memory Operating System that turns unbounded interaction streams into a structured “digital brain” via a three-phase memory lifecycle:
Episodic Trace Formation
Semantic Consolidation
Reconstructive Recollection
Phase I — Episodic Trace Formation: from dialogue streams to MemCells
EverMemOS introduces a core memory primitive: the MemCell, an atomic unit that bridges low-level logs and high-level semantics.
A MemCell is defined as a tuple:
E (Episode): a concise third-person narrative of what happened (a stable semantic anchor)
F (Atomic Facts): discrete, verifiable statements derived from the episode for high-precision matching
P (Foresight): forward-looking inferences (plans, temporary states) annotated with validity intervals [tstart, tend] for temporal awareness
M (Metadata): timestamps and source pointers for grounding
To create MemCells robustly from noisy conversations, EverMemOS uses a pipeline that includes:
Semantic boundary detection (to segment continuous streams into coherent episodes),
Narrative synthesis (resolve coreferences / ambiguity into a clean episode),
Structured derivation of atomic facts + time-bounded foresight signals.
Phase II — Semantic Consolidation: self-organizing “MemScenes” + profile evolution
If MemCells are atoms, then MemScenes are the themes that keep an agent coherent.
In Semantic Consolidation, EverMemOS performs online incremental clustering:
When a new MemCell arrives, it compares the cell to existing MemScene centroids.
If similarity exceeds a threshold τ, the MemCell is assimilated; otherwise, a new MemScene is created.
Crucially, consolidation also drives profile evolution:
Instead of prompting over raw chat logs, EverMemOS updates a compact User Profile from scene summaries, helping separate stable traits from transient states and track conflicts over time.
This is the part many “flat retrieval” memory systems miss: structured consolidation as a first-class system behavior.
Phase III — Reconstructive Recollection: “necessary and sufficient” context, not maximal recall
In EverMemOS, retrieval is not treated as a one-shot lookup. It’s modeled as an active reconstruction process guided by a principle of:
Necessity and sufficiency: retrieve only what’s needed to answer well—no more, no less.
At a high level, EverMemOS:
selects relevant MemScenes,
retrieves episodes (MemCells) using hybrid retrieval,
and uses iterative checks (e.g., sufficiency verification + query rewriting) to avoid both under-recall and “prompt bloat.”
Results
EverMemOS has achieved State-of-the-Art (SOTA) results across four major long-term memory benchmarks:
LoCoMo: Outperformed all existing memory systems and even full-context large models, while using drastically fewer tokens (93.05% overall accuracy).
LongMemEval: Achieved a leading 83.00% accuracy, showing particularly strong gains in Knowledge Updates and temporal reasoning.
HaluMem: Set a new standard for memory integrity and accuracy (90.04% recall).
PersonaMem v2: Demonstrated superior performance in deep personalization and behavioral consistency across diverse scenarios.
Why this matters for real agents (beyond benchmarks)
Today’s benchmarks focus heavily on answer-level correctness. But real assistants must also handle:
conflicting preferences vs. new constraints,
stable personalization,
time-bounded states (medications, deadlines, temporary plans),
and proactive, experience-grounded “foresight.”
EverMemOS explicitly builds memory representations (like time-valid Foresight) and system behaviors (semantic consolidation) to support these agent requirements, and illustrates them via qualitative case studies.
What’s next (and how to try it)
EverMemOS is designed as a system-level foundation: a memory OS that can be attached to different agent stacks and tasks, while keeping a consistent lifecycle contract for building and using memory.
Paper + Code:
arXiv page: https://arxiv.org/abs/2601.02163
code: https://github.com/EverMind-AI/EverMemOS

