Skip to content

Storage, Consistency, and Resiliency

Storage model

  • Every table is sharded across the cluster.
  • Optional partitioning splits table data further by partition keys (for lifecycle and pruning).
  • Each shard persists as a Lucene-backed index composed of immutable segments.
  • Writes are first recorded in the translog and then materialized into new segments during refresh cycles.
  • Background merge policies compact smaller segments into larger ones to maintain query efficiency.
  • Writes pass through journal/translog before durable segment transitions.
  • Replicas store independent shard copies for fault tolerance and read scalability.

Write path

Write path

Write-path notes:

  • Primary shard validates and applies writes first.
  • Replication can operate in synchronous or quorum acknowledgement modes.
  • The write is considered successful once the configured number of shard copies (primary + replicas) confirm persistence.
  • Under backpressure/failures, retries and recovery paths preserve consistency semantics at shard level.

Read path types

  • Primary-key direct lookup path: immediate visibility semantics for key-based fetch/update loops.
  • Search-style path (secondary attributes/full scans): near-real-time visibility with refresh cycles.

Use REFRESH TABLE when deterministic immediate search visibility is required.

Read/visibility decision table

Requirement Recommended pattern Tradeoff
Immediate key-based verification after write Primary-key lookup path Works for key access, not broad search scans.
Immediate search visibility for user-facing query REFRESH TABLE after write batch Extra refresh overhead if overused.
Maximum ingest throughput Rely on natural refresh cycle Search visibility is near-real-time, not immediate.

Consistency model in practice

MonkDB does not provide multi-statement ACID transactions across arbitrary rows/tables.

Operationally important guarantees:

  • Atomicity at row/document write unit.
  • Durability via translog + shard replication.
  • Near-real-time consistency for search readers.
  • Stronger immediate behavior for point-lookups by key.

MVCC and optimistic concurrency

MonkDB uses MVCC-style versioned document semantics suitable for optimistic concurrency patterns.

Typical pattern:

  1. Read current row version/state.
  2. Write with expected version constraints (application/API path).
  3. Retry if conflict occurs.

Refresh and visibility model

A refresh makes newly indexed segments visible to search readers without requiring a full disk flush. For deterministic user flows:

  • trigger REFRESH TABLE <table> after critical writes when immediate search visibility is required
  • keep refresh usage scoped to avoid excessive refresh overhead

Example:

INSERT INTO doc.orders (id, amount) VALUES ('o-1', 120.0);
REFRESH TABLE doc.orders;
SELECT * FROM doc.orders WHERE id = 'o-1';

Resiliency mechanics

  • Replica promotion on primary loss.
  • Automatic shard recovery and relocation after node events.
  • Cluster coordination node election using quorum-based consensus.

Failure recovery flow

Failure recovery flow

Recovery and allocation diagnostics

Use system tables to track recovery health:

SELECT table_name, id, routing_state, state,
       recovery['stage'], recovery['size']['percent']
FROM sys.shards
WHERE routing_state IN ('INITIALIZING', 'RELOCATING')
ORDER BY table_name, id;
SELECT table_name, shard_id, node_id, explanation
FROM sys.allocations
WHERE explanation IS NOT NULL
ORDER BY table_name, shard_id;

Data lifecycle controls

  • Partitioned retention: drop old partitions cheaply.
  • Tiering strategy: hot/warm/cold storage placement via operational shard moves.
  • Snapshot/restore for disaster recovery and rollback workflows.

Shard Distribution and Replica Placement

Shard distribution and replica placement

Backup and disaster recovery posture

Recommended baseline:

  1. Create repository early (CREATE REPOSITORY).
  2. Schedule periodic snapshots (CREATE SNAPSHOT).
  3. Test restore workflows (RESTORE SNAPSHOT) in staging.
  4. Monitor snapshot state via sys.snapshots and repository metadata via sys.repositories.