Skip to content

MonkDB Documentation

Monitoring and Observability

Monitoring and Observability

A production MonkDB setup should combine:

SQL-native observability (sys.*, information_schema.*)
JVM and breaker telemetry (JMX)
Time-series dashboards and alerts (Prometheus + Grafana)

Layer 1: SQL-native observability

Health status

SELECT description, severity
FROM sys.checks
WHERE NOT passed
ORDER BY severity DESC;

Active workload

SELECT id, stmt, started FROM sys.jobs;

SELECT node['name'], job_id, name, used_bytes, started
FROM sys.operations
ORDER BY used_bytes DESC;

Node pressure

SELECT name,
       load['1'] AS load_1m,
       mem['used_percent'] AS mem_used_pct,
       heap['used'] AS heap_used,
       fs['total'],
       fs['used']
FROM sys.nodes
ORDER BY name;

Shard state

SELECT table_name, id, routing_state, state,
       recovery['stage'], recovery['size']['percent']
FROM sys.shards
ORDER BY table_name, id;

Layer 2: JMX instrumentation

Track at minimum:

JVM heap/non-heap usage
GC pause and throughput
thread pools
circuit breaker utilization/trips
request/operation latency trends

Layer 3: Prometheus + Grafana

Recommended pipeline:

JMX exporter sidecar/agent per node.
Prometheus scrape targets for MonkDB nodes.
Grafana dashboards with SLO-focused panels.

Recommended dashboards:

Cluster health and node pressure
Query throughput, p95/p99 latency
Breaker headroom/trip rates
Shard relocation/recovery activity
Governance/audit/lineage sink metrics

Governance and sink metrics

When enabled:

SELECT * FROM sys.policy_audit_sink_metrics LIMIT 1;
SELECT * FROM sys.governance_contract_metrics LIMIT 1;
SELECT * FROM sys.lineage_sink_metrics LIMIT 1;

Alerting baselines

Sustained high heap usage
Breaker trip rate increase
Growing queue depth in sink metrics
Relocation stuck/slow recovery
High ratio of failed queries