Monitoring and Observability
A production MonkDB setup should combine:
- SQL-native observability (
sys.*,information_schema.*) - JVM and breaker telemetry (JMX)
- Time-series dashboards and alerts (Prometheus + Grafana)
Layer 1: SQL-native observability
Health status
SELECT description, severity
FROM sys.checks
WHERE NOT passed
ORDER BY severity DESC;
Active workload
SELECT id, stmt, started FROM sys.jobs;
SELECT node['name'], job_id, name, used_bytes, started
FROM sys.operations
ORDER BY used_bytes DESC;
Node pressure
SELECT name,
load['1'] AS load_1m,
mem['used_percent'] AS mem_used_pct,
heap['used'] AS heap_used,
fs['total'],
fs['used']
FROM sys.nodes
ORDER BY name;
Shard state
SELECT table_name, id, routing_state, state,
recovery['stage'], recovery['size']['percent']
FROM sys.shards
ORDER BY table_name, id;
Layer 2: JMX instrumentation
Track at minimum:
- JVM heap/non-heap usage
- GC pause and throughput
- thread pools
- circuit breaker utilization/trips
- request/operation latency trends
Layer 3: Prometheus + Grafana
Recommended pipeline:
- JMX exporter sidecar/agent per node.
- Prometheus scrape targets for MonkDB nodes.
- Grafana dashboards with SLO-focused panels.
Recommended dashboards:
- Cluster health and node pressure
- Query throughput, p95/p99 latency
- Breaker headroom/trip rates
- Shard relocation/recovery activity
- Governance/audit/lineage sink metrics
Governance and sink metrics
When enabled:
SELECT * FROM sys.policy_audit_sink_metrics LIMIT 1;
SELECT * FROM sys.governance_contract_metrics LIMIT 1;
SELECT * FROM sys.lineage_sink_metrics LIMIT 1;
Alerting baselines
- Sustained high heap usage
- Breaker trip rate increase
- Growing queue depth in sink metrics
- Relocation stuck/slow recovery
- High ratio of failed queries