Kubernetes Etcd monitoring
Etcd acts as the single source of truth for your Kubernetes control plane. It stores all cluster data, including node information, ConfigMaps, secrets, and service discovery details. With Site24x7’s Kubernetes etcd monitoring, you can continuously track the performance and health of this distributed key-value store to maintain cluster stability and responsiveness.
Gain granular visibility into key etcd metrics, such as request latency, leader election events, database size, and snapshot durations. These insights help you:
-
Identify performance bottlenecks and slow disk writes that can impact API server responsiveness.
-
Detect leader instability, data consistency issues, and failed quorum states before they affect your cluster.
-
Monitor disk usage and compaction activity to prevent storage bloat and ensure smooth write operations.
-
Set proactive thresholds and alerts to avoid etcd downtime and maintain control plane availability.
Stay ahead of issues and ensure your Kubernetes cluster runs smoothly by maintaining a healthy, well-monitored etcd layer with Site24x7.
Supported versions
This feature is supported from Linux server monitoring agent version 21.0.0.
Control plane monitoring and the other latest features require you to upgrade your Kubernetes agent to the latest version.
If you haven't added a Kubernetes monitor yet, follow these steps to add one.
Etcd monitor
As soon as you upgrade your agent, the Site24x7 Kubernetes monitoring agent will fetch all the etcd metrics.
To navigate to your Etcd monitor:
- Log in to your Site24x7 account.
- Navigate to K8s > select the Cluster > Etcd. This will open the list of etcd monitors in the particular cluster. Click one to view detailed insights into that monitor.
Supported metrics
Utilization
Metric | Description | Unit |
---|---|---|
Configurations | ||
Version | The current etcd server version | Text |
Role | The role of the etcd member in the cluster (leader or follower) | Text |
Server ID | The unique identifier for the etcd server instance | Text |
Leader Transition Events | ||
Leader Transition Events | The number of leadership changes in the etcd cluster | Count |
Etcd Proposals | ||
Proposals Committed | The number of proposals committed in the Raft log; indicates successful consensus writes | Count |
Proposals Applied | The number of committed proposals applied to a state machine; t ensures committed proposals are applied | Count |
Proposals Pending | The number of proposals waiting to be committed; high numbers suggest bottlenecks, such as high client load or the member cannot commit proposals | Count |
Proposals Failed | The number of proposals that failed during commit or apply | Count |
Read Index Failures and Slow Applies | ||
Failed Read Indexes | Failed read index requests | Count |
Slow Apply Requests | Apply operations exceeding the latency threshold | Count |
Slow Read Indexes | Read index requests exceeding the latency threshold | Count |
Backend Storage Quota Size | ||
Backend Storage Quota Size | The configured maximum storage quota for the back-end database (DB) | Bytes |
Etcd Server Client Requests | ||
Etcd Server Client Requests | Total client requests handled by the server | Count |
Server Lease Expired | ||
Server Lease Expired | The number of expired leases detected | Count |
Health Checks | ||
Health Check Failures | The number of failed health checks | Count |
Heartbeat Send Failures | Failed heartbeat transmissions | Count |
Successful Health Checks | Successful health check responses | Count |
Go Usage | ||
Go Threads | The number of OS threads created by Go runtime | Count |
Go Routines | The number of active Go routines in the process | Count |
Process CPU Time | ||
Process CPU Time | The total CPU time consumed by the process | Seconds |
Process Memory Usage | ||
Process Resident Memory | Memory currently in use in RAM | Bytes |
Process Virtual Memory | Total virtual memory allocated | Bytes |
Open File Descriptors | ||
Process Open File Descriptors | The number of open file descriptors by process; if the file descriptors are exhausted, etcd may panic because it cannot create new write-ahead logging (WAL) files | Count |
Maximum Open File Descriptors | The maximum file descriptors allowed for a process | Count |
OS File Descriptors | ||
File Descriptors Used | The current file descriptors used at the OS level | Count |
File Descriptors | The total available file descriptors at the OS level | Count |
Peers
Metric | Description | Unit |
---|---|---|
Peer to Peer Check | ||
Peer Name | The name of the peer in cluster communication | Text |
Average P2P Round Trip Latency | The average latency between peers | Seconds |
P2P Round Trip Latency | The current round trip latency between peers | Seconds |
P2P Traffic In | Bytes received from the peer | Bytes |
P2P Traffic Out | Bytes sent to the peer | Bytes |
P2P Traffic In Failures | Failed inbound peer communications | Count |
P2P Traffic Out Failures | Failed outbound peer communications | Count |
Active P2P Connections | The number of active peer connections | Count |
Disconnected P2P | The number of disconnected peers | Count |
gRPC
Metric | Description | Unit |
---|---|---|
gRPC Proxy Cache | ||
gRPC Proxy Cache Hits | The number of successful cache hits in the gRPC proxy | Count |
gRPC Proxy Cache Keys | The number of keys currently in the gRPC proxy cache | Count |
gRPC Proxy Cache Misses | The number of cache misses in the gRPC proxy | Count |
gRPC Proxy Coalescing | ||
gRPC Proxy Events Coalescing | The number of coalesced events in the proxy | Count |
gRPC Proxy Watchers Coalescing | The number of watchers merged in the proxy | Count |
Network Utilization | ||
Network Client gRPC Received | Bytes received via gRPC client connections | Bytes |
Network Client gRPC Sent | Bytes sent via gRPC client connections | Bytes |
Disk and snapshot
Metric | Description | Unit |
---|---|---|
Average WAL Fsync Duration | ||
Average WAL Fsync Duration | The average time to fsync write-ahead logs | Seconds |
WAL Fsync | ||
WAL Fsync | The number of fsync operations for WAL | Count |
Average Disk Backend Commit Duration | ||
Average Disk Backend Commit Duration | The average time to commit back-end DB changes | Seconds |
Disk Backend Commit | ||
Disk Backend Commit | The number of back-end commit operations | Count |
Disk WAL Write Bytes | ||
Disk WAL Write Bytes | Bytes written to WAL | Bytes |
Snapshot Fsync and Save Duration | ||
Average Snapshot DB Fsync Duration | The average time to fsync a snapshot of the DB | Seconds |
Average Snapshot Save Duration | The average time to save a snapshot | Seconds |
Average Snapshot Fsync Duration | The average time to fsync a snapshot file | Seconds |
Snapshot Details | ||
Snapshot DB Fsync | The number of fsyncs for a snapshot of a DB; high values indicate disk latency or issues, potentially destabilizing the cluster | Count |
Snapshot Save | The number of snapshot save operations | Count |
Snapshot Fsync | The number of fsyncs for a snapshot file | Count |
Average Snap Save Marshalling Duration | ||
Average Snap Save Marshalling Duration | The average time to marshall data before saving a snapshot | Seconds |
Snap Save Marshalling | ||
Snap Save Marshalling | The number of marshalling operations for a snapshot save | Count |
Average Snap Save Total Duration | ||
Average Snap Save Total Duration | The average total time to save a snapshot (etcd_debugging namespace metrics) | Seconds |
Snap Save Total | ||
Snap Save Total | The total number of snapshot saves (etcd_debugging namespace metrics) | Count |
MVCC and Store
Metric | Description | Unit |
---|---|---|
Events | ||
MVCC Events | The number of MVCC events generated | Count |
MVCC Pending Events | The number of pending MVCC events | Count |
MVCC Database | ||
MVCC Database Size | The total size of the MVCC database | Bytes |
MVCC Database Utilization | The percentage of allocated DB space used | Percent |
MVCC Keys and Range | ||
MVCC Keys | The number of keys stored in the MVCC DB | Count |
MVCC Range (Debug) | The count of range queries for debugging | Count |
Active Watchers and Streams | ||
MVCC Slow Watcher | The number of watchers exceeding the latency threshold | Count |
MVCC Watch Stream | The number of active watch streams | Count |
MVCC Watcher | The total number of watchers | Count |
MVCC Operations | ||
MVCC Delete | The number of delete operations | Count |
MVCC Put | The number of put operations | Count |
MVCC Transactions | The number of transaction operations | Count |
MVCC DB Compaction Keys | ||
MVCC DB Compaction Keys | The number of keys compacted in the DB | Count |
Store Operations | ||
Store Expires | The number of key expirations | Count |
Store Reads | The number of read operations | Count |
Store Writes | The number of write operations | Count |
Store Watch Requests | The number of watch requests | Count |
Store Watchers | The number of active watchers in the store | Count |
Database and Index Compaction Duration | ||
Average MVCC DB Compaction Pause Duration | The average pause duration during DB compaction | Seconds |
Average MVCC DB Compaction Duration | The average duration of DB compaction | Seconds |
Average MVCC Index Compaction Pause Duration | The average pause duration during index compaction | Seconds |
Database and Index Compaction | ||
MVCC DB Compaction Pause | The number of pauses during DB compaction | Count |
MVCC DB Compaction Total | Total DB compaction operations | Count |
MVCC Index Compaction Pause | The number of pauses during index compaction | Count |
Related links:
-
On this page
- Supported versions
- Etcd monitor
- Supported metrics