Help Docs

Kubernetes Etcd monitoring   

Etcd acts as the single source of truth for your Kubernetes control plane. It stores all cluster data, including node information, ConfigMaps, secrets, and service discovery details. With Site24x7’s Kubernetes etcd monitoring, you can continuously track the performance and health of this distributed key-value store to maintain cluster stability and responsiveness.

Gain granular visibility into key etcd metrics, such as request latency, leader election events, database size, and snapshot durations. These insights help you:

  • Identify performance bottlenecks and slow disk writes that can impact API server responsiveness.

  • Detect leader instability, data consistency issues, and failed quorum states before they affect your cluster.

  • Monitor disk usage and compaction activity to prevent storage bloat and ensure smooth write operations.

  • Set proactive thresholds and alerts to avoid etcd downtime and maintain control plane availability.

Stay ahead of issues and ensure your Kubernetes cluster runs smoothly by maintaining a healthy, well-monitored etcd layer with Site24x7.

Supported versions

This feature is supported from Linux server monitoring agent version 21.0.0.

Control plane monitoring and the other latest features require you to upgrade your Kubernetes agent to the latest version.

Note

If you haven't added a Kubernetes monitor yet, follow these steps to add one.

Etcd monitor   

As soon as you upgrade your agent, the Site24x7 Kubernetes monitoring agent will fetch all the etcd metrics. 
To navigate to your Etcd monitor:

  1. Log in to your Site24x7 account.
  2. Navigate to K8s > select the Cluster > Etcd. This will open the list of etcd monitors in the particular cluster. Click one to view detailed insights into that monitor. 

Supported metrics  

Utilization

Metric Description Unit
Configurations    
Version The current etcd server version Text
Role The role of the etcd member in the cluster (leader or follower) Text
Server ID The unique identifier for the etcd server instance Text
Leader Transition Events    
Leader Transition Events The number of leadership changes in the etcd cluster Count
Etcd Proposals    
Proposals Committed The number of proposals committed in the Raft log; indicates successful consensus writes Count
Proposals Applied The number of committed proposals applied to a state machine; t ensures committed proposals are applied Count
Proposals Pending The number of proposals waiting to be committed; high numbers suggest bottlenecks, such as high client load or the member cannot commit proposals Count
Proposals Failed The number of proposals that failed during commit or apply Count
Read Index Failures and Slow Applies    
Failed Read Indexes Failed read index requests Count
Slow Apply Requests Apply operations exceeding the latency threshold Count
Slow Read Indexes Read index requests exceeding the latency threshold Count
Backend Storage Quota Size    
Backend Storage Quota Size The configured maximum storage quota for the back-end database (DB) Bytes
Etcd Server Client Requests    
Etcd Server Client Requests Total client requests handled by the server Count
Server Lease Expired    
Server Lease Expired The number of expired leases detected Count
Health Checks    
Health Check Failures The number of failed health checks Count
Heartbeat Send Failures Failed heartbeat transmissions Count
Successful Health Checks Successful health check responses Count
Go Usage    
Go Threads The number of OS threads created by Go runtime Count
Go Routines The number of active Go routines in the process Count
Process CPU Time    
Process CPU Time The total CPU time consumed by the process Seconds
Process Memory Usage    
Process Resident Memory Memory currently in use in RAM Bytes
Process Virtual Memory Total virtual memory allocated Bytes
Open File Descriptors    
Process Open File Descriptors The number of open file descriptors by process; if the file descriptors are exhausted, etcd may panic because it cannot create new write-ahead logging (WAL) files Count
Maximum Open File Descriptors The maximum file descriptors allowed for a process Count
OS File Descriptors    
File Descriptors Used The current file descriptors used at the OS level Count
File Descriptors The total available file descriptors at the OS level Count

Peers

Metric Description Unit
Peer to Peer Check    
Peer Name The name of the peer in cluster communication Text
Average P2P Round Trip Latency The average latency between peers Seconds
P2P Round Trip Latency The current round trip latency between peers Seconds
P2P Traffic In Bytes received from the peer Bytes
P2P Traffic Out Bytes sent to the peer Bytes
P2P Traffic In Failures Failed inbound peer communications Count
P2P Traffic Out Failures Failed outbound peer communications Count
Active P2P Connections The number of active peer connections Count
Disconnected P2P The number of disconnected peers Count

gRPC

Metric Description Unit
gRPC Proxy Cache    
gRPC Proxy Cache Hits The number of successful cache hits in the gRPC proxy Count
gRPC Proxy Cache Keys The number of keys currently in the gRPC proxy cache Count
gRPC Proxy Cache Misses The number of cache misses in the gRPC proxy Count
gRPC Proxy Coalescing    
gRPC Proxy Events Coalescing The number of coalesced events in the proxy Count
gRPC Proxy Watchers Coalescing The number of watchers merged in the proxy Count
Network Utilization    
Network Client gRPC Received Bytes received via gRPC client connections Bytes
Network Client gRPC Sent Bytes sent via gRPC client connections Bytes

 Disk and snapshot

Metric Description Unit
Average WAL Fsync Duration    
Average WAL Fsync Duration The average time to fsync write-ahead logs Seconds
WAL Fsync    
WAL Fsync The number of fsync operations for WAL Count
Average Disk Backend Commit Duration    
Average Disk Backend Commit Duration The average time to commit back-end DB changes Seconds
Disk Backend Commit    
Disk Backend Commit The number of back-end commit operations Count
Disk WAL Write Bytes    
Disk WAL Write Bytes Bytes written to WAL Bytes
Snapshot Fsync and Save Duration    
Average Snapshot DB Fsync Duration The average time to fsync a snapshot of the DB Seconds
Average Snapshot Save Duration The average time to save a snapshot Seconds
Average Snapshot Fsync Duration The average time to fsync a snapshot file Seconds
Snapshot Details    
Snapshot DB Fsync The number of fsyncs for a snapshot of a DB; high values indicate disk latency or issues, potentially destabilizing the cluster Count
Snapshot Save The number of snapshot save operations Count
Snapshot Fsync The number of fsyncs for a snapshot file Count
Average Snap Save Marshalling Duration    
Average Snap Save Marshalling Duration The average time to marshall data before saving a snapshot Seconds
Snap Save Marshalling    
Snap Save Marshalling The number of marshalling operations for a snapshot save Count
Average Snap Save Total Duration    
Average Snap Save Total Duration The average total time to save a snapshot (etcd_debugging namespace metrics) Seconds
Snap Save Total    
Snap Save Total The total number of snapshot saves (etcd_debugging namespace metrics) Count

MVCC and Store

Metric Description Unit
Events    
MVCC Events The number of MVCC events generated Count
MVCC Pending Events The number of pending MVCC events Count
MVCC Database    
MVCC Database Size The total size of the MVCC database Bytes
MVCC Database Utilization The percentage of allocated DB space used Percent
MVCC Keys and Range    
MVCC Keys The number of keys stored in the MVCC DB Count
MVCC Range (Debug) The count of range queries for debugging Count
Active Watchers and Streams    
MVCC Slow Watcher The number of watchers exceeding the latency threshold Count
MVCC Watch Stream The number of active watch streams Count
MVCC Watcher The total number of watchers Count
MVCC Operations    
MVCC Delete The number of delete operations Count
MVCC Put The number of put operations Count
MVCC Transactions The number of transaction operations Count
MVCC DB Compaction Keys    
MVCC DB Compaction Keys The number of keys compacted in the DB Count
Store Operations    
Store Expires The number of key expirations Count
Store Reads The number of read operations Count
Store Writes The number of write operations Count
Store Watch Requests The number of watch requests Count
Store Watchers The number of active watchers in the store Count
Database and Index Compaction Duration    
Average MVCC DB Compaction Pause Duration The average pause duration during DB compaction Seconds
Average MVCC DB Compaction Duration The average duration of DB compaction Seconds
Average MVCC Index Compaction Pause Duration The average pause duration during index compaction Seconds
Database and Index Compaction    
MVCC DB Compaction Pause The number of pauses during DB compaction Count
MVCC DB Compaction Total Total DB compaction operations Count
MVCC Index Compaction Pause The number of pauses during index compaction Count

Related links:  

Was this document helpful?

Would you like to help us improve our documents? Tell us what you think we could do better.


We're sorry to hear that you're not satisfied with the document. We'd love to learn what we could do to improve the experience.


Thanks for taking the time to share your feedback. We'll use your feedback to improve our online help resources.

Shortlink has been copied!