Kubernetes Etcd monitoring

Etcd acts as the single source of truth for your Kubernetes control plane. It stores all cluster data, including node information, ConfigMaps, secrets, and service discovery details. With Site24x7’s Kubernetes etcd monitoring, you can continuously track the performance and health of this distributed key-value store to maintain cluster stability and responsiveness.

Gain granular visibility into key etcd metrics, such as request latency, leader election events, database size, and snapshot durations. These insights help you:

Identify performance bottlenecks and slow disk writes that can impact API server responsiveness.
Detect leader instability, data consistency issues, and failed quorum states before they affect your cluster.
Monitor disk usage and compaction activity to prevent storage bloat and ensure smooth write operations.
Set proactive thresholds and alerts to avoid etcd downtime and maintain control plane availability.

Stay ahead of issues and ensure your Kubernetes cluster runs smoothly by maintaining a healthy, well-monitored etcd layer with Site24x7.

Supported versions

This feature is supported from Linux server monitoring agent version 21.0.0.

Control plane monitoring and the other latest features require you to upgrade your Kubernetes agent to the latest version.

Note

If you haven't added a Kubernetes monitor yet, follow these steps to add one.

Etcd monitor

As soon as you upgrade your agent, the Site24x7 Kubernetes monitoring agent will fetch all the etcd metrics.
To navigate to your Etcd monitor:

Log in to your Site24x7 account.
Navigate to K8s > select the Cluster > Etcd. This will open the list of etcd monitors in the particular cluster. Click one to view detailed insights into that monitor.

Supported metrics

Utilization

Metric	Description	Unit
Configurations
Version	The current etcd server version	Text
Role	The role of the etcd member in the cluster (leader or follower)	Text
Server ID	The unique identifier for the etcd server instance	Text
Leader Transition Events
Leader Transition Events	The number of leadership changes in the etcd cluster	Count
Etcd Proposals
Proposals Committed	The number of proposals committed in the Raft log; indicates successful consensus writes	Count
Proposals Applied	The number of committed proposals applied to a state machine; t ensures committed proposals are applied	Count
Proposals Pending	The number of proposals waiting to be committed; high numbers suggest bottlenecks, such as high client load or the member cannot commit proposals	Count
Proposals Failed	The number of proposals that failed during commit or apply	Count
Read Index Failures and Slow Applies
Failed Read Indexes	Failed read index requests	Count
Slow Apply Requests	Apply operations exceeding the latency threshold	Count
Slow Read Indexes	Read index requests exceeding the latency threshold	Count
Backend Storage Quota Size
Backend Storage Quota Size	The configured maximum storage quota for the back-end database (DB)	Bytes
Etcd Server Client Requests
Etcd Server Client Requests	Total client requests handled by the server	Count
Server Lease Expired
Server Lease Expired	The number of expired leases detected	Count
Health Checks
Health Check Failures	The number of failed health checks	Count
Heartbeat Send Failures	Failed heartbeat transmissions	Count
Successful Health Checks	Successful health check responses	Count
Go Usage
Go Threads	The number of OS threads created by Go runtime	Count
Go Routines	The number of active Go routines in the process	Count
Process CPU Time
Process CPU Time	The total CPU time consumed by the process	Seconds
Process Memory Usage
Process Resident Memory	Memory currently in use in RAM	Bytes
Process Virtual Memory	Total virtual memory allocated	Bytes
Open File Descriptors
Process Open File Descriptors	The number of open file descriptors by process; if the file descriptors are exhausted, etcd may panic because it cannot create new write-ahead logging (WAL) files	Count
Maximum Open File Descriptors	The maximum file descriptors allowed for a process	Count
OS File Descriptors
File Descriptors Used	The current file descriptors used at the OS level	Count
File Descriptors	The total available file descriptors at the OS level	Count

Peers

Metric	Description	Unit
Peer to Peer Check
Peer Name	The name of the peer in cluster communication	Text
Average P2P Round Trip Latency	The average latency between peers	Seconds
P2P Round Trip Latency	The current round trip latency between peers	Seconds
P2P Traffic In	Bytes received from the peer	Bytes
P2P Traffic Out	Bytes sent to the peer	Bytes
P2P Traffic In Failures	Failed inbound peer communications	Count
P2P Traffic Out Failures	Failed outbound peer communications	Count
Active P2P Connections	The number of active peer connections	Count
Disconnected P2P	The number of disconnected peers	Count

gRPC

Metric	Description	Unit
gRPC Proxy Cache
gRPC Proxy Cache Hits	The number of successful cache hits in the gRPC proxy	Count
gRPC Proxy Cache Keys	The number of keys currently in the gRPC proxy cache	Count
gRPC Proxy Cache Misses	The number of cache misses in the gRPC proxy	Count
gRPC Proxy Coalescing
gRPC Proxy Events Coalescing	The number of coalesced events in the proxy	Count
gRPC Proxy Watchers Coalescing	The number of watchers merged in the proxy	Count
Network Utilization
Network Client gRPC Received	Bytes received via gRPC client connections	Bytes
Network Client gRPC Sent	Bytes sent via gRPC client connections	Bytes

Disk and snapshot

Metric	Description	Unit
Average WAL Fsync Duration
Average WAL Fsync Duration	The average time to fsync write-ahead logs	Seconds
WAL Fsync
WAL Fsync	The number of fsync operations for WAL	Count
Average Disk Backend Commit Duration
Average Disk Backend Commit Duration	The average time to commit back-end DB changes	Seconds
Disk Backend Commit
Disk Backend Commit	The number of back-end commit operations	Count
Disk WAL Write Bytes
Disk WAL Write Bytes	Bytes written to WAL	Bytes
Snapshot Fsync and Save Duration
Average Snapshot DB Fsync Duration	The average time to fsync a snapshot of the DB	Seconds
Average Snapshot Save Duration	The average time to save a snapshot	Seconds
Average Snapshot Fsync Duration	The average time to fsync a snapshot file	Seconds
Snapshot Details
Snapshot DB Fsync	The number of fsyncs for a snapshot of a DB; high values indicate disk latency or issues, potentially destabilizing the cluster	Count
Snapshot Save	The number of snapshot save operations	Count
Snapshot Fsync	The number of fsyncs for a snapshot file	Count
Average Snap Save Marshalling Duration
Average Snap Save Marshalling Duration	The average time to marshall data before saving a snapshot	Seconds
Snap Save Marshalling
Snap Save Marshalling	The number of marshalling operations for a snapshot save	Count
Average Snap Save Total Duration
Average Snap Save Total Duration	The average total time to save a snapshot (etcd_debugging namespace metrics)	Seconds
Snap Save Total
Snap Save Total	The total number of snapshot saves (etcd_debugging namespace metrics)	Count

MVCC and Store

Metric	Description	Unit
Events
MVCC Events	The number of MVCC events generated	Count
MVCC Pending Events	The number of pending MVCC events	Count
MVCC Database
MVCC Database Size	The total size of the MVCC database	Bytes
MVCC Database Utilization	The percentage of allocated DB space used	Percent
MVCC Keys and Range
MVCC Keys	The number of keys stored in the MVCC DB	Count
MVCC Range (Debug)	The count of range queries for debugging	Count
Active Watchers and Streams
MVCC Slow Watcher	The number of watchers exceeding the latency threshold	Count
MVCC Watch Stream	The number of active watch streams	Count
MVCC Watcher	The total number of watchers	Count
MVCC Operations
MVCC Delete	The number of delete operations	Count
MVCC Put	The number of put operations	Count
MVCC Transactions	The number of transaction operations	Count
MVCC DB Compaction Keys
MVCC DB Compaction Keys	The number of keys compacted in the DB	Count
Store Operations
Store Expires	The number of key expirations	Count
Store Reads	The number of read operations	Count
Store Writes	The number of write operations	Count
Store Watch Requests	The number of watch requests	Count
Store Watchers	The number of active watchers in the store	Count
Database and Index Compaction Duration
Average MVCC DB Compaction Pause Duration	The average pause duration during DB compaction	Seconds
Average MVCC DB Compaction Duration	The average duration of DB compaction	Seconds
Average MVCC Index Compaction Pause Duration	The average pause duration during index compaction	Seconds
Database and Index Compaction
MVCC DB Compaction Pause	The number of pauses during DB compaction	Count
MVCC DB Compaction Total	Total DB compaction operations	Count
MVCC Index Compaction Pause	The number of pauses during index compaction	Count

Kubernetes Etcd monitoring

Supported versions

Etcd monitor

Supported metrics

Related links: