Key metrics for monitoring Amazon EC2: Site24x7

Amazon Elastic Compute Cloud (EC2) on the Amazon Web Services (AWS) platform provides re-sizable computing capacity to help you run and scale business applications in the cloud. EC2 enables users to provide resources in the form of virtual servers that are called instances.

There are different types of instances that provide different capacities for CPU, memory, storage, and networking. With EC2 instances, users can modify resource capacity in an agile way and launch instances in specific locations to match the regional demand.

Basic infrastructure-level metrics are collected by querying the CloudWatch API based on the polling intervals set. EC2 IT automations can also be integrated with other AWS services.

Metrics to look for in EC2 monitoring

In spite of having multiple individual instances, you should keep track of the basic system-level metrics of your infrastructure. EC2 metrics fall into the following categories:

CPU credit metrics
Resource usage metricse
EBS metrics for Nitro-based instances
Elastic inference metrics
Elastic graphics metrics
Instance status checks

CPU credit metrics

EC2 instances have many virtual CPUs, and tracking the CPU usage can help you with exact resource mapping based on your workload. Although CloudWatch monitors the utilization and processing capacity of an instance, it does not monitor the CPU usage of the hardware layer on which the instance is being hosted. T2/T3 instances are capable of providing processing power based on a baseline level.

CPU credit usage

This measures the number of CPU credits consumed by the instance. Usually, one CPU credit is equivalent to one minute of 100% CPU utilization.

CPU credit balance

This measures the number of earned CPU credits accrued by the instance. Credits are earned anytime the instance is running below its baseline CPU performance level.

CPU surplus credit balance

This measures the number of surplus credits that have been consumed by the T2/T3 unlimited instance. When the CPU credit balance is exhausted, the instance will consume additional credits to maintain higher CPU usage.

CPU surplus credits charged

This measures the number of consumed surplus credits that are not paid down by earned CPU credits and tracks the difference between the number of credits accumulated and the current credit balance.

Resource usage metrics

Resource usage metrics are some of the most prominent host-level metrics for monitoring applications that have consistently high utilization levels.

CPU utilization

This metric measures the percentage of allocated CPU units that are being used by the instance.

Disk read and write operations

These metrics help you monitor the number of completed read and write operations on all your instance volumes. They can also determine if the performance degradation is the result of high IOPS, which causes bottlenecks.

Network input/output

These measure the number of bytes received by or sent out of all network interfaces.

Metadata no token

This metric allows you to measure the number of times the instance metadata service was successfully accessed using a method that does not involve a token.

EBS metrics for Nitro-based instances

Amazon Elastic Block Store (EBS) is a scalable, high performance block storage service under EC2. The EBS storage volume provides persistent storage compared to an instance volume, which loses the storage volume when the instance stops working.

EBS read and write operations

These metrics help you measure the count of the completed read and write operations for all EBS volumes attached to the instance within a specific period of time.

EBS read and write bytes

These metrics measure the bytes read and written for all EBS volumes attached to the instance within a specific period of time.

EBS balance percent

This metric shows the percentage of I/O or throughput credits remaining in the burst bucket.

Elastic Inference metrics

Amazon Elastic Inference is a resource you can attach to your EC2 instances to accelerate your deep learning inference workloads. Through Elastic Inference metrics, you can monitor the connectivity and performance of your Elastic Inference accelerator connected to your EC2 instance.

Accelerator health check

This metric checks whether the Elastic Inference accelerator has passed a status health check in the previous minute. A value of zero (0) indicates the status check has failed, and a value of one (1) indicates the status check has passed.

Accelerator connectivity check

This metric checks whether the connectivity to the Elastic Inference accelerator is active or has failed. A value of zero (0) indicates a failed connection, and a value of one (1) indicates a successful connection.

Accelerator memory usage

This metric helps you measure the memory of the Elastic Inference accelerator.

Elastic Graphics metrics

Amazon Elastic Graphics provides flexible, low-cost, high performance graphics acceleration for your Windows instances. With Elastic Graphics metrics, you can monitor the connectivity and performance of your Elastic Graphics accelerator connected to your EC2 instance.

GPU connectivity check

GPU connectivity is the backbone of graphics acceleration, and this metric allows you to check whether the connectivity to the Elastic Graphics accelerator is active or has failed. A value of zero (0) indicates a failed connection, and a value of one (1) indicates a successful connection.

GPU health check

You will be able to check whether the Elastic Graphics accelerator has passed a status health check in the previous minute. A value of zero (0) indicates the status check has failed, and a value of one (1) indicates the status check has passed.

GPU memory utilization

Similar to CPU utilization, the GPU memory utilization metric allows you to monitor the GPU memory used in MiB.

Instance status checks

EC2 instance status checks help you check on the status of an individual instance and the AWS systems hosting it. They are available at one-minute intervals, giving you an accurate indication of an instance’s health. This lets you determine whether the problem is with the AWS infrastructure, the software, or the network configuration of the instance.

Status check failed

This metric helps you determine whether the instance has failed both the instance reachability check and the system reachability check in the previous minute.

Status check failed_instance

This reports if the instance has failed the instance reachability check in the previous minute. Usually, these failures are due to problems outside of your control, such as power loss. This can likely be resolved by stopping and restarting an instance to switch it to a new host.

Status check failed_system

This metric reports if the instance has failed the system reachability check in the previous minute.

We have looked at several metrics that are vital for EC2 monitoring as well as tracking the health of your applications. EC2’s varied range of instances lets you create customized infrastructure suitable for any of the above use cases that allows you to scale, change and downsize your instances.

Sorry to hear that. Let us know how we can improve the article.

Key metrics for monitoring Amazon EC2