Amazon’s T2 instance family is unique in that it allows for CPU utilization to burst during periods of high usage. This capability has implications for determining appropriate size, but there are a set of metrics which can be used to create and confirm rightsizing recommendations.
Bursting is good, but it’s all about the baseline
In a previous blog post, I explained how T2 instances work, and what the benefits are. To recap, each T2 instance size receives CPU credits at an hourly rate determined by size. Each CPU credit is good for one minute at 100% of a single CPU, or 2 minutes at 50%, etc. An idle T2 instance will eventually top out its credit balance at 24* the hourly allocation. When the instance is active it consumes credits. If the utilization is below threshold, the consumption will be less than the replenishment rate, so the balance increment, or remain at the max. As CPU usage goes above baseline the credit balance diminishes. The balance recovers during idle periods.
A quick example:
A t2.micro, for example receives a flow of six CPU credits per hour. As each credit equals one minute at 100% utilization you can use 6/60 to determine the baseline CPU as 10%. As the utilization is below baseline the credits will accumulate, when it is higher, the instance can burn credits for high utilization. As long as it averages out at or below the baseline, the instance will run well, as the flow of credits will satisfy the requirements of the CPU.
Enter T2 Unlimited
Recently, Amazon announced the T2 Unlimited feature. T2 Unlimited adds two capabilities:
- The ability to “borrow” up to 24 hours of credits which have not yet accrued. This is useful for sustained bursts which occur several days apart, or weekly. The instance can burst for twice as long, but will take longer to recover.
- The ability to continue to burst at a fee. If the instance is still requesting capacity over the baseline after credits are exhausted, AWS will allow the instance to burst at a cost of $0.05 per CPU hour, meaning that a T2 Unlimited instance, such as a t2.nano could use 100% of CPU for as long as needed, at an additional cost of $0.05 per hour.
Rightsizing T2 Instances—what is the right metric?
In rightsizing, we use the metrics we get from the instance to determine whether an opportunity exists to rightsize, and the appropriate target(s) for the instance. The idea is to normalize the metrics across the various instance offerings and make the best fit. The key metrics used in rightsizing are CPU, memory, and network. With each of these metrics you can use the average or maximum in rightsizing.
With T2, Average CPU % is the best metric. This is because T2 allows you to burst for periods of time as long as your average over time is at or below the baseline CPU, governed by the CPU credits per hour for the given instance type. If the average CPU is below the baseline, the ebbs and flows of usage will consume and accrue credits, and the system will be able to burst where needed.
If you were to rightsize using maximum utilization you would be defeating the whole point of T2, in that you would be sizing for the maximum burst level, so the instance will likely always be at the max level of credits, and costing more than is optimal.
T2 Unlimited reinforces this, as the ability to “borrow” credits allows for longer bursts which occur infrequently. The ability to purchase CPU if needed is a hedge against sudden or periodic increases in activity, but the best match is when the baseline is higher than the average CPU utilization.
T2 Rightsizing with CloudHealth
When rightsizing T2 instances it is important to remember that while the compute resources for T2 are burstable, and can go above baseline, memory is static, and cannot burst. If an instance runs out of memory it may begin to swap, significantly degrading performance.
A best practice is to rightsize based on Average CPU %, and Maximum Memory %, so the instance will not run out of memory after rightsizing. This is easily setup in CloudHealth Policies. Within the Instance Rightsizing Policy you can set the CPU and memory criteria to average or maximum.
This sets rightsizing to optimize based on the baseline CPU and peak memory utilization. In reality a system that runs out of memory and starts to swap will be hindered more than an instance limited to baseline CPU.
Another best practice is to confirm recommendations by clicking on one of the score meters to look at the performance metrics over time. To see the dynamics of usage, switch the time granularity to hourly.
Here we see a t2.large which has been recommended to move to a t2.medium. As you can see, the CPU could be supported by a t2.small (peaking at ~20%), but the memory could not (continuously at ~40%), so the t2.medium recommendation looks correct.
Using CPU credit metrics to monitor T2 sizing
You can track the CPU credit balance in metrics. This is an instance which is right on the edge, in that the average is very close to the baseline. A period of high utilization took out its credit balance and it hasn’t recovered. T2 Unlimited would have allowed the instance to continue at the high utilization for a longer period, if the burst continued, Unlimited would allow the CPU to use what it needed at a cost per CPU/hour.
Looking at the same instance for a longer period of time we can see the cycle of usage, where the usage and credit balance fluctuate periodically, for which the longer burst capability of T2 Unlimited could be very effective.
Another best practice is to set up policies to track the credit balance. The policy below will alert if the average balance falls below 10 for 12 hours. In the previous example, an alert would have been raised, and the instance size could be adjusted if the sustained load had increased, or the situation could be monitored to see if the usage returns to the historical level, and the balance recovers, or continues to at a low balance, in which case upsizing would be in order.
Summary
Organizations are gradually moving T2 instances up the food chain of system loads. We are seeing a trend of large fleets of T2 instances with the larger sizes being used in addition to the traditional smaller T2 usage. As T2 becomes a more substantial portion of compute expense, it is important to understand the proper sizing for T2 instances. Often this is best done with a combination of Average CPU % and Maximum Memory Usage %. Policies should also be used to guard against T2 undersizing by monitoring the CPU credit balance, and alerting when it falls below your threshold. CloudHealth, a market leader in cloud management, gives you all the capabilities to do this.