The FinOps Journey:
(EDITORIAL NOTE - Annotated in 2019 to show this post's role on the path to the FinOps cloud operating model.)
For AWS infrastructure, EC2 optimization has always been a crucial step in cloud cost management, but it’s not limited just to EC2 — or just to AWS. Optimization has grown even more important as cloud infrastructures have grown, so much so that it’s an entire phase in the cloud operating model of FinOps. In FinOps, the Optimize phase includes actions applicable to EC2 (such as removing underutilized resources, automating resources or rightsizing instances) and broader optimization actions, including RI purchasing or discount optimization.
To find out more about FinOps and optimization, check out FinOps: A New Approach to Cloud Financial Management.
Welcome back to the Five Stages of AWS Cost Efficiency. Today, we’ll be talking about Stage III: optimizing EC2 usage. Before diving into this step, be sure you’ve ensured basic cost visibility and implemented cost allocation and chargeback. Ready? Let’s get started.
It’s time to stop treating the cloud like a datacenter. There are 168 hours in a week, and 108 of them are nights and weekends. In spite of this, it’s far too common for companies to over-provision resources and leave everything on all the time. The goal of this stage is to avoid this pitfall: to identify what resources can be turned off, sized down, or autoscaled back during non-peak hours. In doing so, you’ll also make some great progress in determining which instances you should buy reservations for in Stage IV.
Here are the ABCs of optimizing EC2 usage efficiency:
A) Let tags guide your way
You should have already implemented tags in Stage II: Cost Allocation and Chargeback. If so, well done— tagging is crucial to this stage. Without tags, you won’t know with confidence what each of your instances is doing, and which instances you can turn off.
For the purposes of Stage III, use a Role tag to tell you if an instance is part of your web, app or database tier. Use a Name tag to concatenate data about the service, node, or cluster it’s a part of—or apply discrete tag keys for each of those for even more granularity.
These tags buy you two distinct efficiency wins:
- The ability to set up role-specific usage profiles for each tier of your infrastructure—because you’ll want to identify different CPU thresholds for your web tier than your database tier, for example.
- Visibility into what a questionable instance is doing before you turn it off or resize it.
This is so important that some of our customers implement tag-or-terminate rules that automatically shut down instances not tagged within 24 hours.
B) Look for underutilized instances
One of the quickest efficiency wins is to simply turn off underused instances. Start by looking at a combination of low CPU, low Bandwidth, and low Disk I/O. As mentioned in the last section though, these will vary by the instance role—your database servers will likely use more I/O than your web workers. Here are some of the metrics you should look at:
- Low CPU: Less than 5-15% CPU utilization often indicates that an instance is oversized. A CPU utilization less than 1% may mean it’s not being used at all. You’ll need to know what Role it falls under to be sure though.
- Low Bandwidth: Less than 100MB of total bandwidth in a week may indicate the box is not being heavily used. Profiling of the instance role—again—is important here, as certain roles are CPU-heavy but BW-light.
- Low I/O: Less than 1,000 Disk I/O in the last week, you may want to take a closer look at the machine. Guess what? Profiling comes into play here as well as you’ll want to look at different types of machines based on different thresholds.
- Days Alive: The number of days or hours the instance has been on is a helpful heuristic. Day old instances with no usage are less worrisome than an instance that’s been online for 9 months and is not being used.
- Estimated Cost: The higher the cost, the more cause for concern if it’s not being well utilized.
- Name tag: To tell you what the instance is doing and whether it really needs to be there.
C) Turn off the lights at night
At least 65% of the hours in a month are nights and weekends. Unless you have around-the-cloud offshore dev teams or are relying exclusively on ephemeral storage, it’s likely you can turn some of your non-production resources off some of the time.
Chances are, you’ll find your instance count at 2am to be the same as 2pm. Using an orchestration tool like Puppet or Chef, explore autoscaling some of your resources down after your office closes for the day and turning them back on before it opens.
You can ensure comprehensive completion of Stage III by following these steps:
- Define role-specific utilization SLAs based on profiles
- Generate an underutilized instance report based on CPU, BW, Disk IO + Days Alive using Cloudability Usage Analytics
- Identify test/dev/stage resources that don't need to be running 24/7
- Implement API access to Usage data for Ops / Eng Dashboards
- Provide the reports to each product team
Most companies try to skip this stage and go straight to buying Reserved Instances. But by optimizing your EC2 usage first, you’ll put yourself in a select group who can make the most efficient RI buys possible, and save more money along the way. Your finance team will thank you.
For more information about the Five Stages of AWS Cost Efficiency, check out these blog posts: