For AWS infrastructure, EC2 optimization has always been a crucial step in cloud cost management, but it’s not limited just to EC2 — or just to AWS. Optimization has grown even more important as cloud infrastructures have grown, so much so that it’s an entire phase in the cloud operating model of FinOps. In FinOps, the Optimize phase includes actions applicable to EC2 (such as removing underutilized resources, automating resources or rightsizing instances) and broader optimization actions, including RI purchasing or discount optimization.
To find out more about FinOps and optimization, check out FinOps: A New Approach to Cloud Financial Management.
Welcome back to the Five Stages of AWS Cost Efficiency. Today, we’ll be talking about Stage III: optimizing EC2 usage. Before diving into this step, be sure you’ve ensured basic cost visibility and implemented cost allocation and chargeback. Ready? Let’s get started.
It’s time to stop treating the cloud like a datacenter. There are 168 hours in a week, and 108 of them are nights and weekends. In spite of this, it’s far too common for companies to over-provision resources and leave everything on all the time. The goal of this stage is to avoid this pitfall: to identify what resources can be turned off, sized down, or autoscaled back during non-peak hours. In doing so, you’ll also make some great progress in determining which instances you should buy reservations for in Stage IV.
Here are the ABCs of optimizing EC2 usage efficiency:
You should have already implemented tags in Stage II: Cost Allocation and Chargeback. If so, well done— tagging is crucial to this stage. Without tags, you won’t know with confidence what each of your instances is doing, and which instances you can turn off.
For the purposes of Stage III, use a Role tag to tell you if an instance is part of your web, app or database tier. Use a Name tag to concatenate data about the service, node, or cluster it’s a part of—or apply discrete tag keys for each of those for even more granularity.
These tags buy you two distinct efficiency wins:
This is so important that some of our customers implement tag-or-terminate rules that automatically shut down instances not tagged within 24 hours.
One of the quickest efficiency wins is to simply turn off underused instances. Start by looking at a combination of low CPU, low Bandwidth, and low Disk I/O. As mentioned in the last section though, these will vary by the instance role—your database servers will likely use more I/O than your web workers. Here are some of the metrics you should look at:
At least 65% of the hours in a month are nights and weekends. Unless you have around-the-cloud offshore dev teams or are relying exclusively on ephemeral storage, it’s likely you can turn some of your non-production resources off some of the time.
Chances are, you’ll find your instance count at 2am to be the same as 2pm. Using an orchestration tool like Puppet or Chef, explore autoscaling some of your resources down after your office closes for the day and turning them back on before it opens.
You can ensure comprehensive completion of Stage III by following these steps:
- Define role-specific utilization SLAs based on profiles
- Generate an underutilized instance report based on CPU, BW, Disk IO + Days Alive using Cloudability Usage Analytics
- Identify test/dev/stage resources that don't need to be running 24/7
- Implement API access to Usage data for Ops / Eng Dashboards
- Provide the reports to each product team
Most companies try to skip this stage and go straight to buying Reserved Instances. But by optimizing your EC2 usage first, you’ll put yourself in a select group who can make the most efficient RI buys possible, and save more money along the way. Your finance team will thank you.
For more information about the Five Stages of AWS Cost Efficiency, check out these blog posts: