Principal Systems Engineer, Alok Singh, walks through switching from EC2 I2s to I3s to achieve immediate savings of 50% as well as key performance improvements only achieved on the cloud. Get the full story and see how your team can do the same.
Engineers know that to build systems that scale well, we must choose components that not only work well for the current workloads but also have enough room to handle the expected increase in traffic in the future. Switching from EC2 I2 to I3s demonstrates a combination of improving performance while achieving some serious savings and optimizing our compute to handle future loads.
AWS released the EC2 I3 instance back in February of this year. We covered the instance from a cost efficiency perspective within a previous piece. Take a look at that article if you need a quick refresher on what the I3 instances are capable of.
How Much Did We Save on AWS EC2 Costs?
As noted in our previous analysis of I3, there’s a big opportunity for engineers using I2 instances to not only save with I3, but get some serious performance increases. So far, we’ve saved about 50% on our I3 cluster costs.
Even better, if you didn’t buy any RIs, it would still be a better deal than running I2s. But, assuming you have the right compute cost and usage data to back the purchase, buying RIs for the I3s delivers even more savings.
PRO TIP: For more tips on RI planning, check out our comprehensive e-book.
Why Did We Decide to Migrate?
Due to the nature of our Hadoop/HBase workloads, we need high-peak I/O bandwidth across the cluster. The way to achieve these bandwidth requirements with the I2 instance family was to use AWS Placement Groups (PGs).
However, PGs are limited to single Availability Zones (AZs). This restriction reduced our availability SLAs, as we could not recover from a full AZ failure without significant downtime. A high-availability (HA) Hadoop cluster is usually configured with a “rack”-aware topology, where each “rack” is located in a different AZ along with a backup/secondary Namenode.
While PGs are a great way to gain significant throughput and bandwidth between the instances, having all cluster nodes in the same AZ reduced the availability SLA we could achieve. Though infrequent, it is possible to lose connectivity to one or more AZs for extended periods, and when designing a HA system, you must account and test for such scenarios.
Multi-AZ Cross Connections Make I3 Shine
Switching our I/O-intensive workload from I2s to I3s improves the resilience of our system as I3 cross connects can be within multiple AZs. We can spin up I3s in multiple AZs but have them be part of the same cluster, drastically improving uptime. Maintaining uptime and meeting critical performance standards are both important if your business meets criteria for certain levels of support or SLAs.
PRO TIP: Deciding whether to spin up a new EC2 instance or migrate to a different family involves a few cloud cost and performance considerations. Check out our free e-book on choosing the right EC2 instances to get some help.
Notable Performance Improvements
While saving on cloud costs is great, we’re very excited about the high level of availability and performance improvements that I3s bring to our infrastructure. So far we’ve noticed a 2x performance improvement on the workloads that we run on those I3 clusters. We’ve also noticed:
- Dramatically improved Disk I/O: Sustained multi-gigabyte-per-sec write and read throughput per instance
- Higher reliability through cross-region, multi-gigabit interconnects: see above about building our high-performance clusters in multiple AZs instead of just one PG in one AZ
- Improved encryption: the newer I3s feature data encryption at rest (this was not available with the previous generations in the family)
Let’s take a closer look at the improved Disk I/O at work:
Along with price, these kinds of performance enhancements in the cloud are a welcome sight as we scale our infrastructure. Improved performance, data security, and reliability – not a bad compute upgrade, AWS!
Maintaining Cloud Compute Efficiency
Immediate savings are nice, but a growing cloud brings all kinds of complexities toward balancing performance with reliability. Here’s how we’re making sure we’re using the most out of our I3 clusters.
Knowing Our Workload Performance Limits
There’s nothing to stop us from spinning up an entire cluster of the most-expensive I3 instances – but we won’t, because our team believes in rightsizing compute to fit the workload. You also don’t want too snug of a “fit” either. Striking a balance between the size of the workload and the size of the cluster is key.
In building a distributed cluster, it is important to understand the failure modes and design a system that can function well in those modes. As mentioned earlier, one of the ways we handle failures is to place cluster nodes in “racks.” Each “rack” should have sufficient redundant capacity to handle the loss of another “rack” in the cluster. This presents an interesting problem when using the largest i3 instance types (16xlarge).
Though we could have configured our cluster to use 16xlarge instance types, we realized that to provide the needed redundancy, we would end up with significant overcapacity in terms of CPU and memory. So, instead, we decided to use 8xlarge instance types that provided a better fit in terms of both the required redundancy and performance.
The Metrics That Matter to This Workload
When designing the cluster, we paid close attention to the peak disk IO and the expected storage requirements. In addition to the peak IO demands, we also took into account the backup requirements, and the spare capacity needed to handle failures. In our specific case, this means that we run our cluster at ~30% of its peak capacity in terms of IO and storage. This is not a hard-and-fast rule, and will vary significantly depending on the performance and availability SLAs. It’s just a “sweet spot” that we’ve defined for our workload at Cloudability based on our historic operational cloud compute usage data.
PRO TIP: Monitor this using Cloudability’s Rightsizing feature, where you can dig into key compute metrics, like CPU, Memory, Disk I/O and bandwidth, identifying instances that could be modified to improve cost efficiency.
You Can Save Like This and Build a Leaner Cloud
These same principles can apply to any of AWS’s compute instance offerings. It’s all about using your operational cost and usage data to make sound planning decisions for today and tomorrow’s workloads.
Get in touch with our cloud compute and efficiency experts today for a look at how you can put your own cloud cost and usage data to work.