Tracking Waste on Kubernetes Clusters

Colin Spargo, Sasha Jeltuhin
November 15, 2023

Similar to virtualization, containers offer another level of flexibility to run and schedule applications on hardware. They also introduce an additional FinOps challenge: the clusters need to be managed, and the allocation of containers needs to be optimized to minimize unused resources, also known as waste. As the scale increases, there are more ways to allocate resources and reduce underutilization, however, there are also more opportunities for small “underutilization gaps” to appear. While small on their own, at scale, they become a significant issue in terms of cost and effort to resolve. In this blog, we’ll look at how this challenge can be solved effectively.

Choosing a FinOps KPI for containers

We will use waste as the primary KPI to measure and track our progress of optimization. We use waste, as it has a direct correlation to efficiency and can be understood at all levels of an organization. Also, waste is never beneficial — so by measuring even small amounts, we drive maximum efficiency.

Measuring waste for containers

How do you measure waste in Kubernetes? In CPU or memory units? On the Apptio Kubernetes Platform (AKP), we measure it in dollars, as it’s a measure that everyone can understand throughout the business. It also allows us to assign the right amount of effort or tailor our approach to resolving the waste, as it’s critical to not spend more to resolve the waste than the waste itself. Hardware units (such as CPU/memory) can lack meaning across the business and can also vary depending on the vendor, type of hardware, or even version of the hardware.

Understanding cost with Cloudability

We use our own tool, Cloudability, to track infrastructure spending and measure the progress of our FinOps programs. Cloudability allows us to zoom into the hosting cost of each individual service. It leverages resource tags to view the spending across different dimensions, e.g., an application, a service, an instance type, etc. The cost of application containers, for example, is derived from the cost of the underlying compute instances and is based on the share of the compute resources that the container is requesting.

Our approach to measure waste

Each cluster on our platform has several instance groups that are designated for different types of workloads. Each instance group has the label, <cluster platform>-<instance group>-<service>. This identifies them as part of the container platform, their function within the platform, and the service/application they belong to. Each cluster node has a service tag, and the value is inherited from the instance group <cluster platform>-<group>-<service>. Also, all application containers have a service tag that identifies the service that is running in the container. Cloudability allocates cost against the cluster node tag as soon as the node is provisioned and joined the cluster. Cost allocation against the service tag starts as soon as the application container with that tag is deployed.

Using this approach and tagging scheme, we now know the cost of all the underlying compute resources and the cost of the resources that are being used for a specific service.

We can then calculate the waste as:

Waste = Cost of the underlying resources – Cost of the resources used for the services

The difference between these two values is the cost of unused resources on the cluster node, which is an indicator of inefficient node utilization and container scheduling.

In the blog, Blending FinOps with Observability, we’ve shown an example of how we used Cloudability to track waste and detect a costly problem with a large-scale deployment.

Tracking Waste on Kubernetes Clusters

Choosing a FinOps KPI for containers

Measuring waste for containers

Understanding cost with Cloudability

Our approach to measure waste

Categories

Tags

Additional Resources