Managing K8s Agent Updates at Scale with Helm and Terraform

Learn how Helm and Terraform can help platform teams roll out agent updates across multiple Kubernetes clusters with a repeatable, reviewable, and scalable deployment pattern.

Managing an agent on a single Kubernetes cluster is usually straightforward. Managing that same agent across five, ten, or fifty clusters is where things get harder. When you need to roll out an agent update consistently across environments, manual installs can be difficult to audit, prone to drift, and time consuming to repeat.

That challenge shows up quickly when platform teams need to standardize agent deployment, validate updates safely, and avoid treating every cluster like a one-off project. The larger the environment gets, the more important it becomes to use a deployment pattern that is repeatable, reviewable, and easy to scale.

In this post, we’ll look at how Helm and Terraform can work together to make agent rollouts more manageable at scale, and we’ll use updating to IBM Cloudability’s new IBM FinOps Agent as a practical example.

What is Helm, in plain English

Helm is the package manager for Kubernetes. It gives teams a standard way to install, configure, upgrade, and manage software in a cluster.

Instead of hand-authoring and reapplying raw YAML for every deployment, Helm packages Kubernetes resources into a chart with configurable values. That makes it easier to install the same software repeatedly, apply environment-specific settings cleanly, and manage upgrades in a more structured way.

That matters here because the IBM FinOps agent is packaged as a Helm chart.

Where Terraform fits in

Helm solves package deployment inside Kubernetes. Terraform provides automation using infrastructure as code to helps teams provision and manage infrastructure across an organization. Using Terraform, teams can manage that deployment process as code across many environments.

That distinction matters. Terraform is not replacing Helm. It is orchestrating Helm in a way that makes large-scale rollouts easier to manage.

With Terraform, you can:

  • Define the desired deployment once
  • Pass cluster-specific values at runtime
  • Commit changes to version control
  • Make upgrades and rollbacks reviewable
  • Avoid manually repeating Helm commands for every cluster

That makes Terraform especially useful when a team is managing many agent deployments and wants a more standardized rollout model.

A practical example: rolling out the IBM FinOps Agent

The IBM FinOps Agent is the new unified agent across products within the IBM FinOps Suite. It improves reliability and resilience, supports secure credential handling through Kubernetes Secrets, and is the path forward for IBM Cloudability Advanced Containers and future cost optimization capabilities.

For platform teams managing many Kubernetes clusters, the challenge is rolling it out in a way that is consistent, easy to validate, and straightforward to upgrade later.

That is exactly where Helm and Terraform become useful. If you’ve never used Terraform before, don’t worry—we’ll walk through every file and explain what each piece does.

1. What you need before starting

Before you begin, make sure you have the tooling and access needed to deploy across all target clusters.

Tool Minimum Version What it’s for
terraform 1.3 Running your infrastructure code
helm 3.8.0+ Terraform uses this under the hood to install the agent chart
kubectl any recent Verifying the deployment afterward

You’ll also need:

  • Kubeconfig access to each cluster you want to deploy to. Run kubectl config get-contexts to see what’s available.
  • An existing Kubernetes secret on each cluster that holds the federated storage configuration. This example assumes that secret is already in place—Terraform will reference it by name but won’t create or manage it.

New to Terraform? Run terraform -version to check if it’s installed. If not, the easiest way to install it is via the official downloads page or brew install terraform on macOS.

Cloudability users migrating from the legacy metrics agent: This walkthrough focuses on scaling the Helm deployment pattern with Terraform. Before using it in production, complete the specific IBM FinOps Agent prerequisites from the provisioning docs.

2. How we’ll organize our Terraform code

Since each cluster gets its own Terraform run, the code is refreshingly simple. The root module calls the agent module exactly once, and the cluster-specific values come in as variables at run time.

finops-agent/
├── providers.tf          # one provider, no aliases
├── variables.tf          # cluster_id, kube_context, chart_version
├── main.tf               # one module call
├── deploy.sh             # the loop that runs this for every cluster
└── modules/
    └── finops-agent/
        ├── variables.tf  # what the module accepts
        ├── main.tf       # creates namespace + installs chart
        └── outputs.tf    # release status

The module stays simple and unchanged across every cluster. What varies is only the values passed in at run time by deploy.sh.

3. Building the reusable module

Let’s write the three files inside modules/finops-agent/. Once written, you won’t need to touch them again regardless of how many clusters you add.

Variables — what the module needs to know

The module needs three things: which cluster it is, which version to install, and the name of the existing secret that holds the storage config.

variable "cluster_id" {
  description = "A globally unique name for this cluster. If you reuse cluster names across regions, append the region — e.g. prod-us-east-1."
  type        = string
}

variable "chart_version" {
  description = "The Helm chart version to install. Must match the image tag exactly."
  type        = string
}

variable "namespace" {
  description = "The Kubernetes namespace to deploy into."
  type        = string
  default     = "ibm-finops-agent"
}

variable "storage_secret_name" {
  description = "Name of the existing Kubernetes secret that holds the federated storage config. This secret must already exist on the cluster before applying."
  type        = string
  default     = "finops-federated-storage"
}
modules/finops-agent/variables.tf

For Cloudability Legacy Agent Migrations: cluster_id should match the existing CLOUDABILITY_CLUSTER_NAME to avoid cost ingestion issues.

Notice that storage_secret_name has a default value. If all your clusters use the same secret name—which is the common case—you never need to pass this in explicitly.

Main — the deployment logic

This file does two things: creates a namespace, then installs the Helm chart. The storage secret already exists on the cluster, so we just tell the agent its name.

# Step 1: Create the namespace if it doesn't already exist.
# If the namespace was created outside of Terraform (e.g. by a previous manual
# install), Terraform will error with a conflict on first apply. To adopt an
# existing namespace into state instead of recreating it, run from the root module, where deploy.sh runs terraform:
# terraform import module.finops_agent.kubernetes_namespace.this ibm-finops-agent
# ...before running apply. Alternatively, remove this resource and set
# create_namespace = true in the helm_release block below.

resource "kubernetes_namespace" "this" {
  metadata {
    name = var.namespace
  }
}

# Step 2: Install the FinOps Agent Helm chart.
resource "helm_release" "finops_agent" {
  name       = "ibm-finops-agent"
  repository = "https://kubecost.github.io/finops-agent-chart"
  chart      = "finops-agent"
  version    = var.chart_version
  namespace  = kubernetes_namespace.this.metadata[0].name

  # Don't try to create the namespace — we already did that above.
  create_namespace = false

  # If the install or upgrade fails, Helm purges or rolls back automatically. 
  # atomic also implies wait = true, so Terraform won't report success until 
  # the agent pod is actually Running. Default timeout is 5 minutes. 
  
  atomic = true

  # The two required parameters every agent installation must have.
  set {
    name  = "global.clusterId"
    value = var.cluster_id
  }

  set {
    name  = "global.federatedStorage.existingSecret"
    value = var.storage_secret_name
  }

  depends_on = [kubernetes_namespace.this] # create namespace before running Helm install
}
modules/finops-agent/main.tf

Outputs — what the module reports back

output "release_status" {
  description = "The Helm release status (e.g. 'deployed')."
  value       = helm_release.finops_agent.status
}
modules/finops-agent/outputs.tf

4. The root configuration

Providers — one cluster, no aliases

Because each Terraform run targets a single cluster, the provider config is clean. The kubeconfig context is just a variable passed in at run time.

terraform {
  required_version = ">= 1.3"

  required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = ">= 2.11"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = ">= 2.23"
    }
  }
}

provider "kubernetes" {
  config_path    = "~/.kube/config"
  config_context = var.kube_context
}

provider "helm" {
  kubernetes {
    config_path    = "~/.kube/config"
    config_context = var.kube_context
  }
}
providers.tf

Variables — your inputs

variable "kube_context" {
  description = "The kubeconfig context for the cluster being deployed to. Run 'kubectl config get-contexts' to see available contexts."
  type        = string
}

variable "cluster_id" {
  description = "A globally unique identifier for this cluster."
  type        = string
}

variable "chart_version" {
  description = "FinOps Agent Helm chart version to install."
  type        = string
  default     = "1.0.15"
}
variables.tf

Main — one module call

With everything parameterized, main.tf is just a single module call. No cluster-specific logic lives here at all.

module "finops_agent" {
  source = "./modules/finops-agent"

  cluster_id    = var.cluster_id
  chart_version = var.chart_version
}
main.tf

5. Deploying at scale

Run terraform init once to download the provider plugins. You’ll need to re-run it any time you change provider version constraints or add a new provider.

terraform init

Now for the loop. Create a deploy.sh script at the root of your project. It holds your full cluster list and runs Terraform against each one, keeping a separate state file per cluster so they’re fully independent.

#!/usr/bin/env bash
set -euo pipefail

CHART_VERSION="1.0.15"

# Add a line here for every cluster.
# Format: "cluster_id|kube_context"
clusters=(
  "prod-us-east-1|arn:aws:eks:us-east-1:123456789012:cluster/prod-a"
  "prod-us-west-2|arn:aws:eks:us-west-2:123456789012:cluster/prod-b"
  "prod-eu-west-1|arn:aws:eks:eu-west-1:123456789012:cluster/prod-c"
)

for entry in "${clusters[@]}"; do
  cluster_id="${entry%%|*}"    # everything before the |
  kube_context="${entry##*|}"  # everything after the |

  echo "==> Deploying to $cluster_id"

  terraform apply \
    -auto-approve \
    -state="states/${cluster_id}.tfstate" \
    -var="cluster_id=${cluster_id}" \
    -var="kube_context=${kube_context}" \
    -var="chart_version=${CHART_VERSION}"

  echo "==> Done: $cluster_id"
done
deploy.sh
Make it executable, then run it:

chmod +x deploy.sh
mkdir -p states
./deploy.sh

Each cluster gets its own states/<cluster-id>.tfstate file, so a failure on one cluster has no effect on the others.

Adding a new cluster: Just add a line to the clusters array in deploy.sh and re-run the script. Terraform will skip clusters that are already up to date and only act on the new one.

Using remote state in production: Local state files are fine to start with; in a team setting you’ll want a shared, locked backend like S3. Add a partial backend "s3" {} block to providers.tf, then in deploy.sh run terraform init -reconfigure -backend-config="key=${cluster_id}/terraform.tfstate" before each apply (and drop the -state flag). The per-cluster key gives you the same isolation as the local states/ directory.

6. Verifying it worked

After the script completes, check that the agent is running on each cluster:

# Check pod status
kubectl get pods -n ibm-finops-agent --context <your-context-name> 

# Expected output:
# NAME                                  READY   STATUS    RESTARTS   AGE
# ibm-finops-agent-57b9c8b699-7dpth     1/1     Running   0          2m

# Tail the logs to confirm it's exporting data
kubectl logs -n ibm-finops-agent \
  -l app.kubernetes.io/name=finops-agent \
  --context <your-context-name> \
  --tail=50

If the pod is in CrashLoopBackOff, check the logs for a panic message. The most common cause is the storage secret not existing yet on the cluster, or the clusterId being unset. The agent will tell you exactly what it can’t find.

For Cloudability Legacy Agent Migrations: The first successful upload log can take up to 10 minutes to appear. We recommend keeping the legacy metrics agent running until the new IBM FinOps Agent has been uploading successfully for 24 hours. After validation is confirmed, the legacy metrics agent can be removed.

7. Upgrading the agent

To upgrade every cluster, update CHART_VERSION at the top of deploy.sh and re-run it. Terraform compares the desired version against each cluster’s state file and only applies the change where needed.

CHART_VERSION="1.0.16"  # bump this, then re-run ./deploy.sh

Because atomic = true is set in the module, if the new version fails to start on any cluster, Terraform automatically rolls it back to the previous working version for that cluster—without affecting any of the others.

Version matching: The chart version must match the container image tag. Chart 1.0.16 must use image tag v1.0.16. The Helm chart warns when the chart and image versions do not match, but it’s good to be aware of when reading release notes.

Closing

Helm is the standard deployment model for Kubernetes for a reason: it brings structure and consistency to how software is installed and managed in clusters.

Terraform builds on that by giving teams a scalable way to roll out the same pattern across many clusters, manage updates more consistently, and reduce the operational risk that comes with manual repetition.

For IBM Cloudability customers migrating to the new IBM FinOps Agent, that combination can make the move from the legacy metrics agent much easier to manage. You can refer to our migration documentation for guidance. If you need additional support, please contact your account manager or IBM Support.

Additional Resources

A4128 - Kubernetes Applications - Best Practices for Cost-Effective Scaling - thumb

Kubernetes Applications: Best Practices for Cost-Effective Scaling

644934-EMEA-See What's In Your Cloud Webinar Thumbnail

See What’s in Your Cloud – Managing Cloud Cost with Apptio, AWS & KPMG

How Apptio manages explosive cloud growth through FinOps