KubeHero docs

Quickstart

Five minutes, helm install to first waste scan. Works on any cluster 1.28+.

This page gets you from zero to "I can see which pods are wasting money" in under five minutes. For the long-form architecture, see Overview.

Prerequisites

  • Kubernetes 1.28+ (kind, k3d, minikube all work locally)
  • kubectl authenticated to the cluster
  • helm 3.12+

If you want the dashboards to auto-load, also install kube-prometheus-stack first. The stack-install script below handles this.

Install the full stack

For a greenfield cluster, one interactive script handles everything (cert-manager, Prometheus, ClickHouse operator, CloudNativePG, Valkey, Dex, Trivy, Tetragon, KubeHero):

git clone https://github.com/kubehero/kubehero-platform
cd kubehero-platform
./infra/demo/stack-install.sh

Or take everything at once:

./infra/demo/stack-install.sh --all

Install just KubeHero on an existing cluster

If your cluster already runs kube-prometheus-stack + whatever storage you use, install the chart directly:

helm repo add kubehero https://charts.kubehero.io
helm install kubehero kubehero/kubehero \
  --namespace kubehero-system --create-namespace \
  --set prometheus.release=kube-prometheus-stack

Confirm telemetry is flowing

kubectl -n kubehero-system get pods
kubectl -n kubehero-system logs ds/kubehero-collector --tail=20

You should see attribution ok · pods=NNN · nodes=NN within 30 seconds of the agent starting. If not, see Troubleshooting.

Your first scan

The CLI runs against both Cloud and Self-hosted control planes:

kubehero cluster list

kubehero scan --cluster prod-us-east-1 --report waste

Expected output (excerpt, color-coded in a real terminal):

WASTE REPORT    cluster-prod-us-east-1
───────────────────────────────────────────────────────────────
● vectordb-ingress       cpu.request=16  used=0.41   $8,640/mo recoverable
● model-server-a100      gpu=8           util=12%    $18,200/mo recoverable
⚠ jobs-etl-nightly       limit=32cpu     burst=2.1   overcommit risk: HIGH
✓ frontend-gateway       cpu.request=2   used=1.6    right-sized
───────────────────────────────────────────────────────────────
total    47 pods flagged · $38,940/mo recoverable
         run `kubehero rightsize` to apply

Apply a rightsizing recommendation

Dry-run first (default):

kubehero rightsize vectordb-ingress --dry-run=true

Live:

kubehero rightsize vectordb-ingress --dry-run=false

The operator applies the recommended request, audit-logs the previous spec, and starts the 10-minute cooldown. If something breaks, kubehero undo <audit-id> restores the original in one call.

See it in Grafana

kubectl -n monitoring port-forward svc/kps-grafana 3000:80

Open http://localhost:3000 (admin / kubehero-demo if you used the kind demo), then Dashboards → KubeHero folder. Three dashboards are pre-loaded:

  • Chargeback by team — cost per team, 30-day projection, nodepool breakdown, top workloads
  • Fleet — total spend, recoverable, per-cluster time series
  • GPU panel — utilization heatmap + per-GPU idle cost ranking

Write your first BudgetPolicy

Save as budget.yaml:

apiVersion: kubehero.kubehero.io/v1
kind: BudgetPolicy
metadata:
  name: prod-monthly
spec:
  scope:
    clusterSelector:
      matchLabels: { env: prod }
  ceiling: "$100000/mo"
  hardStop: true
  # humanArm defaults to true — policy observes until armed.
  escalation:
    - action: hpa.cap
      ratioPercent: 50
      waitAfter: "2m"
    - action: alert
      channels: ["slack://ops-oncall"]
  alertChannels:
    - "slack://ops"
    - "pagerduty://prod-p1"

Apply it:

kubectl apply -f budget.yaml

# Observe — policy is alert-only until armed:
kubectl get budgetpolicy prod-monthly -o yaml | grep -A5 status:

# Arm for active escalation:
kubehero cap --arm --policy prod-monthly

Next steps

  • Concepts — attribution, rightsizing, ceilings explained
  • CRD reference — every field of every CRD
  • Chargeback — team / nodepool rollup via your existing Kubernetes labels
  • Integrations — wiring details for each dependency