Quickstart
Five minutes, helm install to first waste scan. Works on any cluster 1.28+.
This page gets you from zero to "I can see which pods are wasting money" in under five minutes. For the long-form architecture, see Overview.
Prerequisites
- Kubernetes 1.28+ (kind, k3d, minikube all work locally)
kubectlauthenticated to the clusterhelm3.12+
If you want the dashboards to auto-load, also install kube-prometheus-stack first. The stack-install script below handles this.
Install the full stack
For a greenfield cluster, one interactive script handles everything (cert-manager, Prometheus, ClickHouse operator, CloudNativePG, Valkey, Dex, Trivy, Tetragon, KubeHero):
git clone https://github.com/kubehero/kubehero-platform
cd kubehero-platform
./infra/demo/stack-install.sh
Or take everything at once:
./infra/demo/stack-install.sh --all
Install just KubeHero on an existing cluster
If your cluster already runs kube-prometheus-stack + whatever storage you use, install the chart directly:
helm repo add kubehero https://charts.kubehero.io
helm install kubehero kubehero/kubehero \
--namespace kubehero-system --create-namespace \
--set prometheus.release=kube-prometheus-stack
Confirm telemetry is flowing
kubectl -n kubehero-system get pods
kubectl -n kubehero-system logs ds/kubehero-collector --tail=20
You should see attribution ok · pods=NNN · nodes=NN within 30 seconds of the agent starting. If not, see Troubleshooting.
Your first scan
The CLI runs against both Cloud and Self-hosted control planes:
kubehero cluster list
kubehero scan --cluster prod-us-east-1 --report waste
Expected output (excerpt, color-coded in a real terminal):
WASTE REPORT cluster-prod-us-east-1
───────────────────────────────────────────────────────────────
● vectordb-ingress cpu.request=16 used=0.41 $8,640/mo recoverable
● model-server-a100 gpu=8 util=12% $18,200/mo recoverable
⚠ jobs-etl-nightly limit=32cpu burst=2.1 overcommit risk: HIGH
✓ frontend-gateway cpu.request=2 used=1.6 right-sized
───────────────────────────────────────────────────────────────
total 47 pods flagged · $38,940/mo recoverable
run `kubehero rightsize` to apply
Apply a rightsizing recommendation
Dry-run first (default):
kubehero rightsize vectordb-ingress --dry-run=true
Live:
kubehero rightsize vectordb-ingress --dry-run=false
The operator applies the recommended request, audit-logs the previous spec, and starts the 10-minute cooldown. If something breaks, kubehero undo <audit-id> restores the original in one call.
See it in Grafana
kubectl -n monitoring port-forward svc/kps-grafana 3000:80
Open http://localhost:3000 (admin / kubehero-demo if you used the kind demo), then Dashboards → KubeHero folder. Three dashboards are pre-loaded:
- Chargeback by team — cost per team, 30-day projection, nodepool breakdown, top workloads
- Fleet — total spend, recoverable, per-cluster time series
- GPU panel — utilization heatmap + per-GPU idle cost ranking
Write your first BudgetPolicy
Save as budget.yaml:
apiVersion: kubehero.kubehero.io/v1
kind: BudgetPolicy
metadata:
name: prod-monthly
spec:
scope:
clusterSelector:
matchLabels: { env: prod }
ceiling: "$100000/mo"
hardStop: true
# humanArm defaults to true — policy observes until armed.
escalation:
- action: hpa.cap
ratioPercent: 50
waitAfter: "2m"
- action: alert
channels: ["slack://ops-oncall"]
alertChannels:
- "slack://ops"
- "pagerduty://prod-p1"
Apply it:
kubectl apply -f budget.yaml
# Observe — policy is alert-only until armed:
kubectl get budgetpolicy prod-monthly -o yaml | grep -A5 status:
# Arm for active escalation:
kubehero cap --arm --policy prod-monthly
Next steps
- Concepts — attribution, rightsizing, ceilings explained
- CRD reference — every field of every CRD
- Chargeback — team / nodepool rollup via your existing Kubernetes labels
- Integrations — wiring details for each dependency