Metrics reference
Every metric KubeHero exports, the labels it carries, and the PromQL recording rules we ship.
This is the complete contract the collector's /metrics endpoint publishes. Our PrometheusRule, Grafana dashboards, and every Connect-RPC reader assume these exact names and label sets. A schema change here breaks everything downstream, which is why there's a test in services/collector/internal/metrics/ that pins it.
Labels — the chargeback axis
Every pod-level metric carries these labels. They are the primary grouping dimension for every dashboard and recording rule.
| Label | Populated by | Example | Notes |
|---|---|---|---|
namespace | Pod's namespace | ml-inference | |
pod | Pod's name | vectordb-ingress-7b9fc | |
team | kubehero.io/team pod label | retrieval | falls back to chargeback.defaultTeam value |
cost_center | kubehero.io/cost-center pod label | ml-platform | optional BU-level rollup |
nodepool | cloud-native nodepool label on host node | aks-nc24ads · gke-g2 · eks-p5 | chart auto-detects; override with chargeback.nodepoolLabel |
cloud | derived from node's provider label | aws · gcp · azure | |
region | node's topology.kubernetes.io/region | us-east-1 · westeurope | |
cluster | cluster name from control-plane config | eks-use1-prod | |
gpu_kind | only when pod requests GPU | A100 80GB · H100 80GB · L4 24GB | via DCGM device metadata |
The team, cost_center, and nodepool labels are the chargeback primary axis. See Chargeback for how they're used.
Series catalogue
Cost (the headline numbers)
kubehero_pod_cost_usd_per_second (gauge)
Attributed $/sec for one pod-second of compute. The product of:
- the pod's share of its node's resources (blended 50/50 across CPU and memory)
- the node's per-second list price from the Pricing engine
- the active lifecycle discount (on-demand / spot / savings-plan / committed)
GPU pods additionally attribute their node's GPU cost proportional to their reserved count (or MIG-slice fraction).
sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 3600
→ "Team's $/hour, averaged over the last 5 minutes."
kubehero_pod_recoverable_usd_per_second (gauge)
Portion of pod cost reclaimable via rightsizing: (requested − used) resources, priced out. Equals zero when the pod is already within its p95 + headroom target.
sum(kubehero:team_recoverable_usd:rate1h) by (team)
kubehero_node_cost_usd_per_hour (gauge)
Per-hour cost of a node given its SKU + lifecycle + region. Populated by the Pricing engine, refreshed every 6 hours by default.
Labels: node, nodepool, cloud, region, sku, lifecycle.
Resource usage
kubehero_pod_cpu_millicores (gauge)
Pod CPU usage in millicores, eBPF-attributed at 1-second resolution. Pairs with the node's allocatable CPU for share calculation.
kubehero_pod_memory_bytes (gauge)
Resident set size of the pod's cgroup, in bytes.
GPU (only when pod uses GPUs)
kubehero_pod_gpu_util_ratio (gauge)
GPU utilization, 0.0–1.0. For MIG-partitioned devices, this is the utilization of the allocated slice, not the full device.
1 - avg(kubehero_pod_gpu_util_ratio) by (team)
→ "Idle fraction per team — the chargeback penalty for provisioned-but-unused GPU."
kubehero_pod_gpu_memory_bytes (gauge)
VRAM usage in bytes, DCGM-reported.
Liveness
kubehero_up (gauge)
Value is 1 when the emitting process is healthy. Labeled with service (collector, control-plane, pricing-engine, operator).
up{service="kubehero-collector"} == 0
→ classic Prometheus alerting signal.
Recording rules (shipped in the PrometheusRule)
The chart's templates/prometheusrule.yaml installs these at chart install time. They exist so Grafana queries and alerts don't recompute at read time.
Chargeback
kubehero:pod_cost_usd:rate1m
= sum(rate(kubehero_pod_cost_usd_per_second[1m]))
by (namespace, pod, team, cost_center, nodepool, cloud, region, gpu_kind)
* 60
kubehero:team_cost_usd:rate1h
= sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 3600
kubehero:team_cost_usd:rate24h
= sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 86400
kubehero:team_cost_usd:rate30d
= sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 86400 * 30
kubehero:nodepool_cost_usd:rate1h
= sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (nodepool, cloud, region) * 3600
kubehero:cost_center_cost_usd:rate1h
= sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (cost_center) * 3600
kubehero:team_gpu_idle_cost_usd:rate1h
= sum(
rate(kubehero_pod_cost_usd_per_second{gpu_kind!=""}[5m])
* on(pod, namespace) group_left
(1 - max(kubehero_pod_gpu_util_ratio) by (pod, namespace))
) by (team) * 3600
kubehero:team_recoverable_usd:rate1h
= sum(rate(kubehero_pod_recoverable_usd_per_second[5m])) by (team) * 3600
Alerts
- alert: KubeHeroTeamOverBudgetProjected
expr: |
predict_linear(kubehero:team_cost_usd:rate24h[6h], 86400 * 30)
> on(team) group_left
kubehero_team_budget_usd
for: 15m
labels: { severity: warning, team_scope: "true" }
annotations:
summary: Team on track to exceed monthly budget
- alert: KubeHeroGPUIdleExcessive
expr: kubehero:team_gpu_idle_cost_usd:rate1h > 500
for: 1h
labels: { severity: warning }
annotations:
summary: Team burning > $500/hr on idle GPUs
Cardinality notes
The label set is intentionally small. We do not emit:
- Per-container series (pod-level is the contract).
- Per-node kernel telemetry (cluster-level aggregates instead).
- Request-level latency (that's APM territory; use your existing OTel setup).
Target cardinality for a 500-node cluster is roughly 50k active series. Most of that is kubehero_pod_cost_usd_per_second at one series per running pod.
If you need finer labels, query ClickHouse directly — that's the intended escape hatch for custom analytics. See Architecture · ClickHouse.