KubeHero docs

Metrics reference

Every metric KubeHero exports, the labels it carries, and the PromQL recording rules we ship.

This is the complete contract the collector's /metrics endpoint publishes. Our PrometheusRule, Grafana dashboards, and every Connect-RPC reader assume these exact names and label sets. A schema change here breaks everything downstream, which is why there's a test in services/collector/internal/metrics/ that pins it.

Labels — the chargeback axis

Every pod-level metric carries these labels. They are the primary grouping dimension for every dashboard and recording rule.

LabelPopulated byExampleNotes
namespacePod's namespaceml-inference
podPod's namevectordb-ingress-7b9fc
teamkubehero.io/team pod labelretrievalfalls back to chargeback.defaultTeam value
cost_centerkubehero.io/cost-center pod labelml-platformoptional BU-level rollup
nodepoolcloud-native nodepool label on host nodeaks-nc24ads · gke-g2 · eks-p5chart auto-detects; override with chargeback.nodepoolLabel
cloudderived from node's provider labelaws · gcp · azure
regionnode's topology.kubernetes.io/regionus-east-1 · westeurope
clustercluster name from control-plane configeks-use1-prod
gpu_kindonly when pod requests GPUA100 80GB · H100 80GB · L4 24GBvia DCGM device metadata

The team, cost_center, and nodepool labels are the chargeback primary axis. See Chargeback for how they're used.

Series catalogue

Cost (the headline numbers)

kubehero_pod_cost_usd_per_second (gauge)

Attributed $/sec for one pod-second of compute. The product of:

  • the pod's share of its node's resources (blended 50/50 across CPU and memory)
  • the node's per-second list price from the Pricing engine
  • the active lifecycle discount (on-demand / spot / savings-plan / committed)

GPU pods additionally attribute their node's GPU cost proportional to their reserved count (or MIG-slice fraction).

sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 3600

→ "Team's $/hour, averaged over the last 5 minutes."

kubehero_pod_recoverable_usd_per_second (gauge)

Portion of pod cost reclaimable via rightsizing: (requested − used) resources, priced out. Equals zero when the pod is already within its p95 + headroom target.

sum(kubehero:team_recoverable_usd:rate1h) by (team)

kubehero_node_cost_usd_per_hour (gauge)

Per-hour cost of a node given its SKU + lifecycle + region. Populated by the Pricing engine, refreshed every 6 hours by default.

Labels: node, nodepool, cloud, region, sku, lifecycle.

Resource usage

kubehero_pod_cpu_millicores (gauge)

Pod CPU usage in millicores, eBPF-attributed at 1-second resolution. Pairs with the node's allocatable CPU for share calculation.

kubehero_pod_memory_bytes (gauge)

Resident set size of the pod's cgroup, in bytes.

GPU (only when pod uses GPUs)

kubehero_pod_gpu_util_ratio (gauge)

GPU utilization, 0.0–1.0. For MIG-partitioned devices, this is the utilization of the allocated slice, not the full device.

1 - avg(kubehero_pod_gpu_util_ratio) by (team)

→ "Idle fraction per team — the chargeback penalty for provisioned-but-unused GPU."

kubehero_pod_gpu_memory_bytes (gauge)

VRAM usage in bytes, DCGM-reported.

Liveness

kubehero_up (gauge)

Value is 1 when the emitting process is healthy. Labeled with service (collector, control-plane, pricing-engine, operator).

up{service="kubehero-collector"} == 0

→ classic Prometheus alerting signal.

Recording rules (shipped in the PrometheusRule)

The chart's templates/prometheusrule.yaml installs these at chart install time. They exist so Grafana queries and alerts don't recompute at read time.

Chargeback

kubehero:pod_cost_usd:rate1m
  = sum(rate(kubehero_pod_cost_usd_per_second[1m]))
    by (namespace, pod, team, cost_center, nodepool, cloud, region, gpu_kind)
    * 60

kubehero:team_cost_usd:rate1h
  = sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 3600

kubehero:team_cost_usd:rate24h
  = sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 86400

kubehero:team_cost_usd:rate30d
  = sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (team) * 86400 * 30

kubehero:nodepool_cost_usd:rate1h
  = sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (nodepool, cloud, region) * 3600

kubehero:cost_center_cost_usd:rate1h
  = sum(rate(kubehero_pod_cost_usd_per_second[5m])) by (cost_center) * 3600

kubehero:team_gpu_idle_cost_usd:rate1h
  = sum(
      rate(kubehero_pod_cost_usd_per_second{gpu_kind!=""}[5m])
      * on(pod, namespace) group_left
      (1 - max(kubehero_pod_gpu_util_ratio) by (pod, namespace))
    ) by (team) * 3600

kubehero:team_recoverable_usd:rate1h
  = sum(rate(kubehero_pod_recoverable_usd_per_second[5m])) by (team) * 3600

Alerts

- alert: KubeHeroTeamOverBudgetProjected
  expr: |
    predict_linear(kubehero:team_cost_usd:rate24h[6h], 86400 * 30)
      > on(team) group_left
    kubehero_team_budget_usd
  for: 15m
  labels: { severity: warning, team_scope: "true" }
  annotations:
    summary: Team on track to exceed monthly budget

- alert: KubeHeroGPUIdleExcessive
  expr: kubehero:team_gpu_idle_cost_usd:rate1h > 500
  for: 1h
  labels: { severity: warning }
  annotations:
    summary: Team burning > $500/hr on idle GPUs

Cardinality notes

The label set is intentionally small. We do not emit:

  • Per-container series (pod-level is the contract).
  • Per-node kernel telemetry (cluster-level aggregates instead).
  • Request-level latency (that's APM territory; use your existing OTel setup).

Target cardinality for a 500-node cluster is roughly 50k active series. Most of that is kubehero_pod_cost_usd_per_second at one series per running pod.

If you need finer labels, query ClickHouse directly — that's the intended escape hatch for custom analytics. See Architecture · ClickHouse.