Troubleshooting
Known failure modes and their runbooks.
If your issue isn't listed, check the GitHub issues or reach out to support@kubehero.io.
Agent pods aren't ready
kubectl -n kubehero-system get pods -l app.kubernetes.io/component=collector
Check 1 — RBAC
The collector needs read access on pods, nodes, services, deployments. helm install creates a ClusterRole via rbac.create: true. If you disabled that or run in a locked-down cluster:
kubectl auth can-i --as=system:serviceaccount:kubehero-system:kubehero-collector \
list pods --all-namespaces
# Expected: yes
If no: apply templates/rbac.yaml from the chart manually, or set rbac.create: true and re-run helm upgrade.
Check 2 — hostPID / hostNetwork
The collector requests hostPID: true to attribute CPU to pod cgroups. GKE Autopilot and some hardened OpenShift clusters disallow this. Fallback:
# values.yaml
collector:
useCadvisorFallback: true
# Disables eBPF attribution in favor of cadvisor scraping — accuracy
# drops from 1s to 60s but the collector runs in any cluster.
Check 3 — kernel version
eBPF programs need Linux kernel ≥5.10. Check:
kubectl -n kubehero-system exec ds/kubehero-collector -- uname -r
If older: use the cadvisor fallback above, or upgrade nodes.
"no telemetry at control plane"
kubectl -n kubehero-system logs ds/kubehero-collector --tail=30 | grep -i error
ebpf: load: invalid argument
Kernel is too old or BPF is restricted (AppArmor / SELinux / Talos hardening). Fallback to cadvisor (see above).
x509: certificate signed by unknown authority
mTLS cert doesn't match the control plane's CA. Rotate:
kubehero cluster rotate-cert --name <cluster>
Re-apply the updated secret, restart the collector DaemonSet.
rpc error: code = Unauthenticated
The cluster token expired. For self-hosted, cert-manager rotates weekly by default. For Cloud:
kubehero auth rotate
# then kubectl edit secret kubehero-cloud-creds and paste the new token
Dashboards don't render in Grafana
Check 1 — dashboard ConfigMaps
kubectl -n monitoring get configmap -l grafana_dashboard=1 | grep kubehero
You should see 3 ConfigMaps (chargeback / fleet / gpu). If absent, the grafana.enabled value is false or your Grafana sidecar isn't watching this namespace.
Check 2 — sidecar label match
kube-prometheus-stack's Grafana sidecar watches ConfigMaps with a specific label. Default: grafana_dashboard=1. Your chart may use a different label — check:
kubectl -n monitoring get deploy kps-grafana -o yaml | grep -A2 LABEL
Match it in KubeHero's values:
grafana:
sidecarLabel: "your_label_here"
sidecarLabelValue: "1"
Check 3 — PromQL returns empty
Open Grafana → Explore → paste kubehero:team_cost_usd:rate1h. Empty?
- No data source: make sure your dashboard's variable
${DS_PROMETHEUS}is bound to your Prometheus datasource. - Rules didn't install:
kubectl get prometheusrule -A | grep kubehero-chargeback— if absent, prometheus-operator CRDs aren't in the cluster or the discovery label doesn't match (see Prometheus integration). - Metrics not flowing: port-forward the collector and curl
/metrics— you should seekubehero_pod_cost_usd_per_secondseries.
"policies not firing"
A policy observes by default. Nothing runs until armed.
kubectl get budgetpolicy -A -o wide
# NAME HUMANARM ARMED STATUS
# prod-monthly-ceiling true false awaiting arm
Arm it:
kubehero cap --arm --policy prod-monthly-ceiling
If you don't want humanArm:
spec:
humanArm: false # not recommended for production — disables the safety gate
Recording rules don't evaluate
Prometheus picks up PrometheusRule resources via a label selector. Default mismatch is the most common reason they're ignored.
kubectl get prometheusrule -A -l release=kube-prometheus-stack | grep kubehero
If empty, set prometheus.release: <your-kps-release-name> in KubeHero's values and helm upgrade. Your Prometheus CR's spec.ruleSelector.matchLabels.release has to match.
Postgres / ClickHouse isn't reachable
Check the connection string from within the control-plane pod:
kubectl -n kubehero-system exec deploy/kubehero-control-plane -- \
sh -c 'psql "$DATABASE_URL" -c "select 1"'
kubectl -n kubehero-system exec deploy/kubehero-control-plane -- \
sh -c 'clickhouse-client --host=$CLICKHOUSE_HOST --secure -q "SELECT 1"'
Both should return 1. If either fails:
- Verify the
existingSecretvalue references a real Secret in the right namespace. - If using
embedded: true, confirm the subchart (CloudNativePG or clickhouse-operator) installed its own resources in the expected namespace. - Network policies may block egress to the DB namespace. The chart ships NetworkPolicies for this; re-apply.
Pricing quotes return not_found
The pricing engine refreshes its SKU catalog every 6 hours. On first install, pricing data may be absent for ~6h.
Force a refresh:
kubectl -n kubehero-system create job pricing-refresh-now \
--from=cronjob/kubehero-pricing-engine
Audit log is empty
The audit log persists to Postgres. Check the forwarders are configured:
kubehero audit status
If no forwarders are attached, events still accumulate in Postgres — you just don't get push to syslog / webhook / S3. Configure per Production · Backups.
Getting support
- Enterprise tier: your dedicated support channel.
- Community: GitHub Discussions.
- Security:
security@kubehero.io(PGP key on the Security page).