FAQ
Answers to the questions design partners ask before committing.
Every row here came out of a real design-partner call. If yours isn't listed, open a discussion.
How accurate is your cost attribution compared to our billing record?
The agent reports at 1-second resolution with cgroup-accurate CPU attribution via eBPF. We reconcile against the cloud billing export (AWS Cost Explorer, GCP Billing Export, Azure Cost Management) on a nightly basis. Our internal accuracy threshold is ±1% of the billing record over a 30-day window after mid-month reservation replay. Most FinOps tools that run on cadvisor land at ±10–15%.
What's the agent's overhead?
Target: under 0.5% CPU and under 50 MiB RSS per node. We benchmark against this on every PR. Typical production measurement on a 4-vCPU / 16-GiB node: 0.12% CPU, 38 MiB RSS, 0.4 Mbps out.
Do you need a GPU driver modification?
No. We consume DCGM Exporter, which is NVIDIA's own library and runs as a separate DaemonSet or in your collector sidecar. For MIG slices, we read the partition table via DCGM's standard API. For AMD, DCGM is absent; we read rocm-smi. For Google TPUs, cloudtpu.googleapis.com metrics.
Do you require privileged containers?
The agent requests hostPID: true to attribute CPU to pod cgroups. It does not run as root, does not need privileged: true, and does not use any syscall beyond the standard K8s runtime. If your security team disallows hostPID, enable the cadvisor fallback — you lose 1s resolution, keep everything else.
How does KubeHero handle mid-month Savings Plans?
When a Savings Plan / Reserved Instance / Committed-Use Discount is purchased mid-month, we reprocess the affected time range: the pricing engine emits a pricing.commitment.activated event, the control plane enqueues a ClickHouse replay, and every pod-second in scope gets re-priced against the new effective rate. Historical numbers in your dashboard update within minutes. A timeline event records what was restated and why. See Concepts · Retroactive cost for the full machinery.
Can we run KubeHero fully air-gapped?
Yes. The values.airgap.yaml overlay disables all outbound traffic, and every image can be mirrored to your internal registry. Pricing catalog snapshots can be imported via kubehero pricing import --from snapshot.json so you don't need to reach the public pricing APIs. See Production · Air-gap install.
What happens if the KubeHero control plane goes down?
The agent keeps collecting metrics locally (up to 60s in-memory buffer). The operator keeps reconciling local CRDs. Enforcement continues — BudgetPolicy and CeilingPolicy objects are evaluated in-cluster by the operator even when the central control plane is unreachable. Dashboard queries degrade to "cached data" mode with a clear staleness indicator. See Production · High availability.
Do you mutate our workloads?
Only under a RightsizingPolicy you apply, with mode: apply, or through a BudgetPolicy / CeilingPolicy that you have armed via kubehero cap --arm (or the dashboard toggle). Every action is reversible via kubehero undo <audit-id> within the cooldown window. The default is observe + recommend.
How is this open-source exactly?
- Apache 2.0 — agent, CLI, collector,
cost-modellibrary, Protobuf schemas. Everything that runs in your cluster collecting telemetry. - BSL 1.1 → Apache 2.0 after 3 years — control plane, operator, pricing engine, dashboard. The orchestration brain is commercial during the first 3 years, then auto-opens.
Customers with compliance requirements can audit every line of what runs in their cluster. The value layer has a commercial license during the ramp, then becomes OSS after three years.
What's the smallest install that makes sense?
A single-cluster Cloud install: just the agent, plus our hosted control plane. Under 25 nodes, it's free forever. Helm install takes 90 seconds; first scan takes under 2 minutes. See Quickstart.
What about our existing Kubecost / OpenCost install?
Two paths:
- Coexist — our agent runs alongside; you see both tools' numbers and compare. Most design partners do this for 2–4 weeks.
- Import their allocation rules —
kubehero import opencost --from <url>pulls your existing label-based allocation rules so teams don't have to relearn a new chargeback model.
You can run both indefinitely. We aren't trying to kick out another tool you like — we're offering a different accuracy tier and a policy surface they don't have.
Do you support our identity provider?
If it speaks OIDC, yes. The chart ships Dex as the proxy — connectors for Okta, Azure AD, Google Workspace, GitHub, GitLab, LDAP, and generic OIDC are in the standard Dex distribution. For SaaS-hosted KubeHero Cloud, we use WorkOS, which covers SSO + SCIM for every major enterprise IdP. See Integrations · Identity.
How do you handle multi-cluster?
One control plane, many clusters. See Production · Federation. You register each cluster with kubehero cluster add, get a per-cluster mTLS cert, and drop that into the edge cluster's agent Helm install. Policies written in the hub replicate to every matched cluster via label-based scope selectors.
What's your pricing, concretely?
- Cloud — $10 per node per month, first 25 nodes free. No seat tax. No limits on users or clusters.
- Self-hosted · Free tier — Apache 2.0 components only, 3-cluster / 7-day retention limits. BSL components require the Enterprise license at scale.
- Self-hosted · Enterprise — BSL 1.1 commercial license; unlimited scale, SSO/SCIM/RBAC, audit export, federation. Custom pricing per footprint.
Who's behind this?
A devops / FinOps / HPC engineer with 15+ years on Kubernetes and ML infra. Design partners today are operators running real multi-cloud AKS / GKE / EKS footprints with GPU fleets. We prefer small, hands-on engagements over enterprise-sales theatre.
Is there a SLA?
Not during pre-launch. At GA (Q4 2026 planned), Cloud customers get 99.9% on the control-plane API and 99.95% on telemetry ingest. Self-hosted is customer-operated — our uptime SLA reflects software support, not your cluster's health.