Sizing, HA, backups, disaster recovery, air-gap, and federation for enterprise KubeHero installs.

This page covers production deployment concerns — the things you'd want answered before running KubeHero as a P1 system. For a five-minute install, see Quickstart.

Reference topology

A production self-hosted deployment runs in a dedicated kubehero-system namespace with the following components:

kubehero-system
├── kubehero-control-plane  × 2 replicas   (HA via leader election)
├── kubehero-operator        × 1 replica   (leader-elected, exclusive)
├── kubehero-pricing-engine  × 1 replica   (CronJob, runs every 6h)
├── kubehero-collector       DaemonSet      (one per node, eBPF)
└── kubehero-dashboard       × 2 replicas   (stateless)

storage (separate namespaces)
├── cnpg-system              CloudNativePG cluster ×3 replicas
├── clickhouse-system        ClickHouse ×3 shards ×2 replicas
└── valkey                   sentinel mode ×3 replicas

auth
└── dex                      ×2 replicas  (connected to corp IdP)

Sizing

Component	Sizing per 1,000 nodes
Control plane	2× (2 vCPU / 2 GiB)
Collector	100m CPU / 50 MiB per node (overhead budget)
Operator	1× (500m CPU / 512 MiB)
Pricing engine	1× CronJob (500m CPU / 1 GiB during run)
Dashboard	2× (500m CPU / 1 GiB)
PostgreSQL	3× (4 vCPU / 16 GiB / 200 GiB SSD)
ClickHouse	3 shards × 2 replicas (8 vCPU / 32 GiB / 2 TiB NVMe per pod)
Valkey	3× (500m CPU / 1 GiB)
Dex	2× (200m CPU / 256 MiB)

Scale ClickHouse first as you grow — it's the hot path for dashboard queries. Everything else is comparatively flat.

High availability

Control plane

Runs with --leader-elect via controller-runtime. Two replicas; one serves RPC, the other is warm standby. Leader election leases in PostgreSQL.

Operator

Same leader-elect pattern. One active, one warm. CRD reconciliation is idempotent so a failover is invisible to customers.

Collector

DaemonSet — every node runs one, no election. If a collector crashes, its node's telemetry gaps until the pod restarts. We buffer up to 60s of ticks in memory to cover brief pod restarts; longer outages show as a visible gap in the kubehero_pod_cost_usd_per_second series.

Dashboard

Stateless Next.js; scale horizontally behind your existing ingress.

Postgres

CloudNativePG runs 3 replicas with streaming replication + automated failover. Backup strategy below.

ClickHouse

Altinity's operator manages 3 shards × 2 replicas. ZooKeeper or ClickHouse Keeper provides consensus. Reads prefer the local replica; writes go to the leader of each shard.

Backups

Postgres

CloudNativePG ships WAL-archive + periodic base backups to object storage (S3-compatible):

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata: { name: kubehero-postgres }
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://acme-backups/kubehero-postgres
      s3Credentials:
        accessKeyId:     { name: pg-backup-creds, key: access_key }
        secretAccessKey: { name: pg-backup-creds, key: secret_key }
    retentionPolicy: 30d

RPO: 5 minutes (WAL archive interval). RTO: ~10 minutes (PITR restore + cluster re-bootstrap).

ClickHouse

BACKUP TABLE to S3 daily; s3_disk storage policy for tiered cold storage of tables older than 90 days.

BACKUP TABLE kubehero.pod_cost_1s TO S3('s3://acme-backups/kubehero-clickhouse/2026-04-24', '...', '...');

RPO depends on backup schedule. RTO: hours for a full restore, but the time-series data is replayable from any raw ingest archive you kept.

Audit log

Always tier to immutable storage — every action's signed record lands in S3-compatible object storage via the continuous audit exporter:

kubehero audit forward --s3 s3://acme-compliance/kubehero-audit/

Disaster recovery

Recovery scenarios we've rehearsed:

Single control-plane pod loss

Leader lease expires within 15s; standby takes over. No data loss. No dashboard interruption beyond a brief 503.

Full cluster loss

Redeploy from the Helm chart. Restore Postgres from PITR, ClickHouse from the latest snapshot. Audit log survives in object storage. Agents reconnect to the new ingress endpoint automatically via the kubehero-cluster-cert ConfigMap.

Region-level cloud outage

Configure Postgres cross-region replication (CloudNativePG supports this via externalClusters). Primary control plane in the surviving region. ClickHouse needs a separate decision — full cross-region replication is expensive; most customers accept a few hours of gap.

Air-gap install

Pre-mirror every image to an internal registry, then install with the air-gap overlay:

helm install kubehero deploy/helm/kubehero \
  -f deploy/helm/kubehero/values.yaml \
  -f deploy/helm/kubehero/values.airgap.yaml \
  --set image.registry=registry.internal.acme.com \
  --set imagePullSecrets[0].name=acme-regcred

No outbound traffic required. Pricing catalog can be pre-populated via:

kubectl -n kubehero-system exec deploy/kubehero-pricing-engine -- \
  kubehero pricing import --from ./pricing-snapshot.json

Customers in defense or regulated industries mirror this snapshot monthly.

Federation — multi-cluster

One control plane. Many clusters. This is the default shape for customers running more than a handful of Kubernetes clusters.

Architecture

┌──────────────── federation hub (KubeHero control plane) ───────────────────┐
│                                                                             │
│     ┌──── Postgres ─── ClickHouse ─── Valkey ─── Dex ─── Dashboard ────┐   │
│     │                                                                   │   │
│     └── ingress.kubehero.internal ─── mTLS listener ───────────────────┘   │
│                                          ▲                                  │
└──────────────────────────────────────────┼──────────────────────────────────┘
                                           │
                   ┌───────────────────────┼───────────────────────┐
                   │                       │                       │
          ┌────────┼──┐           ┌────────┼──┐           ┌────────┼──┐
          │cluster-a  │           │cluster-b  │           │cluster-c  │
          │ agent DS  │           │ agent DS  │           │ agent DS  │
          │ operator  │           │ operator  │           │ operator  │
          └───────────┘           └───────────┘           └───────────┘
              AWS                      GCP                     Azure

Each edge cluster runs the agent DaemonSet + operator (for local CRD enforcement). The control plane runs once, usually in a dedicated platform cluster.

Registration

kubehero cluster add \
  --name prod-us-east-1 \
  --cloud aws --region us-east-1

This issues a per-cluster mTLS cert, which you drop into the agent Helm install on the edge cluster:

helm install kubehero-agent kubehero/kubehero \
  --namespace kubehero-system --create-namespace \
  --set controlPlane.enabled=false \
  --set operator.enabled=true \
  --set dashboard.enabled=false \
  --set cloud.enabled=true \
  --set cloud.hubEndpoint=https://ingress.kubehero.internal \
  --set-file cloud.clusterCert=./cluster-cert.pem

Cross-cluster policies

Write a BudgetPolicy once in the hub with scope.clusterSelector — it replicates to every matching cluster's operator:

apiVersion: kubehero.kubehero.io/v1
kind: BudgetPolicy
metadata: { name: all-prod-ceiling }
spec:
  scope:
    clusterSelector:
      matchLabels: { env: prod }   # matches every prod-* cluster
  ceiling: "$500000/mo"
  hardStop: true

The hub's operator watches the policy → pushes to each matched cluster's operator via the same Connect-RPC transport agents use. Each edge operator treats the policy as its own and applies locally.

Multi-tenant isolation

RBAC scopes limit who can arm policies per cluster:

auth:
  rbac:
    clusterScopes:
      - name: acme-sre-east
        groups: ["acme-sre-east"]
        clusters: ["eks-use1-prod", "eks-use1-staging"]
        role: operator
      - name: acme-ml-platform
        groups: ["acme-ml"]
        clusters: ["aks-westeu-prod-01"]
        role: operator

SRE East can't arm ML Platform's policies and vice versa, even though both appear in the same dashboard.

Hardening

Network policies — chart ships default-deny NetworkPolicies you can enable via networkPolicies.enabled: true.
PSS baseline — every manifest conforms to the restricted Pod Security Standard.
Image signing — chart references are all @sha256: pinned; images are Cosign-signed and verified by the chart's admission webhook if enabled.
SBOM — every image has a Syft-generated SBOM attached as a Cosign artifact.
Supply chain — our build pipelines are reproducible via Dagger; see infra/dagger/.

Upgrading

Chart versions follow semver. Minor upgrades are always zero-downtime if you run the HA topology. Major upgrades (v1 → v2) may have CRD migrations — we ship a helm upgrade --atomic path plus a one-command kubehero migrate v1-v2 that walks your existing CRDs through the new schema.

Monitoring KubeHero itself

We eat our own dog food. The chart ships PrometheusRules for:

Control-plane liveness — up{service="kubehero-control-plane"} == 0
Collector coverage — fires when a node's collector stops reporting for >60s
Policy evaluation latency — alerts if the operator's reconcile loop grows beyond 2s
Database connection pool saturation — both Postgres and ClickHouse

Production deployment

Reference topology

Sizing

High availability

Control plane

Operator

Collector

Dashboard

Postgres

ClickHouse

Backups

Postgres

ClickHouse

Audit log

Disaster recovery

Single control-plane pod loss

Full cluster loss

Region-level cloud outage

Air-gap install

Federation — multi-cluster

Architecture

Registration

Cross-cluster policies

Multi-tenant isolation

Hardening

Upgrading

Monitoring KubeHero itself

On this page