spoke-dr - BRAC POC OCP Wiki

Status

Recorded current state

GitOps: spoke-dr-cluster-config recorded Synced/Healthy.
Storage: ODF desired/live spec uses compact LVMS topology and StorageCluster is ready.
OADP: lab-dpa reconciled, BSL available, latest scheduled daily completed in the recorded run. A historical partially failed backup remains in history.
Local app middleware: Demo middleware was retired from spoke desired state. Non-core app middleware is not tracked as part of the OpenShift core operations scope.
AI platform: RHOAI operator installed, but no DataScienceCluster or user AI workloads exist.
User workload metrics: Enabled; user workload Prometheus, Thanos ruler, and the Prometheus operator should run here.
Vault / ESO: SecretStore/rke2-vault is Ready=True and ExternalSecret/eso-vault-smoke is synced through the kubernetes-spoke-dr Vault auth mount.

OSSM 3

Mesh platform state

servicemeshoperator3.v3.3.2 and kiali-operator.v2.22.2 CSVs recorded Succeeded.
Istio/default, IstioCNI/default, ZTunnel/default, Kiali, OSSM Console, CNI pods, and ingress gateway are recorded running.
Ambient components are pinned to OSSM 3.3 version v1.28.5.
No application namespace has opted into ambient yet; use istio.io/dataplane-mode=ambient for app onboarding.
RHOAI ServiceMesh capability is recorded False/MissingOperator because it expects the old OSSM v2 operator; this does not block OSSM 3 itself.

Decision recorded 2026-05-07

Standby semantics: platform standby

spoke-dr remains platform standby — operators, OSSM 3, ESO, OADP, and storage stay ready, but no application workloads run in normal operation. lab-workloads/clusters/spoke-dr/kustomization.yaml is intentionally resources: []; the workload ApplicationSet matches spoke-dc only. Full rationale and alternatives in ADR-0001.

This avoids the cost of cross-cluster session replication, dual image promotion, and split-brain handling that hot-standby would imply. The trade-off is non-trivial activation time during a regional spoke-dc loss; that's accepted in exchange for operational simplicity.

DR activation procedure (runbook-driven, owner-approved):

Confirm spoke-dc is unrecoverable (or accepted-loss for the drill).
Owner opens a PR on lab-workloads populating clusters/spoke-dr/kustomization.yaml with the workload references currently on spoke-dc.
Merge → the workload ApplicationSet selector is widened (or a one-time manual Application per workload) so the hub Argo CD generates spoke-dr-workloads; the Argo CD Agent on spoke-dr reconciles. Image pull warmth comes from the pre-pull controller already on the cluster.
Owner flips the public DNS / LB to spoke-dr's ingress.
Capture evidence (Argo sync time, route reachability, smoke tests) for the drill record.

Next: platform services