Drill gates

Current sequence

  1. Fresh ACM hub backup: all critical backup streams must meet the agreed RPO.
  2. Image readiness: wait for hub pre-pull to finish, then build the durable mirror/IDMS path.
  3. Activation preflight: prove hub-dr has no active BackupSchedule/Restore and restore manifests dry-run cleanly.
  4. Controlled activation: proceed only after explicit user approval because ownership can move.

OADP

General backup state

Object storage

External MinIO bucket layout

ConsumerBucketNotes
ACM Observability on hub-dcacm-observability-hub-dcDedicated bucket and scoped MinIO user.
ACM Observability on hub-dracm-observability-hub-drDedicated bucket and scoped MinIO user. Do not share with hub-dc.
General OADPoadp-<cluster>One bucket per cluster.
ACM hub backupacm-cluster-backupShared by the active/restore hub backup workflow.

The observability object-store credentials are seeded as local-only Kubernetes Secrets and are intentionally not stored in Git.

Spoke regional DR

spoke-dcspoke-dr activation procedure

spoke-dr is platform standby (decision 2026-05-07): platform services stay ready, app workloads activate only when an owner-approved drill or real outage moves traffic. The procedure is deliberately runbook-driven, not automated — see spoke-dr for the rationale.

  1. Confirm: spoke-dc is unrecoverable (or accepted-loss for the drill). Capture the cluster state.
  2. Image pre-pull check: spoke-dr's pre-pull DaemonSet has the workload images warm; if not, wait for the pull to complete before activation.
  3. Vault role: confirm kubernetes-spoke-dr auth mount has the per-app role bound to the workload SAs (the role exists for the platform smoke path; per-app roles are added at activation time).
  4. Workload activation PR: owner opens a PR on lab-workloads populating clusters/spoke-dr/kustomization.yaml with the workload references currently on spoke-dc. Merge to main.
  5. ApplicationSet picks up: hub Argo CD generates spoke-dr-workloads; the cluster's Argo reconciles. Track status on the spoke's local Argo CD UI (authoritative for managed-pull spokes per the GitOps model).
  6. DNS / LB cutover: owner flips the public route to spoke-dr's ingress.
  7. Evidence: capture activation timeline, sync time, route reachability, and smoke test results for the drill record.
  8. Post-drill: revert clusters/spoke-dr/kustomization.yaml to resources: [] when traffic returns to spoke-dc.

Alerting

Backup health controls

Validation

Read-only checks

export KUBECONFIG=<hub-dc-kubeconfig>
oc -n open-cluster-management-backup get dpa,bsl
oc -n open-cluster-management-backup get backupschedule hub-backup-daily -o jsonpath="{.status.phase}{'\n'}{.status.veleroScheduleCredentials.status.lastBackup}{'\n'}{.status.veleroScheduleManagedClusters.status.lastBackup}{'\n'}{.status.veleroScheduleResources.status.lastBackup}{'\n'}"
oc -n open-cluster-management-backup get backups.velero.io --sort-by=.status.startTimestamp | tail -12
oc -n open-cluster-management-backup get servicemonitor,prometheusrule
oc -n openshift-adp get servicemonitor,prometheusrule

export KUBECONFIG=<hub-dr-kubeconfig>
oc -n open-cluster-management-backup get backupschedule,restore
oc -n openshift-image-prepull get ds,pods
oc get imagedigestmirrorset,imagetagmirrorset,imagecontentsourcepolicy