Resilience
Backup and DR gates
The next major operational milestone is a controlled hub-dc to hub-dr ACM activation drill. Do not start it until the gates below are clean.
Drill gates
Current sequence
- Fresh ACM hub backup: all critical backup streams must meet the agreed RPO.
- Image readiness: wait for hub pre-pull to finish, then build the durable mirror/IDMS path.
- Activation preflight: prove hub-dr has no active BackupSchedule/Restore and restore manifests dry-run cleanly.
- Controlled activation: proceed only after explicit user approval because ownership can move.
OADP
General backup state
- General OADP daily schedules are staggered: hub-dc at
0 2 * * *, hub-dr at20 2 * * *, spoke-dc at40 2 * * *, and spoke-dr at0 3 * * *. - Velero has retry hardening:
AWS_RETRY_MODE=standard,AWS_MAX_ATTEMPTS=10, and a 512Mi memory request. - Latest recorded daily series completed across all four clusters.
- Backup health alerting now watches stale, failed, partially failed, and warning-producing
lab-dailybackups. - External Vault logical export now uses a scoped periodic token and self-renewing replication scripts; this supports future OpenShift secret recovery checks without relying on a root-token repair.
- Residual watch item: hub-dr archive persistence to shared MinIO can be slow; cluster-to-MinIO health checks had intermittent timeout samples.
Object storage
External MinIO bucket layout
| Consumer | Bucket | Notes |
|---|---|---|
ACM Observability on hub-dc | acm-observability-hub-dc | Dedicated bucket and scoped MinIO user. |
ACM Observability on hub-dr | acm-observability-hub-dr | Dedicated bucket and scoped MinIO user. Do not share with hub-dc. |
| General OADP | oadp-<cluster> | One bucket per cluster. |
| ACM hub backup | acm-cluster-backup | Shared by the active/restore hub backup workflow. |
The observability object-store credentials are seeded as local-only Kubernetes Secrets and are intentionally not stored in Git.
Spoke regional DR
spoke-dc → spoke-dr activation procedure
spoke-dr is platform standby (decision 2026-05-07): platform services stay ready, app workloads activate only when an owner-approved drill or real outage moves traffic. The procedure is deliberately runbook-driven, not automated — see spoke-dr for the rationale.
- Confirm: spoke-dc is unrecoverable (or accepted-loss for the drill). Capture the cluster state.
- Image pre-pull check: spoke-dr's pre-pull DaemonSet has the workload images warm; if not, wait for the pull to complete before activation.
- Vault role: confirm
kubernetes-spoke-drauth mount has the per-app role bound to the workload SAs (the role exists for the platform smoke path; per-app roles are added at activation time). - Workload activation PR: owner opens a PR on
lab-workloadspopulatingclusters/spoke-dr/kustomization.yamlwith the workload references currently on spoke-dc. Merge tomain. - ApplicationSet picks up: hub Argo CD generates
spoke-dr-workloads; the cluster's Argo reconciles. Track status on the spoke's local Argo CD UI (authoritative for managed-pull spokes per the GitOps model). - DNS / LB cutover: owner flips the public route to spoke-dr's ingress.
- Evidence: capture activation timeline, sync time, route reachability, and smoke test results for the drill record.
- Post-drill: revert
clusters/spoke-dr/kustomization.yamltoresources: []when traffic returns to spoke-dc.
Alerting
Backup health controls
oadp-backup-healthruns inopenshift-adpon all four clusters.acm-backup-healthruns inopen-cluster-management-backupon the active hub.- Platform Prometheus has confirmed Velero backup timestamp metrics for the new scrape targets.
- Validate alert behavior through an actual stale/failure cycle before treating this as a completed DR gate.
Validation
Read-only checks
export KUBECONFIG=<hub-dc-kubeconfig>
oc -n open-cluster-management-backup get dpa,bsl
oc -n open-cluster-management-backup get backupschedule hub-backup-daily -o jsonpath="{.status.phase}{'\n'}{.status.veleroScheduleCredentials.status.lastBackup}{'\n'}{.status.veleroScheduleManagedClusters.status.lastBackup}{'\n'}{.status.veleroScheduleResources.status.lastBackup}{'\n'}"
oc -n open-cluster-management-backup get backups.velero.io --sort-by=.status.startTimestamp | tail -12
oc -n open-cluster-management-backup get servicemonitor,prometheusrule
oc -n openshift-adp get servicemonitor,prometheusrule
export KUBECONFIG=<hub-dr-kubeconfig>
oc -n open-cluster-management-backup get backupschedule,restore
oc -n openshift-image-prepull get ds,pods
oc get imagedigestmirrorset,imagetagmirrorset,imagecontentsourcepolicy