Backup and DR - BRAC POC OCP Wiki

Drill gates

Current sequence

Fresh ACM hub backup: all critical backup streams must meet the agreed RPO.
Image readiness: wait for hub pre-pull to finish, then build the durable mirror/IDMS path.
Activation preflight: prove hub-dr has no active BackupSchedule/Restore and restore manifests dry-run cleanly.
Controlled activation: proceed only after explicit user approval because ownership can move.

OADP

General backup state

General OADP daily schedules are staggered: hub-dc at 0 2 * * *, hub-dr at 20 2 * * *, spoke-dc at 40 2 * * *, and spoke-dr at 0 3 * * *.
Velero has retry hardening: AWS_RETRY_MODE=standard, AWS_MAX_ATTEMPTS=10, and a 512Mi memory request.
Latest recorded daily series completed across all four clusters.
Backup health alerting now watches stale, failed, partially failed, and warning-producing lab-daily backups.
External Vault logical export now uses a scoped periodic token and self-renewing replication scripts; this supports future OpenShift secret recovery checks without relying on a root-token repair.
Residual watch item: hub-dr archive persistence to shared MinIO can be slow; cluster-to-MinIO health checks had intermittent timeout samples.

Object storage

External MinIO bucket layout

Consumer	Bucket	Notes
ACM Observability on `hub-dc`	`acm-observability-hub-dc`	Dedicated bucket and scoped MinIO user.
ACM Observability on `hub-dr`	`acm-observability-hub-dr`	Dedicated bucket and scoped MinIO user. Do not share with `hub-dc`.
General OADP	`oadp-<cluster>`	One bucket per cluster.
ACM hub backup	`acm-cluster-backup`	Shared by the active/restore hub backup workflow.

The observability object-store credentials are seeded as local-only Kubernetes Secrets and are intentionally not stored in Git.

Spoke regional DR

`spoke-dc` → `spoke-dr` activation procedure

spoke-dr is platform standby (decision 2026-05-07): platform services stay ready, app workloads activate only when an owner-approved drill or real outage moves traffic. The procedure is deliberately runbook-driven, not automated — see spoke-dr for the rationale.

Confirm: spoke-dc is unrecoverable (or accepted-loss for the drill). Capture the cluster state.
Image pre-pull check: spoke-dr's pre-pull DaemonSet has the workload images warm; if not, wait for the pull to complete before activation.
Vault role: confirm kubernetes-spoke-dr auth mount has the per-app role bound to the workload SAs (the role exists for the platform smoke path; per-app roles are added at activation time).
Workload activation PR: owner opens a PR on lab-workloads populating clusters/spoke-dr/kustomization.yaml with the workload references currently on spoke-dc. Merge to main.
ApplicationSet picks up: hub Argo CD generates spoke-dr-workloads; the cluster's Argo reconciles. Track status on the spoke's local Argo CD UI (authoritative for managed-pull spokes per the GitOps model).
DNS / LB cutover: owner flips the public route to spoke-dr's ingress.
Evidence: capture activation timeline, sync time, route reachability, and smoke test results for the drill record.
Post-drill: revert clusters/spoke-dr/kustomization.yaml to resources: [] when traffic returns to spoke-dc.

Alerting

Backup health controls

oadp-backup-health runs in openshift-adp on all four clusters.
acm-backup-health runs in open-cluster-management-backup on the active hub.
Platform Prometheus has confirmed Velero backup timestamp metrics for the new scrape targets.
Validate alert behavior through an actual stale/failure cycle before treating this as a completed DR gate.

Validation

Read-only checks

export KUBECONFIG=<hub-dc-kubeconfig>
oc -n open-cluster-management-backup get dpa,bsl
oc -n open-cluster-management-backup get backupschedule hub-backup-daily -o jsonpath="{.status.phase}{'\n'}{.status.veleroScheduleCredentials.status.lastBackup}{'\n'}{.status.veleroScheduleManagedClusters.status.lastBackup}{'\n'}{.status.veleroScheduleResources.status.lastBackup}{'\n'}"
oc -n open-cluster-management-backup get backups.velero.io --sort-by=.status.startTimestamp | tail -12
oc -n open-cluster-management-backup get servicemonitor,prometheusrule
oc -n openshift-adp get servicemonitor,prometheusrule

export KUBECONFIG=<hub-dr-kubeconfig>
oc -n open-cluster-management-backup get backupschedule,restore
oc -n openshift-image-prepull get ds,pods
oc get imagedigestmirrorset,imagetagmirrorset,imagecontentsourcepolicy

Next: service mesh