Resilience
Backup and DR gates
The next major operational milestone is a controlled hub-dc to hub-dr ACM activation drill. Do not start it until the gates below are clean.
Drill gates
Current sequence
- Fresh ACM hub backup: all critical backup streams must meet the agreed RPO.
- ACM/MCE image readiness: mirror or pre-pull required images on hub-dr before activation.
- Activation preflight: prove hub-dr has no active BackupSchedule/Restore and restore manifests dry-run cleanly.
- Controlled activation: proceed only after explicit user approval because ownership can move.
OADP
General backup state
- General OADP daily schedules are staggered: hub-dc at
0 2 * * *, hub-dr at20 2 * * *, spoke-dc at40 2 * * *, and spoke-dr at0 3 * * *. - Velero has retry hardening:
AWS_RETRY_MODE=standard,AWS_MAX_ATTEMPTS=10, and a 512Mi memory request. - Latest recorded daily series completed across all four clusters.
- Residual watch item: hub-dr archive persistence to shared MinIO can be slow; cluster-to-MinIO health checks had intermittent timeout samples.
Validation
Read-only checks
export KUBECONFIG=<hub-dc-kubeconfig>
oc -n open-cluster-management-backup get dpa,bsl
oc -n open-cluster-management-backup get backupschedule hub-backup-daily -o jsonpath="{.status.phase}{'\n'}{.status.veleroScheduleCredentials.status.lastBackup}{'\n'}{.status.veleroScheduleManagedClusters.status.lastBackup}{'\n'}{.status.veleroScheduleResources.status.lastBackup}{'\n'}"
oc -n open-cluster-management-backup get backups.velero.io --sort-by=.status.startTimestamp | tail -12
export KUBECONFIG=<hub-dr-kubeconfig>
oc -n open-cluster-management-backup get backupschedule,restore
oc get imagedigestmirrorset,imagetagmirrorset,imagecontentsourcepolicy