What a disaster recovery plan actually is
A disaster recovery (DR) plan is the documented, tested procedure your team follows to restore IT systems and data after an outage — a failed server, ransomware, a deleted database, a flooded data centre, or a cloud-region failure. It is not the same as a backup. A backup is a copy of data; a DR plan is the playbook that turns that copy back into a running business.
Two numbers anchor every DR plan: RTO and RPO. Get these right and the rest of the plan almost writes itself.
RTO vs RPO: the two numbers that matter
- Recovery Time Objective (RTO) is the maximum acceptable time a system can be down before it causes serious harm. If your e-commerce site has an RTO of 2 hours, recovery must complete within 2 hours of the incident.
- Recovery Point Objective (RPO) is the maximum acceptable amount of data, measured in time, you can afford to lose. An RPO of 15 minutes means your backups (or replication) must be no older than 15 minutes, so you lose at most 15 minutes of work.
A simple way to remember it: RTO looks forward ("how long until we're back?") and RPO looks backward ("how much recent data can we lose?").
| Concept | Question it answers | Driven by | Lower target costs |
|---|---|---|---|
| RTO | How fast must we recover? | Failover/restore speed | More automation, warm standby |
| RPO | How much data can we lose? | Backup/replication frequency | More frequent backups, replication |
Tighter targets cost more, so set them per workload, not for everything at once.
Step 1: Run a business impact analysis (BIA)
List every system and ask: what happens to the business each hour this is down, and how much data loss is tolerable? Rank workloads into tiers.
| Tier | Example workload | RTO | RPO |
|---|---|---|---|
| 1 — Critical | Payment/ERP database, customer-facing app | < 1 hr | < 15 min |
| 2 — Important | Internal apps, business email | < 4 hr | < 1 hr |
| 3 — Standard | File shares, reporting, dev/test | < 24 hr | < 24 hr |
Be honest. A 5-minute RTO for a marketing blog is wasted money; a 24-hour RPO for a payment ledger is a business-ending mistake.
Step 2: Choose a recovery strategy per tier
Match each tier to an architecture:
- Backup and restore (cheapest, slowest): scheduled backups to object storage; rebuild on demand. Fits Tier 3 and many Tier 2 workloads.
- Pilot light / warm standby: a minimal copy of the environment kept running in a second location, scaled up during failover. Fits Tier 1 with moderate RTO.
- Hot standby / active-active: a fully running replica that takes over near-instantly. Highest cost; reserve for Tier 1 systems where minutes of downtime are unacceptable.
For Saudi organisations, also confirm where each copy physically lives. Under the PDPL and NCA guidance, keeping primary and DR copies on in-Kingdom cloud infrastructure keeps you on the right side of data-residency expectations and avoids cross-border transfer reviews.
Step 3: Apply the 3-2-1 backup rule
Whatever the tier, design backups around the proven 3-2-1 rule:
- 3 copies of your data (1 primary + 2 backups)
- 2 different media or storage types
- 1 copy off-site (a different region or provider)
A modern variant, 3-2-1-1-0, adds 1 immutable/offline copy (protects against ransomware) and 0 errors verified on restore. Immutability matters: ransomware now targets backups first.
Here is a reliable, deduplicating, encrypted backup of a database and app directory using restic to S3-compatible object storage:
# Install restic (Debian/Ubuntu)
sudo apt-get update && sudo apt-get install -y restic
# Point restic at your object storage bucket
export RESTIC_REPOSITORY="s3:https://s3.example.com/dr-backups"
export AWS_ACCESS_KEY_ID="<access-key>"
export AWS_SECRET_ACCESS_KEY="<secret-key>"
export RESTIC_PASSWORD="<strong-repo-passphrase>" # keep this safe & off-server
# One-time: initialise the encrypted repository
restic init
# Dump the database, then back up the dump + app files
mysqldump --single-transaction --routines --triggers \
-u backup -p"$DB_PASS" appdb > /var/backups/appdb.sql
restic backup /var/backups/appdb.sql /srv/app --tag nightly
# Enforce retention (keeps storage and RPO sane)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
Your backup frequency must satisfy your RPO. If RPO is 1 hour, schedule the job hourly via cron and confirm each run completes well inside the hour:
# /etc/cron.d/dr-backup — hourly backup at minute 5
5 * * * * root /usr/local/bin/dr-backup.sh >> /var/log/dr-backup.log 2>&1
For tighter RPOs on databases, use continuous methods — MySQL/MariaDB binary-log shipping or PostgreSQL WAL archiving / streaming replication — rather than relying on snapshots alone.
Step 4: Write the recovery runbook
A backup nobody can restore under pressure is worthless. Document the exact restore steps as a runbook anyone on call can follow:
- Declare the incident and notify the DR coordinator.
- Provision recovery infrastructure (a standby cloud server or hosting environment you can spin up fast).
- Restore data from the latest verified backup.
- Verify integrity, then redirect traffic (DNS/load balancer).
- Communicate status to stakeholders.
A restore from the restic example above is simply:
# List recovery points, then restore the latest into a target path
restic snapshots
restic restore latest --target /srv/recovery
Keep the runbook with names, phone numbers, and credentials locations — not the credentials themselves — somewhere reachable even if your main systems are down.
Step 5: Test, then test again
DR plans rot. Schedule recovery drills at least quarterly for Tier 1 systems:
- Tabletop test: walk through the runbook on paper.
- Restore test: actually restore a backup into an isolated environment and confirm the data is correct (the "0" in 3-2-1-1-0).
- Full failover: cut over to the DR site and run on it.
After each drill, measure your actual recovery time and data loss against your RTO and RPO. If you missed them, fix the gap — more frequent backups, more automation, or a warmer standby — and re-test.
Putting it together
A workable DR plan is short: a one-page tiering table, per-tier RTO/RPO targets, a 3-2-1-1-0 backup design, a tested runbook, and a drill schedule. Start with your most critical workload, prove the restore works end-to-end, then expand tier by tier.
If you want backups and standby infrastructure that stay inside the Kingdom for PDPL and NCA compliance, explore the cloud backup cluster for related guides, or create a Skyline Cloud account and provision in-Kingdom object storage, cloud servers, and backup in minutes.
Comments
0 total · 0 threads