Backup & Restore
Backup cadence, retention, and the verify-restore disaster-recovery drill.
Backup & Restore
KuberCoin operational data lives in MariaDB (per-surface schemas) and
on disk under data/ for the worker services. This page
documents the backup cadence, retention policy and the verify-restore
drill that proves the backups are usable.
Cadence and retention
| Source | Tool | Cadence | Retention (local) | Retention (offsite) |
|---|---|---|---|---|
| MariaDB schemas (explorer, wallet, rpc, open, alerter) | mariadb-dump --single-transaction | Hourly | 48 hours | 14 days (encrypted, off-host) |
| MariaDB schemas (full snapshot) | mariabackup | Daily 03:00 UTC | 7 days | 90 days (encrypted, off-host) |
Worker on-disk state (sync/data/, backfill/data/) | tar + zstd | Daily | 14 days | 90 days |
Configuration (*.kuber-coin.com/config.php) | Source-controlled (private repo) | On commit | Forever | Forever |
RPO and RTO
- Recovery Point Objective. 1 hour for transactional surfaces (last hourly logical dump) and 0 for source-controlled config.
- Recovery Time Objective. 30 minutes for a single-surface restore from the latest local dump; 4 hours for a full-stack restore from offsite.
These targets assume the most recent local backup is intact. The verify-restore drill below exists specifically to detect silent corruption before it matters.
Restoring a single surface
- Confirm the surface is fully drained: traffic switched at the edge or the relevant systemd unit stopped.
- Pick the dump:
ls -lt /var/backups/kubercoin/<surface>/*.sql.zst | head. - Decompress and replay:
zstdcat <dump> | mariadb --default-character-set=utf8mb4 <db>. - Run
node scripts/verify-restore.mjs --surface <name>to confirm row-count parity against the dump's manifest. - Re-enable traffic and watch
/readyz+ the SLO dashboard for 15 minutes before considering the restore complete.
Verify-restore drill
Once per quarter, the operations working group runs the disaster-recovery drill to prove that backups can actually be restored. The drill is fully automated:
pwsh scripts/dr-drill.ps1
The script spins up a temporary SQLite database (CI mode) or MariaDB
schema (production mode), replays the most recent dump, invokes
scripts/verify-restore.mjs to compare row counts and a
sample of primary-key checksums against the dump manifest, and asserts
the restore completed in <5 minutes (the RTO budget). A failed drill
is treated as a SEV-2 incident; the next drill is run within 14 days
regardless of the rotation schedule.
Encryption and key custody
Offsite backups are encrypted with age using a fixed set
of recipient public keys held by the operations working group. The
recipient list is checked into ops/backups/recipients.txt
and is rotated whenever a working-group member leaves — see the
secret rotation runbook for the
procedure that mirrors HMAC and database credential rotation.