Backup & Restore

Backup cadence, retention, and the verify-restore disaster-recovery drill.

Backup & Restore

KuberCoin operational data lives in MariaDB (per-surface schemas) and on disk under data/ for the worker services. This page documents the backup cadence, retention policy and the verify-restore drill that proves the backups are usable.

Cadence and retention

Source	Tool	Cadence	Retention (local)	Retention (offsite)
MariaDB schemas (explorer, wallet, rpc, open, alerter)	`mariadb-dump --single-transaction`	Hourly	48 hours	14 days (encrypted, off-host)
MariaDB schemas (full snapshot)	`mariabackup`	Daily 03:00 UTC	7 days	90 days (encrypted, off-host)
Worker on-disk state (`sync/data/`, `backfill/data/`)	tar + zstd	Daily	14 days	90 days
Configuration (`*.kuber-coin.com/config.php`)	Source-controlled (private repo)	On commit	Forever	Forever

RPO and RTO

Recovery Point Objective. 1 hour for transactional surfaces (last hourly logical dump) and 0 for source-controlled config.
Recovery Time Objective. 30 minutes for a single-surface restore from the latest local dump; 4 hours for a full-stack restore from offsite.

These targets assume the most recent local backup is intact. The verify-restore drill below exists specifically to detect silent corruption before it matters.

Restoring a single surface

Confirm the surface is fully drained: traffic switched at the edge or the relevant systemd unit stopped.
Pick the dump: ls -lt /var/backups/kubercoin/<surface>/*.sql.zst | head.
Decompress and replay: zstdcat <dump> | mariadb --default-character-set=utf8mb4 <db>.
Run node scripts/verify-restore.mjs --surface <name> to confirm row-count parity against the dump's manifest.
Re-enable traffic and watch /readyz + the SLO dashboard for 15 minutes before considering the restore complete.

Verify-restore drill

Once per quarter, the operations working group runs the disaster-recovery drill to prove that backups can actually be restored. The drill is fully automated:

pwsh scripts/dr-drill.ps1

The script spins up a temporary SQLite database (CI mode) or MariaDB schema (production mode), replays the most recent dump, invokes scripts/verify-restore.mjs to compare row counts and a sample of primary-key checksums against the dump manifest, and asserts the restore completed in <5 minutes (the RTO budget). A failed drill is treated as a SEV-2 incident; the next drill is run within 14 days regardless of the rotation schedule.

Encryption and key custody

Offsite backups are encrypted with age using a fixed set of recipient public keys held by the operations working group. The recipient list is checked into ops/backups/recipients.txt and is rotated whenever a working-group member leaves — see the secret rotation runbook for the procedure that mirrors HMAC and database credential rotation.

Backup &amp; Restore

Backup & Restore

Cadence and retention

RPO and RTO

Restoring a single surface

Verify-restore drill

Encryption and key custody

Backup & Restore