KuberCoin Docs

Backup & Restore

Backup cadence, retention, and the verify-restore disaster-recovery drill.

Backup & Restore

KuberCoin operational data lives in MariaDB (per-surface schemas) and on disk under data/ for the worker services. This page documents the backup cadence, retention policy and the verify-restore drill that proves the backups are usable.

Cadence and retention

SourceToolCadenceRetention (local)Retention (offsite)
MariaDB schemas (explorer, wallet, rpc, open, alerter)mariadb-dump --single-transactionHourly48 hours14 days (encrypted, off-host)
MariaDB schemas (full snapshot)mariabackupDaily 03:00 UTC7 days90 days (encrypted, off-host)
Worker on-disk state (sync/data/, backfill/data/)tar + zstdDaily14 days90 days
Configuration (*.kuber-coin.com/config.php)Source-controlled (private repo)On commitForeverForever

RPO and RTO

  • Recovery Point Objective. 1 hour for transactional surfaces (last hourly logical dump) and 0 for source-controlled config.
  • Recovery Time Objective. 30 minutes for a single-surface restore from the latest local dump; 4 hours for a full-stack restore from offsite.

These targets assume the most recent local backup is intact. The verify-restore drill below exists specifically to detect silent corruption before it matters.

Restoring a single surface

  1. Confirm the surface is fully drained: traffic switched at the edge or the relevant systemd unit stopped.
  2. Pick the dump: ls -lt /var/backups/kubercoin/<surface>/*.sql.zst | head.
  3. Decompress and replay: zstdcat <dump> | mariadb --default-character-set=utf8mb4 <db>.
  4. Run node scripts/verify-restore.mjs --surface <name> to confirm row-count parity against the dump's manifest.
  5. Re-enable traffic and watch /readyz + the SLO dashboard for 15 minutes before considering the restore complete.

Verify-restore drill

Once per quarter, the operations working group runs the disaster-recovery drill to prove that backups can actually be restored. The drill is fully automated:

pwsh scripts/dr-drill.ps1

The script spins up a temporary SQLite database (CI mode) or MariaDB schema (production mode), replays the most recent dump, invokes scripts/verify-restore.mjs to compare row counts and a sample of primary-key checksums against the dump manifest, and asserts the restore completed in <5 minutes (the RTO budget). A failed drill is treated as a SEV-2 incident; the next drill is run within 14 days regardless of the rotation schedule.

Encryption and key custody

Offsite backups are encrypted with age using a fixed set of recipient public keys held by the operations working group. The recipient list is checked into ops/backups/recipients.txt and is rotated whenever a working-group member leaves — see the secret rotation runbook for the procedure that mirrors HMAC and database credential rotation.