Operations
Health-check contract, cold-start behaviour, and deployment notes.
Operations
This page describes the operational contract that every KuberCoin PHP surface honours. It exists so that orchestrator authors, on-call operators and CI integrators can rely on a single shape across all services.
Health checks
Every surface exposes two HTTP endpoints, both returning JSON:
GET /healthz— health probe. Served before bootstrap so the response time is < 5 ms even on a cold process. Returns{"status":"ok","service":"<name>"}with HTTP 200.GET /readyz— readiness probe. Runs registered dependency checks (node RPC, database, etc.). Returns{"status":"ok|degraded|fail","service":"<name>","checks":{...}}with HTTP 200 when all checks pass and HTTP 503 otherwise.
The full schema lives in ops/contracts/health.openapi.yaml
and is enforced on every CI run by
tests-e2e-cross/health-readyz-contract.spec.ts.
Cold-start behaviour
PHP-FPM serves each surface with a small pool of long-lived workers. The first request to a freshly forked worker pays a one-time cost of ~50–150 ms covering Composer autoload, configuration parse and the first PDO connection. Subsequent requests on the same worker reuse the autoloader and the connection.
Two design choices keep cold-start invisible to health checks:
/healthzis dispatched by the front controller before Composer autoload runs, so health checks never trigger autoload.- When APCu is available, opcode and userland caches are warmed across worker generations, dropping the cold-start cost roughly in half. The
process_apcu_availablegauge in/metricsexposes this state per surface.
Metrics
Every surface exposes Prometheus text on GET /metrics. The
baseline metric set is:
http_requests_total{service,route,method,status}— counter.http_request_duration_seconds{service,route,method,status}— histogram.process_apcu_available{service}— gauge, 1 when APCu is loaded, 0 otherwise.process_start_seconds{service}— gauge, Unix timestamp of when the worker began serving requests.
Deployment notes
- Configure orchestrator health probes to hit
/healthzwith a 1s timeout and a 3s period. - Configure readiness probes to hit
/readyzwith a 5s timeout and a 10s period; treat 503 as not-ready. - Scrape
/metricsevery 15s. Useserviceas the partitioning label in Grafana. - Roll workers when
process_apcu_availabledrops to 0 unexpectedly — this indicates the extension was unloaded or rebuilt.