Runbook
This runbook documents the procedures for the most common GOVERN operational incidents. All procedures assume you have admin console access and Cloudflare dashboard access.
API latency spike
Symptoms: P95 latency above 500ms, assessment throughput dropping.
Diagnosis:
- Check Cloudflare Workers dashboard for CPU time and error rate
- Check Supabase dashboard for connection pool saturation and slow queries
- Check Upstash Redis for connection count and eviction rate
Resolution:
- If Supabase connections are saturated → scale the connection pool via the Supabase dashboard
- If a slow query is the culprit → identify via Supabase Query Performance and add index if needed
- If Redis evictions are high → increase Redis plan or reduce TTLs
Assessment queue backup
Symptoms: Assessment volume chart shows batches instead of steady throughput.
Diagnosis: Check for Durable Object errors in the Cloudflare Workers log for the Coordinator DO.
Resolution: The Coordinator DO self-heals on restart. If the queue is persistently backed up, trigger a manual flush:
curl -X POST https://govern-api.archetypal.ai/api/internal/flush-queue \ -H "Authorization: Bearer $GOVERN_ADMIN_TOKEN"Monitoring agent offline
Symptoms: One or more agents showing red in the admin fleet view.
Diagnosis: SSH to the host running the agent and check:
docker logs govern-agent --tail 50Resolution:
- Network connectivity issue → verify the host can reach
govern-api.archetypal.aion port 443 - Credential rotation → update
GOVERN_API_KEYenv var and restart the container - Container crash → restart with
docker restart govern-agent
Database disk space warning
Symptoms: Supabase disk usage above 80%.
Resolution: Run the assessment archival job to move old assessments to cold storage:
govern-admin maintenance archive-assessments --before 2025-01-01 --dry-rungovern-admin maintenance archive-assessments --before 2025-01-01