Runbook

This runbook documents the procedures for the most common GOVERN operational incidents. All procedures assume you have admin console access and Cloudflare dashboard access.

API latency spike

Symptoms: P95 latency above 500ms, assessment throughput dropping.

Diagnosis:

Check Cloudflare Workers dashboard for CPU time and error rate
Check Supabase dashboard for connection pool saturation and slow queries
Check Upstash Redis for connection count and eviction rate

Resolution:

If Supabase connections are saturated → scale the connection pool via the Supabase dashboard
If a slow query is the culprit → identify via Supabase Query Performance and add index if needed
If Redis evictions are high → increase Redis plan or reduce TTLs

Assessment queue backup

Symptoms: Assessment volume chart shows batches instead of steady throughput.

Diagnosis: Check for Durable Object errors in the Cloudflare Workers log for the Coordinator DO.

Resolution: The Coordinator DO self-heals on restart. If the queue is persistently backed up, trigger a manual flush:

curl -X POST https://govern-api.archetypal.ai/api/internal/flush-queue \
  -H "Authorization: Bearer $GOVERN_ADMIN_TOKEN"

Monitoring agent offline

Symptoms: One or more agents showing red in the admin fleet view.

Diagnosis: SSH to the host running the agent and check:

docker logs govern-agent --tail 50

Resolution:

Network connectivity issue → verify the host can reach govern-api.archetypal.ai on port 443
Credential rotation → update GOVERN_API_KEY env var and restart the container
Container crash → restart with docker restart govern-agent

Database disk space warning

Symptoms: Supabase disk usage above 80%.

Resolution: Run the assessment archival job to move old assessments to cold storage:

govern-admin maintenance archive-assessments --before 2025-01-01 --dry-run
govern-admin maintenance archive-assessments --before 2025-01-01