Skip to content

Runbook

This runbook documents the procedures for the most common GOVERN operational incidents. All procedures assume you have admin console access and Cloudflare dashboard access.

API latency spike

Symptoms: P95 latency above 500ms, assessment throughput dropping.

Diagnosis:

  1. Check Cloudflare Workers dashboard for CPU time and error rate
  2. Check Supabase dashboard for connection pool saturation and slow queries
  3. Check Upstash Redis for connection count and eviction rate

Resolution:

  • If Supabase connections are saturated → scale the connection pool via the Supabase dashboard
  • If a slow query is the culprit → identify via Supabase Query Performance and add index if needed
  • If Redis evictions are high → increase Redis plan or reduce TTLs

Assessment queue backup

Symptoms: Assessment volume chart shows batches instead of steady throughput.

Diagnosis: Check for Durable Object errors in the Cloudflare Workers log for the Coordinator DO.

Resolution: The Coordinator DO self-heals on restart. If the queue is persistently backed up, trigger a manual flush:

expressiveCode.terminalWindowFallbackTitle
curl -X POST https://govern-api.archetypal.ai/api/internal/flush-queue \
-H "Authorization: Bearer $GOVERN_ADMIN_TOKEN"

Monitoring agent offline

Symptoms: One or more agents showing red in the admin fleet view.

Diagnosis: SSH to the host running the agent and check:

expressiveCode.terminalWindowFallbackTitle
docker logs govern-agent --tail 50

Resolution:

  • Network connectivity issue → verify the host can reach govern-api.archetypal.ai on port 443
  • Credential rotation → update GOVERN_API_KEY env var and restart the container
  • Container crash → restart with docker restart govern-agent

Database disk space warning

Symptoms: Supabase disk usage above 80%.

Resolution: Run the assessment archival job to move old assessments to cold storage:

expressiveCode.terminalWindowFallbackTitle
govern-admin maintenance archive-assessments --before 2025-01-01 --dry-run
govern-admin maintenance archive-assessments --before 2025-01-01