Operational Runbooks
This page contains operational runbooks for the GOVERN platform. Each runbook is a step-by-step procedure for a specific operational scenario.
Runbook: Deploy to Production
When to use: Releasing a new version of a GOVERN component to production. Prerequisites: Gate II (V(Q) >= 0.85) and Gate IV (5-point polish) must both be open.
Step 1 — Verify gates are open
# Run the full QA score checkpnpm run qa:score
# Expected output: V(Q) >= 0.85 for all components being deployedDo not proceed if any gate shows BLOCKED.
Step 2 — Tag the release
# Create a semver release taggit tag v0.12.0 -m "Release v0.12.0 — [brief description]"git push origin v0.12.0Step 3 — Deploy the API Gateway
cd packages/api-gateway
# Deploy to production Workersnpx wrangler deploy --env production
# Verify health immediately after deploycurl https://jarvis-api-gateway.ben-c1f.workers.dev/health | jq .# Expected: { "status": "ok" }Step 4 — Deploy frontend packages
# Build all frontend packagespnpm build
# Deploy Customer Dashboard (Cloudflare Pages)npx wrangler pages deploy packages/govern-app/dist \ --project-name=govern-app \ --branch=main
# Deploy Internal Dashboardnpx wrangler pages deploy packages/govern-dashboard/dist \ --project-name=govern-dashboard \ --branch=mainStep 5 — Post-deploy verification
# Wait 60 seconds for health checks to runsleep 60
# Check deploy watchdog statuscurl "$JARVIS_API_URL/api/monitoring/deploys/health" \ -H "Authorization: Bearer $AUTH_SECRET" | jq '.targets'
# All targets should show "healthy"Step 6 — Emit deploy build event
curl -s -X POST "$JARVIS_API_URL/api/build-events" \ -H "Authorization: Bearer $AUTH_SECRET" \ -H "Content-Type: application/json" \ -d '{ "type": "deploy", "archetypeIds": ["jarvis"], "skillsExercised": ["deployment-orchestration"], "description": "Production deploy: v0.12.0", "quality": 1.0, "metadata": { "version": "v0.12.0", "targets": ["api-gateway", "govern-app"] } }'Runbook: Emergency Rollback
When to use: A deploy has caused production degradation. Health checks failing, error rate spiking, or V(Q) dropped significantly.
Step 1 — Identify the last good deploy
curl "$JARVIS_API_URL/api/monitoring/deploys?status=success&limit=10" \ -H "Authorization: Bearer $AUTH_SECRET" | jq '[.[] | {id, version, completedAt, vqScoreAfter}]'Step 2 — Roll back API Gateway
# Find the last good commit hash from the deploy recordGOOD_COMMIT=<hash from deploy record>
# Check out the good commitgit checkout $GOOD_COMMIT
# Deploy immediatelycd packages/api-gatewaynpx wrangler deploy --env productionStep 3 — Roll back Cloudflare Pages
For Pages deployments, use the Cloudflare dashboard:
- Go to
dash.cloudflare.com→ Pages →govern-app - Click “Deployments”
- Find the last successful deployment before the problematic one
- Click “Roll back to this deployment”
Step 4 — Verify recovery
# Health checkcurl https://jarvis-api-gateway.ben-c1f.workers.dev/health | jq .
# Watch for V(Q) recoverywatch -n 30 'curl -s "$JARVIS_API_URL/api/monitoring/deploys/health" \ -H "Authorization: Bearer $AUTH_SECRET" | jq ".targets"'Step 5 — Emit rollback build event
curl -s -X POST "$JARVIS_API_URL/api/build-events" \ -H "Authorization: Bearer $AUTH_SECRET" \ -H "Content-Type: application/json" \ -d '{ "type": "error", "archetypeIds": ["jarvis", "alvin"], "skillsExercised": ["diagnostic-reasoning"], "description": "Emergency rollback: degraded deploy rolled back to $GOOD_COMMIT", "metadata": { "rolledBackVersion": "v0.12.0", "recoveredTo": "$GOOD_COMMIT" } }'Runbook: Incident Response
When to use: Production is degraded, customers are reporting issues, or alerts are firing.
Severity levels
| Level | Criteria | Response time |
|---|---|---|
| P1 | Complete service outage | Immediate |
| P2 | Degraded service (> 10% error rate) | 15 minutes |
| P3 | Single feature broken | 2 hours |
| P4 | Cosmetic or minor issue | Next business day |
P1/P2 incident procedure
Step 1 — Acknowledge
Post in #ops-incidents: “Acknowledging P[1/2] incident: [brief description]. [Your name] is on it.”
Step 2 — Diagnose
# Check API healthcurl https://jarvis-api-gateway.ben-c1f.workers.dev/health
# Check recent errors in Cloudflare Analytics# dash.cloudflare.com → Workers & Pages → jarvis-api-gateway → Analytics
# Check Supabase statuscurl https://status.supabase.com/api/v2/summary.json | jq '.status'
# Check recent deploys (was a deploy the trigger?)curl "$JARVIS_API_URL/api/monitoring/deploys?limit=5" \ -H "Authorization: Bearer $AUTH_SECRET"Step 3 — Contain
If a recent deploy is suspected: execute the Emergency Rollback runbook.
If no recent deploy: identify the failing component and determine if it can be isolated.
Step 4 — Resolve
Fix the root cause. Deploy the fix following the Deploy to Production runbook (even during an incident — gates still apply, but can be expedited).
Step 5 — Post-incident report
Within 24 hours, write a post-incident report covering:
- What happened
- Root cause
- Impact (customers affected, duration)
- Timeline
- Resolution
- Prevention measures
Post the report in #post-incident and link it from the Internal Dashboard incident log.
Runbook: Database Migration
When to use: Deploying a new Supabase migration file.
Prerequisites: Migration file has been reviewed, tested locally, and Gate II is open.
# Apply migration to production Supabasecd Chairman-Infrastructuresupabase db push
# Verify migration appliedsupabase db diff --use-migra
# Check expected: no diff between migration files and production schema
# Verify RLS policies are correct (see Database Operations runbook)Runbook: Wrangler Secrets Rotation
When to use: Rotating API keys, auth tokens, or other secrets stored in Wrangler.
# List current secretswrangler secret list --env production
# Rotate a secretecho "NEW_SECRET_VALUE" | wrangler secret put SECRET_NAME --env production
# Verify the worker picked up the new secret (may require re-deploy)npx wrangler deploy --env production
# Test with the new secretcurl https://jarvis-api-gateway.ben-c1f.workers.dev/health \ -H "Authorization: Bearer NEW_SECRET_VALUE"Important: After rotating AUTH_SECRET, update all CI/CD pipelines and the Internal Dashboard’s stored credentials.