Operational Runbooks

This page contains operational runbooks for the GOVERN platform. Each runbook is a step-by-step procedure for a specific operational scenario.

Runbook: Deploy to Production

When to use: Releasing a new version of a GOVERN component to production. Prerequisites: Gate II (V(Q) >= 0.85) and Gate IV (5-point polish) must both be open.

Step 1 — Verify gates are open

# Run the full QA score check
pnpm run qa:score

# Expected output: V(Q) >= 0.85 for all components being deployed

Do not proceed if any gate shows BLOCKED.

Step 2 — Tag the release

# Create a semver release tag
git tag v0.12.0 -m "Release v0.12.0 — [brief description]"
git push origin v0.12.0

Step 3 — Deploy the API Gateway

cd packages/api-gateway

# Deploy to production Workers
npx wrangler deploy --env production

# Verify health immediately after deploy
curl https://jarvis-api-gateway.ben-c1f.workers.dev/health | jq .
# Expected: { "status": "ok" }

Step 4 — Deploy frontend packages

# Build all frontend packages
pnpm build

# Deploy Customer Dashboard (Cloudflare Pages)
npx wrangler pages deploy packages/govern-app/dist \
  --project-name=govern-app \
  --branch=main

# Deploy Internal Dashboard
npx wrangler pages deploy packages/govern-dashboard/dist \
  --project-name=govern-dashboard \
  --branch=main

Step 5 — Post-deploy verification

# Wait 60 seconds for health checks to run
sleep 60

# Check deploy watchdog status
curl "$JARVIS_API_URL/api/monitoring/deploys/health" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq '.targets'

# All targets should show "healthy"

Step 6 — Emit deploy build event

curl -s -X POST "$JARVIS_API_URL/api/build-events" \
  -H "Authorization: Bearer $AUTH_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "deploy",
    "archetypeIds": ["jarvis"],
    "skillsExercised": ["deployment-orchestration"],
    "description": "Production deploy: v0.12.0",
    "quality": 1.0,
    "metadata": { "version": "v0.12.0", "targets": ["api-gateway", "govern-app"] }
  }'

Runbook: Emergency Rollback

When to use: A deploy has caused production degradation. Health checks failing, error rate spiking, or V(Q) dropped significantly.

Step 1 — Identify the last good deploy

curl "$JARVIS_API_URL/api/monitoring/deploys?status=success&limit=10" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq '[.[] | {id, version, completedAt, vqScoreAfter}]'

Step 2 — Roll back API Gateway

# Find the last good commit hash from the deploy record
GOOD_COMMIT=<hash from deploy record>

# Check out the good commit
git checkout $GOOD_COMMIT

# Deploy immediately
cd packages/api-gateway
npx wrangler deploy --env production

Step 3 — Roll back Cloudflare Pages

For Pages deployments, use the Cloudflare dashboard:

Go to dash.cloudflare.com → Pages → govern-app
Click “Deployments”
Find the last successful deployment before the problematic one
Click “Roll back to this deployment”

Step 4 — Verify recovery

# Health check
curl https://jarvis-api-gateway.ben-c1f.workers.dev/health | jq .

# Watch for V(Q) recovery
watch -n 30 'curl -s "$JARVIS_API_URL/api/monitoring/deploys/health" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq ".targets"'

Step 5 — Emit rollback build event

curl -s -X POST "$JARVIS_API_URL/api/build-events" \
  -H "Authorization: Bearer $AUTH_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "error",
    "archetypeIds": ["jarvis", "alvin"],
    "skillsExercised": ["diagnostic-reasoning"],
    "description": "Emergency rollback: degraded deploy rolled back to $GOOD_COMMIT",
    "metadata": { "rolledBackVersion": "v0.12.0", "recoveredTo": "$GOOD_COMMIT" }
  }'

Runbook: Incident Response

When to use: Production is degraded, customers are reporting issues, or alerts are firing.

Severity levels

Level	Criteria	Response time
P1	Complete service outage	Immediate
P2	Degraded service (> 10% error rate)	15 minutes
P3	Single feature broken	2 hours
P4	Cosmetic or minor issue	Next business day

P1/P2 incident procedure

Step 1 — Acknowledge

Post in #ops-incidents: “Acknowledging P[1/2] incident: [brief description]. [Your name] is on it.”

Step 2 — Diagnose

# Check API health
curl https://jarvis-api-gateway.ben-c1f.workers.dev/health

# Check recent errors in Cloudflare Analytics
# dash.cloudflare.com → Workers & Pages → jarvis-api-gateway → Analytics

# Check Supabase status
curl https://status.supabase.com/api/v2/summary.json | jq '.status'

# Check recent deploys (was a deploy the trigger?)
curl "$JARVIS_API_URL/api/monitoring/deploys?limit=5" \
  -H "Authorization: Bearer $AUTH_SECRET"

Step 3 — Contain

If a recent deploy is suspected: execute the Emergency Rollback runbook.

If no recent deploy: identify the failing component and determine if it can be isolated.

Step 4 — Resolve

Fix the root cause. Deploy the fix following the Deploy to Production runbook (even during an incident — gates still apply, but can be expedited).

Step 5 — Post-incident report

Within 24 hours, write a post-incident report covering:

What happened
Root cause
Impact (customers affected, duration)
Timeline
Resolution
Prevention measures

Post the report in #post-incident and link it from the Internal Dashboard incident log.

Runbook: Database Migration

When to use: Deploying a new Supabase migration file.

Prerequisites: Migration file has been reviewed, tested locally, and Gate II is open.

# Apply migration to production Supabase
cd Chairman-Infrastructure
supabase db push

# Verify migration applied
supabase db diff --use-migra

# Check expected: no diff between migration files and production schema

# Verify RLS policies are correct (see Database Operations runbook)

Runbook: Wrangler Secrets Rotation

When to use: Rotating API keys, auth tokens, or other secrets stored in Wrangler.

# List current secrets
wrangler secret list --env production

# Rotate a secret
echo "NEW_SECRET_VALUE" | wrangler secret put SECRET_NAME --env production

# Verify the worker picked up the new secret (may require re-deploy)
npx wrangler deploy --env production

# Test with the new secret
curl https://jarvis-api-gateway.ben-c1f.workers.dev/health \
  -H "Authorization: Bearer NEW_SECRET_VALUE"

Important: After rotating AUTH_SECRET, update all CI/CD pipelines and the Internal Dashboard’s stored credentials.