Deploy Watchdog

The Deploy Watchdog monitors every GOVERN deployment across all targets. It tracks deploy success/failure, correlates deploys with quality score changes, maintains rollback history, and alerts on failure patterns.

Deploy Targets

GOVERN deploys to multiple targets. The watchdog monitors all of them:

Target	Deploy mechanism	Watchdog check
API Gateway	`wrangler deploy` to Cloudflare Workers	Health endpoint poll
Customer Dashboard	Cloudflare Pages	Pages deployment status API
Internal Dashboard	Cloudflare Pages	Pages deployment status API
GOVERN Docs	Cloudflare Pages	Pages deployment status API
Supabase Migrations	`supabase db push`	Migration table check

Deploy Health Indicators

For each target, the watchdog tracks:

Last deploy time — When was the most recent successful deploy?
Deploy success rate — % of deploys that succeeded in the last 30 days
Average deploy duration — How long deploys take (healthy: < 3 minutes)
Post-deploy health — Did the health check pass after the deploy?
Quality score delta — Did V(Q) go up or down after the deploy?

Deploy Record Schema

interface DeployRecord {
  id: string;
  target: 'api-gateway' | 'customer-dashboard' | 'internal-dashboard' | 'docs';
  version: string;           // Git commit hash or semver tag
  deployedBy: string;        // User ID or 'ci-automated'
  status: 'pending' | 'in-progress' | 'success' | 'failed' | 'rolled-back';
  startedAt: string;
  completedAt?: string;
  durationMs?: number;
  healthCheckPassed: boolean;
  vqScoreBefore?: number;    // V(Q) before this deploy
  vqScoreAfter?: number;     // V(Q) after this deploy (measured 5 min post-deploy)
  rollbackDeployId?: string; // If this is a rollback, which deploy is it rolling back to?
  failureReason?: string;
  metadata: {
    commitHash: string;
    branch: string;
    changedPackages: string[];
    buildDurationMs?: number;
  };
}

Deploy Watchdog API

# Recent deploys
curl "$JARVIS_API_URL/api/monitoring/deploys?limit=10" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq .

# Deploy health summary
curl "$JARVIS_API_URL/api/monitoring/deploys/health" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq .

# Response:
# {
#   "targets": {
#     "api-gateway": { "status": "healthy", "lastDeploy": "...", "successRate": 0.97 },
#     "customer-dashboard": { "status": "healthy", "lastDeploy": "...", "successRate": 1.00 },
#     "internal-dashboard": { "status": "degraded", "lastDeploy": "...", "successRate": 0.88 }
#   },
#   "alerts": [
#     { "target": "internal-dashboard", "type": "low_success_rate", "message": "..." }
#   ]
# }

Post-Deploy Health Check

Every deploy triggers an automatic health check 60 seconds after completion:

// Post-deploy health check (run in waitUntil)
async function postDeployHealthCheck(deployId: string, target: DeployTarget) {
  await new Promise(resolve => setTimeout(resolve, 60_000));

  const health = await checkTargetHealth(target);

  await supabase
    .from('deploy_records')
    .update({
      health_check_passed: health.passed,
      health_check_response_ms: health.latencyMs,
    })
    .eq('id', deployId);

  if (!health.passed) {
    await triggerDeployAlert({
      severity: 'critical',
      target,
      deployId,
      message: `Post-deploy health check failed: ${health.error}`,
    });
  }
}

Rollback Procedure

When a deploy fails or degrades quality, follow this rollback procedure.

Automatic rollback triggers

The watchdog auto-initiates rollback when:

Post-deploy health check fails (health endpoint returns non-200)
V(Q) score drops more than 0.10 within 5 minutes of deploy
Error rate in Cloudflare Analytics exceeds 5% of requests

Manual rollback

# Identify the last good deploy
curl "$JARVIS_API_URL/api/monitoring/deploys?target=api-gateway&status=success&limit=5" \
  -H "Authorization: Bearer $AUTH_SECRET" | jq '.[0]'

# Roll back to a specific commit
git checkout <last-good-commit-hash>

# For Cloudflare Workers
cd packages/api-gateway
wrangler deploy --env production

# For Cloudflare Pages (roll back via dashboard)
# Navigate to: dash.cloudflare.com → Pages → <project> → Deployments → Roll back

Rollback record

Every rollback is recorded as a deploy with status: 'rolled-back' on the failed deploy and rollbackDeployId set on the new (rollback) deploy. The Deploy Watchdog shows the full rollback chain.

Quality Score Correlation

The most powerful feature of the Deploy Watchdog is correlating deploys with V(Q) score changes.

Deploy timeline with quality overlay:

v0.10.0  v0.11.0         v0.11.1  v0.12.0
|        |               |        |
|        ↓ V(Q): 0.94    |        ↓ V(Q): 0.97
|        ──────────────  |        ────────────
|  0.91  |    0.94      |  0.91  |    0.97
─────────|              ─────────|

A deploy that drops V(Q) is flagged immediately. If V(Q) drops below 0.85 after a deploy, the watchdog raises a CRITICAL alert and suggests rollback.

Deploy History Queries

-- Deploy success rate by target (last 30 days)
SELECT
  target,
  COUNT(*) AS total_deploys,
  COUNT(*) FILTER (WHERE status = 'success') AS successful,
  ROUND(
    COUNT(*) FILTER (WHERE status = 'success')::numeric / COUNT(*) * 100,
    1
  ) AS success_rate_pct,
  AVG(duration_ms) / 1000 AS avg_duration_sec
FROM deploy_records
WHERE started_at > NOW() - INTERVAL '30 days'
GROUP BY target;

-- Deploys that triggered rollback
SELECT
  d.target,
  d.version,
  d.deployed_by,
  d.started_at,
  d.vq_score_before,
  d.vq_score_after,
  d.failure_reason
FROM deploy_records d
WHERE d.status = 'rolled-back'
  AND d.started_at > NOW() - INTERVAL '90 days'
ORDER BY d.started_at DESC;

-- Mean time to recover (MTTR) from failed deploys
SELECT
  target,
  AVG(
    EXTRACT(EPOCH FROM (r.started_at - f.started_at)) / 60
  ) AS avg_recovery_minutes
FROM deploy_records f
JOIN deploy_records r ON r.rollback_deploy_id = f.id
GROUP BY target;

Deploy Alerts

The watchdog sends alerts via Slack when:

Condition	Severity	Channel
Deploy failed	ERROR	#ops-alerts
Health check failed post-deploy	CRITICAL	#ops-alerts + #on-call
V(Q) dropped > 0.10 after deploy	WARNING	#ops-alerts
Rollback initiated	CRITICAL	#ops-alerts + #on-call
No deploy in > 7 days (staleness check)	INFO	#ops-digest

Alerts include: target, version, deploy ID, V(Q) delta, failure reason (if any), and link to the Internal Dashboard deploy detail view.