gravl/docs/PRODUCTION_GODEPLOY.md

# Production Go-Live Procedure — Phase 10-07, Task 5

**Date:** 2026-03-06
**Status:** DRAFT (TO BE TESTED ON STAGING)
**Owner:** DevOps / Deployment Lead
**Pre-requisites:** Complete PRODUCTION_READINESS.md checklist items #1-4

---

## Overview

This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.

**Estimated Duration:** 2-3 hours (plus verification window)
**Rollback Window:** <15 minutes (with ROLLBACK.md procedure)
**Required Team:** DevOps (2), Backend (1), Frontend Lead (1)

---

## Pre-Flight Checklist (T-30 minutes)

- [ ] Production cluster access verified (kubectl configured)
- [ ] All team members on call (Slack + video bridge open)
- [ ] Backup of production database exists (snapshot/automated backup running)
- [ ] Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
- [ ] Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
- [ ] Production domain DNS propagated (check DNS resolution)
- [ ] TLS certificates ready or cert-manager deployed and tested
- [ ] Alert thresholds reviewed (no overly sensitive alerts during deployment)
- [ ] Staging environment running last validated build
- [ ] Load balancer health checks configured
- [ ] Incident communication channel created (Slack #gravl-incident)

---

## Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)

### 1.1 Create Kubernetes Namespace & RBAC

```bash
# Apply production namespace configuration
kubectl apply -f k8s/production/namespace.yaml

# Apply RBAC for production deployments
kubectl apply -f k8s/production/rbac.yaml

# Verify namespace created
kubectl get ns gravl-production
kubectl get serviceaccount -n gravl-production gravl-deployer
```

**Verification:**
- [ ] Namespace exists
- [ ] ServiceAccount exists
- [ ] RBAC role bound

### 1.2 Apply Network Policies

```bash
# Apply default deny + explicit allow rules
kubectl apply -f k8s/production/network-policy.yaml

# Verify policies (should see 5+ NetworkPolicies)
kubectl get networkpolicies -n gravl-production
```

**Verification:**
- [ ] Default deny ingress in place
- [ ] Backend, frontend, database, monitoring policies visible

### 1.3 Deploy Secrets (Sealed or External)

**Option A: Sealed Secrets** (if kubeseal is deployed)
```bash
# Unseal production secrets
kubeseal -f k8s/production/sealed-secrets.yaml \
  | kubectl apply -f -

# Verify secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret postgres-secret -n gravl-production
```

**Option B: External Secrets Operator** (if AWS/Vault used)
```bash
# Apply ExternalSecret definitions
kubectl apply -f k8s/production/external-secrets.yaml

# Verify ExternalSecrets synced (should see status: synced)
kubectl get externalsecrets -n gravl-production
kubectl describe externalsecret postgres-secret -n gravl-production
```

**Verification:**
- [ ] postgres-secret contains POSTGRES_PASSWORD
- [ ] app-secret contains JWT_SECRET
- [ ] registry-pull-secret exists (if private registry used)
- [ ] staging-tls exists (or cert-manager will auto-create)

### 1.4 Deploy cert-manager (if not already on cluster)

```bash
# Install cert-manager (one-time, if needed)
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --version v1.13.0

# Create ClusterIssuer for Let's Encrypt (production)
kubectl apply -f k8s/production/cert-manager-issuer.yaml

# Verify issuer ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-prod
```

**Verification:**
- [ ] cert-manager pods running in cert-manager namespace
- [ ] ClusterIssuer status is READY (True)

---

## Phase 2: Database & Storage (T-30 to T-10 minutes)

### 2.1 Deploy PostgreSQL StatefulSet

```bash
# Deploy PostgreSQL to production
kubectl apply -f k8s/production/postgres-statefulset.yaml

# Watch for Pod readiness (should take 30-60 seconds)
kubectl rollout status statefulset/postgres -n gravl-production

# Verify pod is running and ready (2/2 containers)
kubectl get pods -n gravl-production -l component=database
```

**Verification:**
- [ ] Pod status: Running, Ready 2/2
- [ ] PersistentVolumeClaim bound
- [ ] No errors in pod logs: `kubectl logs postgres-0 -n gravl-production`

### 2.2 Run Database Migrations

```bash
# Port-forward to database (for migration job)
kubectl port-forward postgres-0 5432:5432 -n gravl-production &

# Run migrations in separate terminal
cd backend
npm run db:migrate:prod

# Monitor migration logs
kubectl logs -n gravl-production -f job/db-migration

# Kill port-forward when done
kill %1
```

**Verification:**
- [ ] Migration job completed successfully
- [ ] No migration errors in logs
- [ ] Database schema matches expected version

### 2.3 Verify Database Connectivity

```bash
# Create a test pod to verify DB access
kubectl run -it --rm --image=postgres:15 \
  --restart=Never \
  -n gravl-production \
  psql-test \
  -- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"

# Should return PostgreSQL version
```

**Verification:**
- [ ] Database connection successful
- [ ] PostgreSQL version visible

---

## Phase 3: Deploy Application Services (T-10 to T+20 minutes)

### 3.1 Deploy Backend Deployment

```bash
# Deploy backend service
kubectl apply -f k8s/production/backend-deployment.yaml

# Wait for rollout (typically 2-3 minutes)
kubectl rollout status deployment/backend -n gravl-production

# Verify pods running
kubectl get pods -n gravl-production -l component=backend
```

**Verification:**
- [ ] Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
- [ ] No CrashLoopBackOff errors
- [ ] Service endpoint registered: `kubectl get svc backend -n gravl-production`

### 3.2 Deploy Frontend Deployment

```bash
# Deploy frontend service
kubectl apply -f k8s/production/frontend-deployment.yaml

# Wait for rollout
kubectl rollout status deployment/frontend -n gravl-production

# Verify pods
kubectl get pods -n gravl-production -l component=frontend
```

**Verification:**
- [ ] Frontend pods running and ready
- [ ] Service endpoint registered

### 3.3 Apply Ingress with TLS Termination

```bash
# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
kubectl apply -f k8s/production/ingress.yaml

# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
kubectl get ingress -n gravl-production -w

# Check ingress status and TLS certificate
kubectl describe ingress gravl-ingress -n gravl-production
```

**Verification:**
- [ ] Ingress has external IP or DNS name assigned
- [ ] TLS certificate present (cert-manager auto-created if configured)
- [ ] SSL certificate not self-signed (check with OpenSSL):
  ```bash
  echo | openssl s_client -servername gravl.example.com \
    -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
  ```

---

## Phase 4: Service Integration Verification (T+20 to T+40 minutes)

### 4.1 Test Service-to-Service Communication

```bash
# Exec into backend pod to test database connection
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')

kubectl exec -it $BACKEND_POD -n gravl-production -- \
  curl http://postgres:5432 -v 2>&1 | head -5

# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"
```

**Verification:**
- [ ] Backend can reach database (even if timeout, not connection refused)
- [ ] Backend logs show no database errors: `kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10`

### 4.2 Health Check Endpoint

```bash
# Get backend service IP
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')

# Test health endpoint (from another pod)
kubectl run -it --rm --image=curlimages/curl \
  --restart=Never \
  -n gravl-production \
  curl-test \
  -- curl http://$BACKEND_SVC:3000/health

# Expected response: {"status":"ok"} or similar
```

**Verification:**
- [ ] Health endpoint responds (HTTP 200)
- [ ] No error messages in response

### 4.3 External Endpoint Test (via Ingress)

```bash
# Wait for DNS propagation (if using DNS name, not IP)
# Then test external access
curl -k https://gravl.example.com/api/health

# Expected: HTTP 200 with health status
```

**Verification:**
- [ ] HTTPS responds (self-signed cert is OK to see -k warning)
- [ ] Backend responds through ingress

---

## Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)

### 5.1 Verify Prometheus Scraping

```bash
# Check Prometheus targets (should show gravl-production scrape configs)
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &

# Open http://localhost:9090/targets in browser
# Verify all gravl-production targets are "UP"

kill %1
```

**Verification:**
- [ ] All production targets showing as UP
- [ ] No "DOWN" endpoints

### 5.2 Verify Grafana Dashboards

```bash
# Access Grafana
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &

# Open http://localhost:3000
# Login with default credentials (or stored secret)
# Navigate to Gravl dashboards
# Verify graphs showing production metrics

kill %1
```

**Verification:**
- [ ] Gravl dashboards visible
- [ ] Metrics flowing (not empty graphs)
- [ ] CPU, memory, request rate graphs showing data

### 5.3 Verify AlertManager

```bash
# Check AlertManager configuration (should have production severity levels)
kubectl get alertmanagerconfig -n gravl-monitoring
kubectl describe alertmanagerconfig -n gravl-monitoring
```

**Verification:**
- [ ] Alerts configured for production thresholds
- [ ] Notification channels (Slack, PagerDuty, etc.) configured

### 5.4 Test Alert Trigger

```bash
# Send test alert through AlertManager
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
  amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093

# Check Slack / notification channel for alert (should arrive within 1 minute)
```

**Verification:**
- [ ] Test alert received in notification channel
- [ ] Alert formatting correct
- [ ] No excessive duplicate alerts

---

## Phase 6: Load Test & Baseline (T+60 to T+90 minutes)

### 6.1 Run Load Test on Production (Low Traffic)

```bash
# Generate light load using k6 or Apache Bench
k6 run --vus 10 --duration 5m k8s/production/load-test.js

# Expected results:
# - p95 latency: <200ms
# - Throughput: >100 req/s
# - Error rate: <0.1%
```

**Verification:**
- [ ] p95 latency <200ms
- [ ] Error rate <0.1%
- [ ] No pod restarts during test

### 6.2 Baseline Metrics Captured

```bash
# Log current metrics for baseline
kubectl top nodes > /tmp/baseline-nodes.txt
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt

# Store for comparison (alert if exceeds 2x baseline)
```

**Verification:**
- [ ] Node CPU/Memory usage within expected range
- [ ] Pod CPU/Memory usage within resource requests

---

## Phase 7: Production Sign-Off (T+90 minutes)

### 7.1 Final Checklist

- [ ] All pre-flight checks passed
- [ ] Database healthy and migrated
- [ ] All services running and ready
- [ ] Ingress responding (TLS valid)
- [ ] Health checks passing
- [ ] Monitoring metrics flowing
- [ ] Alerts functional
- [ ] Load test passed
- [ ] Team lead review: ✅ READY TO GO LIVE

### 7.2 Change Log Entry

```bash
# Log deployment to version control
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
---
date: 2026-03-06
time: ~09:30 UTC
environment: production
namespace: gravl-production
services:
  - backend: v1.x.x
  - frontend: v1.x.x
  - postgres: 15.x
  - ingress: nginx
  - certificates: cert-manager (Let's Encrypt)
pre_flight_status: ✅ PASSED
security_review: ✅ APPROVED
monitoring_status: ✅ OPERATIONAL
load_test_result: ✅ PASSED
sign_off_by: [DevOps Lead]
DEPLOY_LOG

git add /tmp/PRODUCTION_DEPLOY.log
git commit -m "Production deployment log - 2026-03-06"
```

### 7.3 Notify Team

- [ ] Send deployment completion notice to Slack #gravl-announce
  ```
  🚀 **Gravl Production Deployment COMPLETE**
  - Timestamp: 2026-03-06 09:30 UTC
  - All systems operational
  - Monitoring dashboards: [link]
  - Status page: [link]
  ```

- [ ] Update status page (if external-facing)
- [ ] Notify stakeholders (product, marketing)

---

## Rollback Decision Tree

**If at any point a critical failure occurs:**
1. Do NOT proceed
2. Trigger ROLLBACK.md procedure
3. Investigate root cause post-incident (blameless postmortem)

**Critical Failure Indicators:**
- Database connection failures after 3 retries
- More than 2 pod crashes during rollout
- Ingress TLS certificate invalid
- Health checks failing on all pods
- Alerts firing for production thresholds

---

## Post-Deployment (T+120 minutes and beyond)

### 7.4 Sustained Monitoring Window (Next 24 hours)

- [ ] Assign on-call rotation (24h monitoring)
- [ ] Set up escalation policy (alert → on-call → incident lead)
- [ ] Daily review of logs and metrics for first week
- [ ] Customer feedback monitoring (support tickets, user reports)

### 7.5 Post-Deployment Review (24 hours)

- [ ] Team retrospective (what went well, what to improve)
- [ ] Update runbooks based on findings
- [ ] Document any manual interventions for automation
- [ ] Plan optimization and hardening work for next phase

---

**Document Version:** 1.0
**Last Updated:** 2026-03-06 08:50
**Next Update:** After first production deployment attempt