Files
gravl/docs/PRODUCTION_GODEPLOY.md
T
clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS:
 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00

13 KiB

Production Go-Live Procedure — Phase 10-07, Task 5

Date: 2026-03-06
Status: DRAFT (TO BE TESTED ON STAGING)
Owner: DevOps / Deployment Lead
Pre-requisites: Complete PRODUCTION_READINESS.md checklist items #1-4


Overview

This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.

Estimated Duration: 2-3 hours (plus verification window)
Rollback Window: <15 minutes (with ROLLBACK.md procedure)
Required Team: DevOps (2), Backend (1), Frontend Lead (1)


Pre-Flight Checklist (T-30 minutes)

  • Production cluster access verified (kubectl configured)
  • All team members on call (Slack + video bridge open)
  • Backup of production database exists (snapshot/automated backup running)
  • Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
  • Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
  • Production domain DNS propagated (check DNS resolution)
  • TLS certificates ready or cert-manager deployed and tested
  • Alert thresholds reviewed (no overly sensitive alerts during deployment)
  • Staging environment running last validated build
  • Load balancer health checks configured
  • Incident communication channel created (Slack #gravl-incident)

Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)

1.1 Create Kubernetes Namespace & RBAC

# Apply production namespace configuration
kubectl apply -f k8s/production/namespace.yaml

# Apply RBAC for production deployments
kubectl apply -f k8s/production/rbac.yaml

# Verify namespace created
kubectl get ns gravl-production
kubectl get serviceaccount -n gravl-production gravl-deployer

Verification:

  • Namespace exists
  • ServiceAccount exists
  • RBAC role bound

1.2 Apply Network Policies

# Apply default deny + explicit allow rules
kubectl apply -f k8s/production/network-policy.yaml

# Verify policies (should see 5+ NetworkPolicies)
kubectl get networkpolicies -n gravl-production

Verification:

  • Default deny ingress in place
  • Backend, frontend, database, monitoring policies visible

1.3 Deploy Secrets (Sealed or External)

Option A: Sealed Secrets (if kubeseal is deployed)

# Unseal production secrets
kubeseal -f k8s/production/sealed-secrets.yaml \
  | kubectl apply -f -

# Verify secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret postgres-secret -n gravl-production

Option B: External Secrets Operator (if AWS/Vault used)

# Apply ExternalSecret definitions
kubectl apply -f k8s/production/external-secrets.yaml

# Verify ExternalSecrets synced (should see status: synced)
kubectl get externalsecrets -n gravl-production
kubectl describe externalsecret postgres-secret -n gravl-production

Verification:

  • postgres-secret contains POSTGRES_PASSWORD
  • app-secret contains JWT_SECRET
  • registry-pull-secret exists (if private registry used)
  • staging-tls exists (or cert-manager will auto-create)

1.4 Deploy cert-manager (if not already on cluster)

# Install cert-manager (one-time, if needed)
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --version v1.13.0

# Create ClusterIssuer for Let's Encrypt (production)
kubectl apply -f k8s/production/cert-manager-issuer.yaml

# Verify issuer ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-prod

Verification:

  • cert-manager pods running in cert-manager namespace
  • ClusterIssuer status is READY (True)

Phase 2: Database & Storage (T-30 to T-10 minutes)

2.1 Deploy PostgreSQL StatefulSet

# Deploy PostgreSQL to production
kubectl apply -f k8s/production/postgres-statefulset.yaml

# Watch for Pod readiness (should take 30-60 seconds)
kubectl rollout status statefulset/postgres -n gravl-production

# Verify pod is running and ready (2/2 containers)
kubectl get pods -n gravl-production -l component=database

Verification:

  • Pod status: Running, Ready 2/2
  • PersistentVolumeClaim bound
  • No errors in pod logs: kubectl logs postgres-0 -n gravl-production

2.2 Run Database Migrations

# Port-forward to database (for migration job)
kubectl port-forward postgres-0 5432:5432 -n gravl-production &

# Run migrations in separate terminal
cd backend
npm run db:migrate:prod

# Monitor migration logs
kubectl logs -n gravl-production -f job/db-migration

# Kill port-forward when done
kill %1

Verification:

  • Migration job completed successfully
  • No migration errors in logs
  • Database schema matches expected version

2.3 Verify Database Connectivity

# Create a test pod to verify DB access
kubectl run -it --rm --image=postgres:15 \
  --restart=Never \
  -n gravl-production \
  psql-test \
  -- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"

# Should return PostgreSQL version

Verification:

  • Database connection successful
  • PostgreSQL version visible

Phase 3: Deploy Application Services (T-10 to T+20 minutes)

3.1 Deploy Backend Deployment

# Deploy backend service
kubectl apply -f k8s/production/backend-deployment.yaml

# Wait for rollout (typically 2-3 minutes)
kubectl rollout status deployment/backend -n gravl-production

# Verify pods running
kubectl get pods -n gravl-production -l component=backend

Verification:

  • Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
  • No CrashLoopBackOff errors
  • Service endpoint registered: kubectl get svc backend -n gravl-production

3.2 Deploy Frontend Deployment

# Deploy frontend service
kubectl apply -f k8s/production/frontend-deployment.yaml

# Wait for rollout
kubectl rollout status deployment/frontend -n gravl-production

# Verify pods
kubectl get pods -n gravl-production -l component=frontend

Verification:

  • Frontend pods running and ready
  • Service endpoint registered

3.3 Apply Ingress with TLS Termination

# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
kubectl apply -f k8s/production/ingress.yaml

# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
kubectl get ingress -n gravl-production -w

# Check ingress status and TLS certificate
kubectl describe ingress gravl-ingress -n gravl-production

Verification:

  • Ingress has external IP or DNS name assigned
  • TLS certificate present (cert-manager auto-created if configured)
  • SSL certificate not self-signed (check with OpenSSL):
    echo | openssl s_client -servername gravl.example.com \
      -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
    

Phase 4: Service Integration Verification (T+20 to T+40 minutes)

4.1 Test Service-to-Service Communication

# Exec into backend pod to test database connection
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')

kubectl exec -it $BACKEND_POD -n gravl-production -- \
  curl http://postgres:5432 -v 2>&1 | head -5

# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"

Verification:

  • Backend can reach database (even if timeout, not connection refused)
  • Backend logs show no database errors: kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10

4.2 Health Check Endpoint

# Get backend service IP
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')

# Test health endpoint (from another pod)
kubectl run -it --rm --image=curlimages/curl \
  --restart=Never \
  -n gravl-production \
  curl-test \
  -- curl http://$BACKEND_SVC:3000/health

# Expected response: {"status":"ok"} or similar

Verification:

  • Health endpoint responds (HTTP 200)
  • No error messages in response

4.3 External Endpoint Test (via Ingress)

# Wait for DNS propagation (if using DNS name, not IP)
# Then test external access
curl -k https://gravl.example.com/api/health

# Expected: HTTP 200 with health status

Verification:

  • HTTPS responds (self-signed cert is OK to see -k warning)
  • Backend responds through ingress

Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)

5.1 Verify Prometheus Scraping

# Check Prometheus targets (should show gravl-production scrape configs)
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &

# Open http://localhost:9090/targets in browser
# Verify all gravl-production targets are "UP"

kill %1

Verification:

  • All production targets showing as UP
  • No "DOWN" endpoints

5.2 Verify Grafana Dashboards

# Access Grafana
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &

# Open http://localhost:3000
# Login with default credentials (or stored secret)
# Navigate to Gravl dashboards
# Verify graphs showing production metrics

kill %1

Verification:

  • Gravl dashboards visible
  • Metrics flowing (not empty graphs)
  • CPU, memory, request rate graphs showing data

5.3 Verify AlertManager

# Check AlertManager configuration (should have production severity levels)
kubectl get alertmanagerconfig -n gravl-monitoring
kubectl describe alertmanagerconfig -n gravl-monitoring

Verification:

  • Alerts configured for production thresholds
  • Notification channels (Slack, PagerDuty, etc.) configured

5.4 Test Alert Trigger

# Send test alert through AlertManager
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
  amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093

# Check Slack / notification channel for alert (should arrive within 1 minute)

Verification:

  • Test alert received in notification channel
  • Alert formatting correct
  • No excessive duplicate alerts

Phase 6: Load Test & Baseline (T+60 to T+90 minutes)

6.1 Run Load Test on Production (Low Traffic)

# Generate light load using k6 or Apache Bench
k6 run --vus 10 --duration 5m k8s/production/load-test.js

# Expected results:
# - p95 latency: <200ms
# - Throughput: >100 req/s
# - Error rate: <0.1%

Verification:

  • p95 latency <200ms
  • Error rate <0.1%
  • No pod restarts during test

6.2 Baseline Metrics Captured

# Log current metrics for baseline
kubectl top nodes > /tmp/baseline-nodes.txt
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt

# Store for comparison (alert if exceeds 2x baseline)

Verification:

  • Node CPU/Memory usage within expected range
  • Pod CPU/Memory usage within resource requests

Phase 7: Production Sign-Off (T+90 minutes)

7.1 Final Checklist

  • All pre-flight checks passed
  • Database healthy and migrated
  • All services running and ready
  • Ingress responding (TLS valid)
  • Health checks passing
  • Monitoring metrics flowing
  • Alerts functional
  • Load test passed
  • Team lead review: READY TO GO LIVE

7.2 Change Log Entry

# Log deployment to version control
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
---
date: 2026-03-06
time: ~09:30 UTC
environment: production
namespace: gravl-production
services:
  - backend: v1.x.x
  - frontend: v1.x.x
  - postgres: 15.x
  - ingress: nginx
  - certificates: cert-manager (Let's Encrypt)
pre_flight_status: ✅ PASSED
security_review: ✅ APPROVED
monitoring_status: ✅ OPERATIONAL
load_test_result: ✅ PASSED
sign_off_by: [DevOps Lead]
DEPLOY_LOG

git add /tmp/PRODUCTION_DEPLOY.log
git commit -m "Production deployment log - 2026-03-06"

7.3 Notify Team

  • Send deployment completion notice to Slack #gravl-announce

    🚀 **Gravl Production Deployment COMPLETE**
    - Timestamp: 2026-03-06 09:30 UTC
    - All systems operational
    - Monitoring dashboards: [link]
    - Status page: [link]
    
  • Update status page (if external-facing)

  • Notify stakeholders (product, marketing)


Rollback Decision Tree

If at any point a critical failure occurs:

  1. Do NOT proceed
  2. Trigger ROLLBACK.md procedure
  3. Investigate root cause post-incident (blameless postmortem)

Critical Failure Indicators:

  • Database connection failures after 3 retries
  • More than 2 pod crashes during rollout
  • Ingress TLS certificate invalid
  • Health checks failing on all pods
  • Alerts firing for production thresholds

Post-Deployment (T+120 minutes and beyond)

7.4 Sustained Monitoring Window (Next 24 hours)

  • Assign on-call rotation (24h monitoring)
  • Set up escalation policy (alert → on-call → incident lead)
  • Daily review of logs and metrics for first week
  • Customer feedback monitoring (support tickets, user reports)

7.5 Post-Deployment Review (24 hours)

  • Team retrospective (what went well, what to improve)
  • Update runbooks based on findings
  • Document any manual interventions for automation
  • Plan optimization and hardening work for next phase

Document Version: 1.0
Last Updated: 2026-03-06 08:50
Next Update: After first production deployment attempt