Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00
parent c153a9648f
commit d81e403f01
330 changed files with 87988 additions and 367 deletions
@@ -0,0 +1,494 @@
+# Production Go-Live Procedure — Phase 10-07, Task 5
+
+**Date:** 2026-03-06  
+**Status:** DRAFT (TO BE TESTED ON STAGING)  
+**Owner:** DevOps / Deployment Lead  
+**Pre-requisites:** Complete PRODUCTION_READINESS.md checklist items #1-4  
+
+---
+
+## Overview
+
+This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.
+
+**Estimated Duration:** 2-3 hours (plus verification window)  
+**Rollback Window:** <15 minutes (with ROLLBACK.md procedure)  
+**Required Team:** DevOps (2), Backend (1), Frontend Lead (1)  
+
+---
+
+## Pre-Flight Checklist (T-30 minutes)
+
+- [ ] Production cluster access verified (kubectl configured)
+- [ ] All team members on call (Slack + video bridge open)
+- [ ] Backup of production database exists (snapshot/automated backup running)
+- [ ] Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
+- [ ] Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
+- [ ] Production domain DNS propagated (check DNS resolution)
+- [ ] TLS certificates ready or cert-manager deployed and tested
+- [ ] Alert thresholds reviewed (no overly sensitive alerts during deployment)
+- [ ] Staging environment running last validated build
+- [ ] Load balancer health checks configured
+- [ ] Incident communication channel created (Slack #gravl-incident)
+
+---
+
+## Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)
+
+### 1.1 Create Kubernetes Namespace & RBAC
+
+```bash
+# Apply production namespace configuration
+kubectl apply -f k8s/production/namespace.yaml
+
+# Apply RBAC for production deployments
+kubectl apply -f k8s/production/rbac.yaml
+
+# Verify namespace created
+kubectl get ns gravl-production
+kubectl get serviceaccount -n gravl-production gravl-deployer
+```
+
+**Verification:** 
+- [ ] Namespace exists
+- [ ] ServiceAccount exists
+- [ ] RBAC role bound
+
+### 1.2 Apply Network Policies
+
+```bash
+# Apply default deny + explicit allow rules
+kubectl apply -f k8s/production/network-policy.yaml
+
+# Verify policies (should see 5+ NetworkPolicies)
+kubectl get networkpolicies -n gravl-production
+```
+
+**Verification:**
+- [ ] Default deny ingress in place
+- [ ] Backend, frontend, database, monitoring policies visible
+
+### 1.3 Deploy Secrets (Sealed or External)
+
+**Option A: Sealed Secrets** (if kubeseal is deployed)
+```bash
+# Unseal production secrets
+kubeseal -f k8s/production/sealed-secrets.yaml \
+  | kubectl apply -f -
+
+# Verify secrets exist
+kubectl get secrets -n gravl-production
+kubectl describe secret postgres-secret -n gravl-production
+```
+
+**Option B: External Secrets Operator** (if AWS/Vault used)
+```bash
+# Apply ExternalSecret definitions
+kubectl apply -f k8s/production/external-secrets.yaml
+
+# Verify ExternalSecrets synced (should see status: synced)
+kubectl get externalsecrets -n gravl-production
+kubectl describe externalsecret postgres-secret -n gravl-production
+```
+
+**Verification:**
+- [ ] postgres-secret contains POSTGRES_PASSWORD
+- [ ] app-secret contains JWT_SECRET
+- [ ] registry-pull-secret exists (if private registry used)
+- [ ] staging-tls exists (or cert-manager will auto-create)
+
+### 1.4 Deploy cert-manager (if not already on cluster)
+
+```bash
+# Install cert-manager (one-time, if needed)
+helm install cert-manager jetstack/cert-manager \
+  --namespace cert-manager \
+  --create-namespace \
+  --set installCRDs=true \
+  --version v1.13.0
+
+# Create ClusterIssuer for Let's Encrypt (production)
+kubectl apply -f k8s/production/cert-manager-issuer.yaml
+
+# Verify issuer ready
+kubectl get clusterissuer
+kubectl describe clusterissuer letsencrypt-prod
+```
+
+**Verification:**
+- [ ] cert-manager pods running in cert-manager namespace
+- [ ] ClusterIssuer status is READY (True)
+
+---
+
+## Phase 2: Database & Storage (T-30 to T-10 minutes)
+
+### 2.1 Deploy PostgreSQL StatefulSet
+
+```bash
+# Deploy PostgreSQL to production
+kubectl apply -f k8s/production/postgres-statefulset.yaml
+
+# Watch for Pod readiness (should take 30-60 seconds)
+kubectl rollout status statefulset/postgres -n gravl-production
+
+# Verify pod is running and ready (2/2 containers)
+kubectl get pods -n gravl-production -l component=database
+```
+
+**Verification:**
+- [ ] Pod status: Running, Ready 2/2
+- [ ] PersistentVolumeClaim bound
+- [ ] No errors in pod logs: `kubectl logs postgres-0 -n gravl-production`
+
+### 2.2 Run Database Migrations
+
+```bash
+# Port-forward to database (for migration job)
+kubectl port-forward postgres-0 5432:5432 -n gravl-production &
+
+# Run migrations in separate terminal
+cd backend
+npm run db:migrate:prod
+
+# Monitor migration logs
+kubectl logs -n gravl-production -f job/db-migration
+
+# Kill port-forward when done
+kill %1
+```
+
+**Verification:**
+- [ ] Migration job completed successfully
+- [ ] No migration errors in logs
+- [ ] Database schema matches expected version
+
+### 2.3 Verify Database Connectivity
+
+```bash
+# Create a test pod to verify DB access
+kubectl run -it --rm --image=postgres:15 \
+  --restart=Never \
+  -n gravl-production \
+  psql-test \
+  -- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"
+
+# Should return PostgreSQL version
+```
+
+**Verification:**
+- [ ] Database connection successful
+- [ ] PostgreSQL version visible
+
+---
+
+## Phase 3: Deploy Application Services (T-10 to T+20 minutes)
+
+### 3.1 Deploy Backend Deployment
+
+```bash
+# Deploy backend service
+kubectl apply -f k8s/production/backend-deployment.yaml
+
+# Wait for rollout (typically 2-3 minutes)
+kubectl rollout status deployment/backend -n gravl-production
+
+# Verify pods running
+kubectl get pods -n gravl-production -l component=backend
+```
+
+**Verification:**
+- [ ] Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
+- [ ] No CrashLoopBackOff errors
+- [ ] Service endpoint registered: `kubectl get svc backend -n gravl-production`
+
+### 3.2 Deploy Frontend Deployment
+
+```bash
+# Deploy frontend service
+kubectl apply -f k8s/production/frontend-deployment.yaml
+
+# Wait for rollout
+kubectl rollout status deployment/frontend -n gravl-production
+
+# Verify pods
+kubectl get pods -n gravl-production -l component=frontend
+```
+
+**Verification:**
+- [ ] Frontend pods running and ready
+- [ ] Service endpoint registered
+
+### 3.3 Apply Ingress with TLS Termination
+
+```bash
+# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
+kubectl apply -f k8s/production/ingress.yaml
+
+# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
+kubectl get ingress -n gravl-production -w
+
+# Check ingress status and TLS certificate
+kubectl describe ingress gravl-ingress -n gravl-production
+```
+
+**Verification:**
+- [ ] Ingress has external IP or DNS name assigned
+- [ ] TLS certificate present (cert-manager auto-created if configured)
+- [ ] SSL certificate not self-signed (check with OpenSSL): 
+  ```bash
+  echo | openssl s_client -servername gravl.example.com \
+    -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
+  ```
+
+---
+
+## Phase 4: Service Integration Verification (T+20 to T+40 minutes)
+
+### 4.1 Test Service-to-Service Communication
+
+```bash
+# Exec into backend pod to test database connection
+BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')
+
+kubectl exec -it $BACKEND_POD -n gravl-production -- \
+  curl http://postgres:5432 -v 2>&1 | head -5
+
+# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"
+```
+
+**Verification:**
+- [ ] Backend can reach database (even if timeout, not connection refused)
+- [ ] Backend logs show no database errors: `kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10`
+
+### 4.2 Health Check Endpoint
+
+```bash
+# Get backend service IP
+BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')
+
+# Test health endpoint (from another pod)
+kubectl run -it --rm --image=curlimages/curl \
+  --restart=Never \
+  -n gravl-production \
+  curl-test \
+  -- curl http://$BACKEND_SVC:3000/health
+
+# Expected response: {"status":"ok"} or similar
+```
+
+**Verification:**
+- [ ] Health endpoint responds (HTTP 200)
+- [ ] No error messages in response
+
+### 4.3 External Endpoint Test (via Ingress)
+
+```bash
+# Wait for DNS propagation (if using DNS name, not IP)
+# Then test external access
+curl -k https://gravl.example.com/api/health
+
+# Expected: HTTP 200 with health status
+```
+
+**Verification:**
+- [ ] HTTPS responds (self-signed cert is OK to see -k warning)
+- [ ] Backend responds through ingress
+
+---
+
+## Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)
+
+### 5.1 Verify Prometheus Scraping
+
+```bash
+# Check Prometheus targets (should show gravl-production scrape configs)
+kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &
+
+# Open http://localhost:9090/targets in browser
+# Verify all gravl-production targets are "UP"
+
+kill %1
+```
+
+**Verification:**
+- [ ] All production targets showing as UP
+- [ ] No "DOWN" endpoints
+
+### 5.2 Verify Grafana Dashboards
+
+```bash
+# Access Grafana
+kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &
+
+# Open http://localhost:3000
+# Login with default credentials (or stored secret)
+# Navigate to Gravl dashboards
+# Verify graphs showing production metrics
+
+kill %1
+```
+
+**Verification:**
+- [ ] Gravl dashboards visible
+- [ ] Metrics flowing (not empty graphs)
+- [ ] CPU, memory, request rate graphs showing data
+
+### 5.3 Verify AlertManager
+
+```bash
+# Check AlertManager configuration (should have production severity levels)
+kubectl get alertmanagerconfig -n gravl-monitoring
+kubectl describe alertmanagerconfig -n gravl-monitoring
+```
+
+**Verification:**
+- [ ] Alerts configured for production thresholds
+- [ ] Notification channels (Slack, PagerDuty, etc.) configured
+
+### 5.4 Test Alert Trigger
+
+```bash
+# Send test alert through AlertManager
+kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
+  amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093
+
+# Check Slack / notification channel for alert (should arrive within 1 minute)
+```
+
+**Verification:**
+- [ ] Test alert received in notification channel
+- [ ] Alert formatting correct
+- [ ] No excessive duplicate alerts
+
+---
+
+## Phase 6: Load Test & Baseline (T+60 to T+90 minutes)
+
+### 6.1 Run Load Test on Production (Low Traffic)
+
+```bash
+# Generate light load using k6 or Apache Bench
+k6 run --vus 10 --duration 5m k8s/production/load-test.js
+
+# Expected results:
+# - p95 latency: <200ms
+# - Throughput: >100 req/s
+# - Error rate: <0.1%
+```
+
+**Verification:**
+- [ ] p95 latency <200ms
+- [ ] Error rate <0.1%
+- [ ] No pod restarts during test
+
+### 6.2 Baseline Metrics Captured
+
+```bash
+# Log current metrics for baseline
+kubectl top nodes > /tmp/baseline-nodes.txt
+kubectl top pods -n gravl-production > /tmp/baseline-pods.txt
+
+# Store for comparison (alert if exceeds 2x baseline)
+```
+
+**Verification:**
+- [ ] Node CPU/Memory usage within expected range
+- [ ] Pod CPU/Memory usage within resource requests
+
+---
+
+## Phase 7: Production Sign-Off (T+90 minutes)
+
+### 7.1 Final Checklist
+
+- [ ] All pre-flight checks passed
+- [ ] Database healthy and migrated
+- [ ] All services running and ready
+- [ ] Ingress responding (TLS valid)
+- [ ] Health checks passing
+- [ ] Monitoring metrics flowing
+- [ ] Alerts functional
+- [ ] Load test passed
+- [ ] Team lead review: ✅ READY TO GO LIVE
+
+### 7.2 Change Log Entry
+
+```bash
+# Log deployment to version control
+cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
+---
+date: 2026-03-06
+time: ~09:30 UTC
+environment: production
+namespace: gravl-production
+services:
+  - backend: v1.x.x
+  - frontend: v1.x.x
+  - postgres: 15.x
+  - ingress: nginx
+  - certificates: cert-manager (Let's Encrypt)
+pre_flight_status: ✅ PASSED
+security_review: ✅ APPROVED
+monitoring_status: ✅ OPERATIONAL
+load_test_result: ✅ PASSED
+sign_off_by: [DevOps Lead]
+DEPLOY_LOG
+
+git add /tmp/PRODUCTION_DEPLOY.log
+git commit -m "Production deployment log - 2026-03-06"
+```
+
+### 7.3 Notify Team
+
+- [ ] Send deployment completion notice to Slack #gravl-announce
+  ```
+  🚀 **Gravl Production Deployment COMPLETE**
+  - Timestamp: 2026-03-06 09:30 UTC
+  - All systems operational
+  - Monitoring dashboards: [link]
+  - Status page: [link]
+  ```
+
+- [ ] Update status page (if external-facing)
+- [ ] Notify stakeholders (product, marketing)
+
+---
+
+## Rollback Decision Tree
+
+**If at any point a critical failure occurs:**
+1. Do NOT proceed
+2. Trigger ROLLBACK.md procedure
+3. Investigate root cause post-incident (blameless postmortem)
+
+**Critical Failure Indicators:**
+- Database connection failures after 3 retries
+- More than 2 pod crashes during rollout
+- Ingress TLS certificate invalid
+- Health checks failing on all pods
+- Alerts firing for production thresholds
+
+---
+
+## Post-Deployment (T+120 minutes and beyond)
+
+### 7.4 Sustained Monitoring Window (Next 24 hours)
+
+- [ ] Assign on-call rotation (24h monitoring)
+- [ ] Set up escalation policy (alert → on-call → incident lead)
+- [ ] Daily review of logs and metrics for first week
+- [ ] Customer feedback monitoring (support tickets, user reports)
+
+### 7.5 Post-Deployment Review (24 hours)
+
+- [ ] Team retrospective (what went well, what to improve)
+- [ ] Update runbooks based on findings
+- [ ] Document any manual interventions for automation
+- [ ] Plan optimization and hardening work for next phase
+
+---
+
+**Document Version:** 1.0  
+**Last Updated:** 2026-03-06 08:50  
+**Next Update:** After first production deployment attempt