COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
13 KiB
Production Go-Live Procedure — Phase 10-07, Task 5
Date: 2026-03-06
Status: DRAFT (TO BE TESTED ON STAGING)
Owner: DevOps / Deployment Lead
Pre-requisites: Complete PRODUCTION_READINESS.md checklist items #1-4
Overview
This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.
Estimated Duration: 2-3 hours (plus verification window)
Rollback Window: <15 minutes (with ROLLBACK.md procedure)
Required Team: DevOps (2), Backend (1), Frontend Lead (1)
Pre-Flight Checklist (T-30 minutes)
- Production cluster access verified (kubectl configured)
- All team members on call (Slack + video bridge open)
- Backup of production database exists (snapshot/automated backup running)
- Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
- Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
- Production domain DNS propagated (check DNS resolution)
- TLS certificates ready or cert-manager deployed and tested
- Alert thresholds reviewed (no overly sensitive alerts during deployment)
- Staging environment running last validated build
- Load balancer health checks configured
- Incident communication channel created (Slack #gravl-incident)
Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)
1.1 Create Kubernetes Namespace & RBAC
# Apply production namespace configuration
kubectl apply -f k8s/production/namespace.yaml
# Apply RBAC for production deployments
kubectl apply -f k8s/production/rbac.yaml
# Verify namespace created
kubectl get ns gravl-production
kubectl get serviceaccount -n gravl-production gravl-deployer
Verification:
- Namespace exists
- ServiceAccount exists
- RBAC role bound
1.2 Apply Network Policies
# Apply default deny + explicit allow rules
kubectl apply -f k8s/production/network-policy.yaml
# Verify policies (should see 5+ NetworkPolicies)
kubectl get networkpolicies -n gravl-production
Verification:
- Default deny ingress in place
- Backend, frontend, database, monitoring policies visible
1.3 Deploy Secrets (Sealed or External)
Option A: Sealed Secrets (if kubeseal is deployed)
# Unseal production secrets
kubeseal -f k8s/production/sealed-secrets.yaml \
| kubectl apply -f -
# Verify secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret postgres-secret -n gravl-production
Option B: External Secrets Operator (if AWS/Vault used)
# Apply ExternalSecret definitions
kubectl apply -f k8s/production/external-secrets.yaml
# Verify ExternalSecrets synced (should see status: synced)
kubectl get externalsecrets -n gravl-production
kubectl describe externalsecret postgres-secret -n gravl-production
Verification:
- postgres-secret contains POSTGRES_PASSWORD
- app-secret contains JWT_SECRET
- registry-pull-secret exists (if private registry used)
- staging-tls exists (or cert-manager will auto-create)
1.4 Deploy cert-manager (if not already on cluster)
# Install cert-manager (one-time, if needed)
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true \
--version v1.13.0
# Create ClusterIssuer for Let's Encrypt (production)
kubectl apply -f k8s/production/cert-manager-issuer.yaml
# Verify issuer ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-prod
Verification:
- cert-manager pods running in cert-manager namespace
- ClusterIssuer status is READY (True)
Phase 2: Database & Storage (T-30 to T-10 minutes)
2.1 Deploy PostgreSQL StatefulSet
# Deploy PostgreSQL to production
kubectl apply -f k8s/production/postgres-statefulset.yaml
# Watch for Pod readiness (should take 30-60 seconds)
kubectl rollout status statefulset/postgres -n gravl-production
# Verify pod is running and ready (2/2 containers)
kubectl get pods -n gravl-production -l component=database
Verification:
- Pod status: Running, Ready 2/2
- PersistentVolumeClaim bound
- No errors in pod logs:
kubectl logs postgres-0 -n gravl-production
2.2 Run Database Migrations
# Port-forward to database (for migration job)
kubectl port-forward postgres-0 5432:5432 -n gravl-production &
# Run migrations in separate terminal
cd backend
npm run db:migrate:prod
# Monitor migration logs
kubectl logs -n gravl-production -f job/db-migration
# Kill port-forward when done
kill %1
Verification:
- Migration job completed successfully
- No migration errors in logs
- Database schema matches expected version
2.3 Verify Database Connectivity
# Create a test pod to verify DB access
kubectl run -it --rm --image=postgres:15 \
--restart=Never \
-n gravl-production \
psql-test \
-- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"
# Should return PostgreSQL version
Verification:
- Database connection successful
- PostgreSQL version visible
Phase 3: Deploy Application Services (T-10 to T+20 minutes)
3.1 Deploy Backend Deployment
# Deploy backend service
kubectl apply -f k8s/production/backend-deployment.yaml
# Wait for rollout (typically 2-3 minutes)
kubectl rollout status deployment/backend -n gravl-production
# Verify pods running
kubectl get pods -n gravl-production -l component=backend
Verification:
- Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
- No CrashLoopBackOff errors
- Service endpoint registered:
kubectl get svc backend -n gravl-production
3.2 Deploy Frontend Deployment
# Deploy frontend service
kubectl apply -f k8s/production/frontend-deployment.yaml
# Wait for rollout
kubectl rollout status deployment/frontend -n gravl-production
# Verify pods
kubectl get pods -n gravl-production -l component=frontend
Verification:
- Frontend pods running and ready
- Service endpoint registered
3.3 Apply Ingress with TLS Termination
# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
kubectl apply -f k8s/production/ingress.yaml
# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
kubectl get ingress -n gravl-production -w
# Check ingress status and TLS certificate
kubectl describe ingress gravl-ingress -n gravl-production
Verification:
- Ingress has external IP or DNS name assigned
- TLS certificate present (cert-manager auto-created if configured)
- SSL certificate not self-signed (check with OpenSSL):
echo | openssl s_client -servername gravl.example.com \ -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
Phase 4: Service Integration Verification (T+20 to T+40 minutes)
4.1 Test Service-to-Service Communication
# Exec into backend pod to test database connection
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $BACKEND_POD -n gravl-production -- \
curl http://postgres:5432 -v 2>&1 | head -5
# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"
Verification:
- Backend can reach database (even if timeout, not connection refused)
- Backend logs show no database errors:
kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10
4.2 Health Check Endpoint
# Get backend service IP
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')
# Test health endpoint (from another pod)
kubectl run -it --rm --image=curlimages/curl \
--restart=Never \
-n gravl-production \
curl-test \
-- curl http://$BACKEND_SVC:3000/health
# Expected response: {"status":"ok"} or similar
Verification:
- Health endpoint responds (HTTP 200)
- No error messages in response
4.3 External Endpoint Test (via Ingress)
# Wait for DNS propagation (if using DNS name, not IP)
# Then test external access
curl -k https://gravl.example.com/api/health
# Expected: HTTP 200 with health status
Verification:
- HTTPS responds (self-signed cert is OK to see -k warning)
- Backend responds through ingress
Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)
5.1 Verify Prometheus Scraping
# Check Prometheus targets (should show gravl-production scrape configs)
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &
# Open http://localhost:9090/targets in browser
# Verify all gravl-production targets are "UP"
kill %1
Verification:
- All production targets showing as UP
- No "DOWN" endpoints
5.2 Verify Grafana Dashboards
# Access Grafana
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &
# Open http://localhost:3000
# Login with default credentials (or stored secret)
# Navigate to Gravl dashboards
# Verify graphs showing production metrics
kill %1
Verification:
- Gravl dashboards visible
- Metrics flowing (not empty graphs)
- CPU, memory, request rate graphs showing data
5.3 Verify AlertManager
# Check AlertManager configuration (should have production severity levels)
kubectl get alertmanagerconfig -n gravl-monitoring
kubectl describe alertmanagerconfig -n gravl-monitoring
Verification:
- Alerts configured for production thresholds
- Notification channels (Slack, PagerDuty, etc.) configured
5.4 Test Alert Trigger
# Send test alert through AlertManager
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093
# Check Slack / notification channel for alert (should arrive within 1 minute)
Verification:
- Test alert received in notification channel
- Alert formatting correct
- No excessive duplicate alerts
Phase 6: Load Test & Baseline (T+60 to T+90 minutes)
6.1 Run Load Test on Production (Low Traffic)
# Generate light load using k6 or Apache Bench
k6 run --vus 10 --duration 5m k8s/production/load-test.js
# Expected results:
# - p95 latency: <200ms
# - Throughput: >100 req/s
# - Error rate: <0.1%
Verification:
- p95 latency <200ms
- Error rate <0.1%
- No pod restarts during test
6.2 Baseline Metrics Captured
# Log current metrics for baseline
kubectl top nodes > /tmp/baseline-nodes.txt
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt
# Store for comparison (alert if exceeds 2x baseline)
Verification:
- Node CPU/Memory usage within expected range
- Pod CPU/Memory usage within resource requests
Phase 7: Production Sign-Off (T+90 minutes)
7.1 Final Checklist
- All pre-flight checks passed
- Database healthy and migrated
- All services running and ready
- Ingress responding (TLS valid)
- Health checks passing
- Monitoring metrics flowing
- Alerts functional
- Load test passed
- Team lead review: ✅ READY TO GO LIVE
7.2 Change Log Entry
# Log deployment to version control
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
---
date: 2026-03-06
time: ~09:30 UTC
environment: production
namespace: gravl-production
services:
- backend: v1.x.x
- frontend: v1.x.x
- postgres: 15.x
- ingress: nginx
- certificates: cert-manager (Let's Encrypt)
pre_flight_status: ✅ PASSED
security_review: ✅ APPROVED
monitoring_status: ✅ OPERATIONAL
load_test_result: ✅ PASSED
sign_off_by: [DevOps Lead]
DEPLOY_LOG
git add /tmp/PRODUCTION_DEPLOY.log
git commit -m "Production deployment log - 2026-03-06"
7.3 Notify Team
-
Send deployment completion notice to Slack #gravl-announce
🚀 **Gravl Production Deployment COMPLETE** - Timestamp: 2026-03-06 09:30 UTC - All systems operational - Monitoring dashboards: [link] - Status page: [link] -
Update status page (if external-facing)
-
Notify stakeholders (product, marketing)
Rollback Decision Tree
If at any point a critical failure occurs:
- Do NOT proceed
- Trigger ROLLBACK.md procedure
- Investigate root cause post-incident (blameless postmortem)
Critical Failure Indicators:
- Database connection failures after 3 retries
- More than 2 pod crashes during rollout
- Ingress TLS certificate invalid
- Health checks failing on all pods
- Alerts firing for production thresholds
Post-Deployment (T+120 minutes and beyond)
7.4 Sustained Monitoring Window (Next 24 hours)
- Assign on-call rotation (24h monitoring)
- Set up escalation policy (alert → on-call → incident lead)
- Daily review of logs and metrics for first week
- Customer feedback monitoring (support tickets, user reports)
7.5 Post-Deployment Review (24 hours)
- Team retrospective (what went well, what to improve)
- Update runbooks based on findings
- Document any manual interventions for automation
- Plan optimization and hardening work for next phase
Document Version: 1.0
Last Updated: 2026-03-06 08:50
Next Update: After first production deployment attempt