Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS:
 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06
This commit is contained in:
2026-03-06 20:54:03 +01:00
parent c153a9648f
commit d81e403f01
330 changed files with 87988 additions and 367 deletions
+494
View File
@@ -0,0 +1,494 @@
# Production Go-Live Procedure — Phase 10-07, Task 5
**Date:** 2026-03-06
**Status:** DRAFT (TO BE TESTED ON STAGING)
**Owner:** DevOps / Deployment Lead
**Pre-requisites:** Complete PRODUCTION_READINESS.md checklist items #1-4
---
## Overview
This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.
**Estimated Duration:** 2-3 hours (plus verification window)
**Rollback Window:** <15 minutes (with ROLLBACK.md procedure)
**Required Team:** DevOps (2), Backend (1), Frontend Lead (1)
---
## Pre-Flight Checklist (T-30 minutes)
- [ ] Production cluster access verified (kubectl configured)
- [ ] All team members on call (Slack + video bridge open)
- [ ] Backup of production database exists (snapshot/automated backup running)
- [ ] Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
- [ ] Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
- [ ] Production domain DNS propagated (check DNS resolution)
- [ ] TLS certificates ready or cert-manager deployed and tested
- [ ] Alert thresholds reviewed (no overly sensitive alerts during deployment)
- [ ] Staging environment running last validated build
- [ ] Load balancer health checks configured
- [ ] Incident communication channel created (Slack #gravl-incident)
---
## Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)
### 1.1 Create Kubernetes Namespace & RBAC
```bash
# Apply production namespace configuration
kubectl apply -f k8s/production/namespace.yaml
# Apply RBAC for production deployments
kubectl apply -f k8s/production/rbac.yaml
# Verify namespace created
kubectl get ns gravl-production
kubectl get serviceaccount -n gravl-production gravl-deployer
```
**Verification:**
- [ ] Namespace exists
- [ ] ServiceAccount exists
- [ ] RBAC role bound
### 1.2 Apply Network Policies
```bash
# Apply default deny + explicit allow rules
kubectl apply -f k8s/production/network-policy.yaml
# Verify policies (should see 5+ NetworkPolicies)
kubectl get networkpolicies -n gravl-production
```
**Verification:**
- [ ] Default deny ingress in place
- [ ] Backend, frontend, database, monitoring policies visible
### 1.3 Deploy Secrets (Sealed or External)
**Option A: Sealed Secrets** (if kubeseal is deployed)
```bash
# Unseal production secrets
kubeseal -f k8s/production/sealed-secrets.yaml \
| kubectl apply -f -
# Verify secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret postgres-secret -n gravl-production
```
**Option B: External Secrets Operator** (if AWS/Vault used)
```bash
# Apply ExternalSecret definitions
kubectl apply -f k8s/production/external-secrets.yaml
# Verify ExternalSecrets synced (should see status: synced)
kubectl get externalsecrets -n gravl-production
kubectl describe externalsecret postgres-secret -n gravl-production
```
**Verification:**
- [ ] postgres-secret contains POSTGRES_PASSWORD
- [ ] app-secret contains JWT_SECRET
- [ ] registry-pull-secret exists (if private registry used)
- [ ] staging-tls exists (or cert-manager will auto-create)
### 1.4 Deploy cert-manager (if not already on cluster)
```bash
# Install cert-manager (one-time, if needed)
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true \
--version v1.13.0
# Create ClusterIssuer for Let's Encrypt (production)
kubectl apply -f k8s/production/cert-manager-issuer.yaml
# Verify issuer ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-prod
```
**Verification:**
- [ ] cert-manager pods running in cert-manager namespace
- [ ] ClusterIssuer status is READY (True)
---
## Phase 2: Database & Storage (T-30 to T-10 minutes)
### 2.1 Deploy PostgreSQL StatefulSet
```bash
# Deploy PostgreSQL to production
kubectl apply -f k8s/production/postgres-statefulset.yaml
# Watch for Pod readiness (should take 30-60 seconds)
kubectl rollout status statefulset/postgres -n gravl-production
# Verify pod is running and ready (2/2 containers)
kubectl get pods -n gravl-production -l component=database
```
**Verification:**
- [ ] Pod status: Running, Ready 2/2
- [ ] PersistentVolumeClaim bound
- [ ] No errors in pod logs: `kubectl logs postgres-0 -n gravl-production`
### 2.2 Run Database Migrations
```bash
# Port-forward to database (for migration job)
kubectl port-forward postgres-0 5432:5432 -n gravl-production &
# Run migrations in separate terminal
cd backend
npm run db:migrate:prod
# Monitor migration logs
kubectl logs -n gravl-production -f job/db-migration
# Kill port-forward when done
kill %1
```
**Verification:**
- [ ] Migration job completed successfully
- [ ] No migration errors in logs
- [ ] Database schema matches expected version
### 2.3 Verify Database Connectivity
```bash
# Create a test pod to verify DB access
kubectl run -it --rm --image=postgres:15 \
--restart=Never \
-n gravl-production \
psql-test \
-- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"
# Should return PostgreSQL version
```
**Verification:**
- [ ] Database connection successful
- [ ] PostgreSQL version visible
---
## Phase 3: Deploy Application Services (T-10 to T+20 minutes)
### 3.1 Deploy Backend Deployment
```bash
# Deploy backend service
kubectl apply -f k8s/production/backend-deployment.yaml
# Wait for rollout (typically 2-3 minutes)
kubectl rollout status deployment/backend -n gravl-production
# Verify pods running
kubectl get pods -n gravl-production -l component=backend
```
**Verification:**
- [ ] Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
- [ ] No CrashLoopBackOff errors
- [ ] Service endpoint registered: `kubectl get svc backend -n gravl-production`
### 3.2 Deploy Frontend Deployment
```bash
# Deploy frontend service
kubectl apply -f k8s/production/frontend-deployment.yaml
# Wait for rollout
kubectl rollout status deployment/frontend -n gravl-production
# Verify pods
kubectl get pods -n gravl-production -l component=frontend
```
**Verification:**
- [ ] Frontend pods running and ready
- [ ] Service endpoint registered
### 3.3 Apply Ingress with TLS Termination
```bash
# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
kubectl apply -f k8s/production/ingress.yaml
# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
kubectl get ingress -n gravl-production -w
# Check ingress status and TLS certificate
kubectl describe ingress gravl-ingress -n gravl-production
```
**Verification:**
- [ ] Ingress has external IP or DNS name assigned
- [ ] TLS certificate present (cert-manager auto-created if configured)
- [ ] SSL certificate not self-signed (check with OpenSSL):
```bash
echo | openssl s_client -servername gravl.example.com \
-connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
```
---
## Phase 4: Service Integration Verification (T+20 to T+40 minutes)
### 4.1 Test Service-to-Service Communication
```bash
# Exec into backend pod to test database connection
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $BACKEND_POD -n gravl-production -- \
curl http://postgres:5432 -v 2>&1 | head -5
# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"
```
**Verification:**
- [ ] Backend can reach database (even if timeout, not connection refused)
- [ ] Backend logs show no database errors: `kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10`
### 4.2 Health Check Endpoint
```bash
# Get backend service IP
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')
# Test health endpoint (from another pod)
kubectl run -it --rm --image=curlimages/curl \
--restart=Never \
-n gravl-production \
curl-test \
-- curl http://$BACKEND_SVC:3000/health
# Expected response: {"status":"ok"} or similar
```
**Verification:**
- [ ] Health endpoint responds (HTTP 200)
- [ ] No error messages in response
### 4.3 External Endpoint Test (via Ingress)
```bash
# Wait for DNS propagation (if using DNS name, not IP)
# Then test external access
curl -k https://gravl.example.com/api/health
# Expected: HTTP 200 with health status
```
**Verification:**
- [ ] HTTPS responds (self-signed cert is OK to see -k warning)
- [ ] Backend responds through ingress
---
## Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)
### 5.1 Verify Prometheus Scraping
```bash
# Check Prometheus targets (should show gravl-production scrape configs)
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &
# Open http://localhost:9090/targets in browser
# Verify all gravl-production targets are "UP"
kill %1
```
**Verification:**
- [ ] All production targets showing as UP
- [ ] No "DOWN" endpoints
### 5.2 Verify Grafana Dashboards
```bash
# Access Grafana
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &
# Open http://localhost:3000
# Login with default credentials (or stored secret)
# Navigate to Gravl dashboards
# Verify graphs showing production metrics
kill %1
```
**Verification:**
- [ ] Gravl dashboards visible
- [ ] Metrics flowing (not empty graphs)
- [ ] CPU, memory, request rate graphs showing data
### 5.3 Verify AlertManager
```bash
# Check AlertManager configuration (should have production severity levels)
kubectl get alertmanagerconfig -n gravl-monitoring
kubectl describe alertmanagerconfig -n gravl-monitoring
```
**Verification:**
- [ ] Alerts configured for production thresholds
- [ ] Notification channels (Slack, PagerDuty, etc.) configured
### 5.4 Test Alert Trigger
```bash
# Send test alert through AlertManager
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093
# Check Slack / notification channel for alert (should arrive within 1 minute)
```
**Verification:**
- [ ] Test alert received in notification channel
- [ ] Alert formatting correct
- [ ] No excessive duplicate alerts
---
## Phase 6: Load Test & Baseline (T+60 to T+90 minutes)
### 6.1 Run Load Test on Production (Low Traffic)
```bash
# Generate light load using k6 or Apache Bench
k6 run --vus 10 --duration 5m k8s/production/load-test.js
# Expected results:
# - p95 latency: <200ms
# - Throughput: >100 req/s
# - Error rate: <0.1%
```
**Verification:**
- [ ] p95 latency <200ms
- [ ] Error rate <0.1%
- [ ] No pod restarts during test
### 6.2 Baseline Metrics Captured
```bash
# Log current metrics for baseline
kubectl top nodes > /tmp/baseline-nodes.txt
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt
# Store for comparison (alert if exceeds 2x baseline)
```
**Verification:**
- [ ] Node CPU/Memory usage within expected range
- [ ] Pod CPU/Memory usage within resource requests
---
## Phase 7: Production Sign-Off (T+90 minutes)
### 7.1 Final Checklist
- [ ] All pre-flight checks passed
- [ ] Database healthy and migrated
- [ ] All services running and ready
- [ ] Ingress responding (TLS valid)
- [ ] Health checks passing
- [ ] Monitoring metrics flowing
- [ ] Alerts functional
- [ ] Load test passed
- [ ] Team lead review: ✅ READY TO GO LIVE
### 7.2 Change Log Entry
```bash
# Log deployment to version control
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
---
date: 2026-03-06
time: ~09:30 UTC
environment: production
namespace: gravl-production
services:
- backend: v1.x.x
- frontend: v1.x.x
- postgres: 15.x
- ingress: nginx
- certificates: cert-manager (Let's Encrypt)
pre_flight_status: ✅ PASSED
security_review: ✅ APPROVED
monitoring_status: ✅ OPERATIONAL
load_test_result: ✅ PASSED
sign_off_by: [DevOps Lead]
DEPLOY_LOG
git add /tmp/PRODUCTION_DEPLOY.log
git commit -m "Production deployment log - 2026-03-06"
```
### 7.3 Notify Team
- [ ] Send deployment completion notice to Slack #gravl-announce
```
🚀 **Gravl Production Deployment COMPLETE**
- Timestamp: 2026-03-06 09:30 UTC
- All systems operational
- Monitoring dashboards: [link]
- Status page: [link]
```
- [ ] Update status page (if external-facing)
- [ ] Notify stakeholders (product, marketing)
---
## Rollback Decision Tree
**If at any point a critical failure occurs:**
1. Do NOT proceed
2. Trigger ROLLBACK.md procedure
3. Investigate root cause post-incident (blameless postmortem)
**Critical Failure Indicators:**
- Database connection failures after 3 retries
- More than 2 pod crashes during rollout
- Ingress TLS certificate invalid
- Health checks failing on all pods
- Alerts firing for production thresholds
---
## Post-Deployment (T+120 minutes and beyond)
### 7.4 Sustained Monitoring Window (Next 24 hours)
- [ ] Assign on-call rotation (24h monitoring)
- [ ] Set up escalation policy (alert → on-call → incident lead)
- [ ] Daily review of logs and metrics for first week
- [ ] Customer feedback monitoring (support tickets, user reports)
### 7.5 Post-Deployment Review (24 hours)
- [ ] Team retrospective (what went well, what to improve)
- [ ] Update runbooks based on findings
- [ ] Document any manual interventions for automation
- [ ] Plan optimization and hardening work for next phase
---
**Document Version:** 1.0
**Last Updated:** 2026-03-06 08:50
**Next Update:** After first production deployment attempt