Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00
parent c153a9648f
commit d81e403f01
330 changed files with 87988 additions and 367 deletions
@@ -0,0 +1,441 @@
+# Rollback Procedure — Phase 10-07, Task 5
+
+**Date:** 2026-03-06  
+**Status:** DRAFT (TO BE TESTED)  
+**Owner:** DevOps / On-Call Lead  
+**Target RTO (Recovery Time Objective):** <15 minutes  
+**Target RPO (Recovery Point Objective):** <5 minutes  
+
+---
+
+## Overview
+
+This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.
+
+**When to Rollback:**
+- Database migration failures (data integrity at risk)
+- More than 2 pods in CrashLoopBackOff
+- Ingress / networking down (service unavailable)
+- Security breach or incident requiring immediate action
+- Customer-facing API errors (>5% error rate for >5 minutes)
+
+**When NOT to Rollback:**
+- Single pod restart (normal Kubernetes behavior)
+- Slow response times but no errors (<5% error rate)
+- DNS delays (usually resolves itself)
+- Single replica pod failure (covered by HA setup)
+
+---
+
+## Pre-Requisites for Rollback
+
+**Before deploying to production, ensure:**
+
+1. **Previous version image tag is known:**
+   ```bash
+   # Save these BEFORE deploying new version
+   BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3
+   FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3
+   POSTGRES_PREVIOUS_VERSION=15.2
+   ```
+
+2. **Database backup exists (automated or manual):**
+   ```bash
+   # Verify backup job ran before deployment
+   kubectl logs -n gravl-monitoring job/backup-job | tail -20
+   ```
+
+3. **Kubernetes YAML configs for previous version available:**
+   - k8s/production/backend-deployment.yaml (v1.2.3)
+   - k8s/production/frontend-deployment.yaml (v1.2.3)
+   - Database initialization scripts (v1.2.3)
+
+4. **Monitoring & alerting configured** (to detect failures)
+
+---
+
+## Decision: Is This a Rollback Situation?
+
+Ask yourself:
+
+1. **Is data integrity at risk?**
+   - Database corruption or migration failure → YES, rollback
+   - Lost data → YES, rollback (then restore from backup)
+
+2. **Is the service unavailable to users?**
+   - All pods crashed → YES, rollback
+   - Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
+   - Users seeing errors → CHECK ERROR RATE; if >5% → rollback
+
+3. **Can we fix it without rolling back?**
+   - Restart pods → try this first
+   - Scale up replicas → try this first
+   - DNS issue → fix DNS, don't rollback
+   - Config issue (secrets, env vars) → fix config, restart pods, don't rollback
+
+4. **Do we have a known-good previous version?**
+   - If no recent backup or previous version available → DON'T rollback (call in expert)
+
+---
+
+## Incident Response Checklist (Before Rollback)
+
+Do these in parallel while deciding on rollback:
+
+- [ ] **ALERT:** Page on-call engineer + incident lead to bridge
+- [ ] **COMMUNICATE:** Slack #gravl-incident: "Investigating production issue"
+- [ ] **ASSESS:** Check logs, dashboards, alerts
+  ```bash
+  kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error
+  kubectl get events -n gravl-production --sort-by='.lastTimestamp'
+  ```
+- [ ] **DECIDE:** Rollback or fix-in-place? (30-second decision)
+- [ ] **NOTIFY:** If rolling back, notify stakeholders immediately
+- [ ] **EXECUTE:** Rollback procedure (15 minutes)
+- [ ] **VERIFY:** Post-rollback health checks (5 minutes)
+
+---
+
+## Rollback Scenarios
+
+### Scenario 1: Pod Crash After Deployment (Most Common)
+
+**Symptoms:**
+- Backend pods in CrashLoopBackOff
+- Error in logs: "Database connection refused" or "Config not found"
+
+**Rollback Steps:**
+
+```bash
+# 1. Alert team
+# (already in progress from decision above)
+
+# 2. Scale down failing deployment to stop restarts
+kubectl scale deployment backend --replicas=0 -n gravl-production
+
+# 3. Revert to previous image version
+kubectl set image deployment/backend \
+  backend=gravl-backend:v1.2.3 \
+  -n gravl-production
+
+# 4. Scale back up
+kubectl scale deployment backend --replicas=3 -n gravl-production
+
+# 5. Monitor rollout
+kubectl rollout status deployment/backend -n gravl-production
+
+# 6. Verify pods are running
+kubectl get pods -n gravl-production -l component=backend
+```
+
+**Expected Timeline:**
+- 0-1 min: Scale down (restarts stop)
+- 1-2 min: Image pull + container start
+- 2-3 min: Pod ready + health check pass
+- 3-5 min: Full rollout complete
+
+**Verification:**
+- [ ] All backend pods running and ready
+- [ ] No error messages in pod logs
+- [ ] Health check endpoint responds
+- [ ] Service latency returning to normal
+
+---
+
+### Scenario 2: Database Migration Failure
+
+**Symptoms:**
+- Backend pods stuck in Init (waiting for migration)
+- Error in logs: "Migration failed: duplicate key value"
+- Database migration job failed
+
+**Rollback Steps:**
+
+```bash
+# 1. STOP ALL BACKEND PODS (prevent further schema changes)
+kubectl scale deployment backend --replicas=0 -n gravl-production
+
+# 2. CHECK DATABASE STATUS
+kubectl exec -it postgres-0 -n gravl-production -- \
+  psql -U gravl_user -d gravl -c "SELECT version();"
+
+# 3. RESTORE FROM BACKUP (if schema corrupted)
+# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)
+
+## Example: AWS RDS backup
+# aws rds restore-db-instance-from-db-snapshot \
+#   --db-instance-identifier gravl-production-restored \
+#   --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00
+
+## Example: pg_dump restore
+# kubectl exec -it postgres-0 -- \
+#   psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql
+
+# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
+kubectl set image deployment/backend \
+  backend=gravl-backend:v1.2.3 \
+  -n gravl-production
+
+# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
+# (assume migration job uses image tag from deployment)
+kubectl delete job db-migration -n gravl-production
+kubectl apply -f k8s/production/db-migration-job.yaml
+
+# Monitor migration
+kubectl logs -f job/db-migration -n gravl-production
+
+# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
+kubectl scale deployment backend --replicas=3 -n gravl-production
+```
+
+**Expected Timeline:**
+- 0-1 min: Scale down + stop pods
+- 1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
+- 5-10 min: Migration rollback
+- 10-15 min: Scale up and stabilize
+
+**Verification:**
+- [ ] Database restoration successful (check row counts in critical tables)
+- [ ] Migration job completed without errors
+- [ ] Backend pods running and connected to database
+- [ ] Health checks passing
+
+---
+
+### Scenario 3: Ingress / Network Failure
+
+**Symptoms:**
+- External users cannot reach API
+- Ingress status shows no endpoints
+- Backend pods running but no traffic reaching them
+
+**Rollback Steps:**
+
+```bash
+# 1. Check ingress status
+kubectl describe ingress gravl-ingress -n gravl-production
+
+# 2. Check service endpoints
+kubectl get endpoints -n gravl-production
+
+# 3. If TLS cert is the issue, revert to previous cert
+kubectl delete secret staging-tls -n gravl-production
+kubectl create secret tls staging-tls \
+  --cert=path/to/previous-cert.crt \
+  --key=path/to/previous-key.key \
+  -n gravl-production
+
+# 4. If ingress config is broken, revert to previous version
+kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force
+
+# 5. Verify ingress is up
+kubectl get ingress -n gravl-production -w
+```
+
+**Expected Timeline:**
+- 0-1 min: Diagnose issue
+- 1-2 min: Revert ingress or cert
+- 2-3 min: DNS propagation (if needed)
+
+**Verification:**
+- [ ] Ingress has valid IP / DNS
+- [ ] TLS certificate valid: `echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject`
+- [ ] Health endpoint responds via HTTPS
+
+---
+
+### Scenario 4: Secrets / Configuration Issue
+
+**Symptoms:**
+- Backend pods running but logs show "secret not found" or "env var missing"
+- Service starts but crashes immediately on first request
+
+**Rollback Steps:**
+
+```bash
+# 1. Check secrets exist
+kubectl get secrets -n gravl-production
+kubectl describe secret app-secret -n gravl-production
+
+# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
+kubectl apply -f k8s/production/sealed-secrets.yaml
+
+# 3. OR if using External Secrets Operator, sync the secret
+kubectl annotate externalsecret app-secret \
+  externalsecrets.external-secrets.io/force-sync=true \
+  --overwrite -n gravl-production
+
+# 4. Restart pods to pick up secrets
+kubectl rollout restart deployment/backend -n gravl-production
+
+# 5. Monitor
+kubectl rollout status deployment/backend -n gravl-production
+```
+
+**Expected Timeline:**
+- 0-1 min: Detect missing secrets
+- 1-2 min: Restore secrets
+- 2-4 min: Pod restart + readiness
+
+**Verification:**
+- [ ] Secrets present: `kubectl get secrets -n gravl-production`
+- [ ] Pods restarted and healthy
+- [ ] No "secret not found" errors in logs
+
+---
+
+## Full Rollback (Nuclear Option)
+
+**Use only if above scenarios don't apply or don't resolve issue.**
+
+```bash
+# 1. STOP ALL GRAVL SERVICES
+kubectl scale deployment backend --replicas=0 -n gravl-production
+kubectl scale deployment frontend --replicas=0 -n gravl-production
+
+# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
+# Don't delete anything yet!
+
+# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
+# kubectl delete namespace gravl-production
+# (Only if you have offsite backup and are 100% sure)
+
+# 4. RESTORE FROM BACKUP
+# This depends on your backup solution:
+
+## Option A: Velero (cluster-wide backup)
+# velero restore create --from-backup gravl-prod-2026-03-06-08-00
+
+## Option B: Manual restore (infrastructure as code)
+# kubectl apply -f k8s/production/namespace.yaml
+# kubectl apply -f k8s/production/rbac.yaml
+# kubectl apply -f k8s/production/secrets.yaml
+# kubectl apply -f k8s/production/statefulsets.yaml
+# ... (all resources for v1.2.3)
+
+# 5. RESTORE DATABASE FROM BACKUP
+# aws rds restore-db-instance-from-db-snapshot ...
+# OR restore from pg_dump / backup file
+
+# 6. VERIFY EVERYTHING
+kubectl get all -n gravl-production
+kubectl logs -n gravl-production -l component=backend | grep -i error | head -10
+```
+
+**Expected Timeline:** 15-60 minutes (depending on backup size and complexity)
+
+---
+
+## Post-Rollback Actions
+
+### 1. Verify Service Health (5 minutes)
+
+```bash
+# Check all endpoints
+curl https://gravl.example.com/api/health
+
+# Verify dashboards
+# (Login to Grafana, ensure metrics flowing)
+
+# Check alert status
+# (Should have no firing alerts related to rollback)
+```
+
+### 2. Communicate Status (Immediately)
+
+```bash
+# Slack #gravl-incident
+# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"
+
+# Update status page (if external-facing)
+# "Production: Operational (rolled back to previous version)"
+```
+
+### 3. Root Cause Analysis (Within 24 hours)
+
+- [ ] What went wrong in v1.3.0?
+- [ ] How did we not catch this in staging?
+- [ ] How do we prevent this in the future?
+- [ ] Blameless postmortem (focus on process, not people)
+
+### 4. Fix & Re-deploy (Next 24-72 hours)
+
+- [ ] Fix the issue
+- [ ] Thorough testing in staging
+- [ ] Peer review of changes
+- [ ] Plan new deployment (with team consensus)
+
+---
+
+## Rollback Checklist (Keep In Cockpit During Incident)
+
+```
+INCIDENT RESPONSE
+[ ] Page on-call engineer
+[ ] Slack alert to #gravl-incident
+[ ] Check monitoring dashboard
+[ ] Review error logs
+[ ] Assess: Fix-in-place or rollback?
+
+IF ROLLBACK:
+[ ] Identify previous version (backend, frontend, database)
+[ ] Verify backup exists and is recent
+[ ] Alert team: "Rolling back to vX.Y.Z"
+[ ] Execute rollback (see scenarios above)
+[ ] Monitor rollout (every 30 seconds)
+[ ] Health checks passing? (API, DB, ingress)
+[ ] External test (curl health endpoint)
+[ ] Metrics returning to normal?
+
+POST-ROLLBACK
+[ ] Slack: Service status update
+[ ] Update status page (if applicable)
+[ ] Create incident ticket for RCA
+[ ] Schedule postmortem for tomorrow
+[ ] Document what happened + what to improve
+```
+
+---
+
+## Automation & Testing
+
+### Rollback Drill (Monthly)
+
+```bash
+# Test rollback procedure in staging without actually rolling back production
+# 1. Deploy new version to staging
+# 2. Follow rollback steps (but against staging namespace)
+# 3. Verify it works
+# 4. Document any issues found
+# 5. Update this runbook
+```
+
+### Backup Verification (Weekly)
+
+```bash
+# Ensure backups are recent and restorable
+# 1. Check last backup timestamp
+# 2. Test restore to staging from backup
+# 3. Verify data integrity
+```
+
+---
+
+## Support & Escalation
+
+**If you're unsure about rollback:**
+1. Page senior engineer (don't hesitate)
+2. Isolate the problem (stop creating new pods, scale to 0)
+3. Preserve logs (don't delete anything until RCA is done)
+4. Get expert help before rolling back
+
+**Post-Incident Contact:**
+- Incident lead: [NAME/SLACK]
+- On-call manager: [NAME/SLACK]  
+- Database expert: [NAME/SLACK]
+
+---
+
+**Document Version:** 1.0  
+**Last Updated:** 2026-03-06 08:50  
+**Next Review:** After first production rollback or after 30 days (whichever comes first)