Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
This commit is contained in:
@@ -0,0 +1,441 @@
|
||||
# Rollback Procedure — Phase 10-07, Task 5
|
||||
|
||||
**Date:** 2026-03-06
|
||||
**Status:** DRAFT (TO BE TESTED)
|
||||
**Owner:** DevOps / On-Call Lead
|
||||
**Target RTO (Recovery Time Objective):** <15 minutes
|
||||
**Target RPO (Recovery Point Objective):** <5 minutes
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.
|
||||
|
||||
**When to Rollback:**
|
||||
- Database migration failures (data integrity at risk)
|
||||
- More than 2 pods in CrashLoopBackOff
|
||||
- Ingress / networking down (service unavailable)
|
||||
- Security breach or incident requiring immediate action
|
||||
- Customer-facing API errors (>5% error rate for >5 minutes)
|
||||
|
||||
**When NOT to Rollback:**
|
||||
- Single pod restart (normal Kubernetes behavior)
|
||||
- Slow response times but no errors (<5% error rate)
|
||||
- DNS delays (usually resolves itself)
|
||||
- Single replica pod failure (covered by HA setup)
|
||||
|
||||
---
|
||||
|
||||
## Pre-Requisites for Rollback
|
||||
|
||||
**Before deploying to production, ensure:**
|
||||
|
||||
1. **Previous version image tag is known:**
|
||||
```bash
|
||||
# Save these BEFORE deploying new version
|
||||
BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3
|
||||
FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3
|
||||
POSTGRES_PREVIOUS_VERSION=15.2
|
||||
```
|
||||
|
||||
2. **Database backup exists (automated or manual):**
|
||||
```bash
|
||||
# Verify backup job ran before deployment
|
||||
kubectl logs -n gravl-monitoring job/backup-job | tail -20
|
||||
```
|
||||
|
||||
3. **Kubernetes YAML configs for previous version available:**
|
||||
- k8s/production/backend-deployment.yaml (v1.2.3)
|
||||
- k8s/production/frontend-deployment.yaml (v1.2.3)
|
||||
- Database initialization scripts (v1.2.3)
|
||||
|
||||
4. **Monitoring & alerting configured** (to detect failures)
|
||||
|
||||
---
|
||||
|
||||
## Decision: Is This a Rollback Situation?
|
||||
|
||||
Ask yourself:
|
||||
|
||||
1. **Is data integrity at risk?**
|
||||
- Database corruption or migration failure → YES, rollback
|
||||
- Lost data → YES, rollback (then restore from backup)
|
||||
|
||||
2. **Is the service unavailable to users?**
|
||||
- All pods crashed → YES, rollback
|
||||
- Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
|
||||
- Users seeing errors → CHECK ERROR RATE; if >5% → rollback
|
||||
|
||||
3. **Can we fix it without rolling back?**
|
||||
- Restart pods → try this first
|
||||
- Scale up replicas → try this first
|
||||
- DNS issue → fix DNS, don't rollback
|
||||
- Config issue (secrets, env vars) → fix config, restart pods, don't rollback
|
||||
|
||||
4. **Do we have a known-good previous version?**
|
||||
- If no recent backup or previous version available → DON'T rollback (call in expert)
|
||||
|
||||
---
|
||||
|
||||
## Incident Response Checklist (Before Rollback)
|
||||
|
||||
Do these in parallel while deciding on rollback:
|
||||
|
||||
- [ ] **ALERT:** Page on-call engineer + incident lead to bridge
|
||||
- [ ] **COMMUNICATE:** Slack #gravl-incident: "Investigating production issue"
|
||||
- [ ] **ASSESS:** Check logs, dashboards, alerts
|
||||
```bash
|
||||
kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error
|
||||
kubectl get events -n gravl-production --sort-by='.lastTimestamp'
|
||||
```
|
||||
- [ ] **DECIDE:** Rollback or fix-in-place? (30-second decision)
|
||||
- [ ] **NOTIFY:** If rolling back, notify stakeholders immediately
|
||||
- [ ] **EXECUTE:** Rollback procedure (15 minutes)
|
||||
- [ ] **VERIFY:** Post-rollback health checks (5 minutes)
|
||||
|
||||
---
|
||||
|
||||
## Rollback Scenarios
|
||||
|
||||
### Scenario 1: Pod Crash After Deployment (Most Common)
|
||||
|
||||
**Symptoms:**
|
||||
- Backend pods in CrashLoopBackOff
|
||||
- Error in logs: "Database connection refused" or "Config not found"
|
||||
|
||||
**Rollback Steps:**
|
||||
|
||||
```bash
|
||||
# 1. Alert team
|
||||
# (already in progress from decision above)
|
||||
|
||||
# 2. Scale down failing deployment to stop restarts
|
||||
kubectl scale deployment backend --replicas=0 -n gravl-production
|
||||
|
||||
# 3. Revert to previous image version
|
||||
kubectl set image deployment/backend \
|
||||
backend=gravl-backend:v1.2.3 \
|
||||
-n gravl-production
|
||||
|
||||
# 4. Scale back up
|
||||
kubectl scale deployment backend --replicas=3 -n gravl-production
|
||||
|
||||
# 5. Monitor rollout
|
||||
kubectl rollout status deployment/backend -n gravl-production
|
||||
|
||||
# 6. Verify pods are running
|
||||
kubectl get pods -n gravl-production -l component=backend
|
||||
```
|
||||
|
||||
**Expected Timeline:**
|
||||
- 0-1 min: Scale down (restarts stop)
|
||||
- 1-2 min: Image pull + container start
|
||||
- 2-3 min: Pod ready + health check pass
|
||||
- 3-5 min: Full rollout complete
|
||||
|
||||
**Verification:**
|
||||
- [ ] All backend pods running and ready
|
||||
- [ ] No error messages in pod logs
|
||||
- [ ] Health check endpoint responds
|
||||
- [ ] Service latency returning to normal
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Database Migration Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Backend pods stuck in Init (waiting for migration)
|
||||
- Error in logs: "Migration failed: duplicate key value"
|
||||
- Database migration job failed
|
||||
|
||||
**Rollback Steps:**
|
||||
|
||||
```bash
|
||||
# 1. STOP ALL BACKEND PODS (prevent further schema changes)
|
||||
kubectl scale deployment backend --replicas=0 -n gravl-production
|
||||
|
||||
# 2. CHECK DATABASE STATUS
|
||||
kubectl exec -it postgres-0 -n gravl-production -- \
|
||||
psql -U gravl_user -d gravl -c "SELECT version();"
|
||||
|
||||
# 3. RESTORE FROM BACKUP (if schema corrupted)
|
||||
# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)
|
||||
|
||||
## Example: AWS RDS backup
|
||||
# aws rds restore-db-instance-from-db-snapshot \
|
||||
# --db-instance-identifier gravl-production-restored \
|
||||
# --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00
|
||||
|
||||
## Example: pg_dump restore
|
||||
# kubectl exec -it postgres-0 -- \
|
||||
# psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql
|
||||
|
||||
# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
|
||||
kubectl set image deployment/backend \
|
||||
backend=gravl-backend:v1.2.3 \
|
||||
-n gravl-production
|
||||
|
||||
# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
|
||||
# (assume migration job uses image tag from deployment)
|
||||
kubectl delete job db-migration -n gravl-production
|
||||
kubectl apply -f k8s/production/db-migration-job.yaml
|
||||
|
||||
# Monitor migration
|
||||
kubectl logs -f job/db-migration -n gravl-production
|
||||
|
||||
# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
|
||||
kubectl scale deployment backend --replicas=3 -n gravl-production
|
||||
```
|
||||
|
||||
**Expected Timeline:**
|
||||
- 0-1 min: Scale down + stop pods
|
||||
- 1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
|
||||
- 5-10 min: Migration rollback
|
||||
- 10-15 min: Scale up and stabilize
|
||||
|
||||
**Verification:**
|
||||
- [ ] Database restoration successful (check row counts in critical tables)
|
||||
- [ ] Migration job completed without errors
|
||||
- [ ] Backend pods running and connected to database
|
||||
- [ ] Health checks passing
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Ingress / Network Failure
|
||||
|
||||
**Symptoms:**
|
||||
- External users cannot reach API
|
||||
- Ingress status shows no endpoints
|
||||
- Backend pods running but no traffic reaching them
|
||||
|
||||
**Rollback Steps:**
|
||||
|
||||
```bash
|
||||
# 1. Check ingress status
|
||||
kubectl describe ingress gravl-ingress -n gravl-production
|
||||
|
||||
# 2. Check service endpoints
|
||||
kubectl get endpoints -n gravl-production
|
||||
|
||||
# 3. If TLS cert is the issue, revert to previous cert
|
||||
kubectl delete secret staging-tls -n gravl-production
|
||||
kubectl create secret tls staging-tls \
|
||||
--cert=path/to/previous-cert.crt \
|
||||
--key=path/to/previous-key.key \
|
||||
-n gravl-production
|
||||
|
||||
# 4. If ingress config is broken, revert to previous version
|
||||
kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force
|
||||
|
||||
# 5. Verify ingress is up
|
||||
kubectl get ingress -n gravl-production -w
|
||||
```
|
||||
|
||||
**Expected Timeline:**
|
||||
- 0-1 min: Diagnose issue
|
||||
- 1-2 min: Revert ingress or cert
|
||||
- 2-3 min: DNS propagation (if needed)
|
||||
|
||||
**Verification:**
|
||||
- [ ] Ingress has valid IP / DNS
|
||||
- [ ] TLS certificate valid: `echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject`
|
||||
- [ ] Health endpoint responds via HTTPS
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Secrets / Configuration Issue
|
||||
|
||||
**Symptoms:**
|
||||
- Backend pods running but logs show "secret not found" or "env var missing"
|
||||
- Service starts but crashes immediately on first request
|
||||
|
||||
**Rollback Steps:**
|
||||
|
||||
```bash
|
||||
# 1. Check secrets exist
|
||||
kubectl get secrets -n gravl-production
|
||||
kubectl describe secret app-secret -n gravl-production
|
||||
|
||||
# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
|
||||
kubectl apply -f k8s/production/sealed-secrets.yaml
|
||||
|
||||
# 3. OR if using External Secrets Operator, sync the secret
|
||||
kubectl annotate externalsecret app-secret \
|
||||
externalsecrets.external-secrets.io/force-sync=true \
|
||||
--overwrite -n gravl-production
|
||||
|
||||
# 4. Restart pods to pick up secrets
|
||||
kubectl rollout restart deployment/backend -n gravl-production
|
||||
|
||||
# 5. Monitor
|
||||
kubectl rollout status deployment/backend -n gravl-production
|
||||
```
|
||||
|
||||
**Expected Timeline:**
|
||||
- 0-1 min: Detect missing secrets
|
||||
- 1-2 min: Restore secrets
|
||||
- 2-4 min: Pod restart + readiness
|
||||
|
||||
**Verification:**
|
||||
- [ ] Secrets present: `kubectl get secrets -n gravl-production`
|
||||
- [ ] Pods restarted and healthy
|
||||
- [ ] No "secret not found" errors in logs
|
||||
|
||||
---
|
||||
|
||||
## Full Rollback (Nuclear Option)
|
||||
|
||||
**Use only if above scenarios don't apply or don't resolve issue.**
|
||||
|
||||
```bash
|
||||
# 1. STOP ALL GRAVL SERVICES
|
||||
kubectl scale deployment backend --replicas=0 -n gravl-production
|
||||
kubectl scale deployment frontend --replicas=0 -n gravl-production
|
||||
|
||||
# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
|
||||
# Don't delete anything yet!
|
||||
|
||||
# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
|
||||
# kubectl delete namespace gravl-production
|
||||
# (Only if you have offsite backup and are 100% sure)
|
||||
|
||||
# 4. RESTORE FROM BACKUP
|
||||
# This depends on your backup solution:
|
||||
|
||||
## Option A: Velero (cluster-wide backup)
|
||||
# velero restore create --from-backup gravl-prod-2026-03-06-08-00
|
||||
|
||||
## Option B: Manual restore (infrastructure as code)
|
||||
# kubectl apply -f k8s/production/namespace.yaml
|
||||
# kubectl apply -f k8s/production/rbac.yaml
|
||||
# kubectl apply -f k8s/production/secrets.yaml
|
||||
# kubectl apply -f k8s/production/statefulsets.yaml
|
||||
# ... (all resources for v1.2.3)
|
||||
|
||||
# 5. RESTORE DATABASE FROM BACKUP
|
||||
# aws rds restore-db-instance-from-db-snapshot ...
|
||||
# OR restore from pg_dump / backup file
|
||||
|
||||
# 6. VERIFY EVERYTHING
|
||||
kubectl get all -n gravl-production
|
||||
kubectl logs -n gravl-production -l component=backend | grep -i error | head -10
|
||||
```
|
||||
|
||||
**Expected Timeline:** 15-60 minutes (depending on backup size and complexity)
|
||||
|
||||
---
|
||||
|
||||
## Post-Rollback Actions
|
||||
|
||||
### 1. Verify Service Health (5 minutes)
|
||||
|
||||
```bash
|
||||
# Check all endpoints
|
||||
curl https://gravl.example.com/api/health
|
||||
|
||||
# Verify dashboards
|
||||
# (Login to Grafana, ensure metrics flowing)
|
||||
|
||||
# Check alert status
|
||||
# (Should have no firing alerts related to rollback)
|
||||
```
|
||||
|
||||
### 2. Communicate Status (Immediately)
|
||||
|
||||
```bash
|
||||
# Slack #gravl-incident
|
||||
# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"
|
||||
|
||||
# Update status page (if external-facing)
|
||||
# "Production: Operational (rolled back to previous version)"
|
||||
```
|
||||
|
||||
### 3. Root Cause Analysis (Within 24 hours)
|
||||
|
||||
- [ ] What went wrong in v1.3.0?
|
||||
- [ ] How did we not catch this in staging?
|
||||
- [ ] How do we prevent this in the future?
|
||||
- [ ] Blameless postmortem (focus on process, not people)
|
||||
|
||||
### 4. Fix & Re-deploy (Next 24-72 hours)
|
||||
|
||||
- [ ] Fix the issue
|
||||
- [ ] Thorough testing in staging
|
||||
- [ ] Peer review of changes
|
||||
- [ ] Plan new deployment (with team consensus)
|
||||
|
||||
---
|
||||
|
||||
## Rollback Checklist (Keep In Cockpit During Incident)
|
||||
|
||||
```
|
||||
INCIDENT RESPONSE
|
||||
[ ] Page on-call engineer
|
||||
[ ] Slack alert to #gravl-incident
|
||||
[ ] Check monitoring dashboard
|
||||
[ ] Review error logs
|
||||
[ ] Assess: Fix-in-place or rollback?
|
||||
|
||||
IF ROLLBACK:
|
||||
[ ] Identify previous version (backend, frontend, database)
|
||||
[ ] Verify backup exists and is recent
|
||||
[ ] Alert team: "Rolling back to vX.Y.Z"
|
||||
[ ] Execute rollback (see scenarios above)
|
||||
[ ] Monitor rollout (every 30 seconds)
|
||||
[ ] Health checks passing? (API, DB, ingress)
|
||||
[ ] External test (curl health endpoint)
|
||||
[ ] Metrics returning to normal?
|
||||
|
||||
POST-ROLLBACK
|
||||
[ ] Slack: Service status update
|
||||
[ ] Update status page (if applicable)
|
||||
[ ] Create incident ticket for RCA
|
||||
[ ] Schedule postmortem for tomorrow
|
||||
[ ] Document what happened + what to improve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Automation & Testing
|
||||
|
||||
### Rollback Drill (Monthly)
|
||||
|
||||
```bash
|
||||
# Test rollback procedure in staging without actually rolling back production
|
||||
# 1. Deploy new version to staging
|
||||
# 2. Follow rollback steps (but against staging namespace)
|
||||
# 3. Verify it works
|
||||
# 4. Document any issues found
|
||||
# 5. Update this runbook
|
||||
```
|
||||
|
||||
### Backup Verification (Weekly)
|
||||
|
||||
```bash
|
||||
# Ensure backups are recent and restorable
|
||||
# 1. Check last backup timestamp
|
||||
# 2. Test restore to staging from backup
|
||||
# 3. Verify data integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Support & Escalation
|
||||
|
||||
**If you're unsure about rollback:**
|
||||
1. Page senior engineer (don't hesitate)
|
||||
2. Isolate the problem (stop creating new pods, scale to 0)
|
||||
3. Preserve logs (don't delete anything until RCA is done)
|
||||
4. Get expert help before rolling back
|
||||
|
||||
**Post-Incident Contact:**
|
||||
- Incident lead: [NAME/SLACK]
|
||||
- On-call manager: [NAME/SLACK]
|
||||
- Database expert: [NAME/SLACK]
|
||||
|
||||
---
|
||||
|
||||
**Document Version:** 1.0
|
||||
**Last Updated:** 2026-03-06 08:50
|
||||
**Next Review:** After first production rollback or after 30 days (whichever comes first)
|
||||
Reference in New Issue
Block a user