COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
12 KiB
Rollback Procedure — Phase 10-07, Task 5
Date: 2026-03-06
Status: DRAFT (TO BE TESTED)
Owner: DevOps / On-Call Lead
Target RTO (Recovery Time Objective): <15 minutes
Target RPO (Recovery Point Objective): <5 minutes
Overview
This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.
When to Rollback:
- Database migration failures (data integrity at risk)
- More than 2 pods in CrashLoopBackOff
- Ingress / networking down (service unavailable)
- Security breach or incident requiring immediate action
- Customer-facing API errors (>5% error rate for >5 minutes)
When NOT to Rollback:
- Single pod restart (normal Kubernetes behavior)
- Slow response times but no errors (<5% error rate)
- DNS delays (usually resolves itself)
- Single replica pod failure (covered by HA setup)
Pre-Requisites for Rollback
Before deploying to production, ensure:
-
Previous version image tag is known:
# Save these BEFORE deploying new version BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3 FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3 POSTGRES_PREVIOUS_VERSION=15.2 -
Database backup exists (automated or manual):
# Verify backup job ran before deployment kubectl logs -n gravl-monitoring job/backup-job | tail -20 -
Kubernetes YAML configs for previous version available:
- k8s/production/backend-deployment.yaml (v1.2.3)
- k8s/production/frontend-deployment.yaml (v1.2.3)
- Database initialization scripts (v1.2.3)
-
Monitoring & alerting configured (to detect failures)
Decision: Is This a Rollback Situation?
Ask yourself:
-
Is data integrity at risk?
- Database corruption or migration failure → YES, rollback
- Lost data → YES, rollback (then restore from backup)
-
Is the service unavailable to users?
- All pods crashed → YES, rollback
- Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
- Users seeing errors → CHECK ERROR RATE; if >5% → rollback
-
Can we fix it without rolling back?
- Restart pods → try this first
- Scale up replicas → try this first
- DNS issue → fix DNS, don't rollback
- Config issue (secrets, env vars) → fix config, restart pods, don't rollback
-
Do we have a known-good previous version?
- If no recent backup or previous version available → DON'T rollback (call in expert)
Incident Response Checklist (Before Rollback)
Do these in parallel while deciding on rollback:
- ALERT: Page on-call engineer + incident lead to bridge
- COMMUNICATE: Slack #gravl-incident: "Investigating production issue"
- ASSESS: Check logs, dashboards, alerts
kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error kubectl get events -n gravl-production --sort-by='.lastTimestamp' - DECIDE: Rollback or fix-in-place? (30-second decision)
- NOTIFY: If rolling back, notify stakeholders immediately
- EXECUTE: Rollback procedure (15 minutes)
- VERIFY: Post-rollback health checks (5 minutes)
Rollback Scenarios
Scenario 1: Pod Crash After Deployment (Most Common)
Symptoms:
- Backend pods in CrashLoopBackOff
- Error in logs: "Database connection refused" or "Config not found"
Rollback Steps:
# 1. Alert team
# (already in progress from decision above)
# 2. Scale down failing deployment to stop restarts
kubectl scale deployment backend --replicas=0 -n gravl-production
# 3. Revert to previous image version
kubectl set image deployment/backend \
backend=gravl-backend:v1.2.3 \
-n gravl-production
# 4. Scale back up
kubectl scale deployment backend --replicas=3 -n gravl-production
# 5. Monitor rollout
kubectl rollout status deployment/backend -n gravl-production
# 6. Verify pods are running
kubectl get pods -n gravl-production -l component=backend
Expected Timeline:
- 0-1 min: Scale down (restarts stop)
- 1-2 min: Image pull + container start
- 2-3 min: Pod ready + health check pass
- 3-5 min: Full rollout complete
Verification:
- All backend pods running and ready
- No error messages in pod logs
- Health check endpoint responds
- Service latency returning to normal
Scenario 2: Database Migration Failure
Symptoms:
- Backend pods stuck in Init (waiting for migration)
- Error in logs: "Migration failed: duplicate key value"
- Database migration job failed
Rollback Steps:
# 1. STOP ALL BACKEND PODS (prevent further schema changes)
kubectl scale deployment backend --replicas=0 -n gravl-production
# 2. CHECK DATABASE STATUS
kubectl exec -it postgres-0 -n gravl-production -- \
psql -U gravl_user -d gravl -c "SELECT version();"
# 3. RESTORE FROM BACKUP (if schema corrupted)
# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)
## Example: AWS RDS backup
# aws rds restore-db-instance-from-db-snapshot \
# --db-instance-identifier gravl-production-restored \
# --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00
## Example: pg_dump restore
# kubectl exec -it postgres-0 -- \
# psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql
# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
kubectl set image deployment/backend \
backend=gravl-backend:v1.2.3 \
-n gravl-production
# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
# (assume migration job uses image tag from deployment)
kubectl delete job db-migration -n gravl-production
kubectl apply -f k8s/production/db-migration-job.yaml
# Monitor migration
kubectl logs -f job/db-migration -n gravl-production
# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
kubectl scale deployment backend --replicas=3 -n gravl-production
Expected Timeline:
- 0-1 min: Scale down + stop pods
- 1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
- 5-10 min: Migration rollback
- 10-15 min: Scale up and stabilize
Verification:
- Database restoration successful (check row counts in critical tables)
- Migration job completed without errors
- Backend pods running and connected to database
- Health checks passing
Scenario 3: Ingress / Network Failure
Symptoms:
- External users cannot reach API
- Ingress status shows no endpoints
- Backend pods running but no traffic reaching them
Rollback Steps:
# 1. Check ingress status
kubectl describe ingress gravl-ingress -n gravl-production
# 2. Check service endpoints
kubectl get endpoints -n gravl-production
# 3. If TLS cert is the issue, revert to previous cert
kubectl delete secret staging-tls -n gravl-production
kubectl create secret tls staging-tls \
--cert=path/to/previous-cert.crt \
--key=path/to/previous-key.key \
-n gravl-production
# 4. If ingress config is broken, revert to previous version
kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force
# 5. Verify ingress is up
kubectl get ingress -n gravl-production -w
Expected Timeline:
- 0-1 min: Diagnose issue
- 1-2 min: Revert ingress or cert
- 2-3 min: DNS propagation (if needed)
Verification:
- Ingress has valid IP / DNS
- TLS certificate valid:
echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject - Health endpoint responds via HTTPS
Scenario 4: Secrets / Configuration Issue
Symptoms:
- Backend pods running but logs show "secret not found" or "env var missing"
- Service starts but crashes immediately on first request
Rollback Steps:
# 1. Check secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret app-secret -n gravl-production
# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
kubectl apply -f k8s/production/sealed-secrets.yaml
# 3. OR if using External Secrets Operator, sync the secret
kubectl annotate externalsecret app-secret \
externalsecrets.external-secrets.io/force-sync=true \
--overwrite -n gravl-production
# 4. Restart pods to pick up secrets
kubectl rollout restart deployment/backend -n gravl-production
# 5. Monitor
kubectl rollout status deployment/backend -n gravl-production
Expected Timeline:
- 0-1 min: Detect missing secrets
- 1-2 min: Restore secrets
- 2-4 min: Pod restart + readiness
Verification:
- Secrets present:
kubectl get secrets -n gravl-production - Pods restarted and healthy
- No "secret not found" errors in logs
Full Rollback (Nuclear Option)
Use only if above scenarios don't apply or don't resolve issue.
# 1. STOP ALL GRAVL SERVICES
kubectl scale deployment backend --replicas=0 -n gravl-production
kubectl scale deployment frontend --replicas=0 -n gravl-production
# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
# Don't delete anything yet!
# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
# kubectl delete namespace gravl-production
# (Only if you have offsite backup and are 100% sure)
# 4. RESTORE FROM BACKUP
# This depends on your backup solution:
## Option A: Velero (cluster-wide backup)
# velero restore create --from-backup gravl-prod-2026-03-06-08-00
## Option B: Manual restore (infrastructure as code)
# kubectl apply -f k8s/production/namespace.yaml
# kubectl apply -f k8s/production/rbac.yaml
# kubectl apply -f k8s/production/secrets.yaml
# kubectl apply -f k8s/production/statefulsets.yaml
# ... (all resources for v1.2.3)
# 5. RESTORE DATABASE FROM BACKUP
# aws rds restore-db-instance-from-db-snapshot ...
# OR restore from pg_dump / backup file
# 6. VERIFY EVERYTHING
kubectl get all -n gravl-production
kubectl logs -n gravl-production -l component=backend | grep -i error | head -10
Expected Timeline: 15-60 minutes (depending on backup size and complexity)
Post-Rollback Actions
1. Verify Service Health (5 minutes)
# Check all endpoints
curl https://gravl.example.com/api/health
# Verify dashboards
# (Login to Grafana, ensure metrics flowing)
# Check alert status
# (Should have no firing alerts related to rollback)
2. Communicate Status (Immediately)
# Slack #gravl-incident
# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"
# Update status page (if external-facing)
# "Production: Operational (rolled back to previous version)"
3. Root Cause Analysis (Within 24 hours)
- What went wrong in v1.3.0?
- How did we not catch this in staging?
- How do we prevent this in the future?
- Blameless postmortem (focus on process, not people)
4. Fix & Re-deploy (Next 24-72 hours)
- Fix the issue
- Thorough testing in staging
- Peer review of changes
- Plan new deployment (with team consensus)
Rollback Checklist (Keep In Cockpit During Incident)
INCIDENT RESPONSE
[ ] Page on-call engineer
[ ] Slack alert to #gravl-incident
[ ] Check monitoring dashboard
[ ] Review error logs
[ ] Assess: Fix-in-place or rollback?
IF ROLLBACK:
[ ] Identify previous version (backend, frontend, database)
[ ] Verify backup exists and is recent
[ ] Alert team: "Rolling back to vX.Y.Z"
[ ] Execute rollback (see scenarios above)
[ ] Monitor rollout (every 30 seconds)
[ ] Health checks passing? (API, DB, ingress)
[ ] External test (curl health endpoint)
[ ] Metrics returning to normal?
POST-ROLLBACK
[ ] Slack: Service status update
[ ] Update status page (if applicable)
[ ] Create incident ticket for RCA
[ ] Schedule postmortem for tomorrow
[ ] Document what happened + what to improve
Automation & Testing
Rollback Drill (Monthly)
# Test rollback procedure in staging without actually rolling back production
# 1. Deploy new version to staging
# 2. Follow rollback steps (but against staging namespace)
# 3. Verify it works
# 4. Document any issues found
# 5. Update this runbook
Backup Verification (Weekly)
# Ensure backups are recent and restorable
# 1. Check last backup timestamp
# 2. Test restore to staging from backup
# 3. Verify data integrity
Support & Escalation
If you're unsure about rollback:
- Page senior engineer (don't hesitate)
- Isolate the problem (stop creating new pods, scale to 0)
- Preserve logs (don't delete anything until RCA is done)
- Get expert help before rolling back
Post-Incident Contact:
- Incident lead: [NAME/SLACK]
- On-call manager: [NAME/SLACK]
- Database expert: [NAME/SLACK]
Document Version: 1.0
Last Updated: 2026-03-06 08:50
Next Review: After first production rollback or after 30 days (whichever comes first)