d81e403f01
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
442 lines
12 KiB
Markdown
442 lines
12 KiB
Markdown
# Rollback Procedure — Phase 10-07, Task 5
|
|
|
|
**Date:** 2026-03-06
|
|
**Status:** DRAFT (TO BE TESTED)
|
|
**Owner:** DevOps / On-Call Lead
|
|
**Target RTO (Recovery Time Objective):** <15 minutes
|
|
**Target RPO (Recovery Point Objective):** <5 minutes
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.
|
|
|
|
**When to Rollback:**
|
|
- Database migration failures (data integrity at risk)
|
|
- More than 2 pods in CrashLoopBackOff
|
|
- Ingress / networking down (service unavailable)
|
|
- Security breach or incident requiring immediate action
|
|
- Customer-facing API errors (>5% error rate for >5 minutes)
|
|
|
|
**When NOT to Rollback:**
|
|
- Single pod restart (normal Kubernetes behavior)
|
|
- Slow response times but no errors (<5% error rate)
|
|
- DNS delays (usually resolves itself)
|
|
- Single replica pod failure (covered by HA setup)
|
|
|
|
---
|
|
|
|
## Pre-Requisites for Rollback
|
|
|
|
**Before deploying to production, ensure:**
|
|
|
|
1. **Previous version image tag is known:**
|
|
```bash
|
|
# Save these BEFORE deploying new version
|
|
BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3
|
|
FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3
|
|
POSTGRES_PREVIOUS_VERSION=15.2
|
|
```
|
|
|
|
2. **Database backup exists (automated or manual):**
|
|
```bash
|
|
# Verify backup job ran before deployment
|
|
kubectl logs -n gravl-monitoring job/backup-job | tail -20
|
|
```
|
|
|
|
3. **Kubernetes YAML configs for previous version available:**
|
|
- k8s/production/backend-deployment.yaml (v1.2.3)
|
|
- k8s/production/frontend-deployment.yaml (v1.2.3)
|
|
- Database initialization scripts (v1.2.3)
|
|
|
|
4. **Monitoring & alerting configured** (to detect failures)
|
|
|
|
---
|
|
|
|
## Decision: Is This a Rollback Situation?
|
|
|
|
Ask yourself:
|
|
|
|
1. **Is data integrity at risk?**
|
|
- Database corruption or migration failure → YES, rollback
|
|
- Lost data → YES, rollback (then restore from backup)
|
|
|
|
2. **Is the service unavailable to users?**
|
|
- All pods crashed → YES, rollback
|
|
- Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
|
|
- Users seeing errors → CHECK ERROR RATE; if >5% → rollback
|
|
|
|
3. **Can we fix it without rolling back?**
|
|
- Restart pods → try this first
|
|
- Scale up replicas → try this first
|
|
- DNS issue → fix DNS, don't rollback
|
|
- Config issue (secrets, env vars) → fix config, restart pods, don't rollback
|
|
|
|
4. **Do we have a known-good previous version?**
|
|
- If no recent backup or previous version available → DON'T rollback (call in expert)
|
|
|
|
---
|
|
|
|
## Incident Response Checklist (Before Rollback)
|
|
|
|
Do these in parallel while deciding on rollback:
|
|
|
|
- [ ] **ALERT:** Page on-call engineer + incident lead to bridge
|
|
- [ ] **COMMUNICATE:** Slack #gravl-incident: "Investigating production issue"
|
|
- [ ] **ASSESS:** Check logs, dashboards, alerts
|
|
```bash
|
|
kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error
|
|
kubectl get events -n gravl-production --sort-by='.lastTimestamp'
|
|
```
|
|
- [ ] **DECIDE:** Rollback or fix-in-place? (30-second decision)
|
|
- [ ] **NOTIFY:** If rolling back, notify stakeholders immediately
|
|
- [ ] **EXECUTE:** Rollback procedure (15 minutes)
|
|
- [ ] **VERIFY:** Post-rollback health checks (5 minutes)
|
|
|
|
---
|
|
|
|
## Rollback Scenarios
|
|
|
|
### Scenario 1: Pod Crash After Deployment (Most Common)
|
|
|
|
**Symptoms:**
|
|
- Backend pods in CrashLoopBackOff
|
|
- Error in logs: "Database connection refused" or "Config not found"
|
|
|
|
**Rollback Steps:**
|
|
|
|
```bash
|
|
# 1. Alert team
|
|
# (already in progress from decision above)
|
|
|
|
# 2. Scale down failing deployment to stop restarts
|
|
kubectl scale deployment backend --replicas=0 -n gravl-production
|
|
|
|
# 3. Revert to previous image version
|
|
kubectl set image deployment/backend \
|
|
backend=gravl-backend:v1.2.3 \
|
|
-n gravl-production
|
|
|
|
# 4. Scale back up
|
|
kubectl scale deployment backend --replicas=3 -n gravl-production
|
|
|
|
# 5. Monitor rollout
|
|
kubectl rollout status deployment/backend -n gravl-production
|
|
|
|
# 6. Verify pods are running
|
|
kubectl get pods -n gravl-production -l component=backend
|
|
```
|
|
|
|
**Expected Timeline:**
|
|
- 0-1 min: Scale down (restarts stop)
|
|
- 1-2 min: Image pull + container start
|
|
- 2-3 min: Pod ready + health check pass
|
|
- 3-5 min: Full rollout complete
|
|
|
|
**Verification:**
|
|
- [ ] All backend pods running and ready
|
|
- [ ] No error messages in pod logs
|
|
- [ ] Health check endpoint responds
|
|
- [ ] Service latency returning to normal
|
|
|
|
---
|
|
|
|
### Scenario 2: Database Migration Failure
|
|
|
|
**Symptoms:**
|
|
- Backend pods stuck in Init (waiting for migration)
|
|
- Error in logs: "Migration failed: duplicate key value"
|
|
- Database migration job failed
|
|
|
|
**Rollback Steps:**
|
|
|
|
```bash
|
|
# 1. STOP ALL BACKEND PODS (prevent further schema changes)
|
|
kubectl scale deployment backend --replicas=0 -n gravl-production
|
|
|
|
# 2. CHECK DATABASE STATUS
|
|
kubectl exec -it postgres-0 -n gravl-production -- \
|
|
psql -U gravl_user -d gravl -c "SELECT version();"
|
|
|
|
# 3. RESTORE FROM BACKUP (if schema corrupted)
|
|
# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)
|
|
|
|
## Example: AWS RDS backup
|
|
# aws rds restore-db-instance-from-db-snapshot \
|
|
# --db-instance-identifier gravl-production-restored \
|
|
# --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00
|
|
|
|
## Example: pg_dump restore
|
|
# kubectl exec -it postgres-0 -- \
|
|
# psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql
|
|
|
|
# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
|
|
kubectl set image deployment/backend \
|
|
backend=gravl-backend:v1.2.3 \
|
|
-n gravl-production
|
|
|
|
# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
|
|
# (assume migration job uses image tag from deployment)
|
|
kubectl delete job db-migration -n gravl-production
|
|
kubectl apply -f k8s/production/db-migration-job.yaml
|
|
|
|
# Monitor migration
|
|
kubectl logs -f job/db-migration -n gravl-production
|
|
|
|
# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
|
|
kubectl scale deployment backend --replicas=3 -n gravl-production
|
|
```
|
|
|
|
**Expected Timeline:**
|
|
- 0-1 min: Scale down + stop pods
|
|
- 1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
|
|
- 5-10 min: Migration rollback
|
|
- 10-15 min: Scale up and stabilize
|
|
|
|
**Verification:**
|
|
- [ ] Database restoration successful (check row counts in critical tables)
|
|
- [ ] Migration job completed without errors
|
|
- [ ] Backend pods running and connected to database
|
|
- [ ] Health checks passing
|
|
|
|
---
|
|
|
|
### Scenario 3: Ingress / Network Failure
|
|
|
|
**Symptoms:**
|
|
- External users cannot reach API
|
|
- Ingress status shows no endpoints
|
|
- Backend pods running but no traffic reaching them
|
|
|
|
**Rollback Steps:**
|
|
|
|
```bash
|
|
# 1. Check ingress status
|
|
kubectl describe ingress gravl-ingress -n gravl-production
|
|
|
|
# 2. Check service endpoints
|
|
kubectl get endpoints -n gravl-production
|
|
|
|
# 3. If TLS cert is the issue, revert to previous cert
|
|
kubectl delete secret staging-tls -n gravl-production
|
|
kubectl create secret tls staging-tls \
|
|
--cert=path/to/previous-cert.crt \
|
|
--key=path/to/previous-key.key \
|
|
-n gravl-production
|
|
|
|
# 4. If ingress config is broken, revert to previous version
|
|
kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force
|
|
|
|
# 5. Verify ingress is up
|
|
kubectl get ingress -n gravl-production -w
|
|
```
|
|
|
|
**Expected Timeline:**
|
|
- 0-1 min: Diagnose issue
|
|
- 1-2 min: Revert ingress or cert
|
|
- 2-3 min: DNS propagation (if needed)
|
|
|
|
**Verification:**
|
|
- [ ] Ingress has valid IP / DNS
|
|
- [ ] TLS certificate valid: `echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject`
|
|
- [ ] Health endpoint responds via HTTPS
|
|
|
|
---
|
|
|
|
### Scenario 4: Secrets / Configuration Issue
|
|
|
|
**Symptoms:**
|
|
- Backend pods running but logs show "secret not found" or "env var missing"
|
|
- Service starts but crashes immediately on first request
|
|
|
|
**Rollback Steps:**
|
|
|
|
```bash
|
|
# 1. Check secrets exist
|
|
kubectl get secrets -n gravl-production
|
|
kubectl describe secret app-secret -n gravl-production
|
|
|
|
# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
|
|
kubectl apply -f k8s/production/sealed-secrets.yaml
|
|
|
|
# 3. OR if using External Secrets Operator, sync the secret
|
|
kubectl annotate externalsecret app-secret \
|
|
externalsecrets.external-secrets.io/force-sync=true \
|
|
--overwrite -n gravl-production
|
|
|
|
# 4. Restart pods to pick up secrets
|
|
kubectl rollout restart deployment/backend -n gravl-production
|
|
|
|
# 5. Monitor
|
|
kubectl rollout status deployment/backend -n gravl-production
|
|
```
|
|
|
|
**Expected Timeline:**
|
|
- 0-1 min: Detect missing secrets
|
|
- 1-2 min: Restore secrets
|
|
- 2-4 min: Pod restart + readiness
|
|
|
|
**Verification:**
|
|
- [ ] Secrets present: `kubectl get secrets -n gravl-production`
|
|
- [ ] Pods restarted and healthy
|
|
- [ ] No "secret not found" errors in logs
|
|
|
|
---
|
|
|
|
## Full Rollback (Nuclear Option)
|
|
|
|
**Use only if above scenarios don't apply or don't resolve issue.**
|
|
|
|
```bash
|
|
# 1. STOP ALL GRAVL SERVICES
|
|
kubectl scale deployment backend --replicas=0 -n gravl-production
|
|
kubectl scale deployment frontend --replicas=0 -n gravl-production
|
|
|
|
# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
|
|
# Don't delete anything yet!
|
|
|
|
# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
|
|
# kubectl delete namespace gravl-production
|
|
# (Only if you have offsite backup and are 100% sure)
|
|
|
|
# 4. RESTORE FROM BACKUP
|
|
# This depends on your backup solution:
|
|
|
|
## Option A: Velero (cluster-wide backup)
|
|
# velero restore create --from-backup gravl-prod-2026-03-06-08-00
|
|
|
|
## Option B: Manual restore (infrastructure as code)
|
|
# kubectl apply -f k8s/production/namespace.yaml
|
|
# kubectl apply -f k8s/production/rbac.yaml
|
|
# kubectl apply -f k8s/production/secrets.yaml
|
|
# kubectl apply -f k8s/production/statefulsets.yaml
|
|
# ... (all resources for v1.2.3)
|
|
|
|
# 5. RESTORE DATABASE FROM BACKUP
|
|
# aws rds restore-db-instance-from-db-snapshot ...
|
|
# OR restore from pg_dump / backup file
|
|
|
|
# 6. VERIFY EVERYTHING
|
|
kubectl get all -n gravl-production
|
|
kubectl logs -n gravl-production -l component=backend | grep -i error | head -10
|
|
```
|
|
|
|
**Expected Timeline:** 15-60 minutes (depending on backup size and complexity)
|
|
|
|
---
|
|
|
|
## Post-Rollback Actions
|
|
|
|
### 1. Verify Service Health (5 minutes)
|
|
|
|
```bash
|
|
# Check all endpoints
|
|
curl https://gravl.example.com/api/health
|
|
|
|
# Verify dashboards
|
|
# (Login to Grafana, ensure metrics flowing)
|
|
|
|
# Check alert status
|
|
# (Should have no firing alerts related to rollback)
|
|
```
|
|
|
|
### 2. Communicate Status (Immediately)
|
|
|
|
```bash
|
|
# Slack #gravl-incident
|
|
# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"
|
|
|
|
# Update status page (if external-facing)
|
|
# "Production: Operational (rolled back to previous version)"
|
|
```
|
|
|
|
### 3. Root Cause Analysis (Within 24 hours)
|
|
|
|
- [ ] What went wrong in v1.3.0?
|
|
- [ ] How did we not catch this in staging?
|
|
- [ ] How do we prevent this in the future?
|
|
- [ ] Blameless postmortem (focus on process, not people)
|
|
|
|
### 4. Fix & Re-deploy (Next 24-72 hours)
|
|
|
|
- [ ] Fix the issue
|
|
- [ ] Thorough testing in staging
|
|
- [ ] Peer review of changes
|
|
- [ ] Plan new deployment (with team consensus)
|
|
|
|
---
|
|
|
|
## Rollback Checklist (Keep In Cockpit During Incident)
|
|
|
|
```
|
|
INCIDENT RESPONSE
|
|
[ ] Page on-call engineer
|
|
[ ] Slack alert to #gravl-incident
|
|
[ ] Check monitoring dashboard
|
|
[ ] Review error logs
|
|
[ ] Assess: Fix-in-place or rollback?
|
|
|
|
IF ROLLBACK:
|
|
[ ] Identify previous version (backend, frontend, database)
|
|
[ ] Verify backup exists and is recent
|
|
[ ] Alert team: "Rolling back to vX.Y.Z"
|
|
[ ] Execute rollback (see scenarios above)
|
|
[ ] Monitor rollout (every 30 seconds)
|
|
[ ] Health checks passing? (API, DB, ingress)
|
|
[ ] External test (curl health endpoint)
|
|
[ ] Metrics returning to normal?
|
|
|
|
POST-ROLLBACK
|
|
[ ] Slack: Service status update
|
|
[ ] Update status page (if applicable)
|
|
[ ] Create incident ticket for RCA
|
|
[ ] Schedule postmortem for tomorrow
|
|
[ ] Document what happened + what to improve
|
|
```
|
|
|
|
---
|
|
|
|
## Automation & Testing
|
|
|
|
### Rollback Drill (Monthly)
|
|
|
|
```bash
|
|
# Test rollback procedure in staging without actually rolling back production
|
|
# 1. Deploy new version to staging
|
|
# 2. Follow rollback steps (but against staging namespace)
|
|
# 3. Verify it works
|
|
# 4. Document any issues found
|
|
# 5. Update this runbook
|
|
```
|
|
|
|
### Backup Verification (Weekly)
|
|
|
|
```bash
|
|
# Ensure backups are recent and restorable
|
|
# 1. Check last backup timestamp
|
|
# 2. Test restore to staging from backup
|
|
# 3. Verify data integrity
|
|
```
|
|
|
|
---
|
|
|
|
## Support & Escalation
|
|
|
|
**If you're unsure about rollback:**
|
|
1. Page senior engineer (don't hesitate)
|
|
2. Isolate the problem (stop creating new pods, scale to 0)
|
|
3. Preserve logs (don't delete anything until RCA is done)
|
|
4. Get expert help before rolling back
|
|
|
|
**Post-Incident Contact:**
|
|
- Incident lead: [NAME/SLACK]
|
|
- On-call manager: [NAME/SLACK]
|
|
- Database expert: [NAME/SLACK]
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2026-03-06 08:50
|
|
**Next Review:** After first production rollback or after 30 days (whichever comes first)
|