Files

T

clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS:
✅ 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

✅ 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

✅ 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06

2026-03-06 20:54:03 +01:00

12 KiB

Raw Permalink Blame History

Rollback Procedure — Phase 10-07, Task 5

Date: 2026-03-06
Status: DRAFT (TO BE TESTED)
Owner: DevOps / On-Call Lead
Target RTO (Recovery Time Objective): <15 minutes
Target RPO (Recovery Point Objective): <5 minutes

Overview

This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.

When to Rollback:

Database migration failures (data integrity at risk)
More than 2 pods in CrashLoopBackOff
Ingress / networking down (service unavailable)
Security breach or incident requiring immediate action
Customer-facing API errors (>5% error rate for >5 minutes)

When NOT to Rollback:

Single pod restart (normal Kubernetes behavior)
Slow response times but no errors (<5% error rate)
DNS delays (usually resolves itself)
Single replica pod failure (covered by HA setup)

Pre-Requisites for Rollback

Before deploying to production, ensure:

Previous version image tag is known:

# Save these BEFORE deploying new version
BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3
FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3
POSTGRES_PREVIOUS_VERSION=15.2

Database backup exists (automated or manual):

# Verify backup job ran before deployment
kubectl logs -n gravl-monitoring job/backup-job | tail -20

Kubernetes YAML configs for previous version available:
- k8s/production/backend-deployment.yaml (v1.2.3)
- k8s/production/frontend-deployment.yaml (v1.2.3)
- Database initialization scripts (v1.2.3)
Monitoring & alerting configured (to detect failures)

Decision: Is This a Rollback Situation?

Ask yourself:

Is data integrity at risk?
- Database corruption or migration failure → YES, rollback
- Lost data → YES, rollback (then restore from backup)
Is the service unavailable to users?
- All pods crashed → YES, rollback
- Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
- Users seeing errors → CHECK ERROR RATE; if >5% → rollback
Can we fix it without rolling back?
- Restart pods → try this first
- Scale up replicas → try this first
- DNS issue → fix DNS, don't rollback
- Config issue (secrets, env vars) → fix config, restart pods, don't rollback
Do we have a known-good previous version?
- If no recent backup or previous version available → DON'T rollback (call in expert)

Incident Response Checklist (Before Rollback)

Do these in parallel while deciding on rollback:

ALERT: Page on-call engineer + incident lead to bridge
COMMUNICATE: Slack #gravl-incident: "Investigating production issue"

ASSESS: Check logs, dashboards, alerts

kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error
kubectl get events -n gravl-production --sort-by='.lastTimestamp'

DECIDE: Rollback or fix-in-place? (30-second decision)
NOTIFY: If rolling back, notify stakeholders immediately
EXECUTE: Rollback procedure (15 minutes)
VERIFY: Post-rollback health checks (5 minutes)

Rollback Scenarios

Scenario 1: Pod Crash After Deployment (Most Common)

Symptoms:

Backend pods in CrashLoopBackOff
Error in logs: "Database connection refused" or "Config not found"

Rollback Steps:

# 1. Alert team
# (already in progress from decision above)

# 2. Scale down failing deployment to stop restarts
kubectl scale deployment backend --replicas=0 -n gravl-production

# 3. Revert to previous image version
kubectl set image deployment/backend \
  backend=gravl-backend:v1.2.3 \
  -n gravl-production

# 4. Scale back up
kubectl scale deployment backend --replicas=3 -n gravl-production

# 5. Monitor rollout
kubectl rollout status deployment/backend -n gravl-production

# 6. Verify pods are running
kubectl get pods -n gravl-production -l component=backend

Expected Timeline:

0-1 min: Scale down (restarts stop)
1-2 min: Image pull + container start
2-3 min: Pod ready + health check pass
3-5 min: Full rollout complete

Verification:

All backend pods running and ready
No error messages in pod logs
Health check endpoint responds
Service latency returning to normal

Scenario 2: Database Migration Failure

Symptoms:

Backend pods stuck in Init (waiting for migration)
Error in logs: "Migration failed: duplicate key value"
Database migration job failed

Rollback Steps:

# 1. STOP ALL BACKEND PODS (prevent further schema changes)
kubectl scale deployment backend --replicas=0 -n gravl-production

# 2. CHECK DATABASE STATUS
kubectl exec -it postgres-0 -n gravl-production -- \
  psql -U gravl_user -d gravl -c "SELECT version();"

# 3. RESTORE FROM BACKUP (if schema corrupted)
# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)

## Example: AWS RDS backup
# aws rds restore-db-instance-from-db-snapshot \
#   --db-instance-identifier gravl-production-restored \
#   --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00

## Example: pg_dump restore
# kubectl exec -it postgres-0 -- \
#   psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql

# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
kubectl set image deployment/backend \
  backend=gravl-backend:v1.2.3 \
  -n gravl-production

# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
# (assume migration job uses image tag from deployment)
kubectl delete job db-migration -n gravl-production
kubectl apply -f k8s/production/db-migration-job.yaml

# Monitor migration
kubectl logs -f job/db-migration -n gravl-production

# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
kubectl scale deployment backend --replicas=3 -n gravl-production

Expected Timeline:

0-1 min: Scale down + stop pods
1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
5-10 min: Migration rollback
10-15 min: Scale up and stabilize

Verification:

Database restoration successful (check row counts in critical tables)
Migration job completed without errors
Backend pods running and connected to database
Health checks passing

Scenario 3: Ingress / Network Failure

Symptoms:

External users cannot reach API
Ingress status shows no endpoints
Backend pods running but no traffic reaching them

Rollback Steps:

# 1. Check ingress status
kubectl describe ingress gravl-ingress -n gravl-production

# 2. Check service endpoints
kubectl get endpoints -n gravl-production

# 3. If TLS cert is the issue, revert to previous cert
kubectl delete secret staging-tls -n gravl-production
kubectl create secret tls staging-tls \
  --cert=path/to/previous-cert.crt \
  --key=path/to/previous-key.key \
  -n gravl-production

# 4. If ingress config is broken, revert to previous version
kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force

# 5. Verify ingress is up
kubectl get ingress -n gravl-production -w

Expected Timeline:

0-1 min: Diagnose issue
1-2 min: Revert ingress or cert
2-3 min: DNS propagation (if needed)

Verification:

Ingress has valid IP / DNS
TLS certificate valid: echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject
Health endpoint responds via HTTPS

Scenario 4: Secrets / Configuration Issue

Symptoms:

Backend pods running but logs show "secret not found" or "env var missing"
Service starts but crashes immediately on first request

Rollback Steps:

# 1. Check secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret app-secret -n gravl-production

# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
kubectl apply -f k8s/production/sealed-secrets.yaml

# 3. OR if using External Secrets Operator, sync the secret
kubectl annotate externalsecret app-secret \
  externalsecrets.external-secrets.io/force-sync=true \
  --overwrite -n gravl-production

# 4. Restart pods to pick up secrets
kubectl rollout restart deployment/backend -n gravl-production

# 5. Monitor
kubectl rollout status deployment/backend -n gravl-production

Expected Timeline:

0-1 min: Detect missing secrets
1-2 min: Restore secrets
2-4 min: Pod restart + readiness

Verification:

Secrets present: kubectl get secrets -n gravl-production
Pods restarted and healthy
No "secret not found" errors in logs

Full Rollback (Nuclear Option)

Use only if above scenarios don't apply or don't resolve issue.

# 1. STOP ALL GRAVL SERVICES
kubectl scale deployment backend --replicas=0 -n gravl-production
kubectl scale deployment frontend --replicas=0 -n gravl-production

# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
# Don't delete anything yet!

# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
# kubectl delete namespace gravl-production
# (Only if you have offsite backup and are 100% sure)

# 4. RESTORE FROM BACKUP
# This depends on your backup solution:

## Option A: Velero (cluster-wide backup)
# velero restore create --from-backup gravl-prod-2026-03-06-08-00

## Option B: Manual restore (infrastructure as code)
# kubectl apply -f k8s/production/namespace.yaml
# kubectl apply -f k8s/production/rbac.yaml
# kubectl apply -f k8s/production/secrets.yaml
# kubectl apply -f k8s/production/statefulsets.yaml
# ... (all resources for v1.2.3)

# 5. RESTORE DATABASE FROM BACKUP
# aws rds restore-db-instance-from-db-snapshot ...
# OR restore from pg_dump / backup file

# 6. VERIFY EVERYTHING
kubectl get all -n gravl-production
kubectl logs -n gravl-production -l component=backend | grep -i error | head -10

Expected Timeline: 15-60 minutes (depending on backup size and complexity)

Post-Rollback Actions

1. Verify Service Health (5 minutes)

# Check all endpoints
curl https://gravl.example.com/api/health

# Verify dashboards
# (Login to Grafana, ensure metrics flowing)

# Check alert status
# (Should have no firing alerts related to rollback)

2. Communicate Status (Immediately)

# Slack #gravl-incident
# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"

# Update status page (if external-facing)
# "Production: Operational (rolled back to previous version)"

3. Root Cause Analysis (Within 24 hours)

What went wrong in v1.3.0?
How did we not catch this in staging?
How do we prevent this in the future?
Blameless postmortem (focus on process, not people)

4. Fix & Re-deploy (Next 24-72 hours)

Fix the issue
Thorough testing in staging
Peer review of changes
Plan new deployment (with team consensus)

Rollback Checklist (Keep In Cockpit During Incident)

INCIDENT RESPONSE
[ ] Page on-call engineer
[ ] Slack alert to #gravl-incident
[ ] Check monitoring dashboard
[ ] Review error logs
[ ] Assess: Fix-in-place or rollback?

IF ROLLBACK:
[ ] Identify previous version (backend, frontend, database)
[ ] Verify backup exists and is recent
[ ] Alert team: "Rolling back to vX.Y.Z"
[ ] Execute rollback (see scenarios above)
[ ] Monitor rollout (every 30 seconds)
[ ] Health checks passing? (API, DB, ingress)
[ ] External test (curl health endpoint)
[ ] Metrics returning to normal?

POST-ROLLBACK
[ ] Slack: Service status update
[ ] Update status page (if applicable)
[ ] Create incident ticket for RCA
[ ] Schedule postmortem for tomorrow
[ ] Document what happened + what to improve

Automation & Testing

Rollback Drill (Monthly)

# Test rollback procedure in staging without actually rolling back production
# 1. Deploy new version to staging
# 2. Follow rollback steps (but against staging namespace)
# 3. Verify it works
# 4. Document any issues found
# 5. Update this runbook

Backup Verification (Weekly)

# Ensure backups are recent and restorable
# 1. Check last backup timestamp
# 2. Test restore to staging from backup
# 3. Verify data integrity

Support & Escalation

If you're unsure about rollback:

Page senior engineer (don't hesitate)
Isolate the problem (stop creating new pods, scale to 0)
Preserve logs (don't delete anything until RCA is done)
Get expert help before rolling back

Post-Incident Contact:

Incident lead: [NAME/SLACK]
On-call manager: [NAME/SLACK]
Database expert: [NAME/SLACK]

Document Version: 1.0
Last Updated: 2026-03-06 08:50
Next Review: After first production rollback or after 30 days (whichever comes first)

12 KiB Raw Permalink Blame History

Rollback Procedure — Phase 10-07, Task 5

Overview

Pre-Requisites for Rollback

Decision: Is This a Rollback Situation?

Incident Response Checklist (Before Rollback)

Rollback Scenarios

Scenario 1: Pod Crash After Deployment (Most Common)

Scenario 2: Database Migration Failure

Scenario 3: Ingress / Network Failure

Scenario 4: Secrets / Configuration Issue

Full Rollback (Nuclear Option)

Post-Rollback Actions

1. Verify Service Health (5 minutes)

2. Communicate Status (Immediately)

3. Root Cause Analysis (Within 24 hours)

4. Fix & Re-deploy (Next 24-72 hours)

Rollback Checklist (Keep In Cockpit During Incident)

Automation & Testing

Rollback Drill (Monthly)

Backup Verification (Weekly)

Support & Escalation

12 KiB

Raw Permalink Blame History