gravl/docs/ROLLBACK.md

# Rollback Procedure — Phase 10-07, Task 5

**Date:** 2026-03-06
**Status:** DRAFT (TO BE TESTED)
**Owner:** DevOps / On-Call Lead
**Target RTO (Recovery Time Objective):** <15 minutes
**Target RPO (Recovery Point Objective):** <5 minutes

---

## Overview

This document defines how to roll back Gravl from production if a critical failure is discovered post-deployment.

**When to Rollback:**
- Database migration failures (data integrity at risk)
- More than 2 pods in CrashLoopBackOff
- Ingress / networking down (service unavailable)
- Security breach or incident requiring immediate action
- Customer-facing API errors (>5% error rate for >5 minutes)

**When NOT to Rollback:**
- Single pod restart (normal Kubernetes behavior)
- Slow response times but no errors (<5% error rate)
- DNS delays (usually resolves itself)
- Single replica pod failure (covered by HA setup)

---

## Pre-Requisites for Rollback

**Before deploying to production, ensure:**

1. **Previous version image tag is known:**
   ```bash
   # Save these BEFORE deploying new version
   BACKEND_PREVIOUS_IMAGE=gravl-backend:v1.2.3
   FRONTEND_PREVIOUS_IMAGE=gravl-frontend:v1.2.3
   POSTGRES_PREVIOUS_VERSION=15.2
   ```

2. **Database backup exists (automated or manual):**
   ```bash
   # Verify backup job ran before deployment
   kubectl logs -n gravl-monitoring job/backup-job | tail -20
   ```

3. **Kubernetes YAML configs for previous version available:**
   - k8s/production/backend-deployment.yaml (v1.2.3)
   - k8s/production/frontend-deployment.yaml (v1.2.3)
   - Database initialization scripts (v1.2.3)

4. **Monitoring & alerting configured** (to detect failures)

---

## Decision: Is This a Rollback Situation?

Ask yourself:

1. **Is data integrity at risk?**
   - Database corruption or migration failure → YES, rollback
   - Lost data → YES, rollback (then restore from backup)

2. **Is the service unavailable to users?**
   - All pods crashed → YES, rollback
   - Some pods crashing, service still partial → WAIT 2 minutes, maybe don't rollback
   - Users seeing errors → CHECK ERROR RATE; if >5% → rollback

3. **Can we fix it without rolling back?**
   - Restart pods → try this first
   - Scale up replicas → try this first
   - DNS issue → fix DNS, don't rollback
   - Config issue (secrets, env vars) → fix config, restart pods, don't rollback

4. **Do we have a known-good previous version?**
   - If no recent backup or previous version available → DON'T rollback (call in expert)

---

## Incident Response Checklist (Before Rollback)

Do these in parallel while deciding on rollback:

- [ ] **ALERT:** Page on-call engineer + incident lead to bridge
- [ ] **COMMUNICATE:** Slack #gravl-incident: "Investigating production issue"
- [ ] **ASSESS:** Check logs, dashboards, alerts
  ```bash
  kubectl logs -n gravl-production -l component=backend --tail=100 | grep -i error
  kubectl get events -n gravl-production --sort-by='.lastTimestamp'
  ```
- [ ] **DECIDE:** Rollback or fix-in-place? (30-second decision)
- [ ] **NOTIFY:** If rolling back, notify stakeholders immediately
- [ ] **EXECUTE:** Rollback procedure (15 minutes)
- [ ] **VERIFY:** Post-rollback health checks (5 minutes)

---

## Rollback Scenarios

### Scenario 1: Pod Crash After Deployment (Most Common)

**Symptoms:**
- Backend pods in CrashLoopBackOff
- Error in logs: "Database connection refused" or "Config not found"

**Rollback Steps:**

```bash
# 1. Alert team
# (already in progress from decision above)

# 2. Scale down failing deployment to stop restarts
kubectl scale deployment backend --replicas=0 -n gravl-production

# 3. Revert to previous image version
kubectl set image deployment/backend \
  backend=gravl-backend:v1.2.3 \
  -n gravl-production

# 4. Scale back up
kubectl scale deployment backend --replicas=3 -n gravl-production

# 5. Monitor rollout
kubectl rollout status deployment/backend -n gravl-production

# 6. Verify pods are running
kubectl get pods -n gravl-production -l component=backend
```

**Expected Timeline:**
- 0-1 min: Scale down (restarts stop)
- 1-2 min: Image pull + container start
- 2-3 min: Pod ready + health check pass
- 3-5 min: Full rollout complete

**Verification:**
- [ ] All backend pods running and ready
- [ ] No error messages in pod logs
- [ ] Health check endpoint responds
- [ ] Service latency returning to normal

---

### Scenario 2: Database Migration Failure

**Symptoms:**
- Backend pods stuck in Init (waiting for migration)
- Error in logs: "Migration failed: duplicate key value"
- Database migration job failed

**Rollback Steps:**

```bash
# 1. STOP ALL BACKEND PODS (prevent further schema changes)
kubectl scale deployment backend --replicas=0 -n gravl-production

# 2. CHECK DATABASE STATUS
kubectl exec -it postgres-0 -n gravl-production -- \
  psql -U gravl_user -d gravl -c "SELECT version();"

# 3. RESTORE FROM BACKUP (if schema corrupted)
# This depends on your backup system (e.g., AWS RDS snapshots, Velero, pg_dump)

## Example: AWS RDS backup
# aws rds restore-db-instance-from-db-snapshot \
#   --db-instance-identifier gravl-production-restored \
#   --db-snapshot-identifier gravl-prod-snapshot-2026-03-06-09-00

## Example: pg_dump restore
# kubectl exec -it postgres-0 -- \
#   psql -U gravl_user -d gravl < /backup/gravl-schema-v1.2.3.sql

# 4. ROLLBACK DEPLOYMENT TO PREVIOUS VERSION
kubectl set image deployment/backend \
  backend=gravl-backend:v1.2.3 \
  -n gravl-production

# 5. RESTART MIGRATION JOB WITH PREVIOUS VERSION
# (assume migration job uses image tag from deployment)
kubectl delete job db-migration -n gravl-production
kubectl apply -f k8s/production/db-migration-job.yaml

# Monitor migration
kubectl logs -f job/db-migration -n gravl-production

# 6. SCALE UP BACKEND WHEN MIGRATION SUCCEEDS
kubectl scale deployment backend --replicas=3 -n gravl-production
```

**Expected Timeline:**
- 0-1 min: Scale down + stop pods
- 1-5 min: Database restore (varies by snapshot size; could be 5-30 min)
- 5-10 min: Migration rollback
- 10-15 min: Scale up and stabilize

**Verification:**
- [ ] Database restoration successful (check row counts in critical tables)
- [ ] Migration job completed without errors
- [ ] Backend pods running and connected to database
- [ ] Health checks passing

---

### Scenario 3: Ingress / Network Failure

**Symptoms:**
- External users cannot reach API
- Ingress status shows no endpoints
- Backend pods running but no traffic reaching them

**Rollback Steps:**

```bash
# 1. Check ingress status
kubectl describe ingress gravl-ingress -n gravl-production

# 2. Check service endpoints
kubectl get endpoints -n gravl-production

# 3. If TLS cert is the issue, revert to previous cert
kubectl delete secret staging-tls -n gravl-production
kubectl create secret tls staging-tls \
  --cert=path/to/previous-cert.crt \
  --key=path/to/previous-key.key \
  -n gravl-production

# 4. If ingress config is broken, revert to previous version
kubectl apply -f k8s/production/ingress-v1.2.3.yaml --force

# 5. Verify ingress is up
kubectl get ingress -n gravl-production -w
```

**Expected Timeline:**
- 0-1 min: Diagnose issue
- 1-2 min: Revert ingress or cert
- 2-3 min: DNS propagation (if needed)

**Verification:**
- [ ] Ingress has valid IP / DNS
- [ ] TLS certificate valid: `echo | openssl s_client -servername gravl.example.com -connect <ingress-ip>:443 2>/dev/null | grep Subject`
- [ ] Health endpoint responds via HTTPS

---

### Scenario 4: Secrets / Configuration Issue

**Symptoms:**
- Backend pods running but logs show "secret not found" or "env var missing"
- Service starts but crashes immediately on first request

**Rollback Steps:**

```bash
# 1. Check secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret app-secret -n gravl-production

# 2. If secrets are missing, restore from sealed-secrets backup or External Secrets
kubectl apply -f k8s/production/sealed-secrets.yaml

# 3. OR if using External Secrets Operator, sync the secret
kubectl annotate externalsecret app-secret \
  externalsecrets.external-secrets.io/force-sync=true \
  --overwrite -n gravl-production

# 4. Restart pods to pick up secrets
kubectl rollout restart deployment/backend -n gravl-production

# 5. Monitor
kubectl rollout status deployment/backend -n gravl-production
```

**Expected Timeline:**
- 0-1 min: Detect missing secrets
- 1-2 min: Restore secrets
- 2-4 min: Pod restart + readiness

**Verification:**
- [ ] Secrets present: `kubectl get secrets -n gravl-production`
- [ ] Pods restarted and healthy
- [ ] No "secret not found" errors in logs

---

## Full Rollback (Nuclear Option)

**Use only if above scenarios don't apply or don't resolve issue.**

```bash
# 1. STOP ALL GRAVL SERVICES
kubectl scale deployment backend --replicas=0 -n gravl-production
kubectl scale deployment frontend --replicas=0 -n gravl-production

# 2. VERIFY DATABASE IS SAFE (CHECK BACKUP)
# Don't delete anything yet!

# 3. DELETE PRODUCTION NAMESPACE (CAREFUL!)
# kubectl delete namespace gravl-production
# (Only if you have offsite backup and are 100% sure)

# 4. RESTORE FROM BACKUP
# This depends on your backup solution:

## Option A: Velero (cluster-wide backup)
# velero restore create --from-backup gravl-prod-2026-03-06-08-00

## Option B: Manual restore (infrastructure as code)
# kubectl apply -f k8s/production/namespace.yaml
# kubectl apply -f k8s/production/rbac.yaml
# kubectl apply -f k8s/production/secrets.yaml
# kubectl apply -f k8s/production/statefulsets.yaml
# ... (all resources for v1.2.3)

# 5. RESTORE DATABASE FROM BACKUP
# aws rds restore-db-instance-from-db-snapshot ...
# OR restore from pg_dump / backup file

# 6. VERIFY EVERYTHING
kubectl get all -n gravl-production
kubectl logs -n gravl-production -l component=backend | grep -i error | head -10
```

**Expected Timeline:** 15-60 minutes (depending on backup size and complexity)

---

## Post-Rollback Actions

### 1. Verify Service Health (5 minutes)

```bash
# Check all endpoints
curl https://gravl.example.com/api/health

# Verify dashboards
# (Login to Grafana, ensure metrics flowing)

# Check alert status
# (Should have no firing alerts related to rollback)
```

### 2. Communicate Status (Immediately)

```bash
# Slack #gravl-incident
# "✅ Rollback complete. Service restored to v1.2.3. RCA scheduled for [tomorrow]"

# Update status page (if external-facing)
# "Production: Operational (rolled back to previous version)"
```

### 3. Root Cause Analysis (Within 24 hours)

- [ ] What went wrong in v1.3.0?
- [ ] How did we not catch this in staging?
- [ ] How do we prevent this in the future?
- [ ] Blameless postmortem (focus on process, not people)

### 4. Fix & Re-deploy (Next 24-72 hours)

- [ ] Fix the issue
- [ ] Thorough testing in staging
- [ ] Peer review of changes
- [ ] Plan new deployment (with team consensus)

---

## Rollback Checklist (Keep In Cockpit During Incident)

```
INCIDENT RESPONSE
[ ] Page on-call engineer
[ ] Slack alert to #gravl-incident
[ ] Check monitoring dashboard
[ ] Review error logs
[ ] Assess: Fix-in-place or rollback?

IF ROLLBACK:
[ ] Identify previous version (backend, frontend, database)
[ ] Verify backup exists and is recent
[ ] Alert team: "Rolling back to vX.Y.Z"
[ ] Execute rollback (see scenarios above)
[ ] Monitor rollout (every 30 seconds)
[ ] Health checks passing? (API, DB, ingress)
[ ] External test (curl health endpoint)
[ ] Metrics returning to normal?

POST-ROLLBACK
[ ] Slack: Service status update
[ ] Update status page (if applicable)
[ ] Create incident ticket for RCA
[ ] Schedule postmortem for tomorrow
[ ] Document what happened + what to improve
```

---

## Automation & Testing

### Rollback Drill (Monthly)

```bash
# Test rollback procedure in staging without actually rolling back production
# 1. Deploy new version to staging
# 2. Follow rollback steps (but against staging namespace)
# 3. Verify it works
# 4. Document any issues found
# 5. Update this runbook
```

### Backup Verification (Weekly)

```bash
# Ensure backups are recent and restorable
# 1. Check last backup timestamp
# 2. Test restore to staging from backup
# 3. Verify data integrity
```

---

## Support & Escalation

**If you're unsure about rollback:**
1. Page senior engineer (don't hesitate)
2. Isolate the problem (stop creating new pods, scale to 0)
3. Preserve logs (don't delete anything until RCA is done)
4. Get expert help before rolling back

**Post-Incident Contact:**
- Incident lead: [NAME/SLACK]
- On-call manager: [NAME/SLACK]
- Database expert: [NAME/SLACK]

---

**Document Version:** 1.0
**Last Updated:** 2026-03-06 08:50
**Next Review:** After first production rollback or after 30 days (whichever comes first)