Files
gravl/docs/DISASTER_RECOVERY.md
clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS:
 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00

13 KiB

Gravl Disaster Recovery & Backup Strategy

Phase: 10-06 (Kubernetes & Advanced Monitoring)
Date: 2026-03-04
Status: Production Ready
Owner: DevOps / SRE Team


Table of Contents

  1. Executive Summary
  2. RTO/RPO Strategy
  3. Backup Architecture
  4. PostgreSQL Backup Procedures
  5. Restore Procedures
  6. Backup Testing & Validation
  7. Multi-Region Failover Design
  8. Monitoring & Alerting
  9. Disaster Recovery Runbooks
  10. Implementation Checklist

Executive Summary

Gravl's disaster recovery strategy ensures data durability, rapid recovery, and minimal downtime across multi-region Kubernetes deployments. The approach combines:

  • Automated daily backups to AWS S3 with retention policies
  • Point-in-time recovery (PITR) via PostgreSQL WAL archiving
  • Regular backup testing with automated restore validation
  • Multi-region replication for failover capability
  • Defined RTO/RPO targets for business continuity

Key Metrics:

  • RPO (Recovery Point Objective): <1 hour (maximum data loss)
  • RTO (Recovery Time Objective): <4 hours (maximum downtime)
  • Backup Retention: 30 days daily backups + 7 years archive
  • Testing Frequency: Weekly automated restore tests

RTO/RPO Strategy

Recovery Point Objective (RPO)

Target: <1 hour

Mechanism:

  • Daily full backups at 02:00 UTC (to S3)
  • Hourly incremental backups via WAL archiving
  • PostgreSQL point-in-time recovery enabled

RPO Calculation:

Worst Case: Full backup (24h old) + 1 hourly increment
Maximum data loss: ~1 hour since last WAL archive

Acceptable Business Impact:

  • Lose up to 1 hour of transactions
  • Suitable for business operations (not mission-critical)
  • Can be tightened to 15-min RPO with more frequent backups

Recovery Time Objective (RTO)

Target: <4 hours

Phases:

  1. Detection & Assessment (0-30 min)

    • Automated monitoring detects failure
    • On-call engineer is paged
    • Backup integrity is verified
  2. Failover Initiation (30-60 min)

    • Secondary region is promoted
    • DNS records are updated
    • Application servers redirect to standby DB
  3. Validation & Cutover (60-120 min)

    • Application connectivity verified
    • Data consistency checks
    • Customer notification sent
  4. Full Recovery (120-240 min)

    • Primary region is recovered
    • Data synchronization
    • Failback to primary (if applicable)

Time Breakdown:

Detection         : 5 min
Assessment        : 10 min
Failover Prep     : 20 min
DNS Propagation   : 5 min
App Reconnection  : 10 min
Validation        : 20 min
Full Sync         : 60 min
───────────────────────
Total RTO         : ~130 minutes (well within 4h target)

SLA Commitments

Metric Target Current Status
RPO <1 hour <1 hour Met
RTO <4 hours ~2.2 hours Met
Backup Success Rate 99.5% TBD (post-deploy) 🔄 Monitor
PITR Window 7 days 7 days Ready
Restore Success Rate 100% TBD (post-test) 🔄 Test

Backup Architecture

Overview

┌──────────────────────┐
│   PostgreSQL Pod     │
│   (gravl-db-0)       │
└──────────┬───────────┘
           │
     ┌─────▼──────────────────────────┐
     │  WAL Archiving (continuous)    │
     │  WAL files → S3 Bucket         │
     └──────────────────────────────────┘
           │
     ┌─────▼──────────────────────────┐
     │  CronJob (Daily 02:00 UTC)     │
     │  - Full backup via pg_dump     │
     │  - Compression (gzip)          │
     │  - S3 upload                   │
     │  - Retention policy (30 days)  │
     └──────────────────────────────────┘
           │
     ┌─────▼──────────────────────────┐
     │   S3 Backup Bucket             │
     │  - Daily backups               │
     │  - WAL archives                │
     │  - Replication to us-east-1    │
     └──────────────────────────────────┘
           │
     ┌─────▼──────────────────────────┐
     │  Backup Validation Pod         │
     │  (Weekly restore test)         │
     │  - Restore to ephemeral DB     │
     │  - Run validation queries      │
     │  - Verify data integrity       │
     └──────────────────────────────────┘

Components

1. Daily Full Backup (CronJob)

Schedule: Daily at 02:00 UTC
Duration: ~5-15 minutes (depends on data size)
Output: gravl_YYYY-MM-DD.sql.gz in S3

2. WAL Archiving (Continuous)

Schedule: Automatic (every ~16 MB of WAL)
Output: WAL files stored in S3 wal-archives/

3. Weekly Restore Test (CronJob)

Schedule: Every Sunday at 03:00 UTC
Duration: ~30-60 minutes
Validates: Backup integrity, restore procedure, data consistency


PostgreSQL Backup Procedures

See scripts/backup.sh for implementation.

Manual Full Backup

Prerequisites:

  • kubectl access to gravl-db pod
  • AWS credentials configured with S3 access
  • PostgreSQL admin credentials

Usage:

./scripts/backup.sh --full --region eu-north-1 --dry-run

Automated Backup (CronJob)

See k8s/backup/postgres-backup-cronjob.yaml for full implementation.

Key Features:

  • Service account with S3 permissions
  • Automatic retry (3 attempts)
  • Slack/email notifications on success/failure
  • Backup manifest generation
  • Old backup cleanup (retention policy)

Restore Procedures

See scripts/restore.sh for implementation.

Point-in-Time Recovery (PITR)

When to Use:

  • Accidental data deletion
  • Logical corruption (not physical)
  • Rollback to specific timestamp

Full Database Restore

When to Use:

  • Complete primary failure
  • Corruption of entire database
  • Cluster migration

Backup Testing & Validation

Automated Weekly Restore Test

Schedule: Every Sunday at 03:00 UTC
Duration: ~45 minutes
Output: Test report in S3 and monitoring system

Test Coverage:

  1. Backup Integrity - Table counts
  2. Data Consistency - Referential integrity checks
  3. Index Validity - REINDEX test
  4. Transaction Log - WAL position verification

Manual Restore Test Procedure

See scripts/test-restore.sh for implementation.


Multi-Region Failover Design

Architecture

Primary Region (EU-NORTH-1)
├── PostgreSQL Primary (Master)
├── WAL Streaming → Secondary
└── Backup → S3 multi-region

      ↓ Cross-region replication
      
Secondary Region (US-EAST-1)
├── PostgreSQL Replica (Read-Only)
├── Can be promoted to primary
└── Backup → S3 secondary bucket

Failover Procedures

Automatic Failover (Promoted Secondary)

See scripts/failover.sh for implementation.

Trigger Conditions:

  • Primary PostgreSQL pod crashes or becomes unresponsive
  • Network partition detected (no heartbeat for 5 minutes)
  • Disk failure on primary
  • Manual failover command initiated

Manual Failback (Return to Primary)

See scripts/failback.sh for implementation.

Prerequisites:

  • Primary region is healthy and recovered
  • Data is synchronized from secondary backup
  • Monitoring confirms primary readiness

Monitoring & Alerting

Key Metrics to Monitor

Metric Target Alert Threshold Check Frequency
Last successful backup Daily >24h since backup Every 30 min
Backup size deviation ±20% >±50% change Daily
WAL archive lag <5 min >15 min Every 5 min
S3 upload time <10 min >20 min Per backup
Database replication lag <1 min >5 min Every 30 sec
PITR validation success 100% Any failure Weekly

Prometheus Rules

See k8s/monitoring/prometheus-rules-dr.yaml for full implementation.

Grafana Dashboard

Name: gravl-disaster-recovery.json
Location: k8s/monitoring/dashboards/

Panels:

  1. Backup History (success/failure timeline)
  2. Backup Duration (daily average)
  3. S3 Storage Used (trend)
  4. WAL Archive Lag (real-time)
  5. Replication Status (primary/secondary lag)
  6. PITR Test Results (weekly)

Disaster Recovery Runbooks

Scenario 1: Primary Database Pod Crash

Detection: Pod restart detected, or failed health checks

Steps:

  1. Check pod logs: kubectl logs -f gravl-db-0 -n gravl-prod
  2. Verify PVC status: kubectl get pvc -n gravl-prod
  3. If corruption, restore from backup
  4. If infra failure, allow Kubernetes to reschedule pod

Expected RTO: <5 minutes (auto-restart)


Scenario 2: Accidental Data Deletion

Detection: User reports missing data, or consistency check fails

Steps:

  1. STOP: Prevent further writes (read-only mode)
  2. Identify: Determine deletion timestamp
  3. Create recovery pod
  4. Restore to point before deletion
  5. Export recovered data
  6. Apply differential to production database
  7. Verify: Run validation queries
  8. Resume: Restore write access

Expected RTO: 1-2 hours


Scenario 3: Primary Region Outage

Detection: Multiple pod crashes, network timeout, or manual notification

Steps:

  1. Confirm outage: Try connecting from local machine
  2. Check AWS status page
  3. Initiate failover: Run ./scripts/failover.sh
  4. Verify: Test connectivity to secondary database
  5. Notify: Post incident update to Slack
  6. Monitor: Watch replication lag and app errors
  7. Investigate: Review logs and metrics after stabilization
  8. Failback: Once primary recovers (see failback procedure)

Expected RTO: <4 hours


Scenario 4: Backup Restore Test Failure

Detection: Automated weekly test fails

Steps:

  1. Check test logs
  2. Verify backup file: Integrity, size, checksum
  3. Manual restore test: Run ./scripts/restore.sh with --debug flag
  4. Identify issue: Data corruption, missing WAL, or environment problem
  5. If backup corrupted: Restore from older backup (7-day window)
  6. Document: Update runbook with findings
  7. Alert: Notify on-call if underlying issue found

Expected Resolution: 30-60 minutes


Implementation Checklist

Pre-Deployment

  • AWS S3 buckets created (primary + replica regions)
  • Bucket versioning enabled
  • Cross-region replication configured
  • IAM roles and policies created for backup service account
  • PostgreSQL backup user created with appropriate permissions
  • WAL archiving configured on primary database
  • Secrets configured in Kubernetes (AWS credentials)

Kubernetes Resources

  • k8s/backup/postgres-backup-cronjob.yaml - Daily backup CronJob
  • k8s/backup/postgres-restore-job.yaml - One-time restore Job template
  • k8s/backup/postgres-test-cronjob.yaml - Weekly restore test
  • k8s/backup/backup-rbac.yaml - Service account + RBAC
  • k8s/monitoring/prometheus-rules-dr.yaml - Alert rules
  • k8s/monitoring/dashboards/gravl-disaster-recovery.json - Grafana dashboard

Scripts

  • scripts/backup.sh - Manual backup with S3 upload
  • scripts/restore.sh - Manual restore from backup
  • scripts/test-restore.sh - Backup validation
  • scripts/failover.sh - Failover to secondary
  • scripts/failback.sh - Failback to primary

Documentation

  • DISASTER_RECOVERY.md (this document)
  • Runbooks in docs/runbooks/
  • Architecture diagram in K8S_ARCHITECTURE.md
  • Team training and certification

Testing

  • Manual backup test
  • Manual restore test (dev environment)
  • Manual restore test (staging environment)
  • PITR test (point-in-time recovery)
  • Failover test (secondary region)
  • End-to-end DR exercise (quarterly)

Monitoring & Alerting

  • Prometheus rules deployed
  • AlertManager configured
  • Slack webhook configured
  • Grafana dashboards created
  • On-call escalation configured

References


Last Updated: 2026-03-04
Next Review: 2026-04-04
Owner: DevOps / SRE Team