Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00
parent c153a9648f
commit d81e403f01
330 changed files with 87988 additions and 367 deletions
@@ -0,0 +1,181 @@
+---
+# Prometheus PrometheusRule for Disaster Recovery Monitoring
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: disaster-recovery-rules
+  namespace: gravl-monitoring
+  labels:
+    app: gravl
+    component: monitoring
+    rules: disaster-recovery
+spec:
+  groups:
+  - name: disaster-recovery
+    interval: 30s
+    rules:
+    
+    # Alert: No daily backup in 24+ hours
+    - alert: NoDailyBackup
+      expr: |
+        (time() - backup_last_success_timestamp{type="daily"}) > 86400
+      for: 1h
+      annotations:
+        summary: "Daily backup missing for {{ $value | humanizeDuration }}"
+        description: |
+          No successful daily backup has been completed in the last 24 hours.
+          This violates the RPO target of <1 hour.
+          Action: Check backup CronJob logs and restore connectivity to S3.
+        severity: critical
+      labels:
+        component: backup
+        slo: rpo
+    
+    # Alert: Backup size deviation (likely corruption)
+    - alert: BackupSizeDeviation
+      expr: |
+        abs(backup_size_bytes - avg_over_time(backup_size_bytes[7d])) / avg_over_time(backup_size_bytes[7d]) > 0.5
+      for: 30m
+      annotations:
+        summary: "Backup size deviated >50%: {{ $value | humanizePercentage }}"
+        description: |
+          Latest backup size differs significantly from historical average.
+          This may indicate data corruption or incomplete backup.
+          Action: Review backup logs and test restore from previous backup.
+        severity: warning
+      labels:
+        component: backup
+    
+    # Alert: WAL archive lagging
+    - alert: WALArchiveLagging
+      expr: |
+        wal_archive_lag_seconds > 900
+      for: 5m
+      annotations:
+        summary: "WAL archive lagging: {{ $value | humanizeDuration }}"
+        description: |
+          PostgreSQL WAL files are not being archived to S3 within expected timeframe.
+          This impacts the RPO (Recovery Point Objective).
+          Current lag: {{ $value }}s (target: <300s)
+          Action: Check postgres WAL archiver status and S3 connectivity.
+        severity: warning
+      labels:
+        component: database
+        slo: rpo
+    
+    # Alert: S3 upload performance degraded
+    - alert: S3UploadSlow
+      expr: |
+        backup_upload_duration_seconds > 1200
+      for: 10m
+      annotations:
+        summary: "S3 backup upload taking {{ $value | humanizeDuration }}"
+        description: |
+          Backup upload to S3 is taking longer than expected.
+          This may indicate network issues or S3 throttling.
+          Target duration: <600s
+          Current duration: {{ $value }}s
+          Action: Check network connectivity and S3 bucket metrics.
+        severity: warning
+      labels:
+        component: storage
+    
+    # Alert: Database replication lagging
+    - alert: HighReplicationLag
+      expr: |
+        pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes > 1073741824
+      for: 5m
+      annotations:
+        summary: "Replication lag: {{ $value | humanize1024 }}B"
+        description: |
+          Secondary database replica is lagging significantly behind primary.
+          This impacts failover capability.
+          Current lag: {{ $value | humanize1024 }}B (target: <100MB)
+          Action: Check network between regions and replica pod status.
+        severity: warning
+      labels:
+        component: database
+        slo: rto
+    
+    # Alert: Backup restore test failure
+    - alert: BackupRestoreTestFailed
+      expr: |
+        backup_restore_test_success == 0
+      for: 10m
+      annotations:
+        summary: "Backup restore test failed"
+        description: |
+          Weekly automated backup restore test has failed.
+          This indicates backups may not be recoverable.
+          Action: Review test logs and manually verify backup integrity.
+        severity: critical
+      labels:
+        component: backup
+        slo: rto
+    
+    # Alert: Primary database down (failover trigger)
+    - alert: PrimaryDatabaseDown
+      expr: |
+        up{job="postgresql-primary"} == 0
+      for: 2m
+      annotations:
+        summary: "Primary database unreachable"
+        description: |
+          Primary PostgreSQL database is not responding to health checks.
+          Failover to secondary may be required.
+          Action: Check pod status with kubectl; consider automatic failover.
+        severity: critical
+      labels:
+        component: database
+        slo: rto
+    
+    # Alert: Secondary database replication stopped
+    - alert: SecondaryReplicationDown
+      expr: |
+        pg_replication_slot_active == 0
+      for: 5m
+      annotations:
+        summary: "Secondary replication connection lost"
+        description: |
+          Replication from primary to secondary database has stopped.
+          Secondary will become stale and failover will risk data loss.
+          Action: Check network connectivity and logs on both primary and secondary.
+        severity: warning
+      labels:
+        component: database
+        slo: rpo
+    
+    # Info: Backup statistics
+    - alert: BackupStatsInfo
+      expr: |
+        increase(backup_job_total[24h]) > 0
+      for: 1h
+      annotations:
+        summary: "Daily backup stats: {{ $value }} backups in last 24h"
+        description: |
+          Informational metric for backup statistics.
+          Success rate and performance monitoring.
+        severity: info
+      labels:
+        component: backup
+
+  # Recording rules for aggregation
+  - name: disaster-recovery-recording
+    interval: 1m
+    rules:
+    
+    # Average backup size over 7 days
+    - record: backup:size:avg:7d
+      expr: avg_over_time(backup_size_bytes[7d])
+    
+    # Backup success rate
+    - record: backup:success:rate:24h
+      expr: rate(backup_job_success_total[24h])
+    
+    # Maximum WAL lag
+    - record: wal:lag:max:5m
+      expr: max_over_time(wal_archive_lag_seconds[5m])
+    
+    # Average replication lag
+    - record: replication:lag:avg:5m
+      expr: avg(pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes)