Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
This commit is contained in:
@@ -0,0 +1,181 @@
|
||||
---
|
||||
# Prometheus PrometheusRule for Disaster Recovery Monitoring
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: disaster-recovery-rules
|
||||
namespace: gravl-monitoring
|
||||
labels:
|
||||
app: gravl
|
||||
component: monitoring
|
||||
rules: disaster-recovery
|
||||
spec:
|
||||
groups:
|
||||
- name: disaster-recovery
|
||||
interval: 30s
|
||||
rules:
|
||||
|
||||
# Alert: No daily backup in 24+ hours
|
||||
- alert: NoDailyBackup
|
||||
expr: |
|
||||
(time() - backup_last_success_timestamp{type="daily"}) > 86400
|
||||
for: 1h
|
||||
annotations:
|
||||
summary: "Daily backup missing for {{ $value | humanizeDuration }}"
|
||||
description: |
|
||||
No successful daily backup has been completed in the last 24 hours.
|
||||
This violates the RPO target of <1 hour.
|
||||
Action: Check backup CronJob logs and restore connectivity to S3.
|
||||
severity: critical
|
||||
labels:
|
||||
component: backup
|
||||
slo: rpo
|
||||
|
||||
# Alert: Backup size deviation (likely corruption)
|
||||
- alert: BackupSizeDeviation
|
||||
expr: |
|
||||
abs(backup_size_bytes - avg_over_time(backup_size_bytes[7d])) / avg_over_time(backup_size_bytes[7d]) > 0.5
|
||||
for: 30m
|
||||
annotations:
|
||||
summary: "Backup size deviated >50%: {{ $value | humanizePercentage }}"
|
||||
description: |
|
||||
Latest backup size differs significantly from historical average.
|
||||
This may indicate data corruption or incomplete backup.
|
||||
Action: Review backup logs and test restore from previous backup.
|
||||
severity: warning
|
||||
labels:
|
||||
component: backup
|
||||
|
||||
# Alert: WAL archive lagging
|
||||
- alert: WALArchiveLagging
|
||||
expr: |
|
||||
wal_archive_lag_seconds > 900
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "WAL archive lagging: {{ $value | humanizeDuration }}"
|
||||
description: |
|
||||
PostgreSQL WAL files are not being archived to S3 within expected timeframe.
|
||||
This impacts the RPO (Recovery Point Objective).
|
||||
Current lag: {{ $value }}s (target: <300s)
|
||||
Action: Check postgres WAL archiver status and S3 connectivity.
|
||||
severity: warning
|
||||
labels:
|
||||
component: database
|
||||
slo: rpo
|
||||
|
||||
# Alert: S3 upload performance degraded
|
||||
- alert: S3UploadSlow
|
||||
expr: |
|
||||
backup_upload_duration_seconds > 1200
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "S3 backup upload taking {{ $value | humanizeDuration }}"
|
||||
description: |
|
||||
Backup upload to S3 is taking longer than expected.
|
||||
This may indicate network issues or S3 throttling.
|
||||
Target duration: <600s
|
||||
Current duration: {{ $value }}s
|
||||
Action: Check network connectivity and S3 bucket metrics.
|
||||
severity: warning
|
||||
labels:
|
||||
component: storage
|
||||
|
||||
# Alert: Database replication lagging
|
||||
- alert: HighReplicationLag
|
||||
expr: |
|
||||
pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes > 1073741824
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Replication lag: {{ $value | humanize1024 }}B"
|
||||
description: |
|
||||
Secondary database replica is lagging significantly behind primary.
|
||||
This impacts failover capability.
|
||||
Current lag: {{ $value | humanize1024 }}B (target: <100MB)
|
||||
Action: Check network between regions and replica pod status.
|
||||
severity: warning
|
||||
labels:
|
||||
component: database
|
||||
slo: rto
|
||||
|
||||
# Alert: Backup restore test failure
|
||||
- alert: BackupRestoreTestFailed
|
||||
expr: |
|
||||
backup_restore_test_success == 0
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Backup restore test failed"
|
||||
description: |
|
||||
Weekly automated backup restore test has failed.
|
||||
This indicates backups may not be recoverable.
|
||||
Action: Review test logs and manually verify backup integrity.
|
||||
severity: critical
|
||||
labels:
|
||||
component: backup
|
||||
slo: rto
|
||||
|
||||
# Alert: Primary database down (failover trigger)
|
||||
- alert: PrimaryDatabaseDown
|
||||
expr: |
|
||||
up{job="postgresql-primary"} == 0
|
||||
for: 2m
|
||||
annotations:
|
||||
summary: "Primary database unreachable"
|
||||
description: |
|
||||
Primary PostgreSQL database is not responding to health checks.
|
||||
Failover to secondary may be required.
|
||||
Action: Check pod status with kubectl; consider automatic failover.
|
||||
severity: critical
|
||||
labels:
|
||||
component: database
|
||||
slo: rto
|
||||
|
||||
# Alert: Secondary database replication stopped
|
||||
- alert: SecondaryReplicationDown
|
||||
expr: |
|
||||
pg_replication_slot_active == 0
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Secondary replication connection lost"
|
||||
description: |
|
||||
Replication from primary to secondary database has stopped.
|
||||
Secondary will become stale and failover will risk data loss.
|
||||
Action: Check network connectivity and logs on both primary and secondary.
|
||||
severity: warning
|
||||
labels:
|
||||
component: database
|
||||
slo: rpo
|
||||
|
||||
# Info: Backup statistics
|
||||
- alert: BackupStatsInfo
|
||||
expr: |
|
||||
increase(backup_job_total[24h]) > 0
|
||||
for: 1h
|
||||
annotations:
|
||||
summary: "Daily backup stats: {{ $value }} backups in last 24h"
|
||||
description: |
|
||||
Informational metric for backup statistics.
|
||||
Success rate and performance monitoring.
|
||||
severity: info
|
||||
labels:
|
||||
component: backup
|
||||
|
||||
# Recording rules for aggregation
|
||||
- name: disaster-recovery-recording
|
||||
interval: 1m
|
||||
rules:
|
||||
|
||||
# Average backup size over 7 days
|
||||
- record: backup:size:avg:7d
|
||||
expr: avg_over_time(backup_size_bytes[7d])
|
||||
|
||||
# Backup success rate
|
||||
- record: backup:success:rate:24h
|
||||
expr: rate(backup_job_success_total[24h])
|
||||
|
||||
# Maximum WAL lag
|
||||
- record: wal:lag:max:5m
|
||||
expr: max_over_time(wal_archive_lag_seconds[5m])
|
||||
|
||||
# Average replication lag
|
||||
- record: replication:lag:avg:5m
|
||||
expr: avg(pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes)
|
||||
Reference in New Issue
Block a user