d81e403f01
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
182 lines
5.9 KiB
YAML
182 lines
5.9 KiB
YAML
---
|
|
# Prometheus PrometheusRule for Disaster Recovery Monitoring
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: disaster-recovery-rules
|
|
namespace: gravl-monitoring
|
|
labels:
|
|
app: gravl
|
|
component: monitoring
|
|
rules: disaster-recovery
|
|
spec:
|
|
groups:
|
|
- name: disaster-recovery
|
|
interval: 30s
|
|
rules:
|
|
|
|
# Alert: No daily backup in 24+ hours
|
|
- alert: NoDailyBackup
|
|
expr: |
|
|
(time() - backup_last_success_timestamp{type="daily"}) > 86400
|
|
for: 1h
|
|
annotations:
|
|
summary: "Daily backup missing for {{ $value | humanizeDuration }}"
|
|
description: |
|
|
No successful daily backup has been completed in the last 24 hours.
|
|
This violates the RPO target of <1 hour.
|
|
Action: Check backup CronJob logs and restore connectivity to S3.
|
|
severity: critical
|
|
labels:
|
|
component: backup
|
|
slo: rpo
|
|
|
|
# Alert: Backup size deviation (likely corruption)
|
|
- alert: BackupSizeDeviation
|
|
expr: |
|
|
abs(backup_size_bytes - avg_over_time(backup_size_bytes[7d])) / avg_over_time(backup_size_bytes[7d]) > 0.5
|
|
for: 30m
|
|
annotations:
|
|
summary: "Backup size deviated >50%: {{ $value | humanizePercentage }}"
|
|
description: |
|
|
Latest backup size differs significantly from historical average.
|
|
This may indicate data corruption or incomplete backup.
|
|
Action: Review backup logs and test restore from previous backup.
|
|
severity: warning
|
|
labels:
|
|
component: backup
|
|
|
|
# Alert: WAL archive lagging
|
|
- alert: WALArchiveLagging
|
|
expr: |
|
|
wal_archive_lag_seconds > 900
|
|
for: 5m
|
|
annotations:
|
|
summary: "WAL archive lagging: {{ $value | humanizeDuration }}"
|
|
description: |
|
|
PostgreSQL WAL files are not being archived to S3 within expected timeframe.
|
|
This impacts the RPO (Recovery Point Objective).
|
|
Current lag: {{ $value }}s (target: <300s)
|
|
Action: Check postgres WAL archiver status and S3 connectivity.
|
|
severity: warning
|
|
labels:
|
|
component: database
|
|
slo: rpo
|
|
|
|
# Alert: S3 upload performance degraded
|
|
- alert: S3UploadSlow
|
|
expr: |
|
|
backup_upload_duration_seconds > 1200
|
|
for: 10m
|
|
annotations:
|
|
summary: "S3 backup upload taking {{ $value | humanizeDuration }}"
|
|
description: |
|
|
Backup upload to S3 is taking longer than expected.
|
|
This may indicate network issues or S3 throttling.
|
|
Target duration: <600s
|
|
Current duration: {{ $value }}s
|
|
Action: Check network connectivity and S3 bucket metrics.
|
|
severity: warning
|
|
labels:
|
|
component: storage
|
|
|
|
# Alert: Database replication lagging
|
|
- alert: HighReplicationLag
|
|
expr: |
|
|
pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes > 1073741824
|
|
for: 5m
|
|
annotations:
|
|
summary: "Replication lag: {{ $value | humanize1024 }}B"
|
|
description: |
|
|
Secondary database replica is lagging significantly behind primary.
|
|
This impacts failover capability.
|
|
Current lag: {{ $value | humanize1024 }}B (target: <100MB)
|
|
Action: Check network between regions and replica pod status.
|
|
severity: warning
|
|
labels:
|
|
component: database
|
|
slo: rto
|
|
|
|
# Alert: Backup restore test failure
|
|
- alert: BackupRestoreTestFailed
|
|
expr: |
|
|
backup_restore_test_success == 0
|
|
for: 10m
|
|
annotations:
|
|
summary: "Backup restore test failed"
|
|
description: |
|
|
Weekly automated backup restore test has failed.
|
|
This indicates backups may not be recoverable.
|
|
Action: Review test logs and manually verify backup integrity.
|
|
severity: critical
|
|
labels:
|
|
component: backup
|
|
slo: rto
|
|
|
|
# Alert: Primary database down (failover trigger)
|
|
- alert: PrimaryDatabaseDown
|
|
expr: |
|
|
up{job="postgresql-primary"} == 0
|
|
for: 2m
|
|
annotations:
|
|
summary: "Primary database unreachable"
|
|
description: |
|
|
Primary PostgreSQL database is not responding to health checks.
|
|
Failover to secondary may be required.
|
|
Action: Check pod status with kubectl; consider automatic failover.
|
|
severity: critical
|
|
labels:
|
|
component: database
|
|
slo: rto
|
|
|
|
# Alert: Secondary database replication stopped
|
|
- alert: SecondaryReplicationDown
|
|
expr: |
|
|
pg_replication_slot_active == 0
|
|
for: 5m
|
|
annotations:
|
|
summary: "Secondary replication connection lost"
|
|
description: |
|
|
Replication from primary to secondary database has stopped.
|
|
Secondary will become stale and failover will risk data loss.
|
|
Action: Check network connectivity and logs on both primary and secondary.
|
|
severity: warning
|
|
labels:
|
|
component: database
|
|
slo: rpo
|
|
|
|
# Info: Backup statistics
|
|
- alert: BackupStatsInfo
|
|
expr: |
|
|
increase(backup_job_total[24h]) > 0
|
|
for: 1h
|
|
annotations:
|
|
summary: "Daily backup stats: {{ $value }} backups in last 24h"
|
|
description: |
|
|
Informational metric for backup statistics.
|
|
Success rate and performance monitoring.
|
|
severity: info
|
|
labels:
|
|
component: backup
|
|
|
|
# Recording rules for aggregation
|
|
- name: disaster-recovery-recording
|
|
interval: 1m
|
|
rules:
|
|
|
|
# Average backup size over 7 days
|
|
- record: backup:size:avg:7d
|
|
expr: avg_over_time(backup_size_bytes[7d])
|
|
|
|
# Backup success rate
|
|
- record: backup:success:rate:24h
|
|
expr: rate(backup_job_success_total[24h])
|
|
|
|
# Maximum WAL lag
|
|
- record: wal:lag:max:5m
|
|
expr: max_over_time(wal_archive_lag_seconds[5m])
|
|
|
|
# Average replication lag
|
|
- record: replication:lag:avg:5m
|
|
expr: avg(pg_replication_slot_restart_lsn_bytes - pg_wal_insert_lsn_bytes)
|