COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
12 KiB
Blocking Issues Remediation Guide
Date: 2026-03-06
Status: READY TO IMPLEMENT
Priority: Critical path to production launch
Overview
Three blocking issues identified during production readiness review (Task 10-07-05):
- Loki storage misconfiguration (CrashLoopBackOff)
- Backup cronjob not deployed
- AlertManager endpoints not configured
This guide provides step-by-step fixes for each. Estimated total remediation time: 2-3 hours.
Issue #1: Loki Storage Misconfiguration
Symptom
kubectl get pods -n gravl-logging
# loki-0 0/1 CrashLoopBackOff 161 (4m37s ago) 13h
# promtail-7d8qf 0/1 CrashLoopBackOff 199 (70s ago) 16h
Root Cause
Loki StatefulSet configured to use StorageClass standard, but K3s only provides local-path.
Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart)
# Edit loki-statefulset deployment
kubectl edit statefulset loki -n gravl-logging
# Change volumeClaimTemplates to emptyDir (STAGING ONLY)
# Before:
# volumeClaimTemplates:
# - metadata:
# name: loki-storage
# spec:
# storageClassName: standard
# accessModes: [ "ReadWriteOnce" ]
# resources:
# requests:
# storage: 10Gi
# After:
# volumes:
# - name: loki-storage
# emptyDir: {}
# Restart pods to pick up changes
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
Verification:
kubectl logs loki-0 -n gravl-logging | tail -20
# Should show "Ready to accept connections" (no CrashLoopBackOff)
Fix Option B: Use Existing local-path StorageClass (Recommended for Production)
# Verify available StorageClass
kubectl get storageclass
# NAME PROVISIONER RECLAIMPOLICY
# local-path (default) rancher.io/local-path Delete
# Edit Loki StatefulSet to use local-path
kubectl patch statefulset loki -n gravl-logging -p \
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
# Delete old PVC and restart pod
kubectl delete pvc loki-storage-loki-0 -n gravl-logging
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
Verification:
kubectl get pvc -n gravl-logging
# loki-storage-loki-0 Bound pvc-xxx 10Gi local-path
kubectl logs loki-0 -n gravl-logging | tail -5
# Should show "Ready to accept connections"
Fix Option C: Deploy External Storage Provisioner (Production Best Practice)
If you have AWS/Azure/external storage available:
# Example: AWS EBS provisioner
helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system
# Create StorageClass
cat << 'YAML' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
YAML
# Update Loki to use ebs-gp3
kubectl patch statefulset loki -n gravl-logging -p \
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
Timeline:
- Option A (emptyDir): 5 minutes
- Option B (local-path): 15 minutes
- Option C (external provisioner): 1 hour
Recommendation: Use Option A for staging (immediate), Option B or C for production (ensure persistent storage).
Issue #2: Backup Cronjob Not Deployed
Symptom
kubectl get cronjob -A | grep backup
# (no results)
Root Cause
Backup cronjob manifest exists (k8s/backup/postgres-backup-cronjob.yaml) but has never been applied to the cluster.
Fix
Step 1: Review backup manifest
cat k8s/backup/postgres-backup-cronjob.yaml | head -50
Step 2: Apply cronjob to cluster
kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml
Step 3: Verify deployment
kubectl get cronjob -n gravl-production
# NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE
# postgres-backup-cronjob 0 2 * * * False 0 <none>
kubectl describe cronjob postgres-backup-cronjob -n gravl-production
# Schedule: 0 2 * * * (Daily at 2 AM UTC)
# Concurrency Policy: Allow
# Suspend: False
Step 4: Test backup job (create one-time run)
kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production
# Monitor job
kubectl logs job/postgres-backup-test -n gravl-production -f
# Verify backup file was created
kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/
# Should show backup file with timestamp
Step 5: Test backup restoration (in staging)
# Assuming backup file exists in pod
kubectl exec -it postgres-0 -n gravl-staging -- \
psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql
# Verify data integrity
kubectl exec -it postgres-0 -n gravl-staging -- \
psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;"
# Should return a non-zero count
Timeline: 15 minutes (5 min deploy + 10 min test)
Note: Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production.
Issue #3: AlertManager Endpoints Not Configured
Symptom
kubectl describe configmap alertmanager-config -n gravl-monitoring
# Slack receiver defined but no webhook URL
# Email receiver defined but no SMTP server
Root Cause
AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints.
Fix Option A: Slack Integration
Step 1: Create Slack webhook
- Go to https://api.slack.com/apps
- Create new app → "From scratch" → select your workspace
- Go to "Incoming Webhooks" → Enable
- Click "Add New Webhook to Workspace"
- Select target channel (e.g., #gravl-incidents)
- Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...)
Step 2: Update AlertManager config
# Get current config
kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml
# Edit the file to add Slack webhook
# Find the 'slack_api_url' field and add your URL:
# receivers:
# - name: 'slack-notifications'
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
# channel: '#gravl-incidents'
# title: 'Alert'
# text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing'
# Apply updated config
kubectl apply -f alertmanager-config.yaml
Step 3: Reload AlertManager
# Send SIGHUP to AlertManager to reload config (without restarting)
kubectl exec -it alertmanager-0 -n gravl-monitoring -- \
kill -HUP 1
# Verify config loaded
kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded"
Step 4: Test alert
# Trigger test alert
cat << 'YAML' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test-alert
namespace: gravl-monitoring
spec:
groups:
- name: test
interval: 15s
rules:
- alert: TestAlert
expr: vector(1)
for: 0s
labels:
severity: critical
annotations:
summary: "Test alert firing"
YAML
# Monitor AlertManager for firing alert
kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093
# Go to http://localhost:9093 → should see firing alert
# Check Slack channel for notification
# Should receive alert message within 30 seconds
# Clean up test alert
kubectl delete prometheusrule test-alert -n gravl-monitoring
Fix Option B: Email Integration
Step 1: Configure SMTP
# Create Kubernetes secret for SMTP credentials
kubectl create secret generic alertmanager-smtp \
--from-literal=username=your-email@gmail.com \
--from-literal=password=your-app-password \
-n gravl-monitoring
Step 2: Update AlertManager config
# Edit alertmanager-config.yaml
# global:
# resolve_timeout: 5m
# smtp_from: 'alerts@gravl.example.com'
# smtp_smarthost: 'smtp.gmail.com:587'
# smtp_auth_username: 'your-email@gmail.com'
# smtp_auth_password: 'your-app-password' # Or reference from secret
#
# receivers:
# - name: 'email-notifications'
# email_configs:
# - to: 'team@gravl.example.com'
# from: 'alerts@gravl.example.com'
# smarthost: 'smtp.gmail.com:587'
# auth_username: 'your-email@gmail.com'
# auth_password: 'your-app-password'
# headers:
# Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}'
kubectl apply -f alertmanager-config.yaml
Step 3: Reload and test
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
# Test with command-line tool or create test alert (see above)
Fix Option C: Both Slack + Email
# Modify route and receivers section
global:
resolve_timeout: 5m
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-notifications'
continue: true
- match:
severity: warning
receiver: 'email-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
channel: '#gravl-incidents'
- name: 'email-notifications'
email_configs:
- to: 'team@gravl.example.com'
smarthost: 'smtp.gmail.com:587'
Timeline:
- Option A (Slack only): 30 minutes
- Option B (Email only): 30 minutes
- Option C (Both): 45 minutes
Recommendation: Use Slack + Email. Slack for immediate visibility, email for audit trail.
Consolidated Remediation Checklist
Pre-Flight (5 minutes)
- Team notified of remediation work
- On-call engineer on standby
- Monitoring dashboard open (watch for pod restarts)
Issue #1: Loki Storage (15 minutes)
- Choose fix option (recommend: Option B local-path)
- Apply fix
- Verify Loki pod running (no CrashLoopBackOff)
- Verify Promtail pods running (depends on Loki)
Issue #2: Backup Cronjob (15 minutes)
- Apply cronjob manifest
- Verify cronjob scheduled
- Create test backup job
- Verify backup file created
Issue #3: AlertManager Endpoints (30 minutes)
- Create Slack webhook (if using Slack)
- Create SMTP credentials (if using email)
- Update AlertManager config
- Test alert delivery
- Clean up test alert
Post-Remediation (5 minutes)
- All pods healthy
- All services responding
- Document any manual steps for runbook
- Sign-off: Ready for production deployment
Rollback Plan (If Remediation Fails)
If Loki fix fails:
# Revert to original state (keep broken)
# Loki is non-blocking, can deploy without it
kubectl delete statefulset loki -n gravl-logging
If Backup deployment fails:
# Revert cronjob removal
kubectl delete cronjob postgres-backup-cronjob -n gravl-production
# Schedule manual backup before production launch
If AlertManager config breaks:
# Revert to previous config
kubectl rollout undo configmap alertmanager-config -n gravl-monitoring
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
Success Criteria
✅ Loki operational (pod running, no CrashLoopBackOff)
✅ Promtail operational (logs flowing)
✅ Backup cronjob deployed (scheduled, tested)
✅ AlertManager endpoints configured (test alert received)
✅ No new pod restarts (stable for 5 minutes)
Document Version: 1.0
Created: 2026-03-06 20:16 UTC
Estimated Implementation Time: 2-3 hours
Priority: Critical path to production