Files
gravl/docs/BLOCKING_ISSUES_REMEDIATION.md
T
clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS:
 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00

434 lines
12 KiB
Markdown

# Blocking Issues Remediation Guide
**Date:** 2026-03-06
**Status:** READY TO IMPLEMENT
**Priority:** Critical path to production launch
---
## Overview
Three blocking issues identified during production readiness review (Task 10-07-05):
1. Loki storage misconfiguration (CrashLoopBackOff)
2. Backup cronjob not deployed
3. AlertManager endpoints not configured
This guide provides step-by-step fixes for each. Estimated total remediation time: **2-3 hours**.
---
## Issue #1: Loki Storage Misconfiguration
### Symptom
```bash
kubectl get pods -n gravl-logging
# loki-0 0/1 CrashLoopBackOff 161 (4m37s ago) 13h
# promtail-7d8qf 0/1 CrashLoopBackOff 199 (70s ago) 16h
```
### Root Cause
Loki StatefulSet configured to use StorageClass `standard`, but K3s only provides `local-path`.
### Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart)
```bash
# Edit loki-statefulset deployment
kubectl edit statefulset loki -n gravl-logging
# Change volumeClaimTemplates to emptyDir (STAGING ONLY)
# Before:
# volumeClaimTemplates:
# - metadata:
# name: loki-storage
# spec:
# storageClassName: standard
# accessModes: [ "ReadWriteOnce" ]
# resources:
# requests:
# storage: 10Gi
# After:
# volumes:
# - name: loki-storage
# emptyDir: {}
# Restart pods to pick up changes
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
```
**Verification:**
```bash
kubectl logs loki-0 -n gravl-logging | tail -20
# Should show "Ready to accept connections" (no CrashLoopBackOff)
```
### Fix Option B: Use Existing local-path StorageClass (Recommended for Production)
```bash
# Verify available StorageClass
kubectl get storageclass
# NAME PROVISIONER RECLAIMPOLICY
# local-path (default) rancher.io/local-path Delete
# Edit Loki StatefulSet to use local-path
kubectl patch statefulset loki -n gravl-logging -p \
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
# Delete old PVC and restart pod
kubectl delete pvc loki-storage-loki-0 -n gravl-logging
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
```
**Verification:**
```bash
kubectl get pvc -n gravl-logging
# loki-storage-loki-0 Bound pvc-xxx 10Gi local-path
kubectl logs loki-0 -n gravl-logging | tail -5
# Should show "Ready to accept connections"
```
### Fix Option C: Deploy External Storage Provisioner (Production Best Practice)
If you have AWS/Azure/external storage available:
```bash
# Example: AWS EBS provisioner
helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system
# Create StorageClass
cat << 'YAML' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
YAML
# Update Loki to use ebs-gp3
kubectl patch statefulset loki -n gravl-logging -p \
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
```
**Timeline:**
- Option A (emptyDir): 5 minutes
- Option B (local-path): 15 minutes
- Option C (external provisioner): 1 hour
**Recommendation:** Use **Option A for staging** (immediate), **Option B or C for production** (ensure persistent storage).
---
## Issue #2: Backup Cronjob Not Deployed
### Symptom
```bash
kubectl get cronjob -A | grep backup
# (no results)
```
### Root Cause
Backup cronjob manifest exists (`k8s/backup/postgres-backup-cronjob.yaml`) but has never been applied to the cluster.
### Fix
**Step 1: Review backup manifest**
```bash
cat k8s/backup/postgres-backup-cronjob.yaml | head -50
```
**Step 2: Apply cronjob to cluster**
```bash
kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml
```
**Step 3: Verify deployment**
```bash
kubectl get cronjob -n gravl-production
# NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE
# postgres-backup-cronjob 0 2 * * * False 0 <none>
kubectl describe cronjob postgres-backup-cronjob -n gravl-production
# Schedule: 0 2 * * * (Daily at 2 AM UTC)
# Concurrency Policy: Allow
# Suspend: False
```
**Step 4: Test backup job (create one-time run)**
```bash
kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production
# Monitor job
kubectl logs job/postgres-backup-test -n gravl-production -f
# Verify backup file was created
kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/
# Should show backup file with timestamp
```
**Step 5: Test backup restoration (in staging)**
```bash
# Assuming backup file exists in pod
kubectl exec -it postgres-0 -n gravl-staging -- \
psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql
# Verify data integrity
kubectl exec -it postgres-0 -n gravl-staging -- \
psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;"
# Should return a non-zero count
```
**Timeline:** 15 minutes (5 min deploy + 10 min test)
**Note:** Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production.
---
## Issue #3: AlertManager Endpoints Not Configured
### Symptom
```bash
kubectl describe configmap alertmanager-config -n gravl-monitoring
# Slack receiver defined but no webhook URL
# Email receiver defined but no SMTP server
```
### Root Cause
AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints.
### Fix Option A: Slack Integration
**Step 1: Create Slack webhook**
1. Go to https://api.slack.com/apps
2. Create new app → "From scratch" → select your workspace
3. Go to "Incoming Webhooks" → Enable
4. Click "Add New Webhook to Workspace"
5. Select target channel (e.g., #gravl-incidents)
6. Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...)
**Step 2: Update AlertManager config**
```bash
# Get current config
kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml
# Edit the file to add Slack webhook
# Find the 'slack_api_url' field and add your URL:
# receivers:
# - name: 'slack-notifications'
# slack_configs:
# - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
# channel: '#gravl-incidents'
# title: 'Alert'
# text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing'
# Apply updated config
kubectl apply -f alertmanager-config.yaml
```
**Step 3: Reload AlertManager**
```bash
# Send SIGHUP to AlertManager to reload config (without restarting)
kubectl exec -it alertmanager-0 -n gravl-monitoring -- \
kill -HUP 1
# Verify config loaded
kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded"
```
**Step 4: Test alert**
```bash
# Trigger test alert
cat << 'YAML' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: test-alert
namespace: gravl-monitoring
spec:
groups:
- name: test
interval: 15s
rules:
- alert: TestAlert
expr: vector(1)
for: 0s
labels:
severity: critical
annotations:
summary: "Test alert firing"
YAML
# Monitor AlertManager for firing alert
kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093
# Go to http://localhost:9093 → should see firing alert
# Check Slack channel for notification
# Should receive alert message within 30 seconds
# Clean up test alert
kubectl delete prometheusrule test-alert -n gravl-monitoring
```
### Fix Option B: Email Integration
**Step 1: Configure SMTP**
```bash
# Create Kubernetes secret for SMTP credentials
kubectl create secret generic alertmanager-smtp \
--from-literal=username=your-email@gmail.com \
--from-literal=password=your-app-password \
-n gravl-monitoring
```
**Step 2: Update AlertManager config**
```bash
# Edit alertmanager-config.yaml
# global:
# resolve_timeout: 5m
# smtp_from: 'alerts@gravl.example.com'
# smtp_smarthost: 'smtp.gmail.com:587'
# smtp_auth_username: 'your-email@gmail.com'
# smtp_auth_password: 'your-app-password' # Or reference from secret
#
# receivers:
# - name: 'email-notifications'
# email_configs:
# - to: 'team@gravl.example.com'
# from: 'alerts@gravl.example.com'
# smarthost: 'smtp.gmail.com:587'
# auth_username: 'your-email@gmail.com'
# auth_password: 'your-app-password'
# headers:
# Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}'
kubectl apply -f alertmanager-config.yaml
```
**Step 3: Reload and test**
```bash
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
# Test with command-line tool or create test alert (see above)
```
### Fix Option C: Both Slack + Email
```yaml
# Modify route and receivers section
global:
resolve_timeout: 5m
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-notifications'
continue: true
- match:
severity: warning
receiver: 'email-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
channel: '#gravl-incidents'
- name: 'email-notifications'
email_configs:
- to: 'team@gravl.example.com'
smarthost: 'smtp.gmail.com:587'
```
**Timeline:**
- Option A (Slack only): 30 minutes
- Option B (Email only): 30 minutes
- Option C (Both): 45 minutes
**Recommendation:** Use **Slack + Email**. Slack for immediate visibility, email for audit trail.
---
## Consolidated Remediation Checklist
### Pre-Flight (5 minutes)
- [ ] Team notified of remediation work
- [ ] On-call engineer on standby
- [ ] Monitoring dashboard open (watch for pod restarts)
### Issue #1: Loki Storage (15 minutes)
- [ ] Choose fix option (recommend: Option B local-path)
- [ ] Apply fix
- [ ] Verify Loki pod running (no CrashLoopBackOff)
- [ ] Verify Promtail pods running (depends on Loki)
### Issue #2: Backup Cronjob (15 minutes)
- [ ] Apply cronjob manifest
- [ ] Verify cronjob scheduled
- [ ] Create test backup job
- [ ] Verify backup file created
### Issue #3: AlertManager Endpoints (30 minutes)
- [ ] Create Slack webhook (if using Slack)
- [ ] Create SMTP credentials (if using email)
- [ ] Update AlertManager config
- [ ] Test alert delivery
- [ ] Clean up test alert
### Post-Remediation (5 minutes)
- [ ] All pods healthy
- [ ] All services responding
- [ ] Document any manual steps for runbook
- [ ] Sign-off: Ready for production deployment
---
## Rollback Plan (If Remediation Fails)
**If Loki fix fails:**
```bash
# Revert to original state (keep broken)
# Loki is non-blocking, can deploy without it
kubectl delete statefulset loki -n gravl-logging
```
**If Backup deployment fails:**
```bash
# Revert cronjob removal
kubectl delete cronjob postgres-backup-cronjob -n gravl-production
# Schedule manual backup before production launch
```
**If AlertManager config breaks:**
```bash
# Revert to previous config
kubectl rollout undo configmap alertmanager-config -n gravl-monitoring
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
```
---
## Success Criteria
**Loki operational** (pod running, no CrashLoopBackOff)
**Promtail operational** (logs flowing)
**Backup cronjob deployed** (scheduled, tested)
**AlertManager endpoints configured** (test alert received)
**No new pod restarts** (stable for 5 minutes)
---
**Document Version:** 1.0
**Created:** 2026-03-06 20:16 UTC
**Estimated Implementation Time:** 2-3 hours
**Priority:** Critical path to production