Files
gravl/docs/BLOCKING_ISSUES_REMEDIATION.md
clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System
COMPLETED TASKS:
 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06
2026-03-06 20:54:03 +01:00

12 KiB

Blocking Issues Remediation Guide

Date: 2026-03-06
Status: READY TO IMPLEMENT
Priority: Critical path to production launch


Overview

Three blocking issues identified during production readiness review (Task 10-07-05):

  1. Loki storage misconfiguration (CrashLoopBackOff)
  2. Backup cronjob not deployed
  3. AlertManager endpoints not configured

This guide provides step-by-step fixes for each. Estimated total remediation time: 2-3 hours.


Issue #1: Loki Storage Misconfiguration

Symptom

kubectl get pods -n gravl-logging
# loki-0                  0/1     CrashLoopBackOff   161 (4m37s ago)   13h
# promtail-7d8qf          0/1     CrashLoopBackOff   199 (70s ago)     16h

Root Cause

Loki StatefulSet configured to use StorageClass standard, but K3s only provides local-path.

Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart)

# Edit loki-statefulset deployment
kubectl edit statefulset loki -n gravl-logging

# Change volumeClaimTemplates to emptyDir (STAGING ONLY)
# Before:
# volumeClaimTemplates:
# - metadata:
#     name: loki-storage
#   spec:
#     storageClassName: standard
#     accessModes: [ "ReadWriteOnce" ]
#     resources:
#       requests:
#         storage: 10Gi

# After:
# volumes:
# - name: loki-storage
#   emptyDir: {}

# Restart pods to pick up changes
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging

Verification:

kubectl logs loki-0 -n gravl-logging | tail -20
# Should show "Ready to accept connections" (no CrashLoopBackOff)
# Verify available StorageClass
kubectl get storageclass
# NAME                   PROVISIONER             RECLAIMPOLICY
# local-path (default)   rancher.io/local-path   Delete

# Edit Loki StatefulSet to use local-path
kubectl patch statefulset loki -n gravl-logging -p \
  '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'

# Delete old PVC and restart pod
kubectl delete pvc loki-storage-loki-0 -n gravl-logging
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging

Verification:

kubectl get pvc -n gravl-logging
# loki-storage-loki-0   Bound    pvc-xxx   10Gi   local-path

kubectl logs loki-0 -n gravl-logging | tail -5
# Should show "Ready to accept connections"

Fix Option C: Deploy External Storage Provisioner (Production Best Practice)

If you have AWS/Azure/external storage available:

# Example: AWS EBS provisioner
helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system

# Create StorageClass
cat << 'YAML' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
YAML

# Update Loki to use ebs-gp3
kubectl patch statefulset loki -n gravl-logging -p \
  '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'

Timeline:

  • Option A (emptyDir): 5 minutes
  • Option B (local-path): 15 minutes
  • Option C (external provisioner): 1 hour

Recommendation: Use Option A for staging (immediate), Option B or C for production (ensure persistent storage).


Issue #2: Backup Cronjob Not Deployed

Symptom

kubectl get cronjob -A | grep backup
# (no results)

Root Cause

Backup cronjob manifest exists (k8s/backup/postgres-backup-cronjob.yaml) but has never been applied to the cluster.

Fix

Step 1: Review backup manifest

cat k8s/backup/postgres-backup-cronjob.yaml | head -50

Step 2: Apply cronjob to cluster

kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml

Step 3: Verify deployment

kubectl get cronjob -n gravl-production
# NAME                      SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE
# postgres-backup-cronjob   0 2 * * *     False     0        <none>

kubectl describe cronjob postgres-backup-cronjob -n gravl-production
# Schedule:  0 2 * * * (Daily at 2 AM UTC)
# Concurrency Policy:  Allow
# Suspend:  False

Step 4: Test backup job (create one-time run)

kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production

# Monitor job
kubectl logs job/postgres-backup-test -n gravl-production -f

# Verify backup file was created
kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/
# Should show backup file with timestamp

Step 5: Test backup restoration (in staging)

# Assuming backup file exists in pod
kubectl exec -it postgres-0 -n gravl-staging -- \
  psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql

# Verify data integrity
kubectl exec -it postgres-0 -n gravl-staging -- \
  psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;"
# Should return a non-zero count

Timeline: 15 minutes (5 min deploy + 10 min test)

Note: Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production.


Issue #3: AlertManager Endpoints Not Configured

Symptom

kubectl describe configmap alertmanager-config -n gravl-monitoring
# Slack receiver defined but no webhook URL
# Email receiver defined but no SMTP server

Root Cause

AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints.

Fix Option A: Slack Integration

Step 1: Create Slack webhook

  1. Go to https://api.slack.com/apps
  2. Create new app → "From scratch" → select your workspace
  3. Go to "Incoming Webhooks" → Enable
  4. Click "Add New Webhook to Workspace"
  5. Select target channel (e.g., #gravl-incidents)
  6. Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...)

Step 2: Update AlertManager config

# Get current config
kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml

# Edit the file to add Slack webhook
# Find the 'slack_api_url' field and add your URL:
# receivers:
# - name: 'slack-notifications'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
#     channel: '#gravl-incidents'
#     title: 'Alert'
#     text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing'

# Apply updated config
kubectl apply -f alertmanager-config.yaml

Step 3: Reload AlertManager

# Send SIGHUP to AlertManager to reload config (without restarting)
kubectl exec -it alertmanager-0 -n gravl-monitoring -- \
  kill -HUP 1

# Verify config loaded
kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded"

Step 4: Test alert

# Trigger test alert
cat << 'YAML' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test-alert
  namespace: gravl-monitoring
spec:
  groups:
  - name: test
    interval: 15s
    rules:
    - alert: TestAlert
      expr: vector(1)
      for: 0s
      labels:
        severity: critical
      annotations:
        summary: "Test alert firing"
YAML

# Monitor AlertManager for firing alert
kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093
# Go to http://localhost:9093 → should see firing alert

# Check Slack channel for notification
# Should receive alert message within 30 seconds

# Clean up test alert
kubectl delete prometheusrule test-alert -n gravl-monitoring

Fix Option B: Email Integration

Step 1: Configure SMTP

# Create Kubernetes secret for SMTP credentials
kubectl create secret generic alertmanager-smtp \
  --from-literal=username=your-email@gmail.com \
  --from-literal=password=your-app-password \
  -n gravl-monitoring

Step 2: Update AlertManager config

# Edit alertmanager-config.yaml
# global:
#   resolve_timeout: 5m
#   smtp_from: 'alerts@gravl.example.com'
#   smtp_smarthost: 'smtp.gmail.com:587'
#   smtp_auth_username: 'your-email@gmail.com'
#   smtp_auth_password: 'your-app-password'  # Or reference from secret
#
# receivers:
# - name: 'email-notifications'
#   email_configs:
#   - to: 'team@gravl.example.com'
#     from: 'alerts@gravl.example.com'
#     smarthost: 'smtp.gmail.com:587'
#     auth_username: 'your-email@gmail.com'
#     auth_password: 'your-app-password'
#     headers:
#       Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}'

kubectl apply -f alertmanager-config.yaml

Step 3: Reload and test

kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1

# Test with command-line tool or create test alert (see above)

Fix Option C: Both Slack + Email

# Modify route and receivers section
global:
  resolve_timeout: 5m

route:
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'slack-notifications'
    continue: true
  - match:
      severity: warning
    receiver: 'email-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
    channel: '#gravl-incidents'
    
- name: 'email-notifications'
  email_configs:
  - to: 'team@gravl.example.com'
    smarthost: 'smtp.gmail.com:587'

Timeline:

  • Option A (Slack only): 30 minutes
  • Option B (Email only): 30 minutes
  • Option C (Both): 45 minutes

Recommendation: Use Slack + Email. Slack for immediate visibility, email for audit trail.


Consolidated Remediation Checklist

Pre-Flight (5 minutes)

  • Team notified of remediation work
  • On-call engineer on standby
  • Monitoring dashboard open (watch for pod restarts)

Issue #1: Loki Storage (15 minutes)

  • Choose fix option (recommend: Option B local-path)
  • Apply fix
  • Verify Loki pod running (no CrashLoopBackOff)
  • Verify Promtail pods running (depends on Loki)

Issue #2: Backup Cronjob (15 minutes)

  • Apply cronjob manifest
  • Verify cronjob scheduled
  • Create test backup job
  • Verify backup file created

Issue #3: AlertManager Endpoints (30 minutes)

  • Create Slack webhook (if using Slack)
  • Create SMTP credentials (if using email)
  • Update AlertManager config
  • Test alert delivery
  • Clean up test alert

Post-Remediation (5 minutes)

  • All pods healthy
  • All services responding
  • Document any manual steps for runbook
  • Sign-off: Ready for production deployment

Rollback Plan (If Remediation Fails)

If Loki fix fails:

# Revert to original state (keep broken)
# Loki is non-blocking, can deploy without it
kubectl delete statefulset loki -n gravl-logging

If Backup deployment fails:

# Revert cronjob removal
kubectl delete cronjob postgres-backup-cronjob -n gravl-production
# Schedule manual backup before production launch

If AlertManager config breaks:

# Revert to previous config
kubectl rollout undo configmap alertmanager-config -n gravl-monitoring
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1

Success Criteria

Loki operational (pod running, no CrashLoopBackOff)
Promtail operational (logs flowing)
Backup cronjob deployed (scheduled, tested)
AlertManager endpoints configured (test alert received)
No new pod restarts (stable for 5 minutes)


Document Version: 1.0
Created: 2026-03-06 20:16 UTC
Estimated Implementation Time: 2-3 hours
Priority: Critical path to production