d81e403f01
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
434 lines
12 KiB
Markdown
434 lines
12 KiB
Markdown
# Blocking Issues Remediation Guide
|
|
|
|
**Date:** 2026-03-06
|
|
**Status:** READY TO IMPLEMENT
|
|
**Priority:** Critical path to production launch
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Three blocking issues identified during production readiness review (Task 10-07-05):
|
|
|
|
1. Loki storage misconfiguration (CrashLoopBackOff)
|
|
2. Backup cronjob not deployed
|
|
3. AlertManager endpoints not configured
|
|
|
|
This guide provides step-by-step fixes for each. Estimated total remediation time: **2-3 hours**.
|
|
|
|
---
|
|
|
|
## Issue #1: Loki Storage Misconfiguration
|
|
|
|
### Symptom
|
|
```bash
|
|
kubectl get pods -n gravl-logging
|
|
# loki-0 0/1 CrashLoopBackOff 161 (4m37s ago) 13h
|
|
# promtail-7d8qf 0/1 CrashLoopBackOff 199 (70s ago) 16h
|
|
```
|
|
|
|
### Root Cause
|
|
Loki StatefulSet configured to use StorageClass `standard`, but K3s only provides `local-path`.
|
|
|
|
### Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart)
|
|
|
|
```bash
|
|
# Edit loki-statefulset deployment
|
|
kubectl edit statefulset loki -n gravl-logging
|
|
|
|
# Change volumeClaimTemplates to emptyDir (STAGING ONLY)
|
|
# Before:
|
|
# volumeClaimTemplates:
|
|
# - metadata:
|
|
# name: loki-storage
|
|
# spec:
|
|
# storageClassName: standard
|
|
# accessModes: [ "ReadWriteOnce" ]
|
|
# resources:
|
|
# requests:
|
|
# storage: 10Gi
|
|
|
|
# After:
|
|
# volumes:
|
|
# - name: loki-storage
|
|
# emptyDir: {}
|
|
|
|
# Restart pods to pick up changes
|
|
kubectl delete pod loki-0 -n gravl-logging
|
|
kubectl rollout status statefulset/loki -n gravl-logging
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
kubectl logs loki-0 -n gravl-logging | tail -20
|
|
# Should show "Ready to accept connections" (no CrashLoopBackOff)
|
|
```
|
|
|
|
### Fix Option B: Use Existing local-path StorageClass (Recommended for Production)
|
|
|
|
```bash
|
|
# Verify available StorageClass
|
|
kubectl get storageclass
|
|
# NAME PROVISIONER RECLAIMPOLICY
|
|
# local-path (default) rancher.io/local-path Delete
|
|
|
|
# Edit Loki StatefulSet to use local-path
|
|
kubectl patch statefulset loki -n gravl-logging -p \
|
|
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
|
|
|
|
# Delete old PVC and restart pod
|
|
kubectl delete pvc loki-storage-loki-0 -n gravl-logging
|
|
kubectl delete pod loki-0 -n gravl-logging
|
|
kubectl rollout status statefulset/loki -n gravl-logging
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
kubectl get pvc -n gravl-logging
|
|
# loki-storage-loki-0 Bound pvc-xxx 10Gi local-path
|
|
|
|
kubectl logs loki-0 -n gravl-logging | tail -5
|
|
# Should show "Ready to accept connections"
|
|
```
|
|
|
|
### Fix Option C: Deploy External Storage Provisioner (Production Best Practice)
|
|
|
|
If you have AWS/Azure/external storage available:
|
|
|
|
```bash
|
|
# Example: AWS EBS provisioner
|
|
helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
|
|
helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system
|
|
|
|
# Create StorageClass
|
|
cat << 'YAML' | kubectl apply -f -
|
|
apiVersion: storage.k8s.io/v1
|
|
kind: StorageClass
|
|
metadata:
|
|
name: ebs-gp3
|
|
provisioner: ebs.csi.aws.com
|
|
parameters:
|
|
type: gp3
|
|
iops: "3000"
|
|
throughput: "125"
|
|
YAML
|
|
|
|
# Update Loki to use ebs-gp3
|
|
kubectl patch statefulset loki -n gravl-logging -p \
|
|
'{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
|
|
```
|
|
|
|
**Timeline:**
|
|
- Option A (emptyDir): 5 minutes
|
|
- Option B (local-path): 15 minutes
|
|
- Option C (external provisioner): 1 hour
|
|
|
|
**Recommendation:** Use **Option A for staging** (immediate), **Option B or C for production** (ensure persistent storage).
|
|
|
|
---
|
|
|
|
## Issue #2: Backup Cronjob Not Deployed
|
|
|
|
### Symptom
|
|
```bash
|
|
kubectl get cronjob -A | grep backup
|
|
# (no results)
|
|
```
|
|
|
|
### Root Cause
|
|
Backup cronjob manifest exists (`k8s/backup/postgres-backup-cronjob.yaml`) but has never been applied to the cluster.
|
|
|
|
### Fix
|
|
|
|
**Step 1: Review backup manifest**
|
|
```bash
|
|
cat k8s/backup/postgres-backup-cronjob.yaml | head -50
|
|
```
|
|
|
|
**Step 2: Apply cronjob to cluster**
|
|
```bash
|
|
kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml
|
|
```
|
|
|
|
**Step 3: Verify deployment**
|
|
```bash
|
|
kubectl get cronjob -n gravl-production
|
|
# NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE
|
|
# postgres-backup-cronjob 0 2 * * * False 0 <none>
|
|
|
|
kubectl describe cronjob postgres-backup-cronjob -n gravl-production
|
|
# Schedule: 0 2 * * * (Daily at 2 AM UTC)
|
|
# Concurrency Policy: Allow
|
|
# Suspend: False
|
|
```
|
|
|
|
**Step 4: Test backup job (create one-time run)**
|
|
```bash
|
|
kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production
|
|
|
|
# Monitor job
|
|
kubectl logs job/postgres-backup-test -n gravl-production -f
|
|
|
|
# Verify backup file was created
|
|
kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/
|
|
# Should show backup file with timestamp
|
|
```
|
|
|
|
**Step 5: Test backup restoration (in staging)**
|
|
```bash
|
|
# Assuming backup file exists in pod
|
|
kubectl exec -it postgres-0 -n gravl-staging -- \
|
|
psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql
|
|
|
|
# Verify data integrity
|
|
kubectl exec -it postgres-0 -n gravl-staging -- \
|
|
psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;"
|
|
# Should return a non-zero count
|
|
```
|
|
|
|
**Timeline:** 15 minutes (5 min deploy + 10 min test)
|
|
|
|
**Note:** Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production.
|
|
|
|
---
|
|
|
|
## Issue #3: AlertManager Endpoints Not Configured
|
|
|
|
### Symptom
|
|
```bash
|
|
kubectl describe configmap alertmanager-config -n gravl-monitoring
|
|
# Slack receiver defined but no webhook URL
|
|
# Email receiver defined but no SMTP server
|
|
```
|
|
|
|
### Root Cause
|
|
AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints.
|
|
|
|
### Fix Option A: Slack Integration
|
|
|
|
**Step 1: Create Slack webhook**
|
|
1. Go to https://api.slack.com/apps
|
|
2. Create new app → "From scratch" → select your workspace
|
|
3. Go to "Incoming Webhooks" → Enable
|
|
4. Click "Add New Webhook to Workspace"
|
|
5. Select target channel (e.g., #gravl-incidents)
|
|
6. Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...)
|
|
|
|
**Step 2: Update AlertManager config**
|
|
```bash
|
|
# Get current config
|
|
kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml
|
|
|
|
# Edit the file to add Slack webhook
|
|
# Find the 'slack_api_url' field and add your URL:
|
|
# receivers:
|
|
# - name: 'slack-notifications'
|
|
# slack_configs:
|
|
# - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
|
|
# channel: '#gravl-incidents'
|
|
# title: 'Alert'
|
|
# text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing'
|
|
|
|
# Apply updated config
|
|
kubectl apply -f alertmanager-config.yaml
|
|
```
|
|
|
|
**Step 3: Reload AlertManager**
|
|
```bash
|
|
# Send SIGHUP to AlertManager to reload config (without restarting)
|
|
kubectl exec -it alertmanager-0 -n gravl-monitoring -- \
|
|
kill -HUP 1
|
|
|
|
# Verify config loaded
|
|
kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded"
|
|
```
|
|
|
|
**Step 4: Test alert**
|
|
```bash
|
|
# Trigger test alert
|
|
cat << 'YAML' | kubectl apply -f -
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: test-alert
|
|
namespace: gravl-monitoring
|
|
spec:
|
|
groups:
|
|
- name: test
|
|
interval: 15s
|
|
rules:
|
|
- alert: TestAlert
|
|
expr: vector(1)
|
|
for: 0s
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Test alert firing"
|
|
YAML
|
|
|
|
# Monitor AlertManager for firing alert
|
|
kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093
|
|
# Go to http://localhost:9093 → should see firing alert
|
|
|
|
# Check Slack channel for notification
|
|
# Should receive alert message within 30 seconds
|
|
|
|
# Clean up test alert
|
|
kubectl delete prometheusrule test-alert -n gravl-monitoring
|
|
```
|
|
|
|
### Fix Option B: Email Integration
|
|
|
|
**Step 1: Configure SMTP**
|
|
```bash
|
|
# Create Kubernetes secret for SMTP credentials
|
|
kubectl create secret generic alertmanager-smtp \
|
|
--from-literal=username=your-email@gmail.com \
|
|
--from-literal=password=your-app-password \
|
|
-n gravl-monitoring
|
|
```
|
|
|
|
**Step 2: Update AlertManager config**
|
|
```bash
|
|
# Edit alertmanager-config.yaml
|
|
# global:
|
|
# resolve_timeout: 5m
|
|
# smtp_from: 'alerts@gravl.example.com'
|
|
# smtp_smarthost: 'smtp.gmail.com:587'
|
|
# smtp_auth_username: 'your-email@gmail.com'
|
|
# smtp_auth_password: 'your-app-password' # Or reference from secret
|
|
#
|
|
# receivers:
|
|
# - name: 'email-notifications'
|
|
# email_configs:
|
|
# - to: 'team@gravl.example.com'
|
|
# from: 'alerts@gravl.example.com'
|
|
# smarthost: 'smtp.gmail.com:587'
|
|
# auth_username: 'your-email@gmail.com'
|
|
# auth_password: 'your-app-password'
|
|
# headers:
|
|
# Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}'
|
|
|
|
kubectl apply -f alertmanager-config.yaml
|
|
```
|
|
|
|
**Step 3: Reload and test**
|
|
```bash
|
|
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
|
|
|
|
# Test with command-line tool or create test alert (see above)
|
|
```
|
|
|
|
### Fix Option C: Both Slack + Email
|
|
|
|
```yaml
|
|
# Modify route and receivers section
|
|
global:
|
|
resolve_timeout: 5m
|
|
|
|
route:
|
|
receiver: 'slack-notifications'
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'slack-notifications'
|
|
continue: true
|
|
- match:
|
|
severity: warning
|
|
receiver: 'email-notifications'
|
|
|
|
receivers:
|
|
- name: 'slack-notifications'
|
|
slack_configs:
|
|
- api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
|
|
channel: '#gravl-incidents'
|
|
|
|
- name: 'email-notifications'
|
|
email_configs:
|
|
- to: 'team@gravl.example.com'
|
|
smarthost: 'smtp.gmail.com:587'
|
|
```
|
|
|
|
**Timeline:**
|
|
- Option A (Slack only): 30 minutes
|
|
- Option B (Email only): 30 minutes
|
|
- Option C (Both): 45 minutes
|
|
|
|
**Recommendation:** Use **Slack + Email**. Slack for immediate visibility, email for audit trail.
|
|
|
|
---
|
|
|
|
## Consolidated Remediation Checklist
|
|
|
|
### Pre-Flight (5 minutes)
|
|
- [ ] Team notified of remediation work
|
|
- [ ] On-call engineer on standby
|
|
- [ ] Monitoring dashboard open (watch for pod restarts)
|
|
|
|
### Issue #1: Loki Storage (15 minutes)
|
|
- [ ] Choose fix option (recommend: Option B local-path)
|
|
- [ ] Apply fix
|
|
- [ ] Verify Loki pod running (no CrashLoopBackOff)
|
|
- [ ] Verify Promtail pods running (depends on Loki)
|
|
|
|
### Issue #2: Backup Cronjob (15 minutes)
|
|
- [ ] Apply cronjob manifest
|
|
- [ ] Verify cronjob scheduled
|
|
- [ ] Create test backup job
|
|
- [ ] Verify backup file created
|
|
|
|
### Issue #3: AlertManager Endpoints (30 minutes)
|
|
- [ ] Create Slack webhook (if using Slack)
|
|
- [ ] Create SMTP credentials (if using email)
|
|
- [ ] Update AlertManager config
|
|
- [ ] Test alert delivery
|
|
- [ ] Clean up test alert
|
|
|
|
### Post-Remediation (5 minutes)
|
|
- [ ] All pods healthy
|
|
- [ ] All services responding
|
|
- [ ] Document any manual steps for runbook
|
|
- [ ] Sign-off: Ready for production deployment
|
|
|
|
---
|
|
|
|
## Rollback Plan (If Remediation Fails)
|
|
|
|
**If Loki fix fails:**
|
|
```bash
|
|
# Revert to original state (keep broken)
|
|
# Loki is non-blocking, can deploy without it
|
|
kubectl delete statefulset loki -n gravl-logging
|
|
```
|
|
|
|
**If Backup deployment fails:**
|
|
```bash
|
|
# Revert cronjob removal
|
|
kubectl delete cronjob postgres-backup-cronjob -n gravl-production
|
|
# Schedule manual backup before production launch
|
|
```
|
|
|
|
**If AlertManager config breaks:**
|
|
```bash
|
|
# Revert to previous config
|
|
kubectl rollout undo configmap alertmanager-config -n gravl-monitoring
|
|
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
✅ **Loki operational** (pod running, no CrashLoopBackOff)
|
|
✅ **Promtail operational** (logs flowing)
|
|
✅ **Backup cronjob deployed** (scheduled, tested)
|
|
✅ **AlertManager endpoints configured** (test alert received)
|
|
✅ **No new pod restarts** (stable for 5 minutes)
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Created:** 2026-03-06 20:16 UTC
|
|
**Estimated Implementation Time:** 2-3 hours
|
|
**Priority:** Critical path to production
|