# Blocking Issues Remediation Guide **Date:** 2026-03-06 **Status:** READY TO IMPLEMENT **Priority:** Critical path to production launch --- ## Overview Three blocking issues identified during production readiness review (Task 10-07-05): 1. Loki storage misconfiguration (CrashLoopBackOff) 2. Backup cronjob not deployed 3. AlertManager endpoints not configured This guide provides step-by-step fixes for each. Estimated total remediation time: **2-3 hours**. --- ## Issue #1: Loki Storage Misconfiguration ### Symptom ```bash kubectl get pods -n gravl-logging # loki-0 0/1 CrashLoopBackOff 161 (4m37s ago) 13h # promtail-7d8qf 0/1 CrashLoopBackOff 199 (70s ago) 16h ``` ### Root Cause Loki StatefulSet configured to use StorageClass `standard`, but K3s only provides `local-path`. ### Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart) ```bash # Edit loki-statefulset deployment kubectl edit statefulset loki -n gravl-logging # Change volumeClaimTemplates to emptyDir (STAGING ONLY) # Before: # volumeClaimTemplates: # - metadata: # name: loki-storage # spec: # storageClassName: standard # accessModes: [ "ReadWriteOnce" ] # resources: # requests: # storage: 10Gi # After: # volumes: # - name: loki-storage # emptyDir: {} # Restart pods to pick up changes kubectl delete pod loki-0 -n gravl-logging kubectl rollout status statefulset/loki -n gravl-logging ``` **Verification:** ```bash kubectl logs loki-0 -n gravl-logging | tail -20 # Should show "Ready to accept connections" (no CrashLoopBackOff) ``` ### Fix Option B: Use Existing local-path StorageClass (Recommended for Production) ```bash # Verify available StorageClass kubectl get storageclass # NAME PROVISIONER RECLAIMPOLICY # local-path (default) rancher.io/local-path Delete # Edit Loki StatefulSet to use local-path kubectl patch statefulset loki -n gravl-logging -p \ '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}' # Delete old PVC and restart pod kubectl delete pvc loki-storage-loki-0 -n gravl-logging kubectl delete pod loki-0 -n gravl-logging kubectl rollout status statefulset/loki -n gravl-logging ``` **Verification:** ```bash kubectl get pvc -n gravl-logging # loki-storage-loki-0 Bound pvc-xxx 10Gi local-path kubectl logs loki-0 -n gravl-logging | tail -5 # Should show "Ready to accept connections" ``` ### Fix Option C: Deploy External Storage Provisioner (Production Best Practice) If you have AWS/Azure/external storage available: ```bash # Example: AWS EBS provisioner helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system # Create StorageClass cat << 'YAML' | kubectl apply -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ebs-gp3 provisioner: ebs.csi.aws.com parameters: type: gp3 iops: "3000" throughput: "125" YAML # Update Loki to use ebs-gp3 kubectl patch statefulset loki -n gravl-logging -p \ '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}' ``` **Timeline:** - Option A (emptyDir): 5 minutes - Option B (local-path): 15 minutes - Option C (external provisioner): 1 hour **Recommendation:** Use **Option A for staging** (immediate), **Option B or C for production** (ensure persistent storage). --- ## Issue #2: Backup Cronjob Not Deployed ### Symptom ```bash kubectl get cronjob -A | grep backup # (no results) ``` ### Root Cause Backup cronjob manifest exists (`k8s/backup/postgres-backup-cronjob.yaml`) but has never been applied to the cluster. ### Fix **Step 1: Review backup manifest** ```bash cat k8s/backup/postgres-backup-cronjob.yaml | head -50 ``` **Step 2: Apply cronjob to cluster** ```bash kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml ``` **Step 3: Verify deployment** ```bash kubectl get cronjob -n gravl-production # NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE # postgres-backup-cronjob 0 2 * * * False 0 kubectl describe cronjob postgres-backup-cronjob -n gravl-production # Schedule: 0 2 * * * (Daily at 2 AM UTC) # Concurrency Policy: Allow # Suspend: False ``` **Step 4: Test backup job (create one-time run)** ```bash kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production # Monitor job kubectl logs job/postgres-backup-test -n gravl-production -f # Verify backup file was created kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/ # Should show backup file with timestamp ``` **Step 5: Test backup restoration (in staging)** ```bash # Assuming backup file exists in pod kubectl exec -it postgres-0 -n gravl-staging -- \ psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql # Verify data integrity kubectl exec -it postgres-0 -n gravl-staging -- \ psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;" # Should return a non-zero count ``` **Timeline:** 15 minutes (5 min deploy + 10 min test) **Note:** Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production. --- ## Issue #3: AlertManager Endpoints Not Configured ### Symptom ```bash kubectl describe configmap alertmanager-config -n gravl-monitoring # Slack receiver defined but no webhook URL # Email receiver defined but no SMTP server ``` ### Root Cause AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints. ### Fix Option A: Slack Integration **Step 1: Create Slack webhook** 1. Go to https://api.slack.com/apps 2. Create new app → "From scratch" → select your workspace 3. Go to "Incoming Webhooks" → Enable 4. Click "Add New Webhook to Workspace" 5. Select target channel (e.g., #gravl-incidents) 6. Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...) **Step 2: Update AlertManager config** ```bash # Get current config kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml # Edit the file to add Slack webhook # Find the 'slack_api_url' field and add your URL: # receivers: # - name: 'slack-notifications' # slack_configs: # - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...' # channel: '#gravl-incidents' # title: 'Alert' # text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing' # Apply updated config kubectl apply -f alertmanager-config.yaml ``` **Step 3: Reload AlertManager** ```bash # Send SIGHUP to AlertManager to reload config (without restarting) kubectl exec -it alertmanager-0 -n gravl-monitoring -- \ kill -HUP 1 # Verify config loaded kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded" ``` **Step 4: Test alert** ```bash # Trigger test alert cat << 'YAML' | kubectl apply -f - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: test-alert namespace: gravl-monitoring spec: groups: - name: test interval: 15s rules: - alert: TestAlert expr: vector(1) for: 0s labels: severity: critical annotations: summary: "Test alert firing" YAML # Monitor AlertManager for firing alert kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093 # Go to http://localhost:9093 → should see firing alert # Check Slack channel for notification # Should receive alert message within 30 seconds # Clean up test alert kubectl delete prometheusrule test-alert -n gravl-monitoring ``` ### Fix Option B: Email Integration **Step 1: Configure SMTP** ```bash # Create Kubernetes secret for SMTP credentials kubectl create secret generic alertmanager-smtp \ --from-literal=username=your-email@gmail.com \ --from-literal=password=your-app-password \ -n gravl-monitoring ``` **Step 2: Update AlertManager config** ```bash # Edit alertmanager-config.yaml # global: # resolve_timeout: 5m # smtp_from: 'alerts@gravl.example.com' # smtp_smarthost: 'smtp.gmail.com:587' # smtp_auth_username: 'your-email@gmail.com' # smtp_auth_password: 'your-app-password' # Or reference from secret # # receivers: # - name: 'email-notifications' # email_configs: # - to: 'team@gravl.example.com' # from: 'alerts@gravl.example.com' # smarthost: 'smtp.gmail.com:587' # auth_username: 'your-email@gmail.com' # auth_password: 'your-app-password' # headers: # Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}' kubectl apply -f alertmanager-config.yaml ``` **Step 3: Reload and test** ```bash kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1 # Test with command-line tool or create test alert (see above) ``` ### Fix Option C: Both Slack + Email ```yaml # Modify route and receivers section global: resolve_timeout: 5m route: receiver: 'slack-notifications' routes: - match: severity: critical receiver: 'slack-notifications' continue: true - match: severity: warning receiver: 'email-notifications' receivers: - name: 'slack-notifications' slack_configs: - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...' channel: '#gravl-incidents' - name: 'email-notifications' email_configs: - to: 'team@gravl.example.com' smarthost: 'smtp.gmail.com:587' ``` **Timeline:** - Option A (Slack only): 30 minutes - Option B (Email only): 30 minutes - Option C (Both): 45 minutes **Recommendation:** Use **Slack + Email**. Slack for immediate visibility, email for audit trail. --- ## Consolidated Remediation Checklist ### Pre-Flight (5 minutes) - [ ] Team notified of remediation work - [ ] On-call engineer on standby - [ ] Monitoring dashboard open (watch for pod restarts) ### Issue #1: Loki Storage (15 minutes) - [ ] Choose fix option (recommend: Option B local-path) - [ ] Apply fix - [ ] Verify Loki pod running (no CrashLoopBackOff) - [ ] Verify Promtail pods running (depends on Loki) ### Issue #2: Backup Cronjob (15 minutes) - [ ] Apply cronjob manifest - [ ] Verify cronjob scheduled - [ ] Create test backup job - [ ] Verify backup file created ### Issue #3: AlertManager Endpoints (30 minutes) - [ ] Create Slack webhook (if using Slack) - [ ] Create SMTP credentials (if using email) - [ ] Update AlertManager config - [ ] Test alert delivery - [ ] Clean up test alert ### Post-Remediation (5 minutes) - [ ] All pods healthy - [ ] All services responding - [ ] Document any manual steps for runbook - [ ] Sign-off: Ready for production deployment --- ## Rollback Plan (If Remediation Fails) **If Loki fix fails:** ```bash # Revert to original state (keep broken) # Loki is non-blocking, can deploy without it kubectl delete statefulset loki -n gravl-logging ``` **If Backup deployment fails:** ```bash # Revert cronjob removal kubectl delete cronjob postgres-backup-cronjob -n gravl-production # Schedule manual backup before production launch ``` **If AlertManager config breaks:** ```bash # Revert to previous config kubectl rollout undo configmap alertmanager-config -n gravl-monitoring kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1 ``` --- ## Success Criteria ✅ **Loki operational** (pod running, no CrashLoopBackOff) ✅ **Promtail operational** (logs flowing) ✅ **Backup cronjob deployed** (scheduled, tested) ✅ **AlertManager endpoints configured** (test alert received) ✅ **No new pod restarts** (stable for 5 minutes) --- **Document Version:** 1.0 **Created:** 2026-03-06 20:16 UTC **Estimated Implementation Time:** 2-3 hours **Priority:** Critical path to production