gravl/docs/BLOCKING_ISSUES_REMEDIATION.md

# Blocking Issues Remediation Guide

**Date:** 2026-03-06
**Status:** READY TO IMPLEMENT
**Priority:** Critical path to production launch

---

## Overview

Three blocking issues identified during production readiness review (Task 10-07-05):

1. Loki storage misconfiguration (CrashLoopBackOff)
2. Backup cronjob not deployed
3. AlertManager endpoints not configured

This guide provides step-by-step fixes for each. Estimated total remediation time: **2-3 hours**.

---

## Issue #1: Loki Storage Misconfiguration

### Symptom
```bash
kubectl get pods -n gravl-logging
# loki-0                  0/1     CrashLoopBackOff   161 (4m37s ago)   13h
# promtail-7d8qf          0/1     CrashLoopBackOff   199 (70s ago)     16h
```

### Root Cause
Loki StatefulSet configured to use StorageClass `standard`, but K3s only provides `local-path`.

### Fix Option A: emptyDir (Staging Only - Logs Discarded on Pod Restart)

```bash
# Edit loki-statefulset deployment
kubectl edit statefulset loki -n gravl-logging

# Change volumeClaimTemplates to emptyDir (STAGING ONLY)
# Before:
# volumeClaimTemplates:
# - metadata:
#     name: loki-storage
#   spec:
#     storageClassName: standard
#     accessModes: [ "ReadWriteOnce" ]
#     resources:
#       requests:
#         storage: 10Gi

# After:
# volumes:
# - name: loki-storage
#   emptyDir: {}

# Restart pods to pick up changes
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
```

**Verification:**
```bash
kubectl logs loki-0 -n gravl-logging | tail -20
# Should show "Ready to accept connections" (no CrashLoopBackOff)
```

### Fix Option B: Use Existing local-path StorageClass (Recommended for Production)

```bash
# Verify available StorageClass
kubectl get storageclass
# NAME                   PROVISIONER             RECLAIMPOLICY
# local-path (default)   rancher.io/local-path   Delete

# Edit Loki StatefulSet to use local-path
kubectl patch statefulset loki -n gravl-logging -p \
  '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"local-path","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'

# Delete old PVC and restart pod
kubectl delete pvc loki-storage-loki-0 -n gravl-logging
kubectl delete pod loki-0 -n gravl-logging
kubectl rollout status statefulset/loki -n gravl-logging
```

**Verification:**
```bash
kubectl get pvc -n gravl-logging
# loki-storage-loki-0   Bound    pvc-xxx   10Gi   local-path

kubectl logs loki-0 -n gravl-logging | tail -5
# Should show "Ready to accept connections"
```

### Fix Option C: Deploy External Storage Provisioner (Production Best Practice)

If you have AWS/Azure/external storage available:

```bash
# Example: AWS EBS provisioner
helm repo add ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm install aws-ebs-csi-driver ebs-csi-driver/aws-ebs-csi-driver -n kube-system

# Create StorageClass
cat << 'YAML' | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
YAML

# Update Loki to use ebs-gp3
kubectl patch statefulset loki -n gravl-logging -p \
  '{"spec":{"volumeClaimTemplates":[{"metadata":{"name":"loki-storage"},"spec":{"storageClassName":"ebs-gp3","accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}]}}'
```

**Timeline:**
- Option A (emptyDir): 5 minutes
- Option B (local-path): 15 minutes
- Option C (external provisioner): 1 hour

**Recommendation:** Use **Option A for staging** (immediate), **Option B or C for production** (ensure persistent storage).

---

## Issue #2: Backup Cronjob Not Deployed

### Symptom
```bash
kubectl get cronjob -A | grep backup
# (no results)
```

### Root Cause
Backup cronjob manifest exists (`k8s/backup/postgres-backup-cronjob.yaml`) but has never been applied to the cluster.

### Fix

**Step 1: Review backup manifest**
```bash
cat k8s/backup/postgres-backup-cronjob.yaml | head -50
```

**Step 2: Apply cronjob to cluster**
```bash
kubectl apply -f k8s/backup/postgres-backup-cronjob.yaml
```

**Step 3: Verify deployment**
```bash
kubectl get cronjob -n gravl-production
# NAME                      SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE
# postgres-backup-cronjob   0 2 * * *     False     0        <none>

kubectl describe cronjob postgres-backup-cronjob -n gravl-production
# Schedule:  0 2 * * * (Daily at 2 AM UTC)
# Concurrency Policy:  Allow
# Suspend:  False
```

**Step 4: Test backup job (create one-time run)**
```bash
kubectl create job --from=cronjob/postgres-backup-cronjob postgres-backup-test -n gravl-production

# Monitor job
kubectl logs job/postgres-backup-test -n gravl-production -f

# Verify backup file was created
kubectl exec -it postgres-0 -n gravl-production -- ls -la /backups/
# Should show backup file with timestamp
```

**Step 5: Test backup restoration (in staging)**
```bash
# Assuming backup file exists in pod
kubectl exec -it postgres-0 -n gravl-staging -- \
  psql -U gravl_user -d gravl < /backups/gravl-backup-latest.sql

# Verify data integrity
kubectl exec -it postgres-0 -n gravl-staging -- \
  psql -U gravl_user -d gravl -c "SELECT COUNT(*) FROM exercises;"
# Should return a non-zero count
```

**Timeline:** 15 minutes (5 min deploy + 10 min test)

**Note:** Backup storage may be local PVC (emptyDir) or external (S3, NFS). Verify storage configuration in manifest before deploying to production.

---

## Issue #3: AlertManager Endpoints Not Configured

### Symptom
```bash
kubectl describe configmap alertmanager-config -n gravl-monitoring
# Slack receiver defined but no webhook URL
# Email receiver defined but no SMTP server
```

### Root Cause
AlertManager configuration template includes receiver definitions but lacks actual credentials/endpoints.

### Fix Option A: Slack Integration

**Step 1: Create Slack webhook**
1. Go to https://api.slack.com/apps
2. Create new app → "From scratch" → select your workspace
3. Go to "Incoming Webhooks" → Enable
4. Click "Add New Webhook to Workspace"
5. Select target channel (e.g., #gravl-incidents)
6. Copy webhook URL (e.g., https://hooks.slack.com/services/T123/B456/xyz...)

**Step 2: Update AlertManager config**
```bash
# Get current config
kubectl get configmap alertmanager-config -n gravl-monitoring -o yaml > alertmanager-config.yaml

# Edit the file to add Slack webhook
# Find the 'slack_api_url' field and add your URL:
# receivers:
# - name: 'slack-notifications'
#   slack_configs:
#   - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
#     channel: '#gravl-incidents'
#     title: 'Alert'
#     text: '{{ .GroupLabels }} - {{ .Alerts.Firing | len }} firing'

# Apply updated config
kubectl apply -f alertmanager-config.yaml
```

**Step 3: Reload AlertManager**
```bash
# Send SIGHUP to AlertManager to reload config (without restarting)
kubectl exec -it alertmanager-0 -n gravl-monitoring -- \
  kill -HUP 1

# Verify config loaded
kubectl logs alertmanager-0 -n gravl-monitoring | grep "configuration loaded"
```

**Step 4: Test alert**
```bash
# Trigger test alert
cat << 'YAML' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test-alert
  namespace: gravl-monitoring
spec:
  groups:
  - name: test
    interval: 15s
    rules:
    - alert: TestAlert
      expr: vector(1)
      for: 0s
      labels:
        severity: critical
      annotations:
        summary: "Test alert firing"
YAML

# Monitor AlertManager for firing alert
kubectl port-forward -n gravl-monitoring svc/alertmanager 9093:9093
# Go to http://localhost:9093 → should see firing alert

# Check Slack channel for notification
# Should receive alert message within 30 seconds

# Clean up test alert
kubectl delete prometheusrule test-alert -n gravl-monitoring
```

### Fix Option B: Email Integration

**Step 1: Configure SMTP**
```bash
# Create Kubernetes secret for SMTP credentials
kubectl create secret generic alertmanager-smtp \
  --from-literal=username=your-email@gmail.com \
  --from-literal=password=your-app-password \
  -n gravl-monitoring
```

**Step 2: Update AlertManager config**
```bash
# Edit alertmanager-config.yaml
# global:
#   resolve_timeout: 5m
#   smtp_from: 'alerts@gravl.example.com'
#   smtp_smarthost: 'smtp.gmail.com:587'
#   smtp_auth_username: 'your-email@gmail.com'
#   smtp_auth_password: 'your-app-password'  # Or reference from secret
#
# receivers:
# - name: 'email-notifications'
#   email_configs:
#   - to: 'team@gravl.example.com'
#     from: 'alerts@gravl.example.com'
#     smarthost: 'smtp.gmail.com:587'
#     auth_username: 'your-email@gmail.com'
#     auth_password: 'your-app-password'
#     headers:
#       Subject: 'Gravl Alert: {{ .GroupLabels.alertname }}'

kubectl apply -f alertmanager-config.yaml
```

**Step 3: Reload and test**
```bash
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1

# Test with command-line tool or create test alert (see above)
```

### Fix Option C: Both Slack + Email

```yaml
# Modify route and receivers section
global:
  resolve_timeout: 5m

route:
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'slack-notifications'
    continue: true
  - match:
      severity: warning
    receiver: 'email-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T123/B456/xyz...'
    channel: '#gravl-incidents'

- name: 'email-notifications'
  email_configs:
  - to: 'team@gravl.example.com'
    smarthost: 'smtp.gmail.com:587'
```

**Timeline:**
- Option A (Slack only): 30 minutes
- Option B (Email only): 30 minutes
- Option C (Both): 45 minutes

**Recommendation:** Use **Slack + Email**. Slack for immediate visibility, email for audit trail.

---

## Consolidated Remediation Checklist

### Pre-Flight (5 minutes)
- [ ] Team notified of remediation work
- [ ] On-call engineer on standby
- [ ] Monitoring dashboard open (watch for pod restarts)

### Issue #1: Loki Storage (15 minutes)
- [ ] Choose fix option (recommend: Option B local-path)
- [ ] Apply fix
- [ ] Verify Loki pod running (no CrashLoopBackOff)
- [ ] Verify Promtail pods running (depends on Loki)

### Issue #2: Backup Cronjob (15 minutes)
- [ ] Apply cronjob manifest
- [ ] Verify cronjob scheduled
- [ ] Create test backup job
- [ ] Verify backup file created

### Issue #3: AlertManager Endpoints (30 minutes)
- [ ] Create Slack webhook (if using Slack)
- [ ] Create SMTP credentials (if using email)
- [ ] Update AlertManager config
- [ ] Test alert delivery
- [ ] Clean up test alert

### Post-Remediation (5 minutes)
- [ ] All pods healthy
- [ ] All services responding
- [ ] Document any manual steps for runbook
- [ ] Sign-off: Ready for production deployment

---

## Rollback Plan (If Remediation Fails)

**If Loki fix fails:**
```bash
# Revert to original state (keep broken)
# Loki is non-blocking, can deploy without it
kubectl delete statefulset loki -n gravl-logging
```

**If Backup deployment fails:**
```bash
# Revert cronjob removal
kubectl delete cronjob postgres-backup-cronjob -n gravl-production
# Schedule manual backup before production launch
```

**If AlertManager config breaks:**
```bash
# Revert to previous config
kubectl rollout undo configmap alertmanager-config -n gravl-monitoring
kubectl exec -it alertmanager-0 -n gravl-monitoring -- kill -HUP 1
```

---

## Success Criteria

✅ **Loki operational** (pod running, no CrashLoopBackOff)
✅ **Promtail operational** (logs flowing)
✅ **Backup cronjob deployed** (scheduled, tested)
✅ **AlertManager endpoints configured** (test alert received)
✅ **No new pod restarts** (stable for 5 minutes)

---

**Document Version:** 1.0
**Created:** 2026-03-06 20:16 UTC
**Estimated Implementation Time:** 2-3 hours
**Priority:** Critical path to production