Files
gravl/docs/PRODUCTION_READINESS_IMPLEMENTATION.md
T
clawd ca83efe828 Phase 10-08: Implement DNS egress NetworkPolicy for staging environment
- Add comprehensive network policies to k8s/staging/network-policy.yaml
- Implements default-deny ingress pattern with explicit allow rules
- Critical: Add DNS egress rule for CoreDNS resolution (port 53 UDP/TCP)
- Policies cover: ingress-nginx→backend, backend→postgres, monitoring scrape
- External API egress for backend (HTTP/HTTPS)
- CDN egress for frontend (HTTP/HTTPS)
- Status: Applied to gravl-staging namespace, verified operational
2026-03-08 07:00:07 +01:00

12 KiB

Production Readiness Implementation Plan

Phase 10-07, Task 5 — EXECUTION ROADMAP

Date: 2026-03-07
Status: IMPLEMENTATION READY
Owner: Backend-Dev (execution) + Architect (oversight)
Target Completion: +6-8 hours from start (by ~09:30-11:30 CET Saturday)


Executive Summary

Task 5 (Production Readiness Review) has 4 critical blockers preventing production launch. This document provides the exact implementation steps for each blocker with pre-written Kubernetes manifests and validation procedures.

All 4 blockers have templates ready in /workspace/gravl/k8s/production/:

  1. cert-manager-setup.yaml — TLS automation
  2. sealed-secrets-setup.yaml — Secrets encryption
  3. network-policy-with-dns.yaml — Network egress fix
  4. load-test.js + execution instructions

Critical Path Execution (Ordered by Dependency)

Blocker 1: TLS/cert-manager Setup (Dependency: None)

File: k8s/production/cert-manager-setup.yaml
Status: READY FOR IMPLEMENTATION

Steps:

# 1. Install cert-manager controller (official release)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml

# 2. Verify installation
kubectl rollout status deployment/cert-manager-webhook -n cert-manager --timeout=120s
kubectl rollout status deployment/cert-manager -n cert-manager --timeout=120s

# 3. Apply ClusterIssuers (Let's Encrypt prod + staging)
kubectl apply -f k8s/production/cert-manager-setup.yaml

# 4. Verify issuers created
kubectl get clusterissuer -A
# Expected output:
# NAME                   READY   AGE
# letsencrypt-prod       True    2m
# letsencrypt-staging    True    2m
# selfsigned-issuer      True    2m

# 5. Create Cloudflare API token secret (MANUAL)
kubectl create secret generic cloudflare-api-token \
  --from-literal=api-token=YOUR_CLOUDFLARE_API_TOKEN \
  -n cert-manager

# 6. Update Ingress with cert-manager annotation (already in template)
# Ingress automatically requests certificate once annotation is set
kubectl apply -f k8s/production/cert-manager-setup.yaml

# 7. Verify certificate creation
kubectl get certificate -A
kubectl get secret -A | grep gravl-tls-prod

Validation Checklist:

  • cert-manager pods running in cert-manager namespace
  • ClusterIssuers show READY=True
  • Certificate created in gravl-prod namespace
  • TLS secret gravl-tls-prod exists
  • HTTPS accessible on gravl.app + api.gravl.app
  • cert-manager logs show no errors

Estimated Duration: 10-15 minutes (certificate issuance may take 1-2 minutes)


Blocker 2: Secrets Management (Dependency: None — parallel with TLS)

File: k8s/production/sealed-secrets-setup.yaml
Status: TWO OPTIONS (choose one)

# 1. Install sealed-secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# 2. Verify installation
kubectl rollout status deployment/sealed-secrets-controller -n kube-system --timeout=120s

# 3. Extract sealing key (for backup + disaster recovery)
mkdir -p /secure/location
kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/status=active \
  -o jsonpath='{.items[0].data.tls\.crt}' | base64 -d > /secure/location/sealed-secrets-prod.crt
kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/status=active \
  -o jsonpath='{.items[0].data.tls\.key}' | base64 -d > /secure/location/sealed-secrets-prod.key

# 4. Create plain secret (temporary)
cat <<PLAIN_SECRET | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: gravl-secrets
  namespace: gravl-prod
type: Opaque
data:
  DATABASE_PASSWORD: $(echo -n 'your-secure-password-32-chars-min' | base64)
  JWT_SECRET: $(openssl rand -hex 64 | base64)
  PGADMIN_PASSWORD: $(echo -n 'admin-password' | base64)
PLAIN_SECRET

# 5. Install kubeseal CLI (if not installed)
wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/kubeseal-0.24.0-linux-amd64.tar.gz
tar xfz kubeseal-0.24.0-linux-amd64.tar.gz -C /usr/local/bin/

# 6. Seal the secret
kubeseal -f <(kubectl get secret gravl-secrets -n gravl-prod -o yaml) -w gravl-secrets-sealed.yaml

# 7. Delete plain secret
kubectl delete secret gravl-secrets -n gravl-prod

# 8. Apply sealed secret
kubectl apply -f gravl-secrets-sealed.yaml

# 9. Verify sealed secret deployed
kubectl get sealedsecret -n gravl-prod
kubectl get secret gravl-secrets -n gravl-prod -o yaml  # Should decrypt automatically

OPTION B: External Secrets Operator + AWS Secrets Manager (AWS production environments)

# 1. Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm repo update
helm install external-secrets external-secrets/external-secrets \
  -n external-secrets --create-namespace

# 2. Create secrets in AWS Secrets Manager (manual AWS console or CLI)
aws secretsmanager create-secret \
  --name gravl/prod/db-password \
  --secret-string "your-secure-password-32-chars-min" \
  --region eu-west-1

aws secretsmanager create-secret \
  --name gravl/prod/jwt-secret \
  --secret-string $(openssl rand -hex 64) \
  --region eu-west-1

# 3. Create IAM role for IRSA (service account)
# [SEE AWS documentation for IRSA setup with external-secrets]

# 4. Apply External Secret configuration
kubectl apply -f k8s/production/sealed-secrets-setup.yaml

# 5. Verify sync
kubectl get externalsecret -n gravl-prod
kubectl describe externalsecret gravl-aws-secrets -n gravl-prod

Validation Checklist:

  • Secrets controller pod running
  • gravl-secrets secret exists (either sealed or external)
  • Backend pod can read database password from secret
  • No plain secrets in Git or etcd
  • Sealing key backed up securely

Estimated Duration: 10-15 minutes


Blocker 3: Network Policy DNS Egress (Dependency: None — parallel)

File: k8s/production/network-policy-with-dns.yaml
Status: READY FOR IMPLEMENTATION

# 1. Label kube-system namespace (if not already labeled)
kubectl label namespace kube-system name=kube-system --overwrite

# 2. Apply updated network policies with DNS egress
kubectl apply -f k8s/production/network-policy-with-dns.yaml

# 3. Verify policies created
kubectl get networkpolicy -n gravl-prod
# Expected output:
# NAME                        POD-SELECTOR   AGE
# gravl-default-deny          (empty)        1m
# allow-from-ingress          app=backend    1m
# allow-ingress-to-frontend   app=frontend   1m
# allow-backend-to-db         app=postgres   1m
# allow-monitoring-scrape     (empty)        1m
# allow-dns-egress            (empty)        1m
# allow-backend-db-egress     app=backend    1m
# allow-external-apis         app=backend    1m
# allow-frontend-cdn-egress   app=frontend   1m

# 4. Test DNS resolution from backend pod
kubectl exec -n gravl-prod deployment/backend -- nslookup gravl.app
# Expected: resolves to external IP

# 5. Test inter-pod communication still works
kubectl exec -n gravl-prod deployment/backend -- nc -zv postgres 5432
# Expected: Connection successful

# 6. Test Prometheus scraping (should still work)
kubectl logs -n gravl-monitoring deployment/prometheus | grep "gravl-prod"
# Expected: scraping gravl-prod endpoints successfully

Validation Checklist:

  • All network policies created successfully
  • DNS queries work (nslookup/dig successful)
  • Backend → Database connectivity functional
  • Prometheus scraping operational
  • Ingress-nginx → backend traffic flowing

Estimated Duration: 5-10 minutes


Blocker 4: Load Test Baseline (Dependency: All previous blockers complete)

File: k8s/production/load-test.js
Status: READY FOR EXECUTION

# 1. Install k6 CLI (if not already installed)
# macOS: brew install k6
# Linux: apt-get install k6
# Or Docker: docker run --rm -v $(pwd):/scripts grafana/k6:latest run /scripts/load-test.js

k6 --version
# Expected: k6 v0.49.0+

# 2. Run load test against staging environment
export GRAVL_API_URL="https://staging.gravl.app"
k6 run k8s/production/load-test.js

# 3. Observe results in real-time:
# • Requests/sec
# • p95 latency
# • p99 latency
# • Error rate
# • Active connections

# 4. Expected baseline (PASS criteria):
# ✓ p95 latency: <200ms
# ✓ p99 latency: <500ms
# ✓ Error rate: <0.1%
# ✓ Throughput: >100 req/s

# 5. Save results to file for documentation
k6 run --out json=load-test-results.json k8s/production/load-test.js

# 6. Upload results to shared documentation
mv load-test-results.json docs/load-test-baseline-2026-03-07.json
git add docs/load-test-baseline-*.json
git commit -m "Load test baseline: p95 <200ms, error rate <0.1%"

Validation Checklist:

  • k6 installed and executable
  • Load test completes without script errors
  • p95 latency < 200ms
  • p99 latency < 500ms
  • Error rate < 0.1%
  • Results documented in docs/load-test-baseline-2026-03-07.json

Estimated Duration: 5-10 minutes (test runs for 5 minutes)


Production Readiness Sign-Off Template

Once all blockers are complete, update PRODUCTION_READINESS.md with final sign-offs:

## Final Sign-Off (2026-03-07)

### Security Review ✅ APPROVED
- [x] RBAC: Least privilege verified
- [x] Network Policies: Default deny + explicit allowlist (DNS egress added)
- [x] Secrets Management: sealed-secrets OR External Secrets Operator deployed
- [x] TLS/Encryption: cert-manager + Let's Encrypt configured
- [x] Image Scanning: Scheduled for [DATE]

### Performance Validation ✅ APPROVED
- [x] Load test baseline: p95 <200ms, error rate <0.1%
- [x] Database performance: Query latency acceptable
- [x] Pod resource limits: Configured and validated

### Operations Readiness ✅ APPROVED
- [x] Monitoring: Prometheus + Grafana operational
- [x] Alerting: AlertManager configured with receivers
- [x] Logging: [Loki workaround OR alternative configured]
- [x] Backup: Daily + weekly jobs validated
- [x] Runbooks: Created and tested

### Go-Live Authorization: ✅ APPROVED
**Authorized by:** [Architect/PM name]  
**Date:** 2026-03-07  
**Conditions:** All critical path items complete, load test passing, monitoring alerts active

Rollback Readiness

If any blocker fails production testing:

# 1. Immediate rollback to staging-only:
kubectl scale deployment -n gravl-prod --replicas=0

# 2. Disable cert-manager for Ingress (revert to self-signed):
kubectl patch ingress gravl-ingress -n gravl-prod --type json \
  -p='[{"op":"remove","path":"/metadata/annotations/cert-manager.io~1cluster-issuer"}]'

# 3. Restore pre-cert-manager Ingress:
kubectl apply -f k8s/staging/ingress.yaml

# 4. Alert team: "Production deployment rolled back — investigation required"

Success Criteria

Phase 10-07 is COMPLETE when:

All 4 critical blockers resolved
Load test baseline documented (p95 <200ms)
Security sign-off checklist approved
Monitoring + alerting operational
Team authorization obtained
Go-live procedure documented

Ready to proceed to production launch.


Timeline Summary

Blocker Duration Start End
1. cert-manager setup 10-15 min 03:40 03:55
2. Secrets mgmt (parallel) 10-15 min 03:40 03:55
3. Network policy (parallel) 5-10 min 03:40 03:50
4. Load test 5-10 min 04:00 04:10
Total 6-8 hours 03:40 ~09:30-11:30

(Includes buffer for kubectl wait times, certificate issuance, etc.)


Document Version: 2.0 (Implementation Ready)
Last Updated: 2026-03-07 03:45
Owner: Gravl PM Autonomy / Architect
Next Review: Before production launch