ca83efe828
- Add comprehensive network policies to k8s/staging/network-policy.yaml - Implements default-deny ingress pattern with explicit allow rules - Critical: Add DNS egress rule for CoreDNS resolution (port 53 UDP/TCP) - Policies cover: ingress-nginx→backend, backend→postgres, monitoring scrape - External API egress for backend (HTTP/HTTPS) - CDN egress for frontend (HTTP/HTTPS) - Status: Applied to gravl-staging namespace, verified operational
359 lines
12 KiB
Markdown
359 lines
12 KiB
Markdown
# Production Readiness Implementation Plan
|
|
# Phase 10-07, Task 5 — EXECUTION ROADMAP
|
|
|
|
**Date:** 2026-03-07
|
|
**Status:** IMPLEMENTATION READY
|
|
**Owner:** Backend-Dev (execution) + Architect (oversight)
|
|
**Target Completion:** +6-8 hours from start (by ~09:30-11:30 CET Saturday)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Task 5 (Production Readiness Review) has **4 critical blockers** preventing production launch. This document provides the exact implementation steps for each blocker with pre-written Kubernetes manifests and validation procedures.
|
|
|
|
**All 4 blockers have templates ready in `/workspace/gravl/k8s/production/`:**
|
|
1. `cert-manager-setup.yaml` — TLS automation
|
|
2. `sealed-secrets-setup.yaml` — Secrets encryption
|
|
3. `network-policy-with-dns.yaml` — Network egress fix
|
|
4. `load-test.js` + execution instructions
|
|
|
|
---
|
|
|
|
## Critical Path Execution (Ordered by Dependency)
|
|
|
|
### ✅ Blocker 1: TLS/cert-manager Setup (Dependency: None)
|
|
**File:** `k8s/production/cert-manager-setup.yaml`
|
|
**Status:** READY FOR IMPLEMENTATION
|
|
|
|
#### Steps:
|
|
```bash
|
|
# 1. Install cert-manager controller (official release)
|
|
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
|
|
|
|
# 2. Verify installation
|
|
kubectl rollout status deployment/cert-manager-webhook -n cert-manager --timeout=120s
|
|
kubectl rollout status deployment/cert-manager -n cert-manager --timeout=120s
|
|
|
|
# 3. Apply ClusterIssuers (Let's Encrypt prod + staging)
|
|
kubectl apply -f k8s/production/cert-manager-setup.yaml
|
|
|
|
# 4. Verify issuers created
|
|
kubectl get clusterissuer -A
|
|
# Expected output:
|
|
# NAME READY AGE
|
|
# letsencrypt-prod True 2m
|
|
# letsencrypt-staging True 2m
|
|
# selfsigned-issuer True 2m
|
|
|
|
# 5. Create Cloudflare API token secret (MANUAL)
|
|
kubectl create secret generic cloudflare-api-token \
|
|
--from-literal=api-token=YOUR_CLOUDFLARE_API_TOKEN \
|
|
-n cert-manager
|
|
|
|
# 6. Update Ingress with cert-manager annotation (already in template)
|
|
# Ingress automatically requests certificate once annotation is set
|
|
kubectl apply -f k8s/production/cert-manager-setup.yaml
|
|
|
|
# 7. Verify certificate creation
|
|
kubectl get certificate -A
|
|
kubectl get secret -A | grep gravl-tls-prod
|
|
```
|
|
|
|
#### Validation Checklist:
|
|
- [ ] cert-manager pods running in cert-manager namespace
|
|
- [ ] ClusterIssuers show READY=True
|
|
- [ ] Certificate created in gravl-prod namespace
|
|
- [ ] TLS secret `gravl-tls-prod` exists
|
|
- [ ] HTTPS accessible on gravl.app + api.gravl.app
|
|
- [ ] cert-manager logs show no errors
|
|
|
|
**Estimated Duration:** 10-15 minutes (certificate issuance may take 1-2 minutes)
|
|
|
|
---
|
|
|
|
### ✅ Blocker 2: Secrets Management (Dependency: None — parallel with TLS)
|
|
|
|
**File:** `k8s/production/sealed-secrets-setup.yaml`
|
|
**Status:** TWO OPTIONS (choose one)
|
|
|
|
#### OPTION A: sealed-secrets (kubeseal) — RECOMMENDED for simplicity
|
|
|
|
```bash
|
|
# 1. Install sealed-secrets controller
|
|
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
|
|
|
|
# 2. Verify installation
|
|
kubectl rollout status deployment/sealed-secrets-controller -n kube-system --timeout=120s
|
|
|
|
# 3. Extract sealing key (for backup + disaster recovery)
|
|
mkdir -p /secure/location
|
|
kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/status=active \
|
|
-o jsonpath='{.items[0].data.tls\.crt}' | base64 -d > /secure/location/sealed-secrets-prod.crt
|
|
kubectl get secret -n kube-system -l sealedsecrets.bitnami.com/status=active \
|
|
-o jsonpath='{.items[0].data.tls\.key}' | base64 -d > /secure/location/sealed-secrets-prod.key
|
|
|
|
# 4. Create plain secret (temporary)
|
|
cat <<PLAIN_SECRET | kubectl apply -f -
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: gravl-secrets
|
|
namespace: gravl-prod
|
|
type: Opaque
|
|
data:
|
|
DATABASE_PASSWORD: $(echo -n 'your-secure-password-32-chars-min' | base64)
|
|
JWT_SECRET: $(openssl rand -hex 64 | base64)
|
|
PGADMIN_PASSWORD: $(echo -n 'admin-password' | base64)
|
|
PLAIN_SECRET
|
|
|
|
# 5. Install kubeseal CLI (if not installed)
|
|
wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/kubeseal-0.24.0-linux-amd64.tar.gz
|
|
tar xfz kubeseal-0.24.0-linux-amd64.tar.gz -C /usr/local/bin/
|
|
|
|
# 6. Seal the secret
|
|
kubeseal -f <(kubectl get secret gravl-secrets -n gravl-prod -o yaml) -w gravl-secrets-sealed.yaml
|
|
|
|
# 7. Delete plain secret
|
|
kubectl delete secret gravl-secrets -n gravl-prod
|
|
|
|
# 8. Apply sealed secret
|
|
kubectl apply -f gravl-secrets-sealed.yaml
|
|
|
|
# 9. Verify sealed secret deployed
|
|
kubectl get sealedsecret -n gravl-prod
|
|
kubectl get secret gravl-secrets -n gravl-prod -o yaml # Should decrypt automatically
|
|
```
|
|
|
|
#### OPTION B: External Secrets Operator + AWS Secrets Manager (AWS production environments)
|
|
|
|
```bash
|
|
# 1. Install External Secrets Operator
|
|
helm repo add external-secrets https://charts.external-secrets.io
|
|
helm repo update
|
|
helm install external-secrets external-secrets/external-secrets \
|
|
-n external-secrets --create-namespace
|
|
|
|
# 2. Create secrets in AWS Secrets Manager (manual AWS console or CLI)
|
|
aws secretsmanager create-secret \
|
|
--name gravl/prod/db-password \
|
|
--secret-string "your-secure-password-32-chars-min" \
|
|
--region eu-west-1
|
|
|
|
aws secretsmanager create-secret \
|
|
--name gravl/prod/jwt-secret \
|
|
--secret-string $(openssl rand -hex 64) \
|
|
--region eu-west-1
|
|
|
|
# 3. Create IAM role for IRSA (service account)
|
|
# [SEE AWS documentation for IRSA setup with external-secrets]
|
|
|
|
# 4. Apply External Secret configuration
|
|
kubectl apply -f k8s/production/sealed-secrets-setup.yaml
|
|
|
|
# 5. Verify sync
|
|
kubectl get externalsecret -n gravl-prod
|
|
kubectl describe externalsecret gravl-aws-secrets -n gravl-prod
|
|
```
|
|
|
|
#### Validation Checklist:
|
|
- [ ] Secrets controller pod running
|
|
- [ ] `gravl-secrets` secret exists (either sealed or external)
|
|
- [ ] Backend pod can read database password from secret
|
|
- [ ] No plain secrets in Git or etcd
|
|
- [ ] Sealing key backed up securely
|
|
|
|
**Estimated Duration:** 10-15 minutes
|
|
|
|
---
|
|
|
|
### ✅ Blocker 3: Network Policy DNS Egress (Dependency: None — parallel)
|
|
|
|
**File:** `k8s/production/network-policy-with-dns.yaml`
|
|
**Status:** READY FOR IMPLEMENTATION
|
|
|
|
```bash
|
|
# 1. Label kube-system namespace (if not already labeled)
|
|
kubectl label namespace kube-system name=kube-system --overwrite
|
|
|
|
# 2. Apply updated network policies with DNS egress
|
|
kubectl apply -f k8s/production/network-policy-with-dns.yaml
|
|
|
|
# 3. Verify policies created
|
|
kubectl get networkpolicy -n gravl-prod
|
|
# Expected output:
|
|
# NAME POD-SELECTOR AGE
|
|
# gravl-default-deny (empty) 1m
|
|
# allow-from-ingress app=backend 1m
|
|
# allow-ingress-to-frontend app=frontend 1m
|
|
# allow-backend-to-db app=postgres 1m
|
|
# allow-monitoring-scrape (empty) 1m
|
|
# allow-dns-egress (empty) 1m
|
|
# allow-backend-db-egress app=backend 1m
|
|
# allow-external-apis app=backend 1m
|
|
# allow-frontend-cdn-egress app=frontend 1m
|
|
|
|
# 4. Test DNS resolution from backend pod
|
|
kubectl exec -n gravl-prod deployment/backend -- nslookup gravl.app
|
|
# Expected: resolves to external IP
|
|
|
|
# 5. Test inter-pod communication still works
|
|
kubectl exec -n gravl-prod deployment/backend -- nc -zv postgres 5432
|
|
# Expected: Connection successful
|
|
|
|
# 6. Test Prometheus scraping (should still work)
|
|
kubectl logs -n gravl-monitoring deployment/prometheus | grep "gravl-prod"
|
|
# Expected: scraping gravl-prod endpoints successfully
|
|
```
|
|
|
|
#### Validation Checklist:
|
|
- [ ] All network policies created successfully
|
|
- [ ] DNS queries work (nslookup/dig successful)
|
|
- [ ] Backend → Database connectivity functional
|
|
- [ ] Prometheus scraping operational
|
|
- [ ] Ingress-nginx → backend traffic flowing
|
|
|
|
**Estimated Duration:** 5-10 minutes
|
|
|
|
---
|
|
|
|
### ✅ Blocker 4: Load Test Baseline (Dependency: All previous blockers complete)
|
|
|
|
**File:** `k8s/production/load-test.js`
|
|
**Status:** READY FOR EXECUTION
|
|
|
|
```bash
|
|
# 1. Install k6 CLI (if not already installed)
|
|
# macOS: brew install k6
|
|
# Linux: apt-get install k6
|
|
# Or Docker: docker run --rm -v $(pwd):/scripts grafana/k6:latest run /scripts/load-test.js
|
|
|
|
k6 --version
|
|
# Expected: k6 v0.49.0+
|
|
|
|
# 2. Run load test against staging environment
|
|
export GRAVL_API_URL="https://staging.gravl.app"
|
|
k6 run k8s/production/load-test.js
|
|
|
|
# 3. Observe results in real-time:
|
|
# • Requests/sec
|
|
# • p95 latency
|
|
# • p99 latency
|
|
# • Error rate
|
|
# • Active connections
|
|
|
|
# 4. Expected baseline (PASS criteria):
|
|
# ✓ p95 latency: <200ms
|
|
# ✓ p99 latency: <500ms
|
|
# ✓ Error rate: <0.1%
|
|
# ✓ Throughput: >100 req/s
|
|
|
|
# 5. Save results to file for documentation
|
|
k6 run --out json=load-test-results.json k8s/production/load-test.js
|
|
|
|
# 6. Upload results to shared documentation
|
|
mv load-test-results.json docs/load-test-baseline-2026-03-07.json
|
|
git add docs/load-test-baseline-*.json
|
|
git commit -m "Load test baseline: p95 <200ms, error rate <0.1%"
|
|
```
|
|
|
|
#### Validation Checklist:
|
|
- [ ] k6 installed and executable
|
|
- [ ] Load test completes without script errors
|
|
- [ ] p95 latency < 200ms ✅
|
|
- [ ] p99 latency < 500ms ✅
|
|
- [ ] Error rate < 0.1% ✅
|
|
- [ ] Results documented in `docs/load-test-baseline-2026-03-07.json`
|
|
|
|
**Estimated Duration:** 5-10 minutes (test runs for 5 minutes)
|
|
|
|
---
|
|
|
|
## Production Readiness Sign-Off Template
|
|
|
|
Once all blockers are complete, update `PRODUCTION_READINESS.md` with final sign-offs:
|
|
|
|
```markdown
|
|
## Final Sign-Off (2026-03-07)
|
|
|
|
### Security Review ✅ APPROVED
|
|
- [x] RBAC: Least privilege verified
|
|
- [x] Network Policies: Default deny + explicit allowlist (DNS egress added)
|
|
- [x] Secrets Management: sealed-secrets OR External Secrets Operator deployed
|
|
- [x] TLS/Encryption: cert-manager + Let's Encrypt configured
|
|
- [x] Image Scanning: Scheduled for [DATE]
|
|
|
|
### Performance Validation ✅ APPROVED
|
|
- [x] Load test baseline: p95 <200ms, error rate <0.1%
|
|
- [x] Database performance: Query latency acceptable
|
|
- [x] Pod resource limits: Configured and validated
|
|
|
|
### Operations Readiness ✅ APPROVED
|
|
- [x] Monitoring: Prometheus + Grafana operational
|
|
- [x] Alerting: AlertManager configured with receivers
|
|
- [x] Logging: [Loki workaround OR alternative configured]
|
|
- [x] Backup: Daily + weekly jobs validated
|
|
- [x] Runbooks: Created and tested
|
|
|
|
### Go-Live Authorization: ✅ APPROVED
|
|
**Authorized by:** [Architect/PM name]
|
|
**Date:** 2026-03-07
|
|
**Conditions:** All critical path items complete, load test passing, monitoring alerts active
|
|
```
|
|
|
|
---
|
|
|
|
## Rollback Readiness
|
|
|
|
If any blocker fails production testing:
|
|
|
|
```bash
|
|
# 1. Immediate rollback to staging-only:
|
|
kubectl scale deployment -n gravl-prod --replicas=0
|
|
|
|
# 2. Disable cert-manager for Ingress (revert to self-signed):
|
|
kubectl patch ingress gravl-ingress -n gravl-prod --type json \
|
|
-p='[{"op":"remove","path":"/metadata/annotations/cert-manager.io~1cluster-issuer"}]'
|
|
|
|
# 3. Restore pre-cert-manager Ingress:
|
|
kubectl apply -f k8s/staging/ingress.yaml
|
|
|
|
# 4. Alert team: "Production deployment rolled back — investigation required"
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
Phase 10-07 is **COMPLETE** when:
|
|
|
|
✅ All 4 critical blockers resolved
|
|
✅ Load test baseline documented (p95 <200ms)
|
|
✅ Security sign-off checklist approved
|
|
✅ Monitoring + alerting operational
|
|
✅ Team authorization obtained
|
|
✅ Go-live procedure documented
|
|
|
|
**Ready to proceed to production launch.**
|
|
|
|
---
|
|
|
|
## Timeline Summary
|
|
|
|
| Blocker | Duration | Start | End |
|
|
|---------|----------|-------|-----|
|
|
| 1. cert-manager setup | 10-15 min | 03:40 | 03:55 |
|
|
| 2. Secrets mgmt (parallel) | 10-15 min | 03:40 | 03:55 |
|
|
| 3. Network policy (parallel) | 5-10 min | 03:40 | 03:50 |
|
|
| 4. Load test | 5-10 min | 04:00 | 04:10 |
|
|
| **Total** | **6-8 hours** | **03:40** | **~09:30-11:30** |
|
|
|
|
*(Includes buffer for kubectl wait times, certificate issuance, etc.)*
|
|
|
|
---
|
|
|
|
**Document Version:** 2.0 (Implementation Ready)
|
|
**Last Updated:** 2026-03-07 03:45
|
|
**Owner:** Gravl PM Autonomy / Architect
|
|
**Next Review:** Before production launch
|