# Production Go-Live Procedure — Phase 10-07, Task 5 **Date:** 2026-03-06 **Status:** DRAFT (TO BE TESTED ON STAGING) **Owner:** DevOps / Deployment Lead **Pre-requisites:** Complete PRODUCTION_READINESS.md checklist items #1-4 --- ## Overview This document defines the step-by-step procedure for deploying Gravl to production and verifying system health. **Estimated Duration:** 2-3 hours (plus verification window) **Rollback Window:** <15 minutes (with ROLLBACK.md procedure) **Required Team:** DevOps (2), Backend (1), Frontend Lead (1) --- ## Pre-Flight Checklist (T-30 minutes) - [ ] Production cluster access verified (kubectl configured) - [ ] All team members on call (Slack + video bridge open) - [ ] Backup of production database exists (snapshot/automated backup running) - [ ] Monitoring dashboards loaded and ready (Grafana open in separate browser tabs) - [ ] Rollback procedure briefed to team (5-minute review of ROLLBACK.md) - [ ] Production domain DNS propagated (check DNS resolution) - [ ] TLS certificates ready or cert-manager deployed and tested - [ ] Alert thresholds reviewed (no overly sensitive alerts during deployment) - [ ] Staging environment running last validated build - [ ] Load balancer health checks configured - [ ] Incident communication channel created (Slack #gravl-incident) --- ## Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes) ### 1.1 Create Kubernetes Namespace & RBAC ```bash # Apply production namespace configuration kubectl apply -f k8s/production/namespace.yaml # Apply RBAC for production deployments kubectl apply -f k8s/production/rbac.yaml # Verify namespace created kubectl get ns gravl-production kubectl get serviceaccount -n gravl-production gravl-deployer ``` **Verification:** - [ ] Namespace exists - [ ] ServiceAccount exists - [ ] RBAC role bound ### 1.2 Apply Network Policies ```bash # Apply default deny + explicit allow rules kubectl apply -f k8s/production/network-policy.yaml # Verify policies (should see 5+ NetworkPolicies) kubectl get networkpolicies -n gravl-production ``` **Verification:** - [ ] Default deny ingress in place - [ ] Backend, frontend, database, monitoring policies visible ### 1.3 Deploy Secrets (Sealed or External) **Option A: Sealed Secrets** (if kubeseal is deployed) ```bash # Unseal production secrets kubeseal -f k8s/production/sealed-secrets.yaml \ | kubectl apply -f - # Verify secrets exist kubectl get secrets -n gravl-production kubectl describe secret postgres-secret -n gravl-production ``` **Option B: External Secrets Operator** (if AWS/Vault used) ```bash # Apply ExternalSecret definitions kubectl apply -f k8s/production/external-secrets.yaml # Verify ExternalSecrets synced (should see status: synced) kubectl get externalsecrets -n gravl-production kubectl describe externalsecret postgres-secret -n gravl-production ``` **Verification:** - [ ] postgres-secret contains POSTGRES_PASSWORD - [ ] app-secret contains JWT_SECRET - [ ] registry-pull-secret exists (if private registry used) - [ ] staging-tls exists (or cert-manager will auto-create) ### 1.4 Deploy cert-manager (if not already on cluster) ```bash # Install cert-manager (one-time, if needed) helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --set installCRDs=true \ --version v1.13.0 # Create ClusterIssuer for Let's Encrypt (production) kubectl apply -f k8s/production/cert-manager-issuer.yaml # Verify issuer ready kubectl get clusterissuer kubectl describe clusterissuer letsencrypt-prod ``` **Verification:** - [ ] cert-manager pods running in cert-manager namespace - [ ] ClusterIssuer status is READY (True) --- ## Phase 2: Database & Storage (T-30 to T-10 minutes) ### 2.1 Deploy PostgreSQL StatefulSet ```bash # Deploy PostgreSQL to production kubectl apply -f k8s/production/postgres-statefulset.yaml # Watch for Pod readiness (should take 30-60 seconds) kubectl rollout status statefulset/postgres -n gravl-production # Verify pod is running and ready (2/2 containers) kubectl get pods -n gravl-production -l component=database ``` **Verification:** - [ ] Pod status: Running, Ready 2/2 - [ ] PersistentVolumeClaim bound - [ ] No errors in pod logs: `kubectl logs postgres-0 -n gravl-production` ### 2.2 Run Database Migrations ```bash # Port-forward to database (for migration job) kubectl port-forward postgres-0 5432:5432 -n gravl-production & # Run migrations in separate terminal cd backend npm run db:migrate:prod # Monitor migration logs kubectl logs -n gravl-production -f job/db-migration # Kill port-forward when done kill %1 ``` **Verification:** - [ ] Migration job completed successfully - [ ] No migration errors in logs - [ ] Database schema matches expected version ### 2.3 Verify Database Connectivity ```bash # Create a test pod to verify DB access kubectl run -it --rm --image=postgres:15 \ --restart=Never \ -n gravl-production \ psql-test \ -- psql -h postgres -U gravl_user -d gravl -c "SELECT version();" # Should return PostgreSQL version ``` **Verification:** - [ ] Database connection successful - [ ] PostgreSQL version visible --- ## Phase 3: Deploy Application Services (T-10 to T+20 minutes) ### 3.1 Deploy Backend Deployment ```bash # Deploy backend service kubectl apply -f k8s/production/backend-deployment.yaml # Wait for rollout (typically 2-3 minutes) kubectl rollout status deployment/backend -n gravl-production # Verify pods running kubectl get pods -n gravl-production -l component=backend ``` **Verification:** - [ ] Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready) - [ ] No CrashLoopBackOff errors - [ ] Service endpoint registered: `kubectl get svc backend -n gravl-production` ### 3.2 Deploy Frontend Deployment ```bash # Deploy frontend service kubectl apply -f k8s/production/frontend-deployment.yaml # Wait for rollout kubectl rollout status deployment/frontend -n gravl-production # Verify pods kubectl get pods -n gravl-production -l component=frontend ``` **Verification:** - [ ] Frontend pods running and ready - [ ] Service endpoint registered ### 3.3 Apply Ingress with TLS Termination ```bash # Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation) kubectl apply -f k8s/production/ingress.yaml # Wait for ingress to get external IP / DNS name (typically 30-60 seconds) kubectl get ingress -n gravl-production -w # Check ingress status and TLS certificate kubectl describe ingress gravl-ingress -n gravl-production ``` **Verification:** - [ ] Ingress has external IP or DNS name assigned - [ ] TLS certificate present (cert-manager auto-created if configured) - [ ] SSL certificate not self-signed (check with OpenSSL): ```bash echo | openssl s_client -servername gravl.example.com \ -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject ``` --- ## Phase 4: Service Integration Verification (T+20 to T+40 minutes) ### 4.1 Test Service-to-Service Communication ```bash # Exec into backend pod to test database connection BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}') kubectl exec -it $BACKEND_POD -n gravl-production -- \ curl http://postgres:5432 -v 2>&1 | head -5 # Expected: Some indication that postgres port is responding (or timeout), not "connection refused" ``` **Verification:** - [ ] Backend can reach database (even if timeout, not connection refused) - [ ] Backend logs show no database errors: `kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10` ### 4.2 Health Check Endpoint ```bash # Get backend service IP BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}') # Test health endpoint (from another pod) kubectl run -it --rm --image=curlimages/curl \ --restart=Never \ -n gravl-production \ curl-test \ -- curl http://$BACKEND_SVC:3000/health # Expected response: {"status":"ok"} or similar ``` **Verification:** - [ ] Health endpoint responds (HTTP 200) - [ ] No error messages in response ### 4.3 External Endpoint Test (via Ingress) ```bash # Wait for DNS propagation (if using DNS name, not IP) # Then test external access curl -k https://gravl.example.com/api/health # Expected: HTTP 200 with health status ``` **Verification:** - [ ] HTTPS responds (self-signed cert is OK to see -k warning) - [ ] Backend responds through ingress --- ## Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes) ### 5.1 Verify Prometheus Scraping ```bash # Check Prometheus targets (should show gravl-production scrape configs) kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 & # Open http://localhost:9090/targets in browser # Verify all gravl-production targets are "UP" kill %1 ``` **Verification:** - [ ] All production targets showing as UP - [ ] No "DOWN" endpoints ### 5.2 Verify Grafana Dashboards ```bash # Access Grafana kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 & # Open http://localhost:3000 # Login with default credentials (or stored secret) # Navigate to Gravl dashboards # Verify graphs showing production metrics kill %1 ``` **Verification:** - [ ] Gravl dashboards visible - [ ] Metrics flowing (not empty graphs) - [ ] CPU, memory, request rate graphs showing data ### 5.3 Verify AlertManager ```bash # Check AlertManager configuration (should have production severity levels) kubectl get alertmanagerconfig -n gravl-monitoring kubectl describe alertmanagerconfig -n gravl-monitoring ``` **Verification:** - [ ] Alerts configured for production thresholds - [ ] Notification channels (Slack, PagerDuty, etc.) configured ### 5.4 Test Alert Trigger ```bash # Send test alert through AlertManager kubectl exec -it -n gravl-monitoring alertmanager-0 -- \ amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093 # Check Slack / notification channel for alert (should arrive within 1 minute) ``` **Verification:** - [ ] Test alert received in notification channel - [ ] Alert formatting correct - [ ] No excessive duplicate alerts --- ## Phase 6: Load Test & Baseline (T+60 to T+90 minutes) ### 6.1 Run Load Test on Production (Low Traffic) ```bash # Generate light load using k6 or Apache Bench k6 run --vus 10 --duration 5m k8s/production/load-test.js # Expected results: # - p95 latency: <200ms # - Throughput: >100 req/s # - Error rate: <0.1% ``` **Verification:** - [ ] p95 latency <200ms - [ ] Error rate <0.1% - [ ] No pod restarts during test ### 6.2 Baseline Metrics Captured ```bash # Log current metrics for baseline kubectl top nodes > /tmp/baseline-nodes.txt kubectl top pods -n gravl-production > /tmp/baseline-pods.txt # Store for comparison (alert if exceeds 2x baseline) ``` **Verification:** - [ ] Node CPU/Memory usage within expected range - [ ] Pod CPU/Memory usage within resource requests --- ## Phase 7: Production Sign-Off (T+90 minutes) ### 7.1 Final Checklist - [ ] All pre-flight checks passed - [ ] Database healthy and migrated - [ ] All services running and ready - [ ] Ingress responding (TLS valid) - [ ] Health checks passing - [ ] Monitoring metrics flowing - [ ] Alerts functional - [ ] Load test passed - [ ] Team lead review: ✅ READY TO GO LIVE ### 7.2 Change Log Entry ```bash # Log deployment to version control cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG' --- date: 2026-03-06 time: ~09:30 UTC environment: production namespace: gravl-production services: - backend: v1.x.x - frontend: v1.x.x - postgres: 15.x - ingress: nginx - certificates: cert-manager (Let's Encrypt) pre_flight_status: ✅ PASSED security_review: ✅ APPROVED monitoring_status: ✅ OPERATIONAL load_test_result: ✅ PASSED sign_off_by: [DevOps Lead] DEPLOY_LOG git add /tmp/PRODUCTION_DEPLOY.log git commit -m "Production deployment log - 2026-03-06" ``` ### 7.3 Notify Team - [ ] Send deployment completion notice to Slack #gravl-announce ``` 🚀 **Gravl Production Deployment COMPLETE** - Timestamp: 2026-03-06 09:30 UTC - All systems operational - Monitoring dashboards: [link] - Status page: [link] ``` - [ ] Update status page (if external-facing) - [ ] Notify stakeholders (product, marketing) --- ## Rollback Decision Tree **If at any point a critical failure occurs:** 1. Do NOT proceed 2. Trigger ROLLBACK.md procedure 3. Investigate root cause post-incident (blameless postmortem) **Critical Failure Indicators:** - Database connection failures after 3 retries - More than 2 pod crashes during rollout - Ingress TLS certificate invalid - Health checks failing on all pods - Alerts firing for production thresholds --- ## Post-Deployment (T+120 minutes and beyond) ### 7.4 Sustained Monitoring Window (Next 24 hours) - [ ] Assign on-call rotation (24h monitoring) - [ ] Set up escalation policy (alert → on-call → incident lead) - [ ] Daily review of logs and metrics for first week - [ ] Customer feedback monitoring (support tickets, user reports) ### 7.5 Post-Deployment Review (24 hours) - [ ] Team retrospective (what went well, what to improve) - [ ] Update runbooks based on findings - [ ] Document any manual interventions for automation - [ ] Plan optimization and hardening work for next phase --- **Document Version:** 1.0 **Last Updated:** 2026-03-06 08:50 **Next Update:** After first production deployment attempt