d81e403f01
COMPLETED TASKS: ✅ 06-01: Workout Swap System - Added swapped_from_id to workout_logs - Created workout_swaps table for history - POST /api/workouts/:id/swap endpoint - GET /api/workouts/available endpoint - Reversible swaps with audit trail ✅ 06-02: Muscle Group Recovery Tracking - Created muscle_group_recovery table - Implemented calculateRecoveryScore() function - GET /api/recovery/muscle-groups endpoint - GET /api/recovery/most-recovered endpoint - Auto-tracking on workout log completion ✅ 06-03: Smart Workout Recommendations - GET /api/recommendations/smart-workout endpoint - 7-day workout analysis algorithm - Recovery-based filtering (>30% threshold) - Top 3 recommendations with context - Context-aware reasoning messages DATABASE CHANGES: - Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises - Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id - Created 7 new indexes for performance IMPLEMENTATION: - Recovery service with 4 core functions - 2 new route handlers (recovery, smartRecommendations) - Updated workouts router with swap endpoints - Integrated recovery tracking into POST /api/logs - Full error handling and logging TESTING: - Test file created: /backend/test/phase-06-tests.js - Ready for E2E and staging validation STATUS: Ready for frontend integration and production review Branch: feature/06-phase-06
495 lines
13 KiB
Markdown
495 lines
13 KiB
Markdown
# Production Go-Live Procedure — Phase 10-07, Task 5
|
|
|
|
**Date:** 2026-03-06
|
|
**Status:** DRAFT (TO BE TESTED ON STAGING)
|
|
**Owner:** DevOps / Deployment Lead
|
|
**Pre-requisites:** Complete PRODUCTION_READINESS.md checklist items #1-4
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.
|
|
|
|
**Estimated Duration:** 2-3 hours (plus verification window)
|
|
**Rollback Window:** <15 minutes (with ROLLBACK.md procedure)
|
|
**Required Team:** DevOps (2), Backend (1), Frontend Lead (1)
|
|
|
|
---
|
|
|
|
## Pre-Flight Checklist (T-30 minutes)
|
|
|
|
- [ ] Production cluster access verified (kubectl configured)
|
|
- [ ] All team members on call (Slack + video bridge open)
|
|
- [ ] Backup of production database exists (snapshot/automated backup running)
|
|
- [ ] Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
|
|
- [ ] Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
|
|
- [ ] Production domain DNS propagated (check DNS resolution)
|
|
- [ ] TLS certificates ready or cert-manager deployed and tested
|
|
- [ ] Alert thresholds reviewed (no overly sensitive alerts during deployment)
|
|
- [ ] Staging environment running last validated build
|
|
- [ ] Load balancer health checks configured
|
|
- [ ] Incident communication channel created (Slack #gravl-incident)
|
|
|
|
---
|
|
|
|
## Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)
|
|
|
|
### 1.1 Create Kubernetes Namespace & RBAC
|
|
|
|
```bash
|
|
# Apply production namespace configuration
|
|
kubectl apply -f k8s/production/namespace.yaml
|
|
|
|
# Apply RBAC for production deployments
|
|
kubectl apply -f k8s/production/rbac.yaml
|
|
|
|
# Verify namespace created
|
|
kubectl get ns gravl-production
|
|
kubectl get serviceaccount -n gravl-production gravl-deployer
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Namespace exists
|
|
- [ ] ServiceAccount exists
|
|
- [ ] RBAC role bound
|
|
|
|
### 1.2 Apply Network Policies
|
|
|
|
```bash
|
|
# Apply default deny + explicit allow rules
|
|
kubectl apply -f k8s/production/network-policy.yaml
|
|
|
|
# Verify policies (should see 5+ NetworkPolicies)
|
|
kubectl get networkpolicies -n gravl-production
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Default deny ingress in place
|
|
- [ ] Backend, frontend, database, monitoring policies visible
|
|
|
|
### 1.3 Deploy Secrets (Sealed or External)
|
|
|
|
**Option A: Sealed Secrets** (if kubeseal is deployed)
|
|
```bash
|
|
# Unseal production secrets
|
|
kubeseal -f k8s/production/sealed-secrets.yaml \
|
|
| kubectl apply -f -
|
|
|
|
# Verify secrets exist
|
|
kubectl get secrets -n gravl-production
|
|
kubectl describe secret postgres-secret -n gravl-production
|
|
```
|
|
|
|
**Option B: External Secrets Operator** (if AWS/Vault used)
|
|
```bash
|
|
# Apply ExternalSecret definitions
|
|
kubectl apply -f k8s/production/external-secrets.yaml
|
|
|
|
# Verify ExternalSecrets synced (should see status: synced)
|
|
kubectl get externalsecrets -n gravl-production
|
|
kubectl describe externalsecret postgres-secret -n gravl-production
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] postgres-secret contains POSTGRES_PASSWORD
|
|
- [ ] app-secret contains JWT_SECRET
|
|
- [ ] registry-pull-secret exists (if private registry used)
|
|
- [ ] staging-tls exists (or cert-manager will auto-create)
|
|
|
|
### 1.4 Deploy cert-manager (if not already on cluster)
|
|
|
|
```bash
|
|
# Install cert-manager (one-time, if needed)
|
|
helm install cert-manager jetstack/cert-manager \
|
|
--namespace cert-manager \
|
|
--create-namespace \
|
|
--set installCRDs=true \
|
|
--version v1.13.0
|
|
|
|
# Create ClusterIssuer for Let's Encrypt (production)
|
|
kubectl apply -f k8s/production/cert-manager-issuer.yaml
|
|
|
|
# Verify issuer ready
|
|
kubectl get clusterissuer
|
|
kubectl describe clusterissuer letsencrypt-prod
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] cert-manager pods running in cert-manager namespace
|
|
- [ ] ClusterIssuer status is READY (True)
|
|
|
|
---
|
|
|
|
## Phase 2: Database & Storage (T-30 to T-10 minutes)
|
|
|
|
### 2.1 Deploy PostgreSQL StatefulSet
|
|
|
|
```bash
|
|
# Deploy PostgreSQL to production
|
|
kubectl apply -f k8s/production/postgres-statefulset.yaml
|
|
|
|
# Watch for Pod readiness (should take 30-60 seconds)
|
|
kubectl rollout status statefulset/postgres -n gravl-production
|
|
|
|
# Verify pod is running and ready (2/2 containers)
|
|
kubectl get pods -n gravl-production -l component=database
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Pod status: Running, Ready 2/2
|
|
- [ ] PersistentVolumeClaim bound
|
|
- [ ] No errors in pod logs: `kubectl logs postgres-0 -n gravl-production`
|
|
|
|
### 2.2 Run Database Migrations
|
|
|
|
```bash
|
|
# Port-forward to database (for migration job)
|
|
kubectl port-forward postgres-0 5432:5432 -n gravl-production &
|
|
|
|
# Run migrations in separate terminal
|
|
cd backend
|
|
npm run db:migrate:prod
|
|
|
|
# Monitor migration logs
|
|
kubectl logs -n gravl-production -f job/db-migration
|
|
|
|
# Kill port-forward when done
|
|
kill %1
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Migration job completed successfully
|
|
- [ ] No migration errors in logs
|
|
- [ ] Database schema matches expected version
|
|
|
|
### 2.3 Verify Database Connectivity
|
|
|
|
```bash
|
|
# Create a test pod to verify DB access
|
|
kubectl run -it --rm --image=postgres:15 \
|
|
--restart=Never \
|
|
-n gravl-production \
|
|
psql-test \
|
|
-- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"
|
|
|
|
# Should return PostgreSQL version
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Database connection successful
|
|
- [ ] PostgreSQL version visible
|
|
|
|
---
|
|
|
|
## Phase 3: Deploy Application Services (T-10 to T+20 minutes)
|
|
|
|
### 3.1 Deploy Backend Deployment
|
|
|
|
```bash
|
|
# Deploy backend service
|
|
kubectl apply -f k8s/production/backend-deployment.yaml
|
|
|
|
# Wait for rollout (typically 2-3 minutes)
|
|
kubectl rollout status deployment/backend -n gravl-production
|
|
|
|
# Verify pods running
|
|
kubectl get pods -n gravl-production -l component=backend
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
|
|
- [ ] No CrashLoopBackOff errors
|
|
- [ ] Service endpoint registered: `kubectl get svc backend -n gravl-production`
|
|
|
|
### 3.2 Deploy Frontend Deployment
|
|
|
|
```bash
|
|
# Deploy frontend service
|
|
kubectl apply -f k8s/production/frontend-deployment.yaml
|
|
|
|
# Wait for rollout
|
|
kubectl rollout status deployment/frontend -n gravl-production
|
|
|
|
# Verify pods
|
|
kubectl get pods -n gravl-production -l component=frontend
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Frontend pods running and ready
|
|
- [ ] Service endpoint registered
|
|
|
|
### 3.3 Apply Ingress with TLS Termination
|
|
|
|
```bash
|
|
# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
|
|
kubectl apply -f k8s/production/ingress.yaml
|
|
|
|
# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
|
|
kubectl get ingress -n gravl-production -w
|
|
|
|
# Check ingress status and TLS certificate
|
|
kubectl describe ingress gravl-ingress -n gravl-production
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Ingress has external IP or DNS name assigned
|
|
- [ ] TLS certificate present (cert-manager auto-created if configured)
|
|
- [ ] SSL certificate not self-signed (check with OpenSSL):
|
|
```bash
|
|
echo | openssl s_client -servername gravl.example.com \
|
|
-connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Service Integration Verification (T+20 to T+40 minutes)
|
|
|
|
### 4.1 Test Service-to-Service Communication
|
|
|
|
```bash
|
|
# Exec into backend pod to test database connection
|
|
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')
|
|
|
|
kubectl exec -it $BACKEND_POD -n gravl-production -- \
|
|
curl http://postgres:5432 -v 2>&1 | head -5
|
|
|
|
# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Backend can reach database (even if timeout, not connection refused)
|
|
- [ ] Backend logs show no database errors: `kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10`
|
|
|
|
### 4.2 Health Check Endpoint
|
|
|
|
```bash
|
|
# Get backend service IP
|
|
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')
|
|
|
|
# Test health endpoint (from another pod)
|
|
kubectl run -it --rm --image=curlimages/curl \
|
|
--restart=Never \
|
|
-n gravl-production \
|
|
curl-test \
|
|
-- curl http://$BACKEND_SVC:3000/health
|
|
|
|
# Expected response: {"status":"ok"} or similar
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Health endpoint responds (HTTP 200)
|
|
- [ ] No error messages in response
|
|
|
|
### 4.3 External Endpoint Test (via Ingress)
|
|
|
|
```bash
|
|
# Wait for DNS propagation (if using DNS name, not IP)
|
|
# Then test external access
|
|
curl -k https://gravl.example.com/api/health
|
|
|
|
# Expected: HTTP 200 with health status
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] HTTPS responds (self-signed cert is OK to see -k warning)
|
|
- [ ] Backend responds through ingress
|
|
|
|
---
|
|
|
|
## Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)
|
|
|
|
### 5.1 Verify Prometheus Scraping
|
|
|
|
```bash
|
|
# Check Prometheus targets (should show gravl-production scrape configs)
|
|
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &
|
|
|
|
# Open http://localhost:9090/targets in browser
|
|
# Verify all gravl-production targets are "UP"
|
|
|
|
kill %1
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] All production targets showing as UP
|
|
- [ ] No "DOWN" endpoints
|
|
|
|
### 5.2 Verify Grafana Dashboards
|
|
|
|
```bash
|
|
# Access Grafana
|
|
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &
|
|
|
|
# Open http://localhost:3000
|
|
# Login with default credentials (or stored secret)
|
|
# Navigate to Gravl dashboards
|
|
# Verify graphs showing production metrics
|
|
|
|
kill %1
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Gravl dashboards visible
|
|
- [ ] Metrics flowing (not empty graphs)
|
|
- [ ] CPU, memory, request rate graphs showing data
|
|
|
|
### 5.3 Verify AlertManager
|
|
|
|
```bash
|
|
# Check AlertManager configuration (should have production severity levels)
|
|
kubectl get alertmanagerconfig -n gravl-monitoring
|
|
kubectl describe alertmanagerconfig -n gravl-monitoring
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Alerts configured for production thresholds
|
|
- [ ] Notification channels (Slack, PagerDuty, etc.) configured
|
|
|
|
### 5.4 Test Alert Trigger
|
|
|
|
```bash
|
|
# Send test alert through AlertManager
|
|
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
|
|
amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093
|
|
|
|
# Check Slack / notification channel for alert (should arrive within 1 minute)
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Test alert received in notification channel
|
|
- [ ] Alert formatting correct
|
|
- [ ] No excessive duplicate alerts
|
|
|
|
---
|
|
|
|
## Phase 6: Load Test & Baseline (T+60 to T+90 minutes)
|
|
|
|
### 6.1 Run Load Test on Production (Low Traffic)
|
|
|
|
```bash
|
|
# Generate light load using k6 or Apache Bench
|
|
k6 run --vus 10 --duration 5m k8s/production/load-test.js
|
|
|
|
# Expected results:
|
|
# - p95 latency: <200ms
|
|
# - Throughput: >100 req/s
|
|
# - Error rate: <0.1%
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] p95 latency <200ms
|
|
- [ ] Error rate <0.1%
|
|
- [ ] No pod restarts during test
|
|
|
|
### 6.2 Baseline Metrics Captured
|
|
|
|
```bash
|
|
# Log current metrics for baseline
|
|
kubectl top nodes > /tmp/baseline-nodes.txt
|
|
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt
|
|
|
|
# Store for comparison (alert if exceeds 2x baseline)
|
|
```
|
|
|
|
**Verification:**
|
|
- [ ] Node CPU/Memory usage within expected range
|
|
- [ ] Pod CPU/Memory usage within resource requests
|
|
|
|
---
|
|
|
|
## Phase 7: Production Sign-Off (T+90 minutes)
|
|
|
|
### 7.1 Final Checklist
|
|
|
|
- [ ] All pre-flight checks passed
|
|
- [ ] Database healthy and migrated
|
|
- [ ] All services running and ready
|
|
- [ ] Ingress responding (TLS valid)
|
|
- [ ] Health checks passing
|
|
- [ ] Monitoring metrics flowing
|
|
- [ ] Alerts functional
|
|
- [ ] Load test passed
|
|
- [ ] Team lead review: ✅ READY TO GO LIVE
|
|
|
|
### 7.2 Change Log Entry
|
|
|
|
```bash
|
|
# Log deployment to version control
|
|
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
|
|
---
|
|
date: 2026-03-06
|
|
time: ~09:30 UTC
|
|
environment: production
|
|
namespace: gravl-production
|
|
services:
|
|
- backend: v1.x.x
|
|
- frontend: v1.x.x
|
|
- postgres: 15.x
|
|
- ingress: nginx
|
|
- certificates: cert-manager (Let's Encrypt)
|
|
pre_flight_status: ✅ PASSED
|
|
security_review: ✅ APPROVED
|
|
monitoring_status: ✅ OPERATIONAL
|
|
load_test_result: ✅ PASSED
|
|
sign_off_by: [DevOps Lead]
|
|
DEPLOY_LOG
|
|
|
|
git add /tmp/PRODUCTION_DEPLOY.log
|
|
git commit -m "Production deployment log - 2026-03-06"
|
|
```
|
|
|
|
### 7.3 Notify Team
|
|
|
|
- [ ] Send deployment completion notice to Slack #gravl-announce
|
|
```
|
|
🚀 **Gravl Production Deployment COMPLETE**
|
|
- Timestamp: 2026-03-06 09:30 UTC
|
|
- All systems operational
|
|
- Monitoring dashboards: [link]
|
|
- Status page: [link]
|
|
```
|
|
|
|
- [ ] Update status page (if external-facing)
|
|
- [ ] Notify stakeholders (product, marketing)
|
|
|
|
---
|
|
|
|
## Rollback Decision Tree
|
|
|
|
**If at any point a critical failure occurs:**
|
|
1. Do NOT proceed
|
|
2. Trigger ROLLBACK.md procedure
|
|
3. Investigate root cause post-incident (blameless postmortem)
|
|
|
|
**Critical Failure Indicators:**
|
|
- Database connection failures after 3 retries
|
|
- More than 2 pod crashes during rollout
|
|
- Ingress TLS certificate invalid
|
|
- Health checks failing on all pods
|
|
- Alerts firing for production thresholds
|
|
|
|
---
|
|
|
|
## Post-Deployment (T+120 minutes and beyond)
|
|
|
|
### 7.4 Sustained Monitoring Window (Next 24 hours)
|
|
|
|
- [ ] Assign on-call rotation (24h monitoring)
|
|
- [ ] Set up escalation policy (alert → on-call → incident lead)
|
|
- [ ] Daily review of logs and metrics for first week
|
|
- [ ] Customer feedback monitoring (support tickets, user reports)
|
|
|
|
### 7.5 Post-Deployment Review (24 hours)
|
|
|
|
- [ ] Team retrospective (what went well, what to improve)
|
|
- [ ] Update runbooks based on findings
|
|
- [ ] Document any manual interventions for automation
|
|
- [ ] Plan optimization and hardening work for next phase
|
|
|
|
---
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2026-03-06 08:50
|
|
**Next Update:** After first production deployment attempt
|