Files

T

clawd d81e403f01 Phase 06 Tier 1: Complete Backend Implementation - Recovery Tracking & Swap System

COMPLETED TASKS:
✅ 06-01: Workout Swap System
   - Added swapped_from_id to workout_logs
   - Created workout_swaps table for history
   - POST /api/workouts/:id/swap endpoint
   - GET /api/workouts/available endpoint
   - Reversible swaps with audit trail

✅ 06-02: Muscle Group Recovery Tracking
   - Created muscle_group_recovery table
   - Implemented calculateRecoveryScore() function
   - GET /api/recovery/muscle-groups endpoint
   - GET /api/recovery/most-recovered endpoint
   - Auto-tracking on workout log completion

✅ 06-03: Smart Workout Recommendations
   - GET /api/recommendations/smart-workout endpoint
   - 7-day workout analysis algorithm
   - Recovery-based filtering (>30% threshold)
   - Top 3 recommendations with context
   - Context-aware reasoning messages

DATABASE CHANGES:
- Added 4 new tables: muscle_group_recovery, workout_swaps, custom_workouts, custom_workout_exercises
- Extended workout_logs with: swapped_from_id, source_type, custom_workout_id, custom_workout_exercise_id
- Created 7 new indexes for performance

IMPLEMENTATION:
- Recovery service with 4 core functions
- 2 new route handlers (recovery, smartRecommendations)
- Updated workouts router with swap endpoints
- Integrated recovery tracking into POST /api/logs
- Full error handling and logging

TESTING:
- Test file created: /backend/test/phase-06-tests.js
- Ready for E2E and staging validation

STATUS: Ready for frontend integration and production review
Branch: feature/06-phase-06

2026-03-06 20:54:03 +01:00

13 KiB

Raw Blame History

Production Go-Live Procedure — Phase 10-07, Task 5

Date: 2026-03-06
Status: DRAFT (TO BE TESTED ON STAGING)
Owner: DevOps / Deployment Lead
Pre-requisites: Complete PRODUCTION_READINESS.md checklist items #1-4

Overview

This document defines the step-by-step procedure for deploying Gravl to production and verifying system health.

Estimated Duration: 2-3 hours (plus verification window)
Rollback Window: <15 minutes (with ROLLBACK.md procedure)
Required Team: DevOps (2), Backend (1), Frontend Lead (1)

Pre-Flight Checklist (T-30 minutes)

Production cluster access verified (kubectl configured)
All team members on call (Slack + video bridge open)
Backup of production database exists (snapshot/automated backup running)
Monitoring dashboards loaded and ready (Grafana open in separate browser tabs)
Rollback procedure briefed to team (5-minute review of ROLLBACK.md)
Production domain DNS propagated (check DNS resolution)
TLS certificates ready or cert-manager deployed and tested
Alert thresholds reviewed (no overly sensitive alerts during deployment)
Staging environment running last validated build
Load balancer health checks configured
Incident communication channel created (Slack #gravl-incident)

Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)

1.1 Create Kubernetes Namespace & RBAC

# Apply production namespace configuration
kubectl apply -f k8s/production/namespace.yaml

# Apply RBAC for production deployments
kubectl apply -f k8s/production/rbac.yaml

# Verify namespace created
kubectl get ns gravl-production
kubectl get serviceaccount -n gravl-production gravl-deployer

Verification:

Namespace exists
ServiceAccount exists
RBAC role bound

1.2 Apply Network Policies

# Apply default deny + explicit allow rules
kubectl apply -f k8s/production/network-policy.yaml

# Verify policies (should see 5+ NetworkPolicies)
kubectl get networkpolicies -n gravl-production

Verification:

Default deny ingress in place
Backend, frontend, database, monitoring policies visible

1.3 Deploy Secrets (Sealed or External)

Option A: Sealed Secrets (if kubeseal is deployed)

# Unseal production secrets
kubeseal -f k8s/production/sealed-secrets.yaml \
  | kubectl apply -f -

# Verify secrets exist
kubectl get secrets -n gravl-production
kubectl describe secret postgres-secret -n gravl-production

Option B: External Secrets Operator (if AWS/Vault used)

# Apply ExternalSecret definitions
kubectl apply -f k8s/production/external-secrets.yaml

# Verify ExternalSecrets synced (should see status: synced)
kubectl get externalsecrets -n gravl-production
kubectl describe externalsecret postgres-secret -n gravl-production

Verification:

postgres-secret contains POSTGRES_PASSWORD
app-secret contains JWT_SECRET
registry-pull-secret exists (if private registry used)
staging-tls exists (or cert-manager will auto-create)

1.4 Deploy cert-manager (if not already on cluster)

# Install cert-manager (one-time, if needed)
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --version v1.13.0

# Create ClusterIssuer for Let's Encrypt (production)
kubectl apply -f k8s/production/cert-manager-issuer.yaml

# Verify issuer ready
kubectl get clusterissuer
kubectl describe clusterissuer letsencrypt-prod

Verification:

cert-manager pods running in cert-manager namespace
ClusterIssuer status is READY (True)

Phase 2: Database & Storage (T-30 to T-10 minutes)

2.1 Deploy PostgreSQL StatefulSet

# Deploy PostgreSQL to production
kubectl apply -f k8s/production/postgres-statefulset.yaml

# Watch for Pod readiness (should take 30-60 seconds)
kubectl rollout status statefulset/postgres -n gravl-production

# Verify pod is running and ready (2/2 containers)
kubectl get pods -n gravl-production -l component=database

Verification:

Pod status: Running, Ready 2/2
PersistentVolumeClaim bound
No errors in pod logs: kubectl logs postgres-0 -n gravl-production

2.2 Run Database Migrations

# Port-forward to database (for migration job)
kubectl port-forward postgres-0 5432:5432 -n gravl-production &

# Run migrations in separate terminal
cd backend
npm run db:migrate:prod

# Monitor migration logs
kubectl logs -n gravl-production -f job/db-migration

# Kill port-forward when done
kill %1

Verification:

Migration job completed successfully
No migration errors in logs
Database schema matches expected version

2.3 Verify Database Connectivity

# Create a test pod to verify DB access
kubectl run -it --rm --image=postgres:15 \
  --restart=Never \
  -n gravl-production \
  psql-test \
  -- psql -h postgres -U gravl_user -d gravl -c "SELECT version();"

# Should return PostgreSQL version

Verification:

Database connection successful
PostgreSQL version visible

Phase 3: Deploy Application Services (T-10 to T+20 minutes)

3.1 Deploy Backend Deployment

# Deploy backend service
kubectl apply -f k8s/production/backend-deployment.yaml

# Wait for rollout (typically 2-3 minutes)
kubectl rollout status deployment/backend -n gravl-production

# Verify pods running
kubectl get pods -n gravl-production -l component=backend

Verification:

Pods running and ready (depends on replicas, e.g., 3 replicas = 3/3 ready)
No CrashLoopBackOff errors
Service endpoint registered: kubectl get svc backend -n gravl-production

3.2 Deploy Frontend Deployment

# Deploy frontend service
kubectl apply -f k8s/production/frontend-deployment.yaml

# Wait for rollout
kubectl rollout status deployment/frontend -n gravl-production

# Verify pods
kubectl get pods -n gravl-production -l component=frontend

Verification:

Frontend pods running and ready
Service endpoint registered

3.3 Apply Ingress with TLS Termination

# Deploy ingress (cert-manager will auto-provision TLS if using cert.manager.io/cluster-issuer annotation)
kubectl apply -f k8s/production/ingress.yaml

# Wait for ingress to get external IP / DNS name (typically 30-60 seconds)
kubectl get ingress -n gravl-production -w

# Check ingress status and TLS certificate
kubectl describe ingress gravl-ingress -n gravl-production

Verification:

Ingress has external IP or DNS name assigned
TLS certificate present (cert-manager auto-created if configured)

SSL certificate not self-signed (check with OpenSSL):

echo | openssl s_client -servername gravl.example.com \
  -connect $(kubectl get ingress gravl-ingress -n gravl-production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):443 2>/dev/null | grep Subject

Phase 4: Service Integration Verification (T+20 to T+40 minutes)

4.1 Test Service-to-Service Communication

# Exec into backend pod to test database connection
BACKEND_POD=$(kubectl get pod -n gravl-production -l component=backend -o jsonpath='{.items[0].metadata.name}')

kubectl exec -it $BACKEND_POD -n gravl-production -- \
  curl http://postgres:5432 -v 2>&1 | head -5

# Expected: Some indication that postgres port is responding (or timeout), not "connection refused"

Verification:

Backend can reach database (even if timeout, not connection refused)
Backend logs show no database errors: kubectl logs $BACKEND_POD -n gravl-production | grep -i error | head -10

4.2 Health Check Endpoint

# Get backend service IP
BACKEND_SVC=$(kubectl get svc backend -n gravl-production -o jsonpath='{.spec.clusterIP}')

# Test health endpoint (from another pod)
kubectl run -it --rm --image=curlimages/curl \
  --restart=Never \
  -n gravl-production \
  curl-test \
  -- curl http://$BACKEND_SVC:3000/health

# Expected response: {"status":"ok"} or similar

Verification:

Health endpoint responds (HTTP 200)
No error messages in response

4.3 External Endpoint Test (via Ingress)

# Wait for DNS propagation (if using DNS name, not IP)
# Then test external access
curl -k https://gravl.example.com/api/health

# Expected: HTTP 200 with health status

Verification:

HTTPS responds (self-signed cert is OK to see -k warning)
Backend responds through ingress

Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)

5.1 Verify Prometheus Scraping

# Check Prometheus targets (should show gravl-production scrape configs)
kubectl port-forward -n gravl-monitoring svc/prometheus 9090:9090 &

# Open http://localhost:9090/targets in browser
# Verify all gravl-production targets are "UP"

kill %1

Verification:

All production targets showing as UP
No "DOWN" endpoints

5.2 Verify Grafana Dashboards

# Access Grafana
kubectl port-forward -n gravl-monitoring svc/grafana 3000:3000 &

# Open http://localhost:3000
# Login with default credentials (or stored secret)
# Navigate to Gravl dashboards
# Verify graphs showing production metrics

kill %1

Verification:

Gravl dashboards visible
Metrics flowing (not empty graphs)
CPU, memory, request rate graphs showing data

5.3 Verify AlertManager

# Check AlertManager configuration (should have production severity levels)
kubectl get alertmanagerconfig -n gravl-monitoring
kubectl describe alertmanagerconfig -n gravl-monitoring

Verification:

Alerts configured for production thresholds
Notification channels (Slack, PagerDuty, etc.) configured

5.4 Test Alert Trigger

# Send test alert through AlertManager
kubectl exec -it -n gravl-monitoring alertmanager-0 -- \
  amtool alert add test_alert severity=info --alertmanager.url=http://localhost:9093

# Check Slack / notification channel for alert (should arrive within 1 minute)

Verification:

Test alert received in notification channel
Alert formatting correct
No excessive duplicate alerts

Phase 6: Load Test & Baseline (T+60 to T+90 minutes)

6.1 Run Load Test on Production (Low Traffic)

# Generate light load using k6 or Apache Bench
k6 run --vus 10 --duration 5m k8s/production/load-test.js

# Expected results:
# - p95 latency: <200ms
# - Throughput: >100 req/s
# - Error rate: <0.1%

Verification:

p95 latency <200ms
Error rate <0.1%
No pod restarts during test

6.2 Baseline Metrics Captured

# Log current metrics for baseline
kubectl top nodes > /tmp/baseline-nodes.txt
kubectl top pods -n gravl-production > /tmp/baseline-pods.txt

# Store for comparison (alert if exceeds 2x baseline)

Verification:

Node CPU/Memory usage within expected range
Pod CPU/Memory usage within resource requests

Phase 7: Production Sign-Off (T+90 minutes)

7.1 Final Checklist

All pre-flight checks passed
Database healthy and migrated
All services running and ready
Ingress responding (TLS valid)
Health checks passing
Monitoring metrics flowing
Alerts functional
Load test passed
Team lead review: ✅ READY TO GO LIVE

7.2 Change Log Entry

# Log deployment to version control
cat > /tmp/PRODUCTION_DEPLOY.log << 'DEPLOY_LOG'
---
date: 2026-03-06
time: ~09:30 UTC
environment: production
namespace: gravl-production
services:
  - backend: v1.x.x
  - frontend: v1.x.x
  - postgres: 15.x
  - ingress: nginx
  - certificates: cert-manager (Let's Encrypt)
pre_flight_status: ✅ PASSED
security_review: ✅ APPROVED
monitoring_status: ✅ OPERATIONAL
load_test_result: ✅ PASSED
sign_off_by: [DevOps Lead]
DEPLOY_LOG

git add /tmp/PRODUCTION_DEPLOY.log
git commit -m "Production deployment log - 2026-03-06"

7.3 Notify Team

Send deployment completion notice to Slack #gravl-announce

🚀 **Gravl Production Deployment COMPLETE**
- Timestamp: 2026-03-06 09:30 UTC
- All systems operational
- Monitoring dashboards: [link]
- Status page: [link]

Update status page (if external-facing)
Notify stakeholders (product, marketing)

Rollback Decision Tree

If at any point a critical failure occurs:

Do NOT proceed
Trigger ROLLBACK.md procedure
Investigate root cause post-incident (blameless postmortem)

Critical Failure Indicators:

Database connection failures after 3 retries
More than 2 pod crashes during rollout
Ingress TLS certificate invalid
Health checks failing on all pods
Alerts firing for production thresholds

Post-Deployment (T+120 minutes and beyond)

7.4 Sustained Monitoring Window (Next 24 hours)

Assign on-call rotation (24h monitoring)
Set up escalation policy (alert → on-call → incident lead)
Daily review of logs and metrics for first week
Customer feedback monitoring (support tickets, user reports)

7.5 Post-Deployment Review (24 hours)

Team retrospective (what went well, what to improve)
Update runbooks based on findings
Document any manual interventions for automation
Plan optimization and hardening work for next phase

Document Version: 1.0
Last Updated: 2026-03-06 08:50
Next Update: After first production deployment attempt

13 KiB Raw Blame History

Production Go-Live Procedure — Phase 10-07, Task 5

Overview

Pre-Flight Checklist (T-30 minutes)

Phase 1: Environment & Infrastructure Setup (T-60 to T-30 minutes)

1.1 Create Kubernetes Namespace & RBAC

1.2 Apply Network Policies

1.3 Deploy Secrets (Sealed or External)

1.4 Deploy cert-manager (if not already on cluster)

Phase 2: Database & Storage (T-30 to T-10 minutes)

2.1 Deploy PostgreSQL StatefulSet

2.2 Run Database Migrations

2.3 Verify Database Connectivity

Phase 3: Deploy Application Services (T-10 to T+20 minutes)

3.1 Deploy Backend Deployment

3.2 Deploy Frontend Deployment

3.3 Apply Ingress with TLS Termination

Phase 4: Service Integration Verification (T+20 to T+40 minutes)

4.1 Test Service-to-Service Communication

4.2 Health Check Endpoint

4.3 External Endpoint Test (via Ingress)

Phase 5: Monitoring & Alerting Setup (T+40 to T+60 minutes)

5.1 Verify Prometheus Scraping

5.2 Verify Grafana Dashboards

5.3 Verify AlertManager

5.4 Test Alert Trigger

Phase 6: Load Test & Baseline (T+60 to T+90 minutes)

6.1 Run Load Test on Production (Low Traffic)

6.2 Baseline Metrics Captured

Phase 7: Production Sign-Off (T+90 minutes)

7.1 Final Checklist

7.2 Change Log Entry

7.3 Notify Team

Rollback Decision Tree

Post-Deployment (T+120 minutes and beyond)

7.4 Sustained Monitoring Window (Next 24 hours)

7.5 Post-Deployment Review (24 hours)

13 KiB

Raw Blame History