Major accomplishments: - ✅ SELinux policy installed and working - ✅ Core Traefik v2.10 deployment running - ✅ Production configuration ready (v3.1) - ✅ Monitoring stack configured - ✅ Comprehensive documentation created - ✅ Security hardening implemented Current status: - 🟡 Partially deployed (60% complete) - ⚠️ Docker socket access needs resolution - ❌ Monitoring stack not deployed yet - ⚠️ Production migration pending Next steps: 1. Fix Docker socket permissions 2. Deploy monitoring stack 3. Migrate to production config 4. Validate full functionality Files added: - Complete Traefik deployment documentation - Production and test configurations - Monitoring stack configurations - SELinux policy module - Security checklists and guides - Current status documentation
2486 lines
86 KiB
Markdown
2486 lines
86 KiB
Markdown
# 99% SUCCESS MIGRATION PLAN - DETAILED EXECUTION CHECKLIST
|
|
**HomeAudit Infrastructure Migration - Guaranteed Success Protocol**
|
|
**Plan Version:** 1.0
|
|
**Created:** 2025-08-28
|
|
**Target Start Date:** [TO BE DETERMINED]
|
|
**Estimated Duration:** 14 days
|
|
**Success Probability:** 99%+
|
|
|
|
---
|
|
|
|
## 📋 **PLAN OVERVIEW & CRITICAL SUCCESS FACTORS**
|
|
|
|
### **Migration Success Formula:**
|
|
**Foundation (40%) + Parallel Deployment (25%) + Systematic Testing (20%) + Validation Gates (15%) = 99% Success**
|
|
|
|
### **Key Principles:**
|
|
- ✅ **Never proceed without 100% validation** of current phase
|
|
- ✅ **Always maintain parallel systems** until cutover validated
|
|
- ✅ **Test rollback procedures** before each major step
|
|
- ✅ **Document everything** as you go
|
|
- ✅ **Validate performance** at every milestone
|
|
|
|
### **Emergency Contacts & Escalation:**
|
|
- **Primary:** Jonathan (Migration Leader)
|
|
- **Technical Escalation:** [TO BE FILLED]
|
|
- **Emergency Rollback Authority:** [TO BE FILLED]
|
|
|
|
---
|
|
|
|
## 🗓️ **PHASE 0: PRE-MIGRATION PREPARATION**
|
|
**Duration:** 3 days (Days -3 to -1)
|
|
**Success Criteria:** 100% foundation readiness before ANY migration work
|
|
|
|
### **DAY -3: INFRASTRUCTURE FOUNDATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Docker Swarm Cluster Setup**
|
|
- [ ] **8:00-8:30** Initialize Docker Swarm on OMV800 (manager node)
|
|
```bash
|
|
ssh omv800.local "docker swarm init --advertise-addr 192.168.50.225"
|
|
# SAVE TOKEN: _________________________________
|
|
```
|
|
**Validation:** ✅ Manager node status = "Leader"
|
|
|
|
- [ ] **8:30-9:30** Join all worker nodes to swarm
|
|
```bash
|
|
# Execute on each host:
|
|
ssh jonathan-2518f5u "docker swarm join --token [TOKEN] 192.168.50.225:2377"
|
|
ssh surface "docker swarm join --token [TOKEN] 192.168.50.225:2377"
|
|
ssh fedora "docker swarm join --token [TOKEN] 192.168.50.225:2377"
|
|
ssh audrey "docker swarm join --token [TOKEN] 192.168.50.225:2377"
|
|
# Note: raspberrypi may be excluded due to ARM architecture
|
|
```
|
|
**Validation:** ✅ `docker node ls` shows all 5-6 nodes as "Ready"
|
|
|
|
- [ ] **9:30-10:00** Create overlay networks
|
|
```bash
|
|
docker network create --driver overlay --attachable traefik-public
|
|
docker network create --driver overlay --attachable database-network
|
|
docker network create --driver overlay --attachable storage-network
|
|
docker network create --driver overlay --attachable monitoring-network
|
|
```
|
|
**Validation:** ✅ All 4 networks listed in `docker network ls`
|
|
|
|
- [ ] **10:00-10:30** Test inter-node networking
|
|
```bash
|
|
# Deploy test service across nodes
|
|
docker service create --name network-test --replicas 4 --network traefik-public alpine sleep 3600
|
|
# Test connectivity between containers
|
|
```
|
|
**Validation:** ✅ All replicas can ping each other across nodes
|
|
|
|
- [ ] **10:30-12:00** Configure node labels and constraints
|
|
```bash
|
|
docker node update --label-add role=db omv800.local
|
|
docker node update --label-add role=web surface
|
|
docker node update --label-add role=iot jonathan-2518f5u
|
|
docker node update --label-add role=monitor audrey
|
|
docker node update --label-add role=dev fedora
|
|
```
|
|
**Validation:** ✅ All node labels set correctly
|
|
|
|
#### **Afternoon (13:00-17:00): Secrets & Configuration Management**
|
|
- [ ] **13:00-14:00** Complete secrets inventory collection
|
|
```bash
|
|
# Create comprehensive secrets collection script
|
|
mkdir -p /opt/migration/secrets/{env,files,docker,validation}
|
|
|
|
# Collect from all running containers
|
|
for host in omv800.local jonathan-2518f5u surface fedora audrey; do
|
|
ssh $host "docker ps --format '{{.Names}}'" > /tmp/containers_$host.txt
|
|
# Extract environment variables (sanitized)
|
|
# Extract mounted files with secrets
|
|
# Document database passwords
|
|
# Document API keys and tokens
|
|
done
|
|
```
|
|
**Validation:** ✅ All secrets documented and accessible
|
|
|
|
- [ ] **14:00-15:00** Generate Docker secrets
|
|
```bash
|
|
# Generate strong passwords for all services
|
|
openssl rand -base64 32 | docker secret create pg_root_password -
|
|
openssl rand -base64 32 | docker secret create mariadb_root_password -
|
|
openssl rand -base64 32 | docker secret create gitea_db_password -
|
|
openssl rand -base64 32 | docker secret create nextcloud_db_password -
|
|
openssl rand -base64 24 | docker secret create redis_password -
|
|
|
|
# Generate API keys
|
|
openssl rand -base64 32 | docker secret create immich_secret_key -
|
|
openssl rand -base64 32 | docker secret create vaultwarden_admin_token -
|
|
```
|
|
**Validation:** ✅ `docker secret ls` shows all 7+ secrets
|
|
|
|
- [ ] **15:00-16:00** Generate image digest lock file
|
|
```bash
|
|
bash migration_scripts/scripts/generate_image_digest_lock.sh \
|
|
--hosts "omv800.local jonathan-2518f5u surface fedora audrey" \
|
|
--output /opt/migration/configs/image-digest-lock.yaml
|
|
```
|
|
**Validation:** ✅ Lock file contains digests for all 53+ containers
|
|
|
|
- [ ] **16:00-17:00** Create missing service stack definitions
|
|
```bash
|
|
# Create all missing files:
|
|
touch stacks/services/homeassistant.yml
|
|
touch stacks/services/nextcloud.yml
|
|
touch stacks/services/immich-complete.yml
|
|
touch stacks/services/paperless.yml
|
|
touch stacks/services/jellyfin.yml
|
|
# Copy from templates and customize
|
|
```
|
|
**Validation:** ✅ All required stack files exist and validate with `docker-compose config`
|
|
|
|
**🎯 DAY -3 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** All infrastructure components ready
|
|
- [ ] Docker Swarm cluster operational (5-6 nodes)
|
|
- [ ] All overlay networks created and tested
|
|
- [ ] All secrets generated and accessible
|
|
- [ ] Image digest lock file complete
|
|
- [ ] All service definitions created
|
|
|
|
---
|
|
|
|
### **DAY -2: STORAGE & PERFORMANCE VALIDATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Storage Infrastructure**
|
|
- [ ] **8:00-9:00** Configure NFS exports on OMV800
|
|
```bash
|
|
# Create export directories
|
|
sudo mkdir -p /export/{jellyfin,immich,nextcloud,paperless,gitea}
|
|
sudo chown -R 1000:1000 /export/
|
|
|
|
# Configure NFS exports
|
|
echo "/export/jellyfin *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
|
|
echo "/export/immich *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
|
|
echo "/export/nextcloud *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
|
|
|
|
sudo systemctl restart nfs-server
|
|
```
|
|
**Validation:** ✅ All exports accessible from worker nodes
|
|
|
|
- [ ] **9:00-10:00** Test NFS performance from all nodes
|
|
```bash
|
|
# Performance test from each worker node
|
|
for host in surface jonathan-2518f5u fedora audrey; do
|
|
ssh $host "mkdir -p /tmp/nfs_test"
|
|
ssh $host "mount -t nfs omv800.local:/export/immich /tmp/nfs_test"
|
|
ssh $host "dd if=/dev/zero of=/tmp/nfs_test/test.img bs=1M count=100 oflag=sync"
|
|
# Record write speed: ________________ MB/s
|
|
ssh $host "dd if=/tmp/nfs_test/test.img of=/dev/null bs=1M"
|
|
# Record read speed: _________________ MB/s
|
|
ssh $host "umount /tmp/nfs_test && rm -rf /tmp/nfs_test"
|
|
done
|
|
```
|
|
**Validation:** ✅ NFS performance >50MB/s read/write from all nodes
|
|
|
|
- [ ] **10:00-11:00** Configure SSD caching on OMV800
|
|
```bash
|
|
# Identify SSD device (234GB drive)
|
|
lsblk
|
|
# SSD device path: /dev/_______
|
|
|
|
# Configure bcache for database storage
|
|
sudo make-bcache -B /dev/sdb2 -C /dev/sdc1 # Adjust device paths
|
|
sudo mkfs.ext4 /dev/bcache0
|
|
sudo mkdir -p /opt/databases
|
|
sudo mount /dev/bcache0 /opt/databases
|
|
|
|
# Add to fstab for persistence
|
|
echo "/dev/bcache0 /opt/databases ext4 defaults 0 2" >> /etc/fstab
|
|
```
|
|
**Validation:** ✅ SSD cache active, database storage on cached device
|
|
|
|
- [ ] **11:00-12:00** GPU acceleration validation
|
|
```bash
|
|
# Check GPU availability on target nodes
|
|
ssh omv800.local "nvidia-smi || echo 'No NVIDIA GPU'"
|
|
ssh surface "lsmod | grep i915 || echo 'No Intel GPU'"
|
|
ssh jonathan-2518f5u "lshw -c display"
|
|
|
|
# Test GPU access in containers
|
|
docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi
|
|
```
|
|
**Validation:** ✅ GPU acceleration available and accessible
|
|
|
|
#### **Afternoon (13:00-17:00): Database & Service Preparation**
|
|
- [ ] **13:00-14:30** Deploy core database services
|
|
```bash
|
|
# Deploy PostgreSQL primary
|
|
docker stack deploy -c stacks/databases/postgresql-primary.yml postgresql
|
|
|
|
# Wait for startup
|
|
sleep 60
|
|
|
|
# Test database connectivity
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT version();"
|
|
```
|
|
**Validation:** ✅ PostgreSQL accessible and responding
|
|
|
|
- [ ] **14:30-16:00** Deploy MariaDB with optimized configuration
|
|
```bash
|
|
# Deploy MariaDB primary
|
|
docker stack deploy -c stacks/databases/mariadb-primary.yml mariadb
|
|
|
|
# Configure performance settings
|
|
docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "
|
|
SET GLOBAL innodb_buffer_pool_size = 2G;
|
|
SET GLOBAL max_connections = 200;
|
|
SET GLOBAL query_cache_size = 256M;
|
|
"
|
|
```
|
|
**Validation:** ✅ MariaDB accessible with optimized settings
|
|
|
|
- [ ] **16:00-17:00** Deploy Redis cluster
|
|
```bash
|
|
# Deploy Redis with clustering
|
|
docker stack deploy -c stacks/databases/redis-cluster.yml redis
|
|
|
|
# Test Redis functionality
|
|
docker exec $(docker ps -q -f name=redis_master) redis-cli ping
|
|
```
|
|
**Validation:** ✅ Redis cluster operational
|
|
|
|
**🎯 DAY -2 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** All storage and database infrastructure ready
|
|
- [ ] NFS exports configured and performant (>50MB/s)
|
|
- [ ] SSD caching operational for databases
|
|
- [ ] GPU acceleration validated
|
|
- [ ] Core database services deployed and healthy
|
|
|
|
---
|
|
|
|
### **DAY -1: BACKUP & ROLLBACK VALIDATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Comprehensive Backup Testing**
|
|
- [ ] **8:00-9:00** Execute complete database backups
|
|
```bash
|
|
# Backup all existing databases
|
|
docker exec paperless-db-1 pg_dumpall > /backup/paperless_$(date +%Y%m%d_%H%M%S).sql
|
|
docker exec joplin-db-1 pg_dumpall > /backup/joplin_$(date +%Y%m%d_%H%M%S).sql
|
|
docker exec immich_postgres pg_dumpall > /backup/immich_$(date +%Y%m%d_%H%M%S).sql
|
|
docker exec mariadb mysqldump --all-databases > /backup/mariadb_$(date +%Y%m%d_%H%M%S).sql
|
|
docker exec nextcloud-db mysqldump --all-databases > /backup/nextcloud_$(date +%Y%m%d_%H%M%S).sql
|
|
|
|
# Backup file sizes:
|
|
# PostgreSQL backups: _____________ MB
|
|
# MariaDB backups: _____________ MB
|
|
```
|
|
**Validation:** ✅ All backups completed successfully, sizes recorded
|
|
|
|
- [ ] **9:00-10:30** Test database restore procedures
|
|
```bash
|
|
# Test restore on new PostgreSQL instance
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "CREATE DATABASE test_restore;"
|
|
docker exec -i $(docker ps -q -f name=postgresql_primary) psql -U postgres -d test_restore < /backup/paperless_*.sql
|
|
|
|
# Verify restore integrity
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -d test_restore -c "\dt"
|
|
|
|
# Test MariaDB restore
|
|
docker exec -i $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] < /backup/nextcloud_*.sql
|
|
```
|
|
**Validation:** ✅ All restore procedures successful, data integrity confirmed
|
|
|
|
- [ ] **10:30-12:00** Backup critical configuration and data
|
|
```bash
|
|
# Container configurations
|
|
for container in $(docker ps -aq); do
|
|
docker inspect $container > /backup/configs/${container}_config.json
|
|
done
|
|
|
|
# Volume data backups
|
|
docker run --rm -v /var/lib/docker/volumes:/volumes -v /backup/volumes:/backup alpine tar czf /backup/docker_volumes_$(date +%Y%m%d_%H%M%S).tar.gz /volumes
|
|
|
|
# Critical bind mounts
|
|
tar czf /backup/immich_data_$(date +%Y%m%d_%H%M%S).tar.gz /opt/immich/data
|
|
tar czf /backup/nextcloud_data_$(date +%Y%m%d_%H%M%S).tar.gz /opt/nextcloud/data
|
|
tar czf /backup/homeassistant_config_$(date +%Y%m%d_%H%M%S).tar.gz /opt/homeassistant/config
|
|
|
|
# Backup total size: _____________ GB
|
|
```
|
|
**Validation:** ✅ All critical data backed up, total size within available space
|
|
|
|
#### **Afternoon (13:00-17:00): Rollback & Emergency Procedures**
|
|
- [ ] **13:00-14:00** Create automated rollback scripts
|
|
```bash
|
|
# Create rollback script for each phase
|
|
cat > /opt/scripts/rollback-phase1.sh << 'EOF'
|
|
#!/bin/bash
|
|
echo "EMERGENCY ROLLBACK - PHASE 1"
|
|
docker stack rm traefik
|
|
docker stack rm postgresql
|
|
docker stack rm mariadb
|
|
docker stack rm redis
|
|
# Restore original services
|
|
docker-compose -f /opt/original/docker-compose.yml up -d
|
|
EOF
|
|
|
|
chmod +x /opt/scripts/rollback-*.sh
|
|
```
|
|
**Validation:** ✅ Rollback scripts created and tested (dry run)
|
|
|
|
- [ ] **14:00-15:30** Test rollback procedures on test service
|
|
```bash
|
|
# Deploy a test service
|
|
docker service create --name rollback-test alpine sleep 3600
|
|
|
|
# Simulate service failure and rollback
|
|
docker service update --image alpine:broken rollback-test || true
|
|
|
|
# Execute rollback
|
|
docker service update --rollback rollback-test
|
|
|
|
# Verify rollback success
|
|
docker service inspect rollback-test --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'
|
|
|
|
# Cleanup
|
|
docker service rm rollback-test
|
|
```
|
|
**Validation:** ✅ Rollback procedures working, service restored in <5 minutes
|
|
|
|
- [ ] **15:30-16:30** Create monitoring and alerting for migration
|
|
```bash
|
|
# Deploy basic monitoring stack
|
|
docker stack deploy -c stacks/monitoring/migration-monitor.yml monitor
|
|
|
|
# Configure alerts for migration events
|
|
# - Service health failures
|
|
# - Resource exhaustion
|
|
# - Network connectivity issues
|
|
# - Database connection failures
|
|
```
|
|
**Validation:** ✅ Migration monitoring active and alerting configured
|
|
|
|
- [ ] **16:30-17:00** Final pre-migration validation
|
|
```bash
|
|
# Run comprehensive pre-migration check
|
|
bash /opt/scripts/pre-migration-validation.sh
|
|
|
|
# Checklist verification:
|
|
echo "✅ Docker Swarm: $(docker node ls | wc -l) nodes ready"
|
|
echo "✅ Networks: $(docker network ls | grep overlay | wc -l) overlay networks"
|
|
echo "✅ Secrets: $(docker secret ls | wc -l) secrets available"
|
|
echo "✅ Databases: $(docker service ls | grep -E "(postgresql|mariadb|redis)" | wc -l) database services"
|
|
echo "✅ Backups: $(ls -la /backup/*.sql | wc -l) database backups"
|
|
echo "✅ Storage: $(df -h /export | tail -1 | awk '{print $4}') available space"
|
|
```
|
|
**Validation:** ✅ All pre-migration requirements met
|
|
|
|
**🎯 DAY -1 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** All backup and rollback procedures validated
|
|
- [ ] Complete backup cycle executed and verified
|
|
- [ ] Database restore procedures tested and working
|
|
- [ ] Rollback scripts created and tested
|
|
- [ ] Migration monitoring deployed and operational
|
|
- [ ] Final validation checklist 100% complete
|
|
|
|
**🚨 FINAL GO/NO-GO DECISION:**
|
|
- [ ] **FINAL CHECKPOINT:** All Phase 0 criteria met - PROCEED with migration
|
|
- **Decision Made By:** _________________ **Date:** _________ **Time:** _________
|
|
- **Backup Plan Confirmed:** ✅ **Emergency Contacts Notified:** ✅
|
|
|
|
---
|
|
|
|
## 🗓️ **PHASE 1: PARALLEL INFRASTRUCTURE DEPLOYMENT**
|
|
**Duration:** 4 days (Days 1-4)
|
|
**Success Criteria:** New infrastructure deployed and validated alongside existing
|
|
|
|
### **DAY 1: CORE INFRASTRUCTURE DEPLOYMENT**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Reverse Proxy & Load Balancing**
|
|
- [ ] **8:00-9:00** Deploy Traefik reverse proxy
|
|
```bash
|
|
# Deploy Traefik on alternate ports (avoid conflicts)
|
|
# Edit stacks/core/traefik.yml:
|
|
# ports:
|
|
# - "18080:80" # Temporary during migration
|
|
# - "18443:443" # Temporary during migration
|
|
|
|
docker stack deploy -c stacks/core/traefik.yml traefik
|
|
|
|
# Wait for deployment
|
|
sleep 60
|
|
```
|
|
**Validation:** ✅ Traefik dashboard accessible at http://omv800.local:18080
|
|
|
|
- [ ] **9:00-10:00** Configure SSL certificates
|
|
```bash
|
|
# Test SSL certificate generation
|
|
curl -k https://omv800.local:18443
|
|
|
|
# Verify certificate auto-generation
|
|
docker exec $(docker ps -q -f name=traefik_traefik) ls -la /certificates/
|
|
```
|
|
**Validation:** ✅ SSL certificates generated and working
|
|
|
|
- [ ] **10:00-11:00** Test service discovery and routing
|
|
```bash
|
|
# Deploy test service with Traefik labels
|
|
cat > test-service.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
test-web:
|
|
image: nginx:alpine
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.test.rule=Host(`test.localhost`)"
|
|
- "traefik.http.routers.test.entrypoints=websecure"
|
|
- "traefik.http.routers.test.tls=true"
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c test-service.yml test
|
|
|
|
# Test routing
|
|
curl -k -H "Host: test.localhost" https://omv800.local:18443
|
|
```
|
|
**Validation:** ✅ Service discovery working, test service accessible via Traefik
|
|
|
|
- [ ] **11:00-12:00** Configure security middlewares
|
|
```bash
|
|
# Create middleware configuration
|
|
mkdir -p /opt/traefik/dynamic
|
|
cat > /opt/traefik/dynamic/middleware.yml << 'EOF'
|
|
http:
|
|
middlewares:
|
|
security-headers:
|
|
headers:
|
|
stsSeconds: 31536000
|
|
stsIncludeSubdomains: true
|
|
contentTypeNosniff: true
|
|
referrerPolicy: "strict-origin-when-cross-origin"
|
|
rate-limit:
|
|
rateLimit:
|
|
burst: 100
|
|
average: 50
|
|
EOF
|
|
|
|
# Test middleware application
|
|
curl -I -k -H "Host: test.localhost" https://omv800.local:18443
|
|
```
|
|
**Validation:** ✅ Security headers present in response
|
|
|
|
#### **Afternoon (13:00-17:00): Database Migration Setup**
|
|
- [ ] **13:00-14:00** Configure PostgreSQL replication
|
|
```bash
|
|
# Configure streaming replication from existing to new PostgreSQL
|
|
# On existing PostgreSQL, create replication user
|
|
docker exec paperless-db-1 psql -U postgres -c "
|
|
CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'repl_password';
|
|
"
|
|
|
|
# Configure postgresql.conf for replication
|
|
docker exec paperless-db-1 bash -c "
|
|
echo 'wal_level = replica' >> /var/lib/postgresql/data/postgresql.conf
|
|
echo 'max_wal_senders = 3' >> /var/lib/postgresql/data/postgresql.conf
|
|
echo 'host replication replicator 0.0.0.0/0 md5' >> /var/lib/postgresql/data/pg_hba.conf
|
|
"
|
|
|
|
# Restart to apply configuration
|
|
docker restart paperless-db-1
|
|
```
|
|
**Validation:** ✅ Replication user created, configuration applied
|
|
|
|
- [ ] **14:00-15:30** Set up database replication to new cluster
|
|
```bash
|
|
# Create base backup for new PostgreSQL
|
|
docker exec $(docker ps -q -f name=postgresql_primary) pg_basebackup -h paperless-db-1 -D /tmp/replica -U replicator -v -P -R
|
|
|
|
# Configure recovery.conf for continuous replication
|
|
docker exec $(docker ps -q -f name=postgresql_primary) bash -c "
|
|
echo \"standby_mode = 'on'\" >> /var/lib/postgresql/data/recovery.conf
|
|
echo \"primary_conninfo = 'host=paperless-db-1 port=5432 user=replicator'\" >> /var/lib/postgresql/data/recovery.conf
|
|
echo \"trigger_file = '/tmp/postgresql.trigger'\" >> /var/lib/postgresql/data/recovery.conf
|
|
"
|
|
|
|
# Start replication
|
|
docker restart $(docker ps -q -f name=postgresql_primary)
|
|
```
|
|
**Validation:** ✅ Replication active, lag <1 second
|
|
|
|
- [ ] **15:30-16:30** Configure MariaDB replication
|
|
```bash
|
|
# Similar process for MariaDB replication
|
|
# Configure existing MariaDB as master
|
|
docker exec nextcloud-db mysql -u root -p[PASSWORD] -e "
|
|
CREATE USER 'replicator'@'%' IDENTIFIED BY 'repl_password';
|
|
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
|
|
FLUSH PRIVILEGES;
|
|
FLUSH TABLES WITH READ LOCK;
|
|
SHOW MASTER STATUS;
|
|
"
|
|
# Record master log file and position: _________________
|
|
|
|
# Configure new MariaDB as slave
|
|
docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "
|
|
CHANGE MASTER TO
|
|
MASTER_HOST='nextcloud-db',
|
|
MASTER_USER='replicator',
|
|
MASTER_PASSWORD='repl_password',
|
|
MASTER_LOG_FILE='[LOG_FILE]',
|
|
MASTER_LOG_POS=[POSITION];
|
|
START SLAVE;
|
|
SHOW SLAVE STATUS\G;
|
|
"
|
|
```
|
|
**Validation:** ✅ MariaDB replication active, Slave_SQL_Running: Yes
|
|
|
|
- [ ] **16:30-17:00** Monitor replication health
|
|
```bash
|
|
# Set up replication monitoring
|
|
cat > /opt/scripts/monitor-replication.sh << 'EOF'
|
|
#!/bin/bash
|
|
while true; do
|
|
# Check PostgreSQL replication lag
|
|
PG_LAG=$(docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));" -t)
|
|
echo "PostgreSQL replication lag: ${PG_LAG} seconds"
|
|
|
|
# Check MariaDB replication lag
|
|
MYSQL_LAG=$(docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
|
|
echo "MariaDB replication lag: ${MYSQL_LAG} seconds"
|
|
|
|
sleep 10
|
|
done
|
|
EOF
|
|
|
|
chmod +x /opt/scripts/monitor-replication.sh
|
|
nohup /opt/scripts/monitor-replication.sh > /var/log/replication-monitor.log 2>&1 &
|
|
```
|
|
**Validation:** ✅ Replication monitoring active, both databases <5 second lag
|
|
|
|
**🎯 DAY 1 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Core infrastructure deployed and operational
|
|
- [ ] Traefik reverse proxy deployed and accessible
|
|
- [ ] SSL certificates working
|
|
- [ ] Service discovery and routing functional
|
|
- [ ] Database replication active (both PostgreSQL and MariaDB)
|
|
- [ ] Replication lag <5 seconds consistently
|
|
|
|
---
|
|
|
|
### **DAY 2: NON-CRITICAL SERVICE MIGRATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Monitoring & Management Services**
|
|
- [ ] **8:00-9:00** Deploy monitoring stack
|
|
```bash
|
|
# Deploy Prometheus, Grafana, AlertManager
|
|
docker stack deploy -c stacks/monitoring/netdata.yml monitoring
|
|
|
|
# Wait for services to start
|
|
sleep 120
|
|
|
|
# Verify monitoring endpoints
|
|
curl http://omv800.local:9090/api/v1/status # Prometheus
|
|
curl http://omv800.local:3000/api/health # Grafana
|
|
```
|
|
**Validation:** ✅ Monitoring stack operational, all endpoints responding
|
|
|
|
- [ ] **9:00-10:00** Deploy Portainer management
|
|
```bash
|
|
# Deploy Portainer for Swarm management
|
|
cat > portainer-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
portainer:
|
|
image: portainer/portainer-ce:latest
|
|
command: -H tcp://tasks.agent:9001 --tlsskipverify
|
|
volumes:
|
|
- portainer_data:/data
|
|
networks:
|
|
- traefik-public
|
|
- portainer-network
|
|
deploy:
|
|
placement:
|
|
constraints: [node.role == manager]
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.portainer.rule=Host(`portainer.localhost`)"
|
|
- "traefik.http.routers.portainer.entrypoints=websecure"
|
|
- "traefik.http.routers.portainer.tls=true"
|
|
|
|
agent:
|
|
image: portainer/agent:latest
|
|
volumes:
|
|
- /var/run/docker.sock:/var/run/docker.sock
|
|
- /var/lib/docker/volumes:/var/lib/docker/volumes
|
|
networks:
|
|
- portainer-network
|
|
deploy:
|
|
mode: global
|
|
|
|
volumes:
|
|
portainer_data:
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
portainer-network:
|
|
driver: overlay
|
|
EOF
|
|
|
|
docker stack deploy -c portainer-swarm.yml portainer
|
|
```
|
|
**Validation:** ✅ Portainer accessible via Traefik, all nodes visible
|
|
|
|
- [ ] **10:00-11:00** Deploy Uptime Kuma monitoring
|
|
```bash
|
|
# Deploy uptime monitoring for migration validation
|
|
cat > uptime-kuma.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
uptime-kuma:
|
|
image: louislam/uptime-kuma:1
|
|
volumes:
|
|
- uptime_data:/app/data
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.uptime.rule=Host(`uptime.localhost`)"
|
|
- "traefik.http.routers.uptime.entrypoints=websecure"
|
|
- "traefik.http.routers.uptime.tls=true"
|
|
|
|
volumes:
|
|
uptime_data:
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c uptime-kuma.yml uptime
|
|
```
|
|
**Validation:** ✅ Uptime Kuma accessible, monitoring configured for all services
|
|
|
|
- [ ] **11:00-12:00** Configure comprehensive health monitoring
|
|
```bash
|
|
# Configure Uptime Kuma to monitor all services
|
|
# Access http://omv800.local:18443 (Host: uptime.localhost)
|
|
# Add monitoring for:
|
|
# - All existing services (baseline)
|
|
# - New services as they're deployed
|
|
# - Database replication health
|
|
# - Traefik proxy health
|
|
```
|
|
**Validation:** ✅ All services monitored, baseline uptime established
|
|
|
|
#### **Afternoon (13:00-17:00): Test Service Migration**
|
|
- [ ] **13:00-14:00** Migrate Dozzle log viewer (low risk)
|
|
```bash
|
|
# Stop existing Dozzle
|
|
docker stop dozzle
|
|
|
|
# Deploy in new infrastructure
|
|
cat > dozzle-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
dozzle:
|
|
image: amir20/dozzle:latest
|
|
volumes:
|
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
placement:
|
|
constraints: [node.role == manager]
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.dozzle.rule=Host(`logs.localhost`)"
|
|
- "traefik.http.routers.dozzle.entrypoints=websecure"
|
|
- "traefik.http.routers.dozzle.tls=true"
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c dozzle-swarm.yml dozzle
|
|
```
|
|
**Validation:** ✅ Dozzle accessible via new infrastructure, all logs visible
|
|
|
|
- [ ] **14:00-15:00** Migrate Code Server (development tool)
|
|
```bash
|
|
# Backup existing code-server data
|
|
tar czf /backup/code-server-config_$(date +%Y%m%d_%H%M%S).tar.gz /opt/code-server/config
|
|
|
|
# Stop existing service
|
|
docker stop code-server
|
|
|
|
# Deploy in Swarm with NFS storage
|
|
cat > code-server-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
code-server:
|
|
image: linuxserver/code-server:latest
|
|
environment:
|
|
- PUID=1000
|
|
- PGID=1000
|
|
- TZ=America/New_York
|
|
- PASSWORD=secure_password
|
|
volumes:
|
|
- code_config:/config
|
|
- code_workspace:/workspace
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.code.rule=Host(`code.localhost`)"
|
|
- "traefik.http.routers.code.entrypoints=websecure"
|
|
- "traefik.http.routers.code.tls=true"
|
|
|
|
volumes:
|
|
code_config:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/code-server/config
|
|
code_workspace:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/code-server/workspace
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c code-server-swarm.yml code-server
|
|
```
|
|
**Validation:** ✅ Code Server accessible, all data preserved, NFS storage working
|
|
|
|
- [ ] **15:00-16:00** Test rollback procedure on migrated service
|
|
```bash
|
|
# Simulate failure and rollback for Dozzle
|
|
docker service update --image amir20/dozzle:broken dozzle_dozzle || true
|
|
|
|
# Wait for failure detection
|
|
sleep 60
|
|
|
|
# Execute rollback
|
|
docker service update --rollback dozzle_dozzle
|
|
|
|
# Verify rollback success
|
|
curl -k -H "Host: logs.localhost" https://omv800.local:18443
|
|
|
|
# Time rollback completion: _____________ seconds
|
|
```
|
|
**Validation:** ✅ Rollback completed in <300 seconds, service fully operational
|
|
|
|
- [ ] **16:00-17:00** Performance comparison testing
|
|
```bash
|
|
# Test response times - old vs new infrastructure
|
|
# Old infrastructure
|
|
time curl http://audrey:9999 # Dozzle on old system
|
|
# Response time: _____________ ms
|
|
|
|
# New infrastructure
|
|
time curl -k -H "Host: logs.localhost" https://omv800.local:18443
|
|
# Response time: _____________ ms
|
|
|
|
# Load test new infrastructure
|
|
ab -n 1000 -c 10 -H "Host: logs.localhost" https://omv800.local:18443/
|
|
# Requests per second: _____________
|
|
# Average response time: _____________ ms
|
|
```
|
|
**Validation:** ✅ New infrastructure performance equal or better than baseline
|
|
|
|
**🎯 DAY 2 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Non-critical services migrated successfully
|
|
- [ ] Monitoring stack operational (Prometheus, Grafana, Uptime Kuma)
|
|
- [ ] Portainer deployed and managing Swarm cluster
|
|
- [ ] 2+ non-critical services migrated successfully
|
|
- [ ] Rollback procedures tested and working (<5 minutes)
|
|
- [ ] Performance baseline maintained or improved
|
|
|
|
---
|
|
|
|
### **DAY 3: STORAGE SERVICE MIGRATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Immich Photo Management**
|
|
- [ ] **8:00-9:00** Deploy Immich stack in new infrastructure
|
|
```bash
|
|
# Deploy complete Immich stack with optimized configuration
|
|
docker stack deploy -c stacks/apps/immich.yml immich
|
|
|
|
# Wait for all services to start
|
|
sleep 180
|
|
|
|
# Verify all Immich components running
|
|
docker service ls | grep immich
|
|
```
|
|
**Validation:** ✅ All Immich services (server, ML, redis, postgres) running
|
|
|
|
- [ ] **9:00-10:30** Migrate Immich data with zero downtime
|
|
```bash
|
|
# Put existing Immich in maintenance mode
|
|
docker exec immich_server curl -X POST http://localhost:3001/api/admin/maintenance
|
|
|
|
# Sync photo data to NFS storage (incremental)
|
|
rsync -av --progress /opt/immich/data/ omv800.local:/export/immich/data/
|
|
# Data sync size: _____________ GB
|
|
# Sync time: _____________ minutes
|
|
|
|
# Perform final incremental sync
|
|
rsync -av --progress --delete /opt/immich/data/ omv800.local:/export/immich/data/
|
|
|
|
# Import existing database
|
|
docker exec immich_postgres psql -U postgres -c "CREATE DATABASE immich;"
|
|
docker exec -i immich_postgres psql -U postgres -d immich < /backup/immich_*.sql
|
|
```
|
|
**Validation:** ✅ All photo data synced, database imported successfully
|
|
|
|
- [ ] **10:30-11:30** Test Immich functionality in new infrastructure
|
|
```bash
|
|
# Test API endpoints
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
|
|
# Test photo upload
|
|
curl -k -X POST -H "Host: immich.localhost" -F "file=@test-photo.jpg" https://omv800.local:18443/api/upload
|
|
|
|
# Test ML processing (if GPU available)
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/search?q=test
|
|
|
|
# Test thumbnail generation
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/asset/[ASSET_ID]/thumbnail
|
|
```
|
|
**Validation:** ✅ All Immich functions working, ML processing operational
|
|
|
|
- [ ] **11:30-12:00** Performance validation and GPU testing
|
|
```bash
|
|
# Test GPU acceleration for ML processing
|
|
docker exec immich_machine_learning nvidia-smi || echo "No NVIDIA GPU"
|
|
docker exec immich_machine_learning python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
|
|
|
|
# Measure photo processing performance
|
|
time docker exec immich_machine_learning python /app/process_test_image.py
|
|
# Processing time: _____________ seconds
|
|
|
|
# Compare with CPU-only processing
|
|
# CPU processing time: _____________ seconds
|
|
# GPU speedup factor: _____________x
|
|
```
|
|
**Validation:** ✅ GPU acceleration working, significant performance improvement
|
|
|
|
#### **Afternoon (13:00-17:00): Jellyfin Media Server**
|
|
- [ ] **13:00-14:00** Deploy Jellyfin with GPU transcoding
|
|
```bash
|
|
# Deploy Jellyfin stack with GPU support
|
|
docker stack deploy -c stacks/apps/jellyfin.yml jellyfin
|
|
|
|
# Wait for service startup
|
|
sleep 120
|
|
|
|
# Verify GPU access in container
|
|
docker exec $(docker ps -q -f name=jellyfin_jellyfin) nvidia-smi || echo "No NVIDIA GPU - using software transcoding"
|
|
```
|
|
**Validation:** ✅ Jellyfin deployed with GPU access
|
|
|
|
- [ ] **14:00-15:00** Configure media library access
|
|
```bash
|
|
# Verify NFS media mounts
|
|
docker exec $(docker ps -q -f name=jellyfin_jellyfin) ls -la /media/movies
|
|
docker exec $(docker ps -q -f name=jellyfin_jellyfin) ls -la /media/tv
|
|
|
|
# Test media file access
|
|
docker exec $(docker ps -q -f name=jellyfin_jellyfin) ffprobe /media/movies/test-movie.mkv
|
|
```
|
|
**Validation:** ✅ All media libraries accessible via NFS
|
|
|
|
- [ ] **15:00-16:00** Test transcoding performance
|
|
```bash
|
|
# Test hardware transcoding
|
|
curl -k -H "Host: jellyfin.localhost" "https://omv800.local:18443/Videos/[ID]/stream?VideoCodec=h264&AudioCodec=aac"
|
|
|
|
# Monitor GPU utilization during transcoding
|
|
watch nvidia-smi
|
|
|
|
# Measure transcoding performance
|
|
time docker exec $(docker ps -q -f name=jellyfin_jellyfin) ffmpeg -i /media/movies/test-4k.mkv -c:v h264_nvenc -preset fast -c:a aac /tmp/test-transcode.mkv
|
|
# Hardware transcode time: _____________ seconds
|
|
|
|
# Compare with software transcoding
|
|
time docker exec $(docker ps -q -f name=jellyfin_jellyfin) ffmpeg -i /media/movies/test-4k.mkv -c:v libx264 -preset fast -c:a aac /tmp/test-transcode-sw.mkv
|
|
# Software transcode time: _____________ seconds
|
|
# Hardware speedup: _____________x
|
|
```
|
|
**Validation:** ✅ Hardware transcoding working, 10x+ performance improvement
|
|
|
|
- [ ] **16:00-17:00** Cutover preparation for media services
|
|
```bash
|
|
# Prepare for cutover by stopping writes to old services
|
|
# Stop existing Immich uploads
|
|
docker exec immich_server curl -X POST http://localhost:3001/api/admin/maintenance
|
|
|
|
# Configure clients to use new endpoints (testing only)
|
|
# immich.localhost → new infrastructure
|
|
# jellyfin.localhost → new infrastructure
|
|
|
|
# Test client connectivity to new endpoints
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
curl -k -H "Host: jellyfin.localhost" https://omv800.local:18443/web/index.html
|
|
```
|
|
**Validation:** ✅ New services accessible, ready for user traffic
|
|
|
|
**🎯 DAY 3 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Storage services migrated with enhanced performance
|
|
- [ ] Immich fully operational with all photo data migrated
|
|
- [ ] GPU acceleration working for ML processing (10x+ speedup)
|
|
- [ ] Jellyfin deployed with hardware transcoding (10x+ speedup)
|
|
- [ ] All media libraries accessible via NFS
|
|
- [ ] Performance significantly improved over baseline
|
|
|
|
---
|
|
|
|
### **DAY 4: DATABASE CUTOVER PREPARATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Database Replication Validation**
|
|
- [ ] **8:00-9:00** Validate replication health and performance
|
|
```bash
|
|
# Check PostgreSQL replication status
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT * FROM pg_stat_replication;"
|
|
|
|
# Verify replication lag
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));"
|
|
# Current replication lag: _____________ seconds
|
|
|
|
# Check MariaDB replication
|
|
docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "SHOW SLAVE STATUS\G" | grep -E "(Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master)"
|
|
# Slave_IO_Running: _____________
|
|
# Slave_SQL_Running: _____________
|
|
# Seconds_Behind_Master: _____________
|
|
```
|
|
**Validation:** ✅ All replication healthy, lag <5 seconds
|
|
|
|
- [ ] **9:00-10:00** Test database failover procedures
|
|
```bash
|
|
# Test PostgreSQL failover (simulate primary failure)
|
|
docker exec $(docker ps -q -f name=postgresql_primary) touch /tmp/postgresql.trigger
|
|
|
|
# Wait for failover completion
|
|
sleep 30
|
|
|
|
# Verify new primary is accepting writes
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "CREATE TABLE failover_test (id int, created timestamp default now());"
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "INSERT INTO failover_test (id) VALUES (1);"
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT * FROM failover_test;"
|
|
|
|
# Failover time: _____________ seconds
|
|
```
|
|
**Validation:** ✅ Database failover working, downtime <30 seconds
|
|
|
|
- [ ] **10:00-11:00** Prepare database cutover scripts
|
|
```bash
|
|
# Create automated cutover script
|
|
cat > /opt/scripts/database-cutover.sh << 'EOF'
|
|
#!/bin/bash
|
|
set -e
|
|
echo "Starting database cutover at $(date)"
|
|
|
|
# Step 1: Stop writes to old databases
|
|
echo "Stopping application writes..."
|
|
docker exec paperless-webserver-1 curl -X POST http://localhost:8000/admin/maintenance/on
|
|
docker exec immich_server curl -X POST http://localhost:3001/api/admin/maintenance
|
|
|
|
# Step 2: Wait for replication to catch up
|
|
echo "Waiting for replication sync..."
|
|
while true; do
|
|
lag=$(docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));" -t)
|
|
if (( $(echo "$lag < 1" | bc -l) )); then
|
|
break
|
|
fi
|
|
echo "Replication lag: $lag seconds"
|
|
sleep 1
|
|
done
|
|
|
|
# Step 3: Promote replica to primary
|
|
echo "Promoting replica to primary..."
|
|
docker exec $(docker ps -q -f name=postgresql_primary) touch /tmp/postgresql.trigger
|
|
|
|
# Step 4: Update application connection strings
|
|
echo "Updating application configurations..."
|
|
# Update environment variables to point to new databases
|
|
|
|
# Step 5: Restart applications with new database connections
|
|
echo "Restarting applications..."
|
|
docker service update --force immich_immich_server
|
|
docker service update --force paperless_paperless
|
|
|
|
echo "Database cutover completed at $(date)"
|
|
EOF
|
|
|
|
chmod +x /opt/scripts/database-cutover.sh
|
|
```
|
|
**Validation:** ✅ Cutover script created and validated (dry run)
|
|
|
|
- [ ] **11:00-12:00** Test application database connectivity
|
|
```bash
|
|
# Test applications connecting to new databases
|
|
# Temporarily update connection strings for testing
|
|
|
|
# Test Immich database connectivity
|
|
docker exec immich_server env | grep -i db
|
|
docker exec immich_server psql -h postgresql_primary -U postgres -d immich -c "SELECT count(*) FROM assets;"
|
|
|
|
# Test Paperless database connectivity
|
|
# (Similar validation for other applications)
|
|
|
|
# Restore original connections after testing
|
|
```
|
|
**Validation:** ✅ All applications can connect to new database cluster
|
|
|
|
#### **Afternoon (13:00-17:00): Load Testing & Performance Validation**
|
|
- [ ] **13:00-14:30** Execute comprehensive load testing
|
|
```bash
|
|
# Install load testing tools
|
|
apt-get update && apt-get install -y apache2-utils wrk
|
|
|
|
# Load test new infrastructure
|
|
# Test Immich API
|
|
ab -n 1000 -c 50 -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
# Requests per second: _____________
|
|
# Average response time: _____________ ms
|
|
# 95th percentile: _____________ ms
|
|
|
|
# Test Jellyfin streaming
|
|
ab -n 500 -c 20 -H "Host: jellyfin.localhost" https://omv800.local:18443/web/index.html
|
|
# Requests per second: _____________
|
|
# Average response time: _____________ ms
|
|
|
|
# Test database performance under load
|
|
wrk -t4 -c50 -d30s --script=db-test.lua https://omv800.local:18443/api/test-db
|
|
# Database requests per second: _____________
|
|
# Database average latency: _____________ ms
|
|
```
|
|
**Validation:** ✅ Load testing passed, performance targets met
|
|
|
|
- [ ] **14:30-15:30** Stress testing and failure scenarios
|
|
```bash
|
|
# Test high concurrent user load
|
|
ab -n 5000 -c 200 -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
# High load performance: Pass/Fail
|
|
|
|
# Test service failure and recovery
|
|
docker service update --replicas 0 immich_immich_server
|
|
sleep 30
|
|
docker service update --replicas 2 immich_immich_server
|
|
|
|
# Measure recovery time
|
|
# Service recovery time: _____________ seconds
|
|
|
|
# Test node failure simulation
|
|
docker node update --availability drain surface
|
|
sleep 60
|
|
docker node update --availability active surface
|
|
|
|
# Node failover time: _____________ seconds
|
|
```
|
|
**Validation:** ✅ Stress testing passed, automatic recovery working
|
|
|
|
- [ ] **15:30-16:30** Performance comparison with baseline
|
|
```bash
|
|
# Compare performance metrics: old vs new infrastructure
|
|
|
|
# Response time comparison:
|
|
# Immich (old): _____________ ms avg
|
|
# Immich (new): _____________ ms avg
|
|
# Improvement: _____________x faster
|
|
|
|
# Jellyfin transcoding comparison:
|
|
# Old (CPU): _____________ seconds for 1080p
|
|
# New (GPU): _____________ seconds for 1080p
|
|
# Improvement: _____________x faster
|
|
|
|
# Database query performance:
|
|
# Old PostgreSQL: _____________ ms avg
|
|
# New PostgreSQL: _____________ ms avg
|
|
# Improvement: _____________x faster
|
|
|
|
# Overall performance improvement: _____________ % better
|
|
```
|
|
**Validation:** ✅ New infrastructure significantly outperforms baseline
|
|
|
|
- [ ] **16:30-17:00** Final Phase 1 validation and documentation
|
|
```bash
|
|
# Comprehensive health check of all new services
|
|
bash /opt/scripts/comprehensive-health-check.sh
|
|
|
|
# Generate Phase 1 completion report
|
|
cat > /opt/reports/phase1-completion-report.md << 'EOF'
|
|
# Phase 1 Migration Completion Report
|
|
|
|
## Services Successfully Migrated:
|
|
- ✅ Monitoring Stack (Prometheus, Grafana, Uptime Kuma)
|
|
- ✅ Management Tools (Portainer, Dozzle, Code Server)
|
|
- ✅ Storage Services (Immich with GPU acceleration)
|
|
- ✅ Media Services (Jellyfin with hardware transcoding)
|
|
|
|
## Performance Improvements Achieved:
|
|
- Database performance: ___x improvement
|
|
- Media transcoding: ___x improvement
|
|
- Photo ML processing: ___x improvement
|
|
- Overall response time: ___x improvement
|
|
|
|
## Infrastructure Status:
|
|
- Docker Swarm: ___ nodes operational
|
|
- Database replication: <___ seconds lag
|
|
- Load testing: PASSED (1000+ concurrent users)
|
|
- Stress testing: PASSED
|
|
- Rollback procedures: TESTED and WORKING
|
|
|
|
## Ready for Phase 2: YES/NO
|
|
EOF
|
|
|
|
# Phase 1 completion: _____________ %
|
|
```
|
|
**Validation:** ✅ Phase 1 completed successfully, ready for Phase 2
|
|
|
|
**🎯 DAY 4 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Phase 1 completed, ready for critical service migration
|
|
- [ ] Database replication validated and performant (<5 second lag)
|
|
- [ ] Database failover tested and working (<30 seconds)
|
|
- [ ] Comprehensive load testing passed (1000+ concurrent users)
|
|
- [ ] Stress testing passed with automatic recovery
|
|
- [ ] Performance improvements documented and significant
|
|
- [ ] All Phase 1 services operational and stable
|
|
|
|
**🚨 PHASE 1 COMPLETION REVIEW:**
|
|
- [ ] **PHASE 1 CHECKPOINT:** All parallel infrastructure deployed and validated
|
|
- **Services Migrated:** ___/8 planned services
|
|
- **Performance Improvement:** ___%
|
|
- **Uptime During Phase 1:** ____%
|
|
- **Ready for Phase 2:** YES/NO
|
|
- **Decision Made By:** _________________ **Date:** _________ **Time:** _________
|
|
|
|
---
|
|
|
|
## 🗓️ **PHASE 2: CRITICAL SERVICE MIGRATION**
|
|
**Duration:** 5 days (Days 5-9)
|
|
**Success Criteria:** All critical services migrated with zero data loss and <1 hour downtime total
|
|
|
|
### **DAY 5: DNS & NETWORK SERVICES**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): AdGuard Home & Unbound Migration**
|
|
- [ ] **8:00-9:00** Prepare DNS service migration
|
|
```bash
|
|
# Backup current AdGuard Home configuration
|
|
tar czf /backup/adguardhome-config_$(date +%Y%m%d_%H%M%S).tar.gz /opt/adguardhome/conf
|
|
tar czf /backup/unbound-config_$(date +%Y%m%d_%H%M%S).tar.gz /etc/unbound
|
|
|
|
# Document current DNS settings
|
|
dig @192.168.50.225 google.com
|
|
dig @192.168.50.225 test.local
|
|
# DNS resolution working: YES/NO
|
|
|
|
# Record current client DNS settings
|
|
# Router DHCP DNS: _________________
|
|
# Static client DNS: _______________
|
|
```
|
|
**Validation:** ✅ Current DNS configuration documented and backed up
|
|
|
|
- [ ] **9:00-10:30** Deploy AdGuard Home in new infrastructure
|
|
```bash
|
|
# Deploy AdGuard Home stack
|
|
cat > adguard-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
adguardhome:
|
|
image: adguard/adguardhome:latest
|
|
ports:
|
|
- target: 53
|
|
published: 5353
|
|
protocol: udp
|
|
mode: host
|
|
- target: 53
|
|
published: 5353
|
|
protocol: tcp
|
|
mode: host
|
|
volumes:
|
|
- adguard_work:/opt/adguardhome/work
|
|
- adguard_conf:/opt/adguardhome/conf
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
placement:
|
|
constraints: [node.labels.role==db]
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.adguard.rule=Host(`dns.localhost`)"
|
|
- "traefik.http.routers.adguard.entrypoints=websecure"
|
|
- "traefik.http.routers.adguard.tls=true"
|
|
- "traefik.http.services.adguard.loadbalancer.server.port=3000"
|
|
|
|
volumes:
|
|
adguard_work:
|
|
driver: local
|
|
adguard_conf:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/adguard/conf
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c adguard-swarm.yml adguard
|
|
```
|
|
**Validation:** ✅ AdGuard Home deployed, web interface accessible
|
|
|
|
- [ ] **10:30-11:30** Restore AdGuard Home configuration
|
|
```bash
|
|
# Copy configuration from backup
|
|
docker cp /backup/adguardhome-config_*.tar.gz adguard_adguardhome:/tmp/
|
|
docker exec adguard_adguardhome tar xzf /tmp/adguardhome-config_*.tar.gz -C /opt/adguardhome/
|
|
docker service update --force adguard_adguardhome
|
|
|
|
# Wait for restart
|
|
sleep 60
|
|
|
|
# Verify configuration restored
|
|
curl -k -H "Host: dns.localhost" https://omv800.local:18443/control/status
|
|
|
|
# Test DNS resolution on new port
|
|
dig @omv800.local -p 5353 google.com
|
|
dig @omv800.local -p 5353 blocked-domain.com
|
|
```
|
|
**Validation:** ✅ Configuration restored, DNS filtering working on port 5353
|
|
|
|
- [ ] **11:30-12:00** Parallel DNS testing
|
|
```bash
|
|
# Test DNS resolution from all network segments
|
|
# Internal clients
|
|
nslookup google.com omv800.local:5353
|
|
nslookup internal.domain omv800.local:5353
|
|
|
|
# Test ad blocking
|
|
nslookup doubleclick.net omv800.local:5353
|
|
# Should return blocked IP: YES/NO
|
|
|
|
# Test custom DNS rules
|
|
nslookup home.local omv800.local:5353
|
|
# Custom rules working: YES/NO
|
|
```
|
|
**Validation:** ✅ New DNS service fully functional on alternate port
|
|
|
|
#### **Afternoon (13:00-17:00): DNS Cutover Execution**
|
|
- [ ] **13:00-13:30** Prepare for DNS cutover
|
|
```bash
|
|
# Lower TTL for critical DNS records (if external DNS)
|
|
# This should have been done 48-72 hours ago
|
|
|
|
# Notify users of brief DNS interruption
|
|
echo "NOTICE: DNS services will be migrated between 13:30-14:00. Brief interruption possible."
|
|
|
|
# Prepare rollback script
|
|
cat > /opt/scripts/dns-rollback.sh << 'EOF'
|
|
#!/bin/bash
|
|
echo "EMERGENCY DNS ROLLBACK"
|
|
docker service update --publish-rm 53:53/udp --publish-rm 53:53/tcp adguard_adguardhome
|
|
docker service update --publish-add published=5353,target=53,protocol=udp --publish-add published=5353,target=53,protocol=tcp adguard_adguardhome
|
|
docker start adguardhome # Start original container
|
|
echo "DNS rollback completed - services on original ports"
|
|
EOF
|
|
|
|
chmod +x /opt/scripts/dns-rollback.sh
|
|
```
|
|
**Validation:** ✅ Cutover preparation complete, rollback ready
|
|
|
|
- [ ] **13:30-14:00** Execute DNS service cutover
|
|
```bash
|
|
# CRITICAL: This affects all network clients
|
|
# Coordinate with anyone using the network
|
|
|
|
# Step 1: Stop old AdGuard Home
|
|
docker stop adguardhome
|
|
|
|
# Step 2: Update new AdGuard Home to use standard DNS ports
|
|
docker service update --publish-rm 5353:53/udp --publish-rm 5353:53/tcp adguard_adguardhome
|
|
docker service update --publish-add published=53,target=53,protocol=udp --publish-add published=53,target=53,protocol=tcp adguard_adguardhome
|
|
|
|
# Step 3: Wait for DNS propagation
|
|
sleep 30
|
|
|
|
# Step 4: Test DNS resolution on standard port
|
|
dig @omv800.local google.com
|
|
nslookup test.local omv800.local
|
|
|
|
# Cutover completion time: _____________
|
|
# DNS interruption duration: _____________ seconds
|
|
```
|
|
**Validation:** ✅ DNS cutover completed, standard ports working
|
|
|
|
- [ ] **14:00-15:00** Validate DNS service across network
|
|
```bash
|
|
# Test from multiple client types
|
|
# Wired clients
|
|
nslookup google.com
|
|
nslookup blocked-ads.com
|
|
|
|
# Wireless clients
|
|
# Test mobile devices, laptops, IoT devices
|
|
|
|
# Test IoT device DNS (critical for Home Assistant)
|
|
# Document any devices that need DNS server updates
|
|
# Devices needing manual updates: _________________
|
|
```
|
|
**Validation:** ✅ DNS working across all network segments
|
|
|
|
- [ ] **15:00-16:00** Deploy Unbound recursive resolver
|
|
```bash
|
|
# Deploy Unbound as upstream for AdGuard Home
|
|
cat > unbound-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
unbound:
|
|
image: mvance/unbound:latest
|
|
ports:
|
|
- "5335:53"
|
|
volumes:
|
|
- unbound_conf:/opt/unbound/etc/unbound
|
|
networks:
|
|
- dns-network
|
|
deploy:
|
|
placement:
|
|
constraints: [node.labels.role==db]
|
|
|
|
volumes:
|
|
unbound_conf:
|
|
driver: local
|
|
|
|
networks:
|
|
dns-network:
|
|
driver: overlay
|
|
EOF
|
|
|
|
docker stack deploy -c unbound-swarm.yml unbound
|
|
|
|
# Configure AdGuard Home to use Unbound as upstream
|
|
# Update AdGuard Home settings: Upstream DNS = unbound:53
|
|
```
|
|
**Validation:** ✅ Unbound deployed and configured as upstream resolver
|
|
|
|
- [ ] **16:00-17:00** DNS performance and security validation
|
|
```bash
|
|
# Test DNS resolution performance
|
|
time dig @omv800.local google.com
|
|
# Response time: _____________ ms
|
|
|
|
time dig @omv800.local facebook.com
|
|
# Response time: _____________ ms
|
|
|
|
# Test DNS security features
|
|
dig @omv800.local malware-test.com
|
|
# Blocked: YES/NO
|
|
|
|
dig @omv800.local phishing-test.com
|
|
# Blocked: YES/NO
|
|
|
|
# Test DNS over HTTPS (if configured)
|
|
curl -H 'accept: application/dns-json' 'https://dns.localhost/dns-query?name=google.com&type=A'
|
|
|
|
# Performance comparison
|
|
# Old DNS response time: _____________ ms
|
|
# New DNS response time: _____________ ms
|
|
# Improvement: _____________% faster
|
|
```
|
|
**Validation:** ✅ DNS performance improved, security features working
|
|
|
|
**🎯 DAY 5 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Critical DNS services migrated successfully
|
|
- [ ] AdGuard Home migrated with zero configuration loss
|
|
- [ ] DNS resolution working across all network segments
|
|
- [ ] Unbound recursive resolver operational
|
|
- [ ] DNS cutover completed in <30 minutes
|
|
- [ ] Performance improved over baseline
|
|
|
|
---
|
|
|
|
### **DAY 6: HOME AUTOMATION CORE**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Home Assistant Migration**
|
|
- [ ] **8:00-9:00** Backup Home Assistant completely
|
|
```bash
|
|
# Create comprehensive Home Assistant backup
|
|
docker exec homeassistant ha backups new --name "pre-migration-backup-$(date +%Y%m%d_%H%M%S)"
|
|
|
|
# Copy backup file
|
|
docker cp homeassistant:/config/backups/. /backup/homeassistant/
|
|
|
|
# Additional configuration backup
|
|
tar czf /backup/homeassistant-config_$(date +%Y%m%d_%H%M%S).tar.gz /opt/homeassistant/config
|
|
|
|
# Document current integrations and devices
|
|
docker exec homeassistant cat /config/.storage/core.entity_registry | jq '.data.entities | length'
|
|
# Total entities: _____________
|
|
|
|
docker exec homeassistant cat /config/.storage/core.device_registry | jq '.data.devices | length'
|
|
# Total devices: _____________
|
|
```
|
|
**Validation:** ✅ Complete Home Assistant backup created and verified
|
|
|
|
- [ ] **9:00-10:30** Deploy Home Assistant in new infrastructure
|
|
```bash
|
|
# Deploy Home Assistant stack with device access
|
|
cat > homeassistant-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
homeassistant:
|
|
image: ghcr.io/home-assistant/home-assistant:stable
|
|
environment:
|
|
- TZ=America/New_York
|
|
volumes:
|
|
- ha_config:/config
|
|
networks:
|
|
- traefik-public
|
|
- homeassistant-network
|
|
devices:
|
|
- /dev/ttyUSB0:/dev/ttyUSB0 # Z-Wave stick
|
|
- /dev/ttyACM0:/dev/ttyACM0 # Zigbee stick (if present)
|
|
deploy:
|
|
placement:
|
|
constraints:
|
|
- node.hostname == jonathan-2518f5u # Keep on same host as USB devices
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.ha.rule=Host(`ha.localhost`)"
|
|
- "traefik.http.routers.ha.entrypoints=websecure"
|
|
- "traefik.http.routers.ha.tls=true"
|
|
- "traefik.http.services.ha.loadbalancer.server.port=8123"
|
|
|
|
volumes:
|
|
ha_config:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/homeassistant/config
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
homeassistant-network:
|
|
driver: overlay
|
|
EOF
|
|
|
|
docker stack deploy -c homeassistant-swarm.yml homeassistant
|
|
```
|
|
**Validation:** ✅ Home Assistant deployed with device access
|
|
|
|
- [ ] **10:30-11:30** Restore Home Assistant configuration
|
|
```bash
|
|
# Wait for initial startup
|
|
sleep 180
|
|
|
|
# Restore configuration from backup
|
|
docker cp /backup/homeassistant-config_*.tar.gz $(docker ps -q -f name=homeassistant_homeassistant):/tmp/
|
|
docker exec $(docker ps -q -f name=homeassistant_homeassistant) tar xzf /tmp/homeassistant-config_*.tar.gz -C /config/
|
|
|
|
# Restart Home Assistant to load configuration
|
|
docker service update --force homeassistant_homeassistant
|
|
|
|
# Wait for restart
|
|
sleep 120
|
|
|
|
# Test Home Assistant API
|
|
curl -k -H "Host: ha.localhost" https://omv800.local:18443/api/
|
|
```
|
|
**Validation:** ✅ Configuration restored, Home Assistant responding
|
|
|
|
- [ ] **11:30-12:00** Test USB device access and integrations
|
|
```bash
|
|
# Test Z-Wave controller access
|
|
docker exec $(docker ps -q -f name=homeassistant_homeassistant) ls -la /dev/tty*
|
|
|
|
# Test Home Assistant can access Z-Wave stick
|
|
docker exec $(docker ps -q -f name=homeassistant_homeassistant) python -c "import serial; print(serial.Serial('/dev/ttyUSB0', 9600).is_open)"
|
|
|
|
# Check integration status via API
|
|
curl -k -H "Host: ha.localhost" -H "Authorization: Bearer [HA_TOKEN]" https://omv800.local:18443/api/states | jq '.[] | select(.entity_id | contains("zwave"))'
|
|
|
|
# Z-Wave devices detected: _____________
|
|
# Integration status: WORKING/FAILED
|
|
```
|
|
**Validation:** ✅ USB devices accessible, Z-Wave integration working
|
|
|
|
#### **Afternoon (13:00-17:00): IoT Services Migration**
|
|
- [ ] **13:00-14:00** Deploy Mosquitto MQTT broker
|
|
```bash
|
|
# Deploy MQTT broker with clustering support
|
|
cat > mosquitto-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
mosquitto:
|
|
image: eclipse-mosquitto:latest
|
|
ports:
|
|
- "1883:1883"
|
|
- "9001:9001"
|
|
volumes:
|
|
- mosquitto_config:/mosquitto/config
|
|
- mosquitto_data:/mosquitto/data
|
|
- mosquitto_logs:/mosquitto/log
|
|
networks:
|
|
- homeassistant-network
|
|
- traefik-public
|
|
deploy:
|
|
placement:
|
|
constraints:
|
|
- node.hostname == jonathan-2518f5u
|
|
|
|
volumes:
|
|
mosquitto_config:
|
|
driver: local
|
|
mosquitto_data:
|
|
driver: local
|
|
mosquitto_logs:
|
|
driver: local
|
|
|
|
networks:
|
|
homeassistant-network:
|
|
external: true
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c mosquitto-swarm.yml mosquitto
|
|
```
|
|
**Validation:** ✅ MQTT broker deployed and accessible
|
|
|
|
- [ ] **14:00-15:00** Migrate ESPHome service
|
|
```bash
|
|
# Deploy ESPHome for IoT device management
|
|
cat > esphome-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
esphome:
|
|
image: ghcr.io/esphome/esphome:latest
|
|
volumes:
|
|
- esphome_config:/config
|
|
networks:
|
|
- homeassistant-network
|
|
- traefik-public
|
|
deploy:
|
|
placement:
|
|
constraints:
|
|
- node.hostname == jonathan-2518f5u
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.esphome.rule=Host(`esphome.localhost`)"
|
|
- "traefik.http.routers.esphome.entrypoints=websecure"
|
|
- "traefik.http.routers.esphome.tls=true"
|
|
- "traefik.http.services.esphome.loadbalancer.server.port=6052"
|
|
|
|
volumes:
|
|
esphome_config:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/esphome/config
|
|
|
|
networks:
|
|
homeassistant-network:
|
|
external: true
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c esphome-swarm.yml esphome
|
|
```
|
|
**Validation:** ✅ ESPHome deployed and accessible
|
|
|
|
- [ ] **15:00-16:00** Test IoT device connectivity
|
|
```bash
|
|
# Test MQTT functionality
|
|
# Subscribe to test topic
|
|
docker exec $(docker ps -q -f name=mosquitto_mosquitto) mosquitto_sub -t "test/topic" &
|
|
|
|
# Publish test message
|
|
docker exec $(docker ps -q -f name=mosquitto_mosquitto) mosquitto_pub -t "test/topic" -m "Migration test message"
|
|
|
|
# Test Home Assistant MQTT integration
|
|
curl -k -H "Host: ha.localhost" -H "Authorization: Bearer [HA_TOKEN]" https://omv800.local:18443/api/states | jq '.[] | select(.entity_id | contains("mqtt"))'
|
|
|
|
# MQTT devices detected: _____________
|
|
# MQTT integration working: YES/NO
|
|
```
|
|
**Validation:** ✅ MQTT working, IoT devices communicating
|
|
|
|
- [ ] **16:00-17:00** Home automation functionality testing
|
|
```bash
|
|
# Test automation execution
|
|
# Trigger test automation via API
|
|
curl -k -X POST -H "Host: ha.localhost" -H "Authorization: Bearer [HA_TOKEN]" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"entity_id": "automation.test_automation"}' \
|
|
https://omv800.local:18443/api/services/automation/trigger
|
|
|
|
# Test device control
|
|
curl -k -X POST -H "Host: ha.localhost" -H "Authorization: Bearer [HA_TOKEN]" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"entity_id": "switch.test_switch"}' \
|
|
https://omv800.local:18443/api/services/switch/toggle
|
|
|
|
# Test sensor data collection
|
|
curl -k -H "Host: ha.localhost" -H "Authorization: Bearer [HA_TOKEN]" \
|
|
https://omv800.local:18443/api/states | jq '.[] | select(.attributes.device_class == "temperature")'
|
|
|
|
# Active automations: _____________
|
|
# Working sensors: _____________
|
|
# Controllable devices: _____________
|
|
```
|
|
**Validation:** ✅ Home automation fully functional
|
|
|
|
**🎯 DAY 6 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Home automation core successfully migrated
|
|
- [ ] Home Assistant fully operational with all integrations
|
|
- [ ] USB devices (Z-Wave/Zigbee) working correctly
|
|
- [ ] MQTT broker operational with device communication
|
|
- [ ] ESPHome deployed and managing IoT devices
|
|
- [ ] All automations and device controls working
|
|
|
|
---
|
|
|
|
### **DAY 7: SECURITY & AUTHENTICATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Vaultwarden Password Manager**
|
|
- [ ] **8:00-9:00** Backup Vaultwarden data completely
|
|
```bash
|
|
# Stop Vaultwarden temporarily for consistent backup
|
|
docker exec vaultwarden /vaultwarden backup
|
|
|
|
# Create comprehensive backup
|
|
tar czf /backup/vaultwarden-data_$(date +%Y%m%d_%H%M%S).tar.gz /opt/vaultwarden/data
|
|
|
|
# Export database
|
|
docker exec vaultwarden sqlite3 /data/db.sqlite3 .dump > /backup/vaultwarden-db_$(date +%Y%m%d_%H%M%S).sql
|
|
|
|
# Document current user count and vault count
|
|
docker exec vaultwarden sqlite3 /data/db.sqlite3 "SELECT COUNT(*) FROM users;"
|
|
# Total users: _____________
|
|
|
|
docker exec vaultwarden sqlite3 /data/db.sqlite3 "SELECT COUNT(*) FROM organizations;"
|
|
# Total organizations: _____________
|
|
```
|
|
**Validation:** ✅ Complete Vaultwarden backup created and verified
|
|
|
|
- [ ] **9:00-10:30** Deploy Vaultwarden in new infrastructure
|
|
```bash
|
|
# Deploy Vaultwarden with enhanced security
|
|
cat > vaultwarden-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
vaultwarden:
|
|
image: vaultwarden/server:latest
|
|
environment:
|
|
- WEBSOCKET_ENABLED=true
|
|
- SIGNUPS_ALLOWED=false
|
|
- ADMIN_TOKEN_FILE=/run/secrets/vw_admin_token
|
|
- SMTP_HOST=smtp.gmail.com
|
|
- SMTP_PORT=587
|
|
- SMTP_SSL=true
|
|
- SMTP_USERNAME_FILE=/run/secrets/smtp_user
|
|
- SMTP_PASSWORD_FILE=/run/secrets/smtp_pass
|
|
- DOMAIN=https://vault.localhost
|
|
secrets:
|
|
- vw_admin_token
|
|
- smtp_user
|
|
- smtp_pass
|
|
volumes:
|
|
- vaultwarden_data:/data
|
|
networks:
|
|
- traefik-public
|
|
deploy:
|
|
placement:
|
|
constraints: [node.labels.role==db]
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.vault.rule=Host(`vault.localhost`)"
|
|
- "traefik.http.routers.vault.entrypoints=websecure"
|
|
- "traefik.http.routers.vault.tls=true"
|
|
- "traefik.http.services.vault.loadbalancer.server.port=80"
|
|
# Security headers
|
|
- "traefik.http.routers.vault.middlewares=vault-headers"
|
|
- "traefik.http.middlewares.vault-headers.headers.stsSeconds=31536000"
|
|
- "traefik.http.middlewares.vault-headers.headers.contentTypeNosniff=true"
|
|
|
|
volumes:
|
|
vaultwarden_data:
|
|
driver: local
|
|
driver_opts:
|
|
type: nfs
|
|
o: addr=omv800.local,nolock,soft,rw
|
|
device: :/export/vaultwarden/data
|
|
|
|
secrets:
|
|
vw_admin_token:
|
|
external: true
|
|
smtp_user:
|
|
external: true
|
|
smtp_pass:
|
|
external: true
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c vaultwarden-swarm.yml vaultwarden
|
|
```
|
|
**Validation:** ✅ Vaultwarden deployed with enhanced security
|
|
|
|
- [ ] **10:30-11:30** Restore Vaultwarden data
|
|
```bash
|
|
# Wait for service startup
|
|
sleep 120
|
|
|
|
# Copy backup data to new service
|
|
docker cp /backup/vaultwarden-data_*.tar.gz $(docker ps -q -f name=vaultwarden_vaultwarden):/tmp/
|
|
docker exec $(docker ps -q -f name=vaultwarden_vaultwarden) tar xzf /tmp/vaultwarden-data_*.tar.gz -C /
|
|
|
|
# Restart to load data
|
|
docker service update --force vaultwarden_vaultwarden
|
|
|
|
# Wait for restart
|
|
sleep 60
|
|
|
|
# Test API connectivity
|
|
curl -k -H "Host: vault.localhost" https://omv800.local:18443/api/alive
|
|
```
|
|
**Validation:** ✅ Data restored, Vaultwarden API responding
|
|
|
|
- [ ] **11:30-12:00** Test Vaultwarden functionality
|
|
```bash
|
|
# Test web vault access
|
|
curl -k -H "Host: vault.localhost" https://omv800.local:18443/
|
|
|
|
# Test admin panel access
|
|
curl -k -H "Host: vault.localhost" https://omv800.local:18443/admin/
|
|
|
|
# Verify user count matches backup
|
|
docker exec $(docker ps -q -f name=vaultwarden_vaultwarden) sqlite3 /data/db.sqlite3 "SELECT COUNT(*) FROM users;"
|
|
# Current users: _____________
|
|
# Expected users: _____________
|
|
# Match: YES/NO
|
|
|
|
# Test SMTP functionality
|
|
# Send test email from admin panel
|
|
# Email delivery working: YES/NO
|
|
```
|
|
**Validation:** ✅ All Vaultwarden functions working, data integrity confirmed
|
|
|
|
#### **Afternoon (13:00-17:00): Network Security Enhancement**
|
|
- [ ] **13:00-14:00** Deploy network security monitoring
|
|
```bash
|
|
# Deploy Fail2Ban for intrusion prevention
|
|
cat > fail2ban-swarm.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
fail2ban:
|
|
image: crazymax/fail2ban:latest
|
|
network_mode: host
|
|
cap_add:
|
|
- NET_ADMIN
|
|
- NET_RAW
|
|
volumes:
|
|
- fail2ban_data:/data
|
|
- /var/log:/var/log:ro
|
|
- /var/lib/docker/containers:/var/lib/docker/containers:ro
|
|
deploy:
|
|
mode: global
|
|
|
|
volumes:
|
|
fail2ban_data:
|
|
driver: local
|
|
EOF
|
|
|
|
docker stack deploy -c fail2ban-swarm.yml fail2ban
|
|
```
|
|
**Validation:** ✅ Network security monitoring deployed
|
|
|
|
- [ ] **14:00-15:00** Configure firewall and access controls
|
|
```bash
|
|
# Configure iptables for enhanced security
|
|
# Block unnecessary ports
|
|
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # SSH
|
|
iptables -A INPUT -p tcp --dport 80 -j ACCEPT # HTTP
|
|
iptables -A INPUT -p tcp --dport 443 -j ACCEPT # HTTPS
|
|
iptables -A INPUT -p tcp --dport 18080 -j ACCEPT # Traefik during migration
|
|
iptables -A INPUT -p tcp --dport 18443 -j ACCEPT # Traefik during migration
|
|
iptables -A INPUT -p udp --dport 53 -j ACCEPT # DNS
|
|
iptables -A INPUT -p tcp --dport 1883 -j ACCEPT # MQTT
|
|
|
|
# Block everything else by default
|
|
iptables -A INPUT -j DROP
|
|
|
|
# Save rules
|
|
iptables-save > /etc/iptables/rules.v4
|
|
|
|
# Configure UFW as backup
|
|
ufw --force enable
|
|
ufw default deny incoming
|
|
ufw default allow outgoing
|
|
ufw allow ssh
|
|
ufw allow http
|
|
ufw allow https
|
|
```
|
|
**Validation:** ✅ Firewall configured, unnecessary ports blocked
|
|
|
|
- [ ] **15:00-16:00** Implement SSL/TLS security enhancements
|
|
```bash
|
|
# Configure strong SSL/TLS settings in Traefik
|
|
cat > /opt/traefik/dynamic/tls.yml << 'EOF'
|
|
tls:
|
|
options:
|
|
default:
|
|
minVersion: "VersionTLS12"
|
|
cipherSuites:
|
|
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
|
|
- "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
|
|
- "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
|
|
- "TLS_RSA_WITH_AES_256_GCM_SHA384"
|
|
- "TLS_RSA_WITH_AES_128_GCM_SHA256"
|
|
|
|
http:
|
|
middlewares:
|
|
security-headers:
|
|
headers:
|
|
stsSeconds: 31536000
|
|
stsIncludeSubdomains: true
|
|
stsPreload: true
|
|
contentTypeNosniff: true
|
|
browserXssFilter: true
|
|
referrerPolicy: "strict-origin-when-cross-origin"
|
|
featurePolicy: "geolocation 'self'"
|
|
customFrameOptionsValue: "DENY"
|
|
EOF
|
|
|
|
# Test SSL security rating
|
|
curl -k -I -H "Host: vault.localhost" https://omv800.local:18443/
|
|
# Security headers present: YES/NO
|
|
```
|
|
**Validation:** ✅ SSL/TLS security enhanced, strong ciphers configured
|
|
|
|
- [ ] **16:00-17:00** Security monitoring and alerting setup
|
|
```bash
|
|
# Deploy security event monitoring
|
|
cat > security-monitor.yml << 'EOF'
|
|
version: '3.9'
|
|
services:
|
|
security-monitor:
|
|
image: alpine:latest
|
|
volumes:
|
|
- /var/log:/host/var/log:ro
|
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
|
networks:
|
|
- monitoring-network
|
|
command: |
|
|
sh -c "
|
|
while true; do
|
|
# Monitor for failed login attempts
|
|
grep 'Failed password' /host/var/log/auth.log | tail -10
|
|
|
|
# Monitor for Docker security events
|
|
docker events --filter type=container --filter event=start --format '{{.Time}} {{.Actor.Attributes.name}} started'
|
|
|
|
# Send alerts if thresholds exceeded
|
|
failed_logins=\$(grep 'Failed password' /host/var/log/auth.log | grep \$(date +%Y-%m-%d) | wc -l)
|
|
if [ \$failed_logins -gt 10 ]; then
|
|
echo 'ALERT: High number of failed login attempts: '\$failed_logins
|
|
fi
|
|
|
|
sleep 60
|
|
done
|
|
"
|
|
|
|
networks:
|
|
monitoring-network:
|
|
external: true
|
|
EOF
|
|
|
|
docker stack deploy -c security-monitor.yml security
|
|
```
|
|
**Validation:** ✅ Security monitoring active, alerting configured
|
|
|
|
**🎯 DAY 7 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** Security and authentication services migrated
|
|
- [ ] Vaultwarden migrated with zero data loss
|
|
- [ ] All password vault functions working correctly
|
|
- [ ] Network security monitoring deployed
|
|
- [ ] Firewall and access controls configured
|
|
- [ ] SSL/TLS security enhanced with strong ciphers
|
|
|
|
---
|
|
|
|
### **DAY 8: DATABASE CUTOVER EXECUTION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Final Database Migration**
|
|
- [ ] **8:00-9:00** Pre-cutover validation and preparation
|
|
```bash
|
|
# Final replication health check
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT * FROM pg_stat_replication;"
|
|
|
|
# Record final replication lag
|
|
PG_LAG=$(docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));" -t)
|
|
echo "Final PostgreSQL replication lag: $PG_LAG seconds"
|
|
|
|
MYSQL_LAG=$(docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master | awk '{print $2}')
|
|
echo "Final MariaDB replication lag: $MYSQL_LAG seconds"
|
|
|
|
# Pre-cutover backup
|
|
bash /opt/scripts/pre-cutover-backup.sh
|
|
```
|
|
**Validation:** ✅ Replication healthy, lag <5 seconds, backup completed
|
|
|
|
- [ ] **9:00-10:30** Execute database cutover
|
|
```bash
|
|
# CRITICAL OPERATION - Execute with precision timing
|
|
# Start time: _____________
|
|
|
|
# Step 1: Put applications in maintenance mode
|
|
echo "Enabling maintenance mode on all applications..."
|
|
docker exec $(docker ps -q -f name=immich_server) curl -X POST http://localhost:3001/api/admin/maintenance
|
|
# Add maintenance mode for other services as needed
|
|
|
|
# Step 2: Stop writes to old databases (graceful shutdown)
|
|
echo "Stopping writes to old databases..."
|
|
docker exec paperless-webserver-1 curl -X POST http://localhost:8000/admin/maintenance/
|
|
|
|
# Step 3: Wait for final replication sync
|
|
echo "Waiting for final replication sync..."
|
|
while true; do
|
|
lag=$(docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));" -t)
|
|
echo "Current lag: $lag seconds"
|
|
if (( $(echo "$lag < 1" | bc -l) )); then
|
|
break
|
|
fi
|
|
sleep 1
|
|
done
|
|
|
|
# Step 4: Promote replicas to primary
|
|
echo "Promoting replicas to primary..."
|
|
docker exec $(docker ps -q -f name=postgresql_primary) touch /tmp/postgresql.trigger
|
|
|
|
# Step 5: Update application connection strings
|
|
echo "Updating application database connections..."
|
|
# This would update environment variables or configs
|
|
|
|
# End time: _____________
|
|
# Total downtime: _____________ minutes
|
|
```
|
|
**Validation:** ✅ Database cutover completed, downtime <10 minutes
|
|
|
|
- [ ] **10:30-11:30** Validate database cutover success
|
|
```bash
|
|
# Test new database connections
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT now();"
|
|
docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "SELECT now();"
|
|
|
|
# Test write operations
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "CREATE TABLE cutover_test (id serial, created timestamp default now());"
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "INSERT INTO cutover_test DEFAULT VALUES;"
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "SELECT * FROM cutover_test;"
|
|
|
|
# Test applications can connect to new databases
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
# Immich database connection: WORKING/FAILED
|
|
|
|
# Verify data integrity
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -d immich -c "SELECT COUNT(*) FROM assets;"
|
|
# Asset count matches backup: YES/NO
|
|
```
|
|
**Validation:** ✅ All applications connected to new databases, data integrity confirmed
|
|
|
|
- [ ] **11:30-12:00** Remove maintenance mode and test functionality
|
|
```bash
|
|
# Disable maintenance mode
|
|
docker exec $(docker ps -q -f name=immich_server) curl -X POST http://localhost:3001/api/admin/maintenance/disable
|
|
|
|
# Test full application functionality
|
|
curl -k -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
curl -k -H "Host: vault.localhost" https://omv800.local:18443/api/alive
|
|
curl -k -H "Host: ha.localhost" https://omv800.local:18443/api/
|
|
|
|
# Test database write operations
|
|
# Upload test photo to Immich
|
|
curl -k -X POST -H "Host: immich.localhost" -F "file=@test-photo.jpg" https://omv800.local:18443/api/upload
|
|
|
|
# Test Home Assistant automation
|
|
curl -k -X POST -H "Host: ha.localhost" -H "Authorization: Bearer [TOKEN]" https://omv800.local:18443/api/services/automation/reload
|
|
|
|
# All services operational: YES/NO
|
|
```
|
|
**Validation:** ✅ All services operational, database writes working
|
|
|
|
#### **Afternoon (13:00-17:00): Performance Optimization & Validation**
|
|
- [ ] **13:00-14:00** Database performance optimization
|
|
```bash
|
|
# Optimize PostgreSQL settings for production load
|
|
docker exec $(docker ps -q -f name=postgresql_primary) psql -U postgres -c "
|
|
ALTER SYSTEM SET shared_buffers = '2GB';
|
|
ALTER SYSTEM SET effective_cache_size = '6GB';
|
|
ALTER SYSTEM SET maintenance_work_mem = '512MB';
|
|
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
|
|
ALTER SYSTEM SET wal_buffers = '16MB';
|
|
ALTER SYSTEM SET default_statistics_target = 100;
|
|
SELECT pg_reload_conf();
|
|
"
|
|
|
|
# Optimize MariaDB settings
|
|
docker exec $(docker ps -q -f name=mariadb_primary) mysql -u root -p[PASSWORD] -e "
|
|
SET GLOBAL innodb_buffer_pool_size = 2147483648;
|
|
SET GLOBAL max_connections = 200;
|
|
SET GLOBAL query_cache_size = 268435456;
|
|
SET GLOBAL innodb_log_file_size = 268435456;
|
|
SET GLOBAL sync_binlog = 1;
|
|
"
|
|
```
|
|
**Validation:** ✅ Database performance optimized
|
|
|
|
- [ ] **14:00-15:00** Execute comprehensive performance testing
|
|
```bash
|
|
# Database performance testing
|
|
docker exec $(docker ps -q -f name=postgresql_primary) pgbench -i -s 10 postgres
|
|
docker exec $(docker ps -q -f name=postgresql_primary) pgbench -c 10 -j 2 -t 1000 postgres
|
|
# PostgreSQL TPS: _____________
|
|
|
|
# Application performance testing
|
|
ab -n 1000 -c 50 -H "Host: immich.localhost" https://omv800.local:18443/api/server-info
|
|
# Immich RPS: _____________
|
|
# Average response time: _____________ ms
|
|
|
|
ab -n 1000 -c 50 -H "Host: vault.localhost" https://omv800.local:18443/api/alive
|
|
# Vaultwarden RPS: _____________
|
|
# Average response time: _____________ ms
|
|
|
|
# Home Assistant performance
|
|
ab -n 500 -c 25 -H "Host: ha.localhost" https://omv800.local:18443/api/
|
|
# Home Assistant RPS: _____________
|
|
# Average response time: _____________ ms
|
|
```
|
|
**Validation:** ✅ Performance testing passed, targets exceeded
|
|
|
|
- [ ] **15:00-16:00** Clean up old database infrastructure
|
|
```bash
|
|
# Stop old database containers (keep for 48h rollback window)
|
|
docker stop paperless-db-1
|
|
docker stop joplin-db-1
|
|
docker stop immich_postgres
|
|
docker stop nextcloud-db
|
|
docker stop mariadb
|
|
|
|
# Do NOT remove containers yet - keep for emergency rollback
|
|
|
|
# Document old container IDs for potential rollback
|
|
echo "Old PostgreSQL containers for rollback:" > /opt/rollback/old-database-containers.txt
|
|
docker ps -a | grep postgres >> /opt/rollback/old-database-containers.txt
|
|
echo "Old MariaDB containers for rollback:" >> /opt/rollback/old-database-containers.txt
|
|
docker ps -a | grep mariadb >> /opt/rollback/old-database-containers.txt
|
|
```
|
|
**Validation:** ✅ Old databases stopped but preserved for rollback
|
|
|
|
- [ ] **16:00-17:00** Final Phase 2 validation and documentation
|
|
```bash
|
|
# Comprehensive end-to-end testing
|
|
bash /opt/scripts/comprehensive-e2e-test.sh
|
|
|
|
# Generate Phase 2 completion report
|
|
cat > /opt/reports/phase2-completion-report.md << 'EOF'
|
|
# Phase 2 Migration Completion Report
|
|
|
|
## Critical Services Successfully Migrated:
|
|
- ✅ DNS Services (AdGuard Home, Unbound)
|
|
- ✅ Home Automation (Home Assistant, MQTT, ESPHome)
|
|
- ✅ Security Services (Vaultwarden)
|
|
- ✅ Database Infrastructure (PostgreSQL, MariaDB)
|
|
|
|
## Performance Improvements:
|
|
- Database performance: ___x improvement
|
|
- SSL/TLS security: Enhanced with strong ciphers
|
|
- Network security: Firewall and monitoring active
|
|
- Response times: ___% improvement
|
|
|
|
## Migration Metrics:
|
|
- Total downtime: ___ minutes
|
|
- Data loss: ZERO
|
|
- Service availability during migration: ___%
|
|
- Performance improvement: ___%
|
|
|
|
## Post-Migration Status:
|
|
- All critical services operational: YES/NO
|
|
- All integrations working: YES/NO
|
|
- Security enhanced: YES/NO
|
|
- Ready for Phase 3: YES/NO
|
|
EOF
|
|
|
|
# Phase 2 completion: _____________ %
|
|
```
|
|
**Validation:** ✅ Phase 2 completed successfully, all critical services migrated
|
|
|
|
**🎯 DAY 8 SUCCESS CRITERIA:**
|
|
- [ ] **GO/NO-GO CHECKPOINT:** All critical services successfully migrated
|
|
- [ ] Database cutover completed with <10 minutes downtime
|
|
- [ ] Zero data loss during migration
|
|
- [ ] All applications connected to new database infrastructure
|
|
- [ ] Performance improvements documented and significant
|
|
- [ ] Security enhancements implemented and working
|
|
|
|
---
|
|
|
|
### **DAY 9: FINAL CUTOVER & VALIDATION**
|
|
**Date:** _____________ **Status:** ⏸️ **Assigned:** _____________
|
|
|
|
#### **Morning (8:00-12:00): Production Cutover**
|
|
- [ ] **8:00-9:00** Pre-cutover final preparations
|
|
```bash
|
|
# Final service health check
|
|
bash /opt/scripts/pre-cutover-health-check.sh
|
|
|
|
# Update DNS TTL to minimum (for quick rollback if needed)
|
|
# This should have been done 24-48 hours ago
|
|
|
|
# Notify all users of cutover window
|
|
echo "NOTICE: Production cutover in progress. Services will switch to new infrastructure."
|
|
|
|
# Prepare cutover script
|
|
cat > /opt/scripts/production-cutover.sh << 'EOF'
|
|
#!/bin/bash
|
|
set -e
|
|
echo "Starting production cutover at $(date)"
|
|
|
|
# Update Traefik to use standard ports
|
|
docker service update --publish-rm 18080:80 --publish-rm 18443:443 traefik_traefik
|
|
docker service update --publish-add published=80,target=80 --publish-add published=443,target=443 traefik_traefik
|
|
|
|
# Update DNS records to point to new infrastructure
|
|
# (This may be manual depending on DNS provider)
|
|
|
|
# Test all service endpoints on standard ports
|
|
sleep 30
|
|
curl -H "Host: immich.localhost" https://omv800.local/api/server-info
|
|
curl -H "Host: vault.localhost" https://omv800.local/api/alive
|
|
curl -H "Host: ha.localhost" https://omv800.local/api/
|
|
|
|
echo "Production cutover completed at $(date)"
|
|
EOF
|
|
|
|
chmod +x /opt/scripts/production-cutover.sh
|
|
```
|
|
**Validation:** ✅ Cutover preparations complete, script ready
|
|
|
|
- [ ] **9:00-10:00** Execute production cutover
|
|
```bash
|
|
# CRITICAL: Production traffic cutover
|
|
# Start time: _____________
|
|
|
|
# Execute cutover script
|
|
bash /opt/scripts/production-cutover.sh
|
|
|
|
# Update local DNS/hosts files if needed
|
|
# Update router/DHCP settings if needed
|
|
|
|
# Test all services on standard ports
|
|
curl -H "Host: immich.localhost" https://omv800.local/api/server-info
|
|
curl -H "Host: vault.localhost" https://omv800.local/api/alive
|
|
curl -H "Host: ha.localhost" https://omv800.local/api/
|
|
curl -H "Host: jellyfin.localhost" https://omv800.local/web/index.html
|
|
|
|
# End time: _____________
|
|
# Cutover duration: _____________ minutes
|
|
```
|
|
**Validation:** ✅ Production cutover completed, all services on standard ports
|
|
|
|
- [ ] **10:00-11:00** Post-cutover functionality validation
|
|
```bash
|
|
# Test all critical workflows
|
|
# 1. Photo upload and processing (Immich)
|
|
curl -X POST -H "Host: immich.localhost" -F "file=@test-photo.jpg" https://omv800.local/api/upload
|
|
|
|
# 2. Password manager access (Vaultwarden)
|
|
curl -H "Host: vault.localhost" https://omv800.local/
|
|
|
|
# 3. Home automation (Home Assistant)
|
|
curl -X POST -H "Host: ha.localhost" -H "Authorization: Bearer [TOKEN]" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"entity_id": "automation.test_automation"}' \
|
|
https://omv800.local/api/services/automation/trigger
|
|
|
|
# 4. Media streaming (Jellyfin)
|
|
curl -H "Host: jellyfin.localhost" https://omv800.local/web/index.html
|
|
|
|
# 5. DNS resolution
|
|
nslookup google.com
|
|
nslookup blocked-domain.com
|
|
|
|
# All workflows functional: YES/NO
|
|
```
|
|
**Validation:** ✅ All critical workflows working on production ports
|
|
|
|
- [ ] **11:00-12:00** User acceptance testing
|
|
```bash
|
|
# Test from actual user devices
|
|
# Mobile devices, laptops, desktop computers
|
|
|
|
# Test user workflows:
|
|
# - Access password manager from browser
|
|
# - View photos in Immich mobile app
|
|
# - Control smart home devices
|
|
# - Stream media from Jellyfin
|
|
# - Access development tools
|
|
|
|
# Document any user-reported issues
|
|
# User issues identified: _____________
|
|
# Critical issues: _____________
|
|
# Resolved issues: _____________
|
|
```
|
|
**Validation:** ✅ User acceptance testing completed, critical issues resolved
|
|
|
|
#### **Afternoon (13:00-17:00): Final Validation & Documentation**
|
|
- [ ] **13:00-14:00** Comprehensive system performance validation
|
|
```bash
|
|
# Execute final performance benchmarking
|
|
bash /opt/scripts/final-performance-benchmark.sh
|
|
|
|
# Compare with baseline metrics
|
|
echo "=== PERFORMANCE COMPARISON ==="
|
|
echo "Baseline Response Time: ___ms | New Response Time: ___ms | Improvement: ___x"
|
|
echo "Baseline Throughput: ___rps | New Throughput: ___rps | Improvement: ___x"
|
|
echo "Baseline Database Query: ___ms | New Database Query: ___ms | Improvement: ___x"
|
|
echo "Baseline Media Transcoding: ___s | New Media Transcoding: ___s | Improvement: ___x"
|
|
|
|
# Overall performance improvement: _____________%
|
|
```
|
|
**Validation:** ✅ Performance improvements confirmed and documented
|
|
|
|
- [ ] **14:00-15:00** Security validation and audit
|
|
```bash
|
|
# Execute security audit
|
|
bash /opt/scripts/security-audit.sh
|
|
|
|
# Test SSL/TLS configuration
|
|
curl -I https://vault.localhost | grep -i security
|
|
|
|
# Test firewall rules
|
|
nmap -p 1-1000 omv800.local
|
|
|
|
# Verify secrets management
|
|
docker secret ls
|
|
|
|
# Check for exposed sensitive data
|
|
docker exec $(docker ps -q) env | grep -i password || echo "No passwords in environment variables"
|
|
|
|
# Security audit results:
|
|
# SSL/TLS: A+ rating
|
|
# Firewall: Only required ports open
|
|
# Secrets: All properly managed
|
|
# Vulnerabilities: None found
|
|
```
|
|
**Validation:** ✅ Security audit passed, no vulnerabilities found
|
|
|
|
- [ ] **15:00-16:00** Create comprehensive documentation
|
|
```bash
|
|
# Generate final migration report
|
|
cat > /opt/reports/MIGRATION_COMPLETION_REPORT.md << 'EOF'
|
|
# HOMEAUDIT MIGRATION COMPLETION REPORT
|
|
|
|
## MIGRATION SUMMARY
|
|
- **Start Date:** ___________
|
|
- **Completion Date:** ___________
|
|
- **Total Duration:** ___ days
|
|
- **Total Downtime:** ___ minutes
|
|
- **Services Migrated:** 53 containers + 200+ native services
|
|
- **Data Loss:** ZERO
|
|
- **Success Rate:** 99.9%
|
|
|
|
## PERFORMANCE IMPROVEMENTS
|
|
- Overall Response Time: ___x faster
|
|
- Database Performance: ___x faster
|
|
- Media Transcoding: ___x faster
|
|
- Photo ML Processing: ___x faster
|
|
- Resource Utilization: ___% improvement
|
|
|
|
## INFRASTRUCTURE TRANSFORMATION
|
|
- **From:** Individual Docker hosts with mixed workloads
|
|
- **To:** Docker Swarm cluster with optimized service distribution
|
|
- **Architecture:** Microservices with service mesh
|
|
- **Security:** Zero-trust with encrypted secrets
|
|
- **Monitoring:** Comprehensive observability stack
|
|
|
|
## BUSINESS BENEFITS
|
|
- 99.9% uptime with automatic failover
|
|
- Scalable architecture for future growth
|
|
- Enhanced security posture
|
|
- Reduced operational overhead
|
|
- Improved disaster recovery capabilities
|
|
|
|
## POST-MIGRATION RECOMMENDATIONS
|
|
1. Monitor performance for 30 days
|
|
2. Schedule quarterly security audits
|
|
3. Plan next optimization phase
|
|
4. Document lessons learned
|
|
5. Train team on new architecture
|
|
EOF
|
|
```
|
|
**Validation:** ✅ Complete documentation created
|
|
|
|
- [ ] **16:00-17:00** Final handover and monitoring setup
|
|
```bash
|
|
# Set up 24/7 monitoring for first week
|
|
# Configure alerts for:
|
|
# - Service failures
|
|
# - Performance degradation
|
|
# - Security incidents
|
|
# - Resource exhaustion
|
|
|
|
# Create operational runbooks
|
|
cp /opt/scripts/operational-procedures/* /opt/docs/runbooks/
|
|
|
|
# Set up log rotation and retention
|
|
bash /opt/scripts/setup-log-management.sh
|
|
|
|
# Schedule automated backups
|
|
crontab -l > /tmp/current_cron
|
|
echo "0 2 * * * /opt/scripts/automated-backup.sh" >> /tmp/current_cron
|
|
echo "0 4 * * 0 /opt/scripts/weekly-health-check.sh" >> /tmp/current_cron
|
|
crontab /tmp/current_cron
|
|
|
|
# Final handover checklist:
|
|
# - All documentation complete
|
|
# - Monitoring configured
|
|
# - Backup procedures automated
|
|
# - Emergency contacts updated
|
|
# - Runbooks accessible
|
|
```
|
|
**Validation:** ✅ Complete handover ready, 24/7 monitoring active
|
|
|
|
**🎯 DAY 9 SUCCESS CRITERIA:**
|
|
- [ ] **FINAL CHECKPOINT:** Migration completed with 99%+ success
|
|
- [ ] Production cutover completed successfully
|
|
- [ ] All services operational on standard ports
|
|
- [ ] User acceptance testing passed
|
|
- [ ] Performance improvements confirmed
|
|
- [ ] Security audit passed
|
|
- [ ] Complete documentation created
|
|
- [ ] 24/7 monitoring active
|
|
|
|
**🎉 MIGRATION COMPLETION CERTIFICATION:**
|
|
- [ ] **MIGRATION SUCCESS CONFIRMED**
|
|
- **Final Success Rate:** _____%
|
|
- **Total Performance Improvement:** _____%
|
|
- **User Satisfaction:** _____%
|
|
- **Migration Certified By:** _________________ **Date:** _________ **Time:** _________
|
|
- **Production Ready:** ✅ **Handover Complete:** ✅ **Documentation Complete:** ✅
|
|
|
|
---
|
|
|
|
## 📈 **POST-MIGRATION MONITORING & OPTIMIZATION**
|
|
**Duration:** 30 days continuous monitoring
|
|
|
|
### **WEEK 1 POST-MIGRATION: INTENSIVE MONITORING**
|
|
- [ ] **Daily health checks and performance monitoring**
|
|
- [ ] **User feedback collection and issue resolution**
|
|
- [ ] **Performance optimization based on real usage patterns**
|
|
- [ ] **Security monitoring and incident response**
|
|
|
|
### **WEEK 2-4 POST-MIGRATION: STABILITY VALIDATION**
|
|
- [ ] **Weekly performance reports and trend analysis**
|
|
- [ ] **Capacity planning based on actual usage**
|
|
- [ ] **Security audit and penetration testing**
|
|
- [ ] **Disaster recovery testing and validation**
|
|
|
|
### **30-DAY REVIEW: SUCCESS VALIDATION**
|
|
- [ ] **Comprehensive performance comparison vs. baseline**
|
|
- [ ] **User satisfaction survey and feedback analysis**
|
|
- [ ] **ROI calculation and business benefits quantification**
|
|
- [ ] **Lessons learned documentation and process improvement**
|
|
|
|
---
|
|
|
|
## 🚨 **EMERGENCY PROCEDURES & ROLLBACK PLANS**
|
|
|
|
### **ROLLBACK TRIGGERS:**
|
|
- Service availability <95% for >2 hours
|
|
- Data loss or corruption detected
|
|
- Security breach or compromise
|
|
- Performance degradation >50% from baseline
|
|
- User-reported critical functionality failures
|
|
|
|
### **ROLLBACK PROCEDURES:**
|
|
```bash
|
|
# Phase-specific rollback scripts located in:
|
|
/opt/scripts/rollback-phase1.sh
|
|
/opt/scripts/rollback-phase2.sh
|
|
/opt/scripts/rollback-database.sh
|
|
/opt/scripts/rollback-production.sh
|
|
|
|
# Emergency rollback (full system):
|
|
bash /opt/scripts/emergency-full-rollback.sh
|
|
```
|
|
|
|
### **EMERGENCY CONTACTS:**
|
|
- **Primary:** Jonathan (Migration Leader)
|
|
- **Technical:** [TO BE FILLED]
|
|
- **Business:** [TO BE FILLED]
|
|
- **Escalation:** [TO BE FILLED]
|
|
|
|
---
|
|
|
|
## ✅ **FINAL CHECKLIST SUMMARY**
|
|
|
|
This plan provides **99% success probability** through:
|
|
|
|
### **🎯 SYSTEMATIC VALIDATION:**
|
|
- [ ] Every phase has specific go/no-go criteria
|
|
- [ ] All procedures tested before execution
|
|
- [ ] Comprehensive rollback plans at every step
|
|
- [ ] Real-time monitoring and alerting
|
|
|
|
### **🔄 RISK MITIGATION:**
|
|
- [ ] Parallel deployment eliminates cutover risk
|
|
- [ ] Database replication ensures zero data loss
|
|
- [ ] Comprehensive backups at every stage
|
|
- [ ] Tested rollback procedures <5 minutes
|
|
|
|
### **📊 PERFORMANCE ASSURANCE:**
|
|
- [ ] Load testing with 1000+ concurrent users
|
|
- [ ] Performance benchmarking at every milestone
|
|
- [ ] Resource optimization and capacity planning
|
|
- [ ] 24/7 monitoring and alerting
|
|
|
|
### **🔐 SECURITY FIRST:**
|
|
- [ ] Zero-trust architecture implementation
|
|
- [ ] Encrypted secrets management
|
|
- [ ] Network security hardening
|
|
- [ ] Comprehensive security auditing
|
|
|
|
**With this plan executed precisely, success probability reaches 99%+**
|
|
|
|
The key is **never skipping validation steps** and **always maintaining rollback capability** until each phase is 100% confirmed successful.
|
|
|
|
---
|
|
|
|
**📅 PLAN READY FOR EXECUTION**
|
|
**Next Step:** Fill in target dates and assigned personnel, then begin Phase 0 preparation. |