COMPREHENSIVE CHANGES: INFRASTRUCTURE MIGRATION: - Migrated services to Docker Swarm on OMV800 (192.168.50.229) - Deployed PostgreSQL database for Vaultwarden migration - Updated all stack configurations for Docker Swarm compatibility - Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox) - Implemented proper secret management for all services VAULTWARDEN POSTGRESQL MIGRATION: - Attempted migration from SQLite to PostgreSQL for NFS compatibility - Created PostgreSQL stack with proper user/password configuration - Built custom Vaultwarden image with PostgreSQL support - Troubleshot persistent SQLite fallback issue despite PostgreSQL config - Identified known issue where Vaultwarden silently falls back to SQLite - Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues - Current status: Old Vaultwarden on lenovo410 still working, new one has config issues PAPERLESS SERVICES: - Successfully deployed Paperless-NGX and Paperless-AI on OMV800 - Both services running on ports 8000 and 3000 respectively - Caddy configuration updated for external access - Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org CADDY CONFIGURATION: - Updated Caddyfile on Surface (192.168.50.254) for new service locations - Fixed Vaultwarden reverse proxy to point to new Docker Swarm service - Removed old notification hub reference that was causing conflicts - All services properly configured for external access via DuckDNS BACKUP AND DISCOVERY: - Created comprehensive backup system for all hosts - Generated detailed discovery reports for infrastructure analysis - Implemented automated backup validation scripts - Created migration progress tracking and verification reports MONITORING STACK: - Deployed Prometheus, Grafana, and Blackbox monitoring - Created infrastructure and system overview dashboards - Added proper service discovery and alerting configuration - Implemented performance monitoring for all critical services DOCUMENTATION: - Reorganized documentation into logical structure - Created comprehensive migration playbook and troubleshooting guides - Added hardware specifications and optimization recommendations - Documented all configuration changes and service dependencies CURRENT STATUS: - Paperless services: ✅ Working and accessible externally - Vaultwarden: ❌ PostgreSQL configuration issues, old instance still working - Monitoring: ✅ Deployed and operational - Caddy: ✅ Updated and working for external access - PostgreSQL: ✅ Database running, connection issues with Vaultwarden NEXT STEPS: - Continue troubleshooting Vaultwarden PostgreSQL configuration - Consider alternative approaches for Vaultwarden migration - Validate all external service access - Complete final migration validation TECHNICAL NOTES: - Used Docker Swarm for orchestration on OMV800 - Implemented proper secret management for sensitive data - Added comprehensive logging and monitoring - Created automated backup and validation scripts
8.1 KiB
8.1 KiB
Traefik Production Deployment Guide
Overview
This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.
Architecture Components
Core Services
- Traefik v3.1: Load balancer and reverse proxy with authentication
- Prometheus: Metrics collection and alerting
- Grafana: Monitoring dashboards and visualization
- AlertManager: Alert routing and notification management
- Loki + Promtail: Log aggregation and analysis
Security Features
- ✅ Basic authentication with bcrypt hashing
- ✅ TLS/SSL termination with automatic certificates
- ✅ Security headers (HSTS, XSS protection, etc.)
- ✅ Rate limiting and DDoS protection
- ✅ SELinux policy compliance
- ✅ Prometheus metrics for security monitoring
Prerequisites
System Requirements
- Docker Swarm cluster (single manager minimum)
- SELinux enabled (Fedora/RHEL/CentOS)
- Minimum 4GB RAM, 20GB disk space
- Network ports: 80, 443, 8080, 9090, 3000
Directory Structure
sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
sudo chown -R 1000:1000 /opt/monitoring/grafana
Installation Steps
Step 1: SELinux Policy Configuration
# Install SELinux development tools
sudo dnf install -y selinux-policy-devel
# Install custom SELinux policy
cd /home/jonathan/Coding/HomeAudit/selinux
./install_selinux_policy.sh
Step 2: Docker Swarm Network Setup
# Create overlay network
docker network create --driver overlay --attachable traefik-public
Step 3: Configuration Deployment
# Copy monitoring configurations
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/
# Set proper permissions
sudo chown -R 65534:65534 /opt/monitoring/prometheus
sudo chown -R 472:472 /opt/monitoring/grafana
Step 4: Environment Variables
Create /opt/traefik/.env:
DOMAIN=yourdomain.com
EMAIL=admin@yourdomain.com
Step 5: Deploy Services
# Deploy Traefik
export DOMAIN=yourdomain.com
docker stack deploy -c stacks/core/traefik-production.yml traefik
# Deploy monitoring stack
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring
Configuration Details
Authentication Credentials
- Username:
admin - Password:
secure_password_2024(bcrypt hash included) - Change in production: Generate new hash with
htpasswd -nbB admin newpassword
SSL/TLS Configuration
- Automatic Let's Encrypt certificates
- HTTPS redirect for all HTTP traffic
- HSTS headers with 2-year max-age
- Secure cipher suites only
Monitoring Access Points
- Traefik Dashboard:
https://traefik.yourdomain.com/dashboard/ - Prometheus:
https://prometheus.yourdomain.com - Grafana:
https://grafana.yourdomain.com - AlertManager:
https://alertmanager.yourdomain.com
Security Monitoring
Key Metrics Monitored
- Authentication Failures: Rate of 401/403 responses
- Brute Force Attacks: High-frequency auth failures
- Service Availability: Backend health status
- Response Times: 95th percentile latency
- Error Rates: 5xx error percentage
- Certificate Expiration: TLS cert validity
- Rate Limiting: 429 response frequency
Alert Thresholds
- Critical: >50 auth failures/second = Possible brute force
- Warning: >10 auth failures/minute = High failure rate
- Critical: Service backend down >1 minute
- Warning: 95th percentile response time >2 seconds
- Warning: Error rate >10% for 5 minutes
- Warning: TLS certificate expires <7 days
- Critical: TLS certificate expired
Production Checklist
Pre-Deployment
- SELinux policy installed and tested
- Docker Swarm initialized and nodes joined
- Directory structure created with correct permissions
- Environment variables configured
- DNS records pointing to Swarm manager
- Firewall rules configured for ports 80, 443, 8080
Post-Deployment Verification
- Traefik dashboard accessible with authentication
- HTTPS redirects working correctly
- Security headers present in responses
- Prometheus collecting Traefik metrics
- Grafana dashboards displaying data
- AlertManager receiving and routing alerts
- Log aggregation working in Loki
- Certificate auto-renewal configured
Security Validation
- Authentication required for all admin interfaces
- TLS certificates valid and auto-renewing
- Security headers (HSTS, XSS protection) enabled
- Rate limiting functional
- Monitoring alerts triggering correctly
- SELinux in enforcing mode without denials
Maintenance Operations
Certificate Management
# Check certificate status
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json
# Force certificate renewal (if needed)
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
docker service update --force traefik_traefik
Log Management
# Rotate Traefik logs
sudo logrotate -f /etc/logrotate.d/traefik
# Check log sizes
du -sh /opt/traefik/logs/*
Monitoring Maintenance
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
# Grafana backup
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data
Troubleshooting
Common Issues
SELinux Permission Denied
# Check for denials
sudo ausearch -m avc -ts recent | grep traefik
# Temporarily disable to test
sudo setenforce 0
# Re-install policy if needed
cd selinux && ./install_selinux_policy.sh
Authentication Not Working
# Check service labels
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'
# Verify bcrypt hash
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin
Certificate Issues
# Check ACME log
docker service logs traefik_traefik | grep -i acme
# Verify DNS resolution
nslookup yourdomain.com
# Check rate limits
curl -I https://acme-v02.api.letsencrypt.org/directory
Health Checks
# Traefik API health
curl -f http://localhost:8080/ping
# Service discovery
curl -s http://localhost:8080/api/http/services | jq '.'
# Prometheus metrics
curl -s http://localhost:8080/metrics | grep traefik_
Performance Tuning
Resource Limits
- Traefik: 1 CPU, 512MB RAM
- Prometheus: 1 CPU, 1GB RAM
- Grafana: 0.5 CPU, 512MB RAM
- AlertManager: 0.2 CPU, 256MB RAM
Scaling Recommendations
- Single Traefik instance per manager node
- Prometheus data retention: 30 days
- Log rotation: Daily, keep 7 days
- Monitoring scrape interval: 15 seconds
Backup Strategy
Critical Data
/opt/traefik/letsencrypt/: TLS certificates/opt/monitoring/prometheus/data/: Metrics data/opt/monitoring/grafana/data/: Dashboards and config/opt/monitoring/alertmanager/config/: Alert rules
Backup Script
#!/bin/bash
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/
Support and Documentation
Log Locations
- Traefik Logs:
/opt/traefik/logs/ - Access Logs:
/opt/traefik/logs/access.log - Service Logs:
docker service logs traefik_traefik
Monitoring Queries
# Authentication failure rate
rate(traefik_service_requests_total{code=~"401|403"}[5m])
# Service availability
up{job="traefik"}
# Response time 95th percentile
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))
This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities.