COMPREHENSIVE CHANGES: INFRASTRUCTURE MIGRATION: - Migrated services to Docker Swarm on OMV800 (192.168.50.229) - Deployed PostgreSQL database for Vaultwarden migration - Updated all stack configurations for Docker Swarm compatibility - Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox) - Implemented proper secret management for all services VAULTWARDEN POSTGRESQL MIGRATION: - Attempted migration from SQLite to PostgreSQL for NFS compatibility - Created PostgreSQL stack with proper user/password configuration - Built custom Vaultwarden image with PostgreSQL support - Troubleshot persistent SQLite fallback issue despite PostgreSQL config - Identified known issue where Vaultwarden silently falls back to SQLite - Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues - Current status: Old Vaultwarden on lenovo410 still working, new one has config issues PAPERLESS SERVICES: - Successfully deployed Paperless-NGX and Paperless-AI on OMV800 - Both services running on ports 8000 and 3000 respectively - Caddy configuration updated for external access - Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org CADDY CONFIGURATION: - Updated Caddyfile on Surface (192.168.50.254) for new service locations - Fixed Vaultwarden reverse proxy to point to new Docker Swarm service - Removed old notification hub reference that was causing conflicts - All services properly configured for external access via DuckDNS BACKUP AND DISCOVERY: - Created comprehensive backup system for all hosts - Generated detailed discovery reports for infrastructure analysis - Implemented automated backup validation scripts - Created migration progress tracking and verification reports MONITORING STACK: - Deployed Prometheus, Grafana, and Blackbox monitoring - Created infrastructure and system overview dashboards - Added proper service discovery and alerting configuration - Implemented performance monitoring for all critical services DOCUMENTATION: - Reorganized documentation into logical structure - Created comprehensive migration playbook and troubleshooting guides - Added hardware specifications and optimization recommendations - Documented all configuration changes and service dependencies CURRENT STATUS: - Paperless services: ✅ Working and accessible externally - Vaultwarden: ❌ PostgreSQL configuration issues, old instance still working - Monitoring: ✅ Deployed and operational - Caddy: ✅ Updated and working for external access - PostgreSQL: ✅ Database running, connection issues with Vaultwarden NEXT STEPS: - Continue troubleshooting Vaultwarden PostgreSQL configuration - Consider alternative approaches for Vaultwarden migration - Validate all external service access - Complete final migration validation TECHNICAL NOTES: - Used Docker Swarm for orchestration on OMV800 - Implemented proper secret management for sensitive data - Added comprehensive logging and monitoring - Created automated backup and validation scripts
288 lines
8.1 KiB
Markdown
288 lines
8.1 KiB
Markdown
# Traefik Production Deployment Guide
|
|
|
|
## Overview
|
|
This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.
|
|
|
|
## Architecture Components
|
|
|
|
### Core Services
|
|
- **Traefik v3.1**: Load balancer and reverse proxy with authentication
|
|
- **Prometheus**: Metrics collection and alerting
|
|
- **Grafana**: Monitoring dashboards and visualization
|
|
- **AlertManager**: Alert routing and notification management
|
|
- **Loki + Promtail**: Log aggregation and analysis
|
|
|
|
### Security Features
|
|
- ✅ Basic authentication with bcrypt hashing
|
|
- ✅ TLS/SSL termination with automatic certificates
|
|
- ✅ Security headers (HSTS, XSS protection, etc.)
|
|
- ✅ Rate limiting and DDoS protection
|
|
- ✅ SELinux policy compliance
|
|
- ✅ Prometheus metrics for security monitoring
|
|
|
|
## Prerequisites
|
|
|
|
### System Requirements
|
|
- Docker Swarm cluster (single manager minimum)
|
|
- SELinux enabled (Fedora/RHEL/CentOS)
|
|
- Minimum 4GB RAM, 20GB disk space
|
|
- Network ports: 80, 443, 8080, 9090, 3000
|
|
|
|
### Directory Structure
|
|
```bash
|
|
sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
|
|
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
|
|
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
|
|
sudo chown -R 1000:1000 /opt/monitoring/grafana
|
|
```
|
|
|
|
## Installation Steps
|
|
|
|
### Step 1: SELinux Policy Configuration
|
|
|
|
```bash
|
|
# Install SELinux development tools
|
|
sudo dnf install -y selinux-policy-devel
|
|
|
|
# Install custom SELinux policy
|
|
cd /home/jonathan/Coding/HomeAudit/selinux
|
|
./install_selinux_policy.sh
|
|
```
|
|
|
|
### Step 2: Docker Swarm Network Setup
|
|
|
|
```bash
|
|
# Create overlay network
|
|
docker network create --driver overlay --attachable traefik-public
|
|
```
|
|
|
|
### Step 3: Configuration Deployment
|
|
|
|
```bash
|
|
# Copy monitoring configurations
|
|
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
|
|
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
|
|
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/
|
|
|
|
# Set proper permissions
|
|
sudo chown -R 65534:65534 /opt/monitoring/prometheus
|
|
sudo chown -R 472:472 /opt/monitoring/grafana
|
|
```
|
|
|
|
### Step 4: Environment Variables
|
|
|
|
Create `/opt/traefik/.env`:
|
|
```bash
|
|
DOMAIN=yourdomain.com
|
|
EMAIL=admin@yourdomain.com
|
|
```
|
|
|
|
### Step 5: Deploy Services
|
|
|
|
```bash
|
|
# Deploy Traefik
|
|
export DOMAIN=yourdomain.com
|
|
docker stack deploy -c stacks/core/traefik-production.yml traefik
|
|
|
|
# Deploy monitoring stack
|
|
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring
|
|
```
|
|
|
|
## Configuration Details
|
|
|
|
### Authentication Credentials
|
|
- **Username**: `admin`
|
|
- **Password**: `secure_password_2024` (bcrypt hash included)
|
|
- **Change in production**: Generate new hash with `htpasswd -nbB admin newpassword`
|
|
|
|
### SSL/TLS Configuration
|
|
- Automatic Let's Encrypt certificates
|
|
- HTTPS redirect for all HTTP traffic
|
|
- HSTS headers with 2-year max-age
|
|
- Secure cipher suites only
|
|
|
|
### Monitoring Access Points
|
|
- **Traefik Dashboard**: `https://traefik.yourdomain.com/dashboard/`
|
|
- **Prometheus**: `https://prometheus.yourdomain.com`
|
|
- **Grafana**: `https://grafana.yourdomain.com`
|
|
- **AlertManager**: `https://alertmanager.yourdomain.com`
|
|
|
|
## Security Monitoring
|
|
|
|
### Key Metrics Monitored
|
|
1. **Authentication Failures**: Rate of 401/403 responses
|
|
2. **Brute Force Attacks**: High-frequency auth failures
|
|
3. **Service Availability**: Backend health status
|
|
4. **Response Times**: 95th percentile latency
|
|
5. **Error Rates**: 5xx error percentage
|
|
6. **Certificate Expiration**: TLS cert validity
|
|
7. **Rate Limiting**: 429 response frequency
|
|
|
|
### Alert Thresholds
|
|
- **Critical**: >50 auth failures/second = Possible brute force
|
|
- **Warning**: >10 auth failures/minute = High failure rate
|
|
- **Critical**: Service backend down >1 minute
|
|
- **Warning**: 95th percentile response time >2 seconds
|
|
- **Warning**: Error rate >10% for 5 minutes
|
|
- **Warning**: TLS certificate expires <7 days
|
|
- **Critical**: TLS certificate expired
|
|
|
|
## Production Checklist
|
|
|
|
### Pre-Deployment
|
|
- [ ] SELinux policy installed and tested
|
|
- [ ] Docker Swarm initialized and nodes joined
|
|
- [ ] Directory structure created with correct permissions
|
|
- [ ] Environment variables configured
|
|
- [ ] DNS records pointing to Swarm manager
|
|
- [ ] Firewall rules configured for ports 80, 443, 8080
|
|
|
|
### Post-Deployment Verification
|
|
- [ ] Traefik dashboard accessible with authentication
|
|
- [ ] HTTPS redirects working correctly
|
|
- [ ] Security headers present in responses
|
|
- [ ] Prometheus collecting Traefik metrics
|
|
- [ ] Grafana dashboards displaying data
|
|
- [ ] AlertManager receiving and routing alerts
|
|
- [ ] Log aggregation working in Loki
|
|
- [ ] Certificate auto-renewal configured
|
|
|
|
### Security Validation
|
|
- [ ] Authentication required for all admin interfaces
|
|
- [ ] TLS certificates valid and auto-renewing
|
|
- [ ] Security headers (HSTS, XSS protection) enabled
|
|
- [ ] Rate limiting functional
|
|
- [ ] Monitoring alerts triggering correctly
|
|
- [ ] SELinux in enforcing mode without denials
|
|
|
|
## Maintenance Operations
|
|
|
|
### Certificate Management
|
|
```bash
|
|
# Check certificate status
|
|
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json
|
|
|
|
# Force certificate renewal (if needed)
|
|
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
|
|
docker service update --force traefik_traefik
|
|
```
|
|
|
|
### Log Management
|
|
```bash
|
|
# Rotate Traefik logs
|
|
sudo logrotate -f /etc/logrotate.d/traefik
|
|
|
|
# Check log sizes
|
|
du -sh /opt/traefik/logs/*
|
|
```
|
|
|
|
### Monitoring Maintenance
|
|
```bash
|
|
# Check Prometheus targets
|
|
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
|
|
|
|
# Grafana backup
|
|
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**SELinux Permission Denied**
|
|
```bash
|
|
# Check for denials
|
|
sudo ausearch -m avc -ts recent | grep traefik
|
|
|
|
# Temporarily disable to test
|
|
sudo setenforce 0
|
|
|
|
# Re-install policy if needed
|
|
cd selinux && ./install_selinux_policy.sh
|
|
```
|
|
|
|
**Authentication Not Working**
|
|
```bash
|
|
# Check service labels
|
|
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'
|
|
|
|
# Verify bcrypt hash
|
|
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin
|
|
```
|
|
|
|
**Certificate Issues**
|
|
```bash
|
|
# Check ACME log
|
|
docker service logs traefik_traefik | grep -i acme
|
|
|
|
# Verify DNS resolution
|
|
nslookup yourdomain.com
|
|
|
|
# Check rate limits
|
|
curl -I https://acme-v02.api.letsencrypt.org/directory
|
|
```
|
|
|
|
### Health Checks
|
|
```bash
|
|
# Traefik API health
|
|
curl -f http://localhost:8080/ping
|
|
|
|
# Service discovery
|
|
curl -s http://localhost:8080/api/http/services | jq '.'
|
|
|
|
# Prometheus metrics
|
|
curl -s http://localhost:8080/metrics | grep traefik_
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Resource Limits
|
|
- **Traefik**: 1 CPU, 512MB RAM
|
|
- **Prometheus**: 1 CPU, 1GB RAM
|
|
- **Grafana**: 0.5 CPU, 512MB RAM
|
|
- **AlertManager**: 0.2 CPU, 256MB RAM
|
|
|
|
### Scaling Recommendations
|
|
- Single Traefik instance per manager node
|
|
- Prometheus data retention: 30 days
|
|
- Log rotation: Daily, keep 7 days
|
|
- Monitoring scrape interval: 15 seconds
|
|
|
|
## Backup Strategy
|
|
|
|
### Critical Data
|
|
- `/opt/traefik/letsencrypt/`: TLS certificates
|
|
- `/opt/monitoring/prometheus/data/`: Metrics data
|
|
- `/opt/monitoring/grafana/data/`: Dashboards and config
|
|
- `/opt/monitoring/alertmanager/config/`: Alert rules
|
|
|
|
### Backup Script
|
|
```bash
|
|
#!/bin/bash
|
|
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
|
|
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/
|
|
```
|
|
|
|
## Support and Documentation
|
|
|
|
### Log Locations
|
|
- **Traefik Logs**: `/opt/traefik/logs/`
|
|
- **Access Logs**: `/opt/traefik/logs/access.log`
|
|
- **Service Logs**: `docker service logs traefik_traefik`
|
|
|
|
### Monitoring Queries
|
|
```promql
|
|
# Authentication failure rate
|
|
rate(traefik_service_requests_total{code=~"401|403"}[5m])
|
|
|
|
# Service availability
|
|
up{job="traefik"}
|
|
|
|
# Response time 95th percentile
|
|
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))
|
|
```
|
|
|
|
This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities. |