Files
HomeAudit/dev_documentation/monitoring/TRAEFIK_DEPLOYMENT_GUIDE.md
admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting
COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services:  Working and accessible externally
- Vaultwarden:  PostgreSQL configuration issues, old instance still working
- Monitoring:  Deployed and operational
- Caddy:  Updated and working for external access
- PostgreSQL:  Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts
2025-08-30 20:18:44 -04:00

288 lines
8.1 KiB
Markdown

# Traefik Production Deployment Guide
## Overview
This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.
## Architecture Components
### Core Services
- **Traefik v3.1**: Load balancer and reverse proxy with authentication
- **Prometheus**: Metrics collection and alerting
- **Grafana**: Monitoring dashboards and visualization
- **AlertManager**: Alert routing and notification management
- **Loki + Promtail**: Log aggregation and analysis
### Security Features
- ✅ Basic authentication with bcrypt hashing
- ✅ TLS/SSL termination with automatic certificates
- ✅ Security headers (HSTS, XSS protection, etc.)
- ✅ Rate limiting and DDoS protection
- ✅ SELinux policy compliance
- ✅ Prometheus metrics for security monitoring
## Prerequisites
### System Requirements
- Docker Swarm cluster (single manager minimum)
- SELinux enabled (Fedora/RHEL/CentOS)
- Minimum 4GB RAM, 20GB disk space
- Network ports: 80, 443, 8080, 9090, 3000
### Directory Structure
```bash
sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
sudo chown -R 1000:1000 /opt/monitoring/grafana
```
## Installation Steps
### Step 1: SELinux Policy Configuration
```bash
# Install SELinux development tools
sudo dnf install -y selinux-policy-devel
# Install custom SELinux policy
cd /home/jonathan/Coding/HomeAudit/selinux
./install_selinux_policy.sh
```
### Step 2: Docker Swarm Network Setup
```bash
# Create overlay network
docker network create --driver overlay --attachable traefik-public
```
### Step 3: Configuration Deployment
```bash
# Copy monitoring configurations
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/
# Set proper permissions
sudo chown -R 65534:65534 /opt/monitoring/prometheus
sudo chown -R 472:472 /opt/monitoring/grafana
```
### Step 4: Environment Variables
Create `/opt/traefik/.env`:
```bash
DOMAIN=yourdomain.com
EMAIL=admin@yourdomain.com
```
### Step 5: Deploy Services
```bash
# Deploy Traefik
export DOMAIN=yourdomain.com
docker stack deploy -c stacks/core/traefik-production.yml traefik
# Deploy monitoring stack
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring
```
## Configuration Details
### Authentication Credentials
- **Username**: `admin`
- **Password**: `secure_password_2024` (bcrypt hash included)
- **Change in production**: Generate new hash with `htpasswd -nbB admin newpassword`
### SSL/TLS Configuration
- Automatic Let's Encrypt certificates
- HTTPS redirect for all HTTP traffic
- HSTS headers with 2-year max-age
- Secure cipher suites only
### Monitoring Access Points
- **Traefik Dashboard**: `https://traefik.yourdomain.com/dashboard/`
- **Prometheus**: `https://prometheus.yourdomain.com`
- **Grafana**: `https://grafana.yourdomain.com`
- **AlertManager**: `https://alertmanager.yourdomain.com`
## Security Monitoring
### Key Metrics Monitored
1. **Authentication Failures**: Rate of 401/403 responses
2. **Brute Force Attacks**: High-frequency auth failures
3. **Service Availability**: Backend health status
4. **Response Times**: 95th percentile latency
5. **Error Rates**: 5xx error percentage
6. **Certificate Expiration**: TLS cert validity
7. **Rate Limiting**: 429 response frequency
### Alert Thresholds
- **Critical**: >50 auth failures/second = Possible brute force
- **Warning**: >10 auth failures/minute = High failure rate
- **Critical**: Service backend down >1 minute
- **Warning**: 95th percentile response time >2 seconds
- **Warning**: Error rate >10% for 5 minutes
- **Warning**: TLS certificate expires <7 days
- **Critical**: TLS certificate expired
## Production Checklist
### Pre-Deployment
- [ ] SELinux policy installed and tested
- [ ] Docker Swarm initialized and nodes joined
- [ ] Directory structure created with correct permissions
- [ ] Environment variables configured
- [ ] DNS records pointing to Swarm manager
- [ ] Firewall rules configured for ports 80, 443, 8080
### Post-Deployment Verification
- [ ] Traefik dashboard accessible with authentication
- [ ] HTTPS redirects working correctly
- [ ] Security headers present in responses
- [ ] Prometheus collecting Traefik metrics
- [ ] Grafana dashboards displaying data
- [ ] AlertManager receiving and routing alerts
- [ ] Log aggregation working in Loki
- [ ] Certificate auto-renewal configured
### Security Validation
- [ ] Authentication required for all admin interfaces
- [ ] TLS certificates valid and auto-renewing
- [ ] Security headers (HSTS, XSS protection) enabled
- [ ] Rate limiting functional
- [ ] Monitoring alerts triggering correctly
- [ ] SELinux in enforcing mode without denials
## Maintenance Operations
### Certificate Management
```bash
# Check certificate status
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json
# Force certificate renewal (if needed)
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
docker service update --force traefik_traefik
```
### Log Management
```bash
# Rotate Traefik logs
sudo logrotate -f /etc/logrotate.d/traefik
# Check log sizes
du -sh /opt/traefik/logs/*
```
### Monitoring Maintenance
```bash
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
# Grafana backup
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data
```
## Troubleshooting
### Common Issues
**SELinux Permission Denied**
```bash
# Check for denials
sudo ausearch -m avc -ts recent | grep traefik
# Temporarily disable to test
sudo setenforce 0
# Re-install policy if needed
cd selinux && ./install_selinux_policy.sh
```
**Authentication Not Working**
```bash
# Check service labels
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'
# Verify bcrypt hash
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin
```
**Certificate Issues**
```bash
# Check ACME log
docker service logs traefik_traefik | grep -i acme
# Verify DNS resolution
nslookup yourdomain.com
# Check rate limits
curl -I https://acme-v02.api.letsencrypt.org/directory
```
### Health Checks
```bash
# Traefik API health
curl -f http://localhost:8080/ping
# Service discovery
curl -s http://localhost:8080/api/http/services | jq '.'
# Prometheus metrics
curl -s http://localhost:8080/metrics | grep traefik_
```
## Performance Tuning
### Resource Limits
- **Traefik**: 1 CPU, 512MB RAM
- **Prometheus**: 1 CPU, 1GB RAM
- **Grafana**: 0.5 CPU, 512MB RAM
- **AlertManager**: 0.2 CPU, 256MB RAM
### Scaling Recommendations
- Single Traefik instance per manager node
- Prometheus data retention: 30 days
- Log rotation: Daily, keep 7 days
- Monitoring scrape interval: 15 seconds
## Backup Strategy
### Critical Data
- `/opt/traefik/letsencrypt/`: TLS certificates
- `/opt/monitoring/prometheus/data/`: Metrics data
- `/opt/monitoring/grafana/data/`: Dashboards and config
- `/opt/monitoring/alertmanager/config/`: Alert rules
### Backup Script
```bash
#!/bin/bash
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/
```
## Support and Documentation
### Log Locations
- **Traefik Logs**: `/opt/traefik/logs/`
- **Access Logs**: `/opt/traefik/logs/access.log`
- **Service Logs**: `docker service logs traefik_traefik`
### Monitoring Queries
```promql
# Authentication failure rate
rate(traefik_service_requests_total{code=~"401|403"}[5m])
# Service availability
up{job="traefik"}
# Response time 95th percentile
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))
```
This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities.