Files
HomeAudit/dev_documentation/monitoring/TRAEFIK_DEPLOYMENT_GUIDE.md
admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting
COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services:  Working and accessible externally
- Vaultwarden:  PostgreSQL configuration issues, old instance still working
- Monitoring:  Deployed and operational
- Caddy:  Updated and working for external access
- PostgreSQL:  Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts
2025-08-30 20:18:44 -04:00

8.1 KiB

Traefik Production Deployment Guide

Overview

This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.

Architecture Components

Core Services

  • Traefik v3.1: Load balancer and reverse proxy with authentication
  • Prometheus: Metrics collection and alerting
  • Grafana: Monitoring dashboards and visualization
  • AlertManager: Alert routing and notification management
  • Loki + Promtail: Log aggregation and analysis

Security Features

  • Basic authentication with bcrypt hashing
  • TLS/SSL termination with automatic certificates
  • Security headers (HSTS, XSS protection, etc.)
  • Rate limiting and DDoS protection
  • SELinux policy compliance
  • Prometheus metrics for security monitoring

Prerequisites

System Requirements

  • Docker Swarm cluster (single manager minimum)
  • SELinux enabled (Fedora/RHEL/CentOS)
  • Minimum 4GB RAM, 20GB disk space
  • Network ports: 80, 443, 8080, 9090, 3000

Directory Structure

sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
sudo chown -R 1000:1000 /opt/monitoring/grafana

Installation Steps

Step 1: SELinux Policy Configuration

# Install SELinux development tools
sudo dnf install -y selinux-policy-devel

# Install custom SELinux policy
cd /home/jonathan/Coding/HomeAudit/selinux
./install_selinux_policy.sh

Step 2: Docker Swarm Network Setup

# Create overlay network
docker network create --driver overlay --attachable traefik-public

Step 3: Configuration Deployment

# Copy monitoring configurations
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/

# Set proper permissions
sudo chown -R 65534:65534 /opt/monitoring/prometheus
sudo chown -R 472:472 /opt/monitoring/grafana

Step 4: Environment Variables

Create /opt/traefik/.env:

DOMAIN=yourdomain.com
EMAIL=admin@yourdomain.com

Step 5: Deploy Services

# Deploy Traefik
export DOMAIN=yourdomain.com
docker stack deploy -c stacks/core/traefik-production.yml traefik

# Deploy monitoring stack
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring

Configuration Details

Authentication Credentials

  • Username: admin
  • Password: secure_password_2024 (bcrypt hash included)
  • Change in production: Generate new hash with htpasswd -nbB admin newpassword

SSL/TLS Configuration

  • Automatic Let's Encrypt certificates
  • HTTPS redirect for all HTTP traffic
  • HSTS headers with 2-year max-age
  • Secure cipher suites only

Monitoring Access Points

  • Traefik Dashboard: https://traefik.yourdomain.com/dashboard/
  • Prometheus: https://prometheus.yourdomain.com
  • Grafana: https://grafana.yourdomain.com
  • AlertManager: https://alertmanager.yourdomain.com

Security Monitoring

Key Metrics Monitored

  1. Authentication Failures: Rate of 401/403 responses
  2. Brute Force Attacks: High-frequency auth failures
  3. Service Availability: Backend health status
  4. Response Times: 95th percentile latency
  5. Error Rates: 5xx error percentage
  6. Certificate Expiration: TLS cert validity
  7. Rate Limiting: 429 response frequency

Alert Thresholds

  • Critical: >50 auth failures/second = Possible brute force
  • Warning: >10 auth failures/minute = High failure rate
  • Critical: Service backend down >1 minute
  • Warning: 95th percentile response time >2 seconds
  • Warning: Error rate >10% for 5 minutes
  • Warning: TLS certificate expires <7 days
  • Critical: TLS certificate expired

Production Checklist

Pre-Deployment

  • SELinux policy installed and tested
  • Docker Swarm initialized and nodes joined
  • Directory structure created with correct permissions
  • Environment variables configured
  • DNS records pointing to Swarm manager
  • Firewall rules configured for ports 80, 443, 8080

Post-Deployment Verification

  • Traefik dashboard accessible with authentication
  • HTTPS redirects working correctly
  • Security headers present in responses
  • Prometheus collecting Traefik metrics
  • Grafana dashboards displaying data
  • AlertManager receiving and routing alerts
  • Log aggregation working in Loki
  • Certificate auto-renewal configured

Security Validation

  • Authentication required for all admin interfaces
  • TLS certificates valid and auto-renewing
  • Security headers (HSTS, XSS protection) enabled
  • Rate limiting functional
  • Monitoring alerts triggering correctly
  • SELinux in enforcing mode without denials

Maintenance Operations

Certificate Management

# Check certificate status
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json

# Force certificate renewal (if needed)
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
docker service update --force traefik_traefik

Log Management

# Rotate Traefik logs
sudo logrotate -f /etc/logrotate.d/traefik

# Check log sizes
du -sh /opt/traefik/logs/*

Monitoring Maintenance

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

# Grafana backup
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data

Troubleshooting

Common Issues

SELinux Permission Denied

# Check for denials
sudo ausearch -m avc -ts recent | grep traefik

# Temporarily disable to test
sudo setenforce 0

# Re-install policy if needed
cd selinux && ./install_selinux_policy.sh

Authentication Not Working

# Check service labels
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'

# Verify bcrypt hash
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin

Certificate Issues

# Check ACME log
docker service logs traefik_traefik | grep -i acme

# Verify DNS resolution
nslookup yourdomain.com

# Check rate limits
curl -I https://acme-v02.api.letsencrypt.org/directory

Health Checks

# Traefik API health
curl -f http://localhost:8080/ping

# Service discovery
curl -s http://localhost:8080/api/http/services | jq '.'

# Prometheus metrics
curl -s http://localhost:8080/metrics | grep traefik_

Performance Tuning

Resource Limits

  • Traefik: 1 CPU, 512MB RAM
  • Prometheus: 1 CPU, 1GB RAM
  • Grafana: 0.5 CPU, 512MB RAM
  • AlertManager: 0.2 CPU, 256MB RAM

Scaling Recommendations

  • Single Traefik instance per manager node
  • Prometheus data retention: 30 days
  • Log rotation: Daily, keep 7 days
  • Monitoring scrape interval: 15 seconds

Backup Strategy

Critical Data

  • /opt/traefik/letsencrypt/: TLS certificates
  • /opt/monitoring/prometheus/data/: Metrics data
  • /opt/monitoring/grafana/data/: Dashboards and config
  • /opt/monitoring/alertmanager/config/: Alert rules

Backup Script

#!/bin/bash
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/

Support and Documentation

Log Locations

  • Traefik Logs: /opt/traefik/logs/
  • Access Logs: /opt/traefik/logs/access.log
  • Service Logs: docker service logs traefik_traefik

Monitoring Queries

# Authentication failure rate
rate(traefik_service_requests_total{code=~"401|403"}[5m])

# Service availability
up{job="traefik"}

# Response time 95th percentile
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))

This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities.