Files

admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting

COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services: ✅ Working and accessible externally
- Vaultwarden: ❌ PostgreSQL configuration issues, old instance still working
- Monitoring: ✅ Deployed and operational
- Caddy: ✅ Updated and working for external access
- PostgreSQL: ✅ Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts

2025-08-30 20:18:44 -04:00

8.1 KiB

Raw Blame History

Traefik Production Deployment Guide

Overview

This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.

Architecture Components

Core Services

Traefik v3.1: Load balancer and reverse proxy with authentication
Prometheus: Metrics collection and alerting
Grafana: Monitoring dashboards and visualization
AlertManager: Alert routing and notification management
Loki + Promtail: Log aggregation and analysis

Security Features

✅ Basic authentication with bcrypt hashing
✅ TLS/SSL termination with automatic certificates
✅ Security headers (HSTS, XSS protection, etc.)
✅ Rate limiting and DDoS protection
✅ SELinux policy compliance
✅ Prometheus metrics for security monitoring

Prerequisites

System Requirements

Docker Swarm cluster (single manager minimum)
SELinux enabled (Fedora/RHEL/CentOS)
Minimum 4GB RAM, 20GB disk space
Network ports: 80, 443, 8080, 9090, 3000

Directory Structure

sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
sudo chown -R 1000:1000 /opt/monitoring/grafana

Installation Steps

Step 1: SELinux Policy Configuration

# Install SELinux development tools
sudo dnf install -y selinux-policy-devel

# Install custom SELinux policy
cd /home/jonathan/Coding/HomeAudit/selinux
./install_selinux_policy.sh

Step 2: Docker Swarm Network Setup

# Create overlay network
docker network create --driver overlay --attachable traefik-public

Step 3: Configuration Deployment

# Copy monitoring configurations
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/

# Set proper permissions
sudo chown -R 65534:65534 /opt/monitoring/prometheus
sudo chown -R 472:472 /opt/monitoring/grafana

Step 4: Environment Variables

Create /opt/traefik/.env:

DOMAIN=yourdomain.com
EMAIL=admin@yourdomain.com

Step 5: Deploy Services

# Deploy Traefik
export DOMAIN=yourdomain.com
docker stack deploy -c stacks/core/traefik-production.yml traefik

# Deploy monitoring stack
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring

Configuration Details

Authentication Credentials

Username: admin
Password: secure_password_2024 (bcrypt hash included)
Change in production: Generate new hash with htpasswd -nbB admin newpassword

SSL/TLS Configuration

Automatic Let's Encrypt certificates
HTTPS redirect for all HTTP traffic
HSTS headers with 2-year max-age
Secure cipher suites only

Monitoring Access Points

Traefik Dashboard: https://traefik.yourdomain.com/dashboard/
Prometheus: https://prometheus.yourdomain.com
Grafana: https://grafana.yourdomain.com
AlertManager: https://alertmanager.yourdomain.com

Security Monitoring

Key Metrics Monitored

Authentication Failures: Rate of 401/403 responses
Brute Force Attacks: High-frequency auth failures
Service Availability: Backend health status
Response Times: 95th percentile latency
Error Rates: 5xx error percentage
Certificate Expiration: TLS cert validity
Rate Limiting: 429 response frequency

Alert Thresholds

Critical: >50 auth failures/second = Possible brute force
Warning: >10 auth failures/minute = High failure rate
Critical: Service backend down >1 minute
Warning: 95th percentile response time >2 seconds
Warning: Error rate >10% for 5 minutes
Warning: TLS certificate expires <7 days
Critical: TLS certificate expired

Production Checklist

Pre-Deployment

SELinux policy installed and tested
Docker Swarm initialized and nodes joined
Directory structure created with correct permissions
Environment variables configured
DNS records pointing to Swarm manager
Firewall rules configured for ports 80, 443, 8080

Post-Deployment Verification

Traefik dashboard accessible with authentication
HTTPS redirects working correctly
Security headers present in responses
Prometheus collecting Traefik metrics
Grafana dashboards displaying data
AlertManager receiving and routing alerts
Log aggregation working in Loki
Certificate auto-renewal configured

Security Validation

Authentication required for all admin interfaces
TLS certificates valid and auto-renewing
Security headers (HSTS, XSS protection) enabled
Rate limiting functional
Monitoring alerts triggering correctly
SELinux in enforcing mode without denials

Maintenance Operations

Certificate Management

# Check certificate status
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json

# Force certificate renewal (if needed)
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
docker service update --force traefik_traefik

Log Management

# Rotate Traefik logs
sudo logrotate -f /etc/logrotate.d/traefik

# Check log sizes
du -sh /opt/traefik/logs/*

Monitoring Maintenance

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

# Grafana backup
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data

Troubleshooting

Common Issues

SELinux Permission Denied

# Check for denials
sudo ausearch -m avc -ts recent | grep traefik

# Temporarily disable to test
sudo setenforce 0

# Re-install policy if needed
cd selinux && ./install_selinux_policy.sh

Authentication Not Working

# Check service labels
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'

# Verify bcrypt hash
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin

Certificate Issues

# Check ACME log
docker service logs traefik_traefik | grep -i acme

# Verify DNS resolution
nslookup yourdomain.com

# Check rate limits
curl -I https://acme-v02.api.letsencrypt.org/directory

Health Checks

# Traefik API health
curl -f http://localhost:8080/ping

# Service discovery
curl -s http://localhost:8080/api/http/services | jq '.'

# Prometheus metrics
curl -s http://localhost:8080/metrics | grep traefik_

Performance Tuning

Resource Limits

Traefik: 1 CPU, 512MB RAM
Prometheus: 1 CPU, 1GB RAM
Grafana: 0.5 CPU, 512MB RAM
AlertManager: 0.2 CPU, 256MB RAM

Scaling Recommendations

Single Traefik instance per manager node
Prometheus data retention: 30 days
Log rotation: Daily, keep 7 days
Monitoring scrape interval: 15 seconds

Backup Strategy

Critical Data

/opt/traefik/letsencrypt/: TLS certificates
/opt/monitoring/prometheus/data/: Metrics data
/opt/monitoring/grafana/data/: Dashboards and config
/opt/monitoring/alertmanager/config/: Alert rules

Backup Script

#!/bin/bash
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/

Support and Documentation

Log Locations

Traefik Logs: /opt/traefik/logs/
Access Logs: /opt/traefik/logs/access.log
Service Logs: docker service logs traefik_traefik

Monitoring Queries

# Authentication failure rate
rate(traefik_service_requests_total{code=~"401|403"}[5m])

# Service availability
up{job="traefik"}

# Response time 95th percentile
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))

This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities.

8.1 KiB Raw Blame History