HomeAudit/dev_documentation/monitoring/TRAEFIK_DEPLOYMENT_GUIDE.md

# Traefik Production Deployment Guide

## Overview
This guide provides comprehensive instructions for deploying Traefik v3.1 in production with full authentication, monitoring, and security features on Docker Swarm with SELinux enforcement.

## Architecture Components

### Core Services
- **Traefik v3.1**: Load balancer and reverse proxy with authentication
- **Prometheus**: Metrics collection and alerting
- **Grafana**: Monitoring dashboards and visualization
- **AlertManager**: Alert routing and notification management
- **Loki + Promtail**: Log aggregation and analysis

### Security Features
- ✅ Basic authentication with bcrypt hashing
- ✅ TLS/SSL termination with automatic certificates
- ✅ Security headers (HSTS, XSS protection, etc.)
- ✅ Rate limiting and DDoS protection
- ✅ SELinux policy compliance
- ✅ Prometheus metrics for security monitoring

## Prerequisites

### System Requirements
- Docker Swarm cluster (single manager minimum)
- SELinux enabled (Fedora/RHEL/CentOS)
- Minimum 4GB RAM, 20GB disk space
- Network ports: 80, 443, 8080, 9090, 3000

### Directory Structure
```bash
sudo mkdir -p /opt/{traefik,monitoring}/{letsencrypt,logs,prometheus,grafana,alertmanager,loki}
sudo mkdir -p /opt/monitoring/{prometheus/{data,config},grafana/{data,config}}
sudo mkdir -p /opt/monitoring/{alertmanager/{data,config},loki/data,promtail/config}
sudo chown -R 1000:1000 /opt/monitoring/grafana
```

## Installation Steps

### Step 1: SELinux Policy Configuration

```bash
# Install SELinux development tools
sudo dnf install -y selinux-policy-devel

# Install custom SELinux policy
cd /home/jonathan/Coding/HomeAudit/selinux
./install_selinux_policy.sh
```

### Step 2: Docker Swarm Network Setup

```bash
# Create overlay network
docker network create --driver overlay --attachable traefik-public
```

### Step 3: Configuration Deployment

```bash
# Copy monitoring configurations
sudo cp configs/monitoring/prometheus.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/traefik_rules.yml /opt/monitoring/prometheus/config/
sudo cp configs/monitoring/alertmanager.yml /opt/monitoring/alertmanager/config/

# Set proper permissions
sudo chown -R 65534:65534 /opt/monitoring/prometheus
sudo chown -R 472:472 /opt/monitoring/grafana
```

### Step 4: Environment Variables

Create `/opt/traefik/.env`:
```bash
DOMAIN=yourdomain.com
EMAIL=admin@yourdomain.com
```

### Step 5: Deploy Services

```bash
# Deploy Traefik
export DOMAIN=yourdomain.com
docker stack deploy -c stacks/core/traefik-production.yml traefik

# Deploy monitoring stack
docker stack deploy -c stacks/monitoring/traefik-monitoring.yml monitoring
```

## Configuration Details

### Authentication Credentials
- **Username**: `admin`
- **Password**: `secure_password_2024` (bcrypt hash included)
- **Change in production**: Generate new hash with `htpasswd -nbB admin newpassword`

### SSL/TLS Configuration
- Automatic Let's Encrypt certificates
- HTTPS redirect for all HTTP traffic
- HSTS headers with 2-year max-age
- Secure cipher suites only

### Monitoring Access Points
- **Traefik Dashboard**: `https://traefik.yourdomain.com/dashboard/`
- **Prometheus**: `https://prometheus.yourdomain.com`
- **Grafana**: `https://grafana.yourdomain.com`
- **AlertManager**: `https://alertmanager.yourdomain.com`

## Security Monitoring

### Key Metrics Monitored
1. **Authentication Failures**: Rate of 401/403 responses
2. **Brute Force Attacks**: High-frequency auth failures
3. **Service Availability**: Backend health status
4. **Response Times**: 95th percentile latency
5. **Error Rates**: 5xx error percentage
6. **Certificate Expiration**: TLS cert validity
7. **Rate Limiting**: 429 response frequency

### Alert Thresholds
- **Critical**: >50 auth failures/second = Possible brute force
- **Warning**: >10 auth failures/minute = High failure rate
- **Critical**: Service backend down >1 minute
- **Warning**: 95th percentile response time >2 seconds
- **Warning**: Error rate >10% for 5 minutes
- **Warning**: TLS certificate expires <7 days
- **Critical**: TLS certificate expired

## Production Checklist

### Pre-Deployment
- [ ] SELinux policy installed and tested
- [ ] Docker Swarm initialized and nodes joined
- [ ] Directory structure created with correct permissions
- [ ] Environment variables configured
- [ ] DNS records pointing to Swarm manager
- [ ] Firewall rules configured for ports 80, 443, 8080

### Post-Deployment Verification
- [ ] Traefik dashboard accessible with authentication
- [ ] HTTPS redirects working correctly
- [ ] Security headers present in responses
- [ ] Prometheus collecting Traefik metrics
- [ ] Grafana dashboards displaying data
- [ ] AlertManager receiving and routing alerts
- [ ] Log aggregation working in Loki
- [ ] Certificate auto-renewal configured

### Security Validation
- [ ] Authentication required for all admin interfaces
- [ ] TLS certificates valid and auto-renewing
- [ ] Security headers (HSTS, XSS protection) enabled
- [ ] Rate limiting functional
- [ ] Monitoring alerts triggering correctly
- [ ] SELinux in enforcing mode without denials

## Maintenance Operations

### Certificate Management
```bash
# Check certificate status
docker exec $(docker ps -q -f name=traefik) ls -la /letsencrypt/acme.json

# Force certificate renewal (if needed)
docker exec $(docker ps -q -f name=traefik) rm /letsencrypt/acme.json
docker service update --force traefik_traefik
```

### Log Management
```bash
# Rotate Traefik logs
sudo logrotate -f /etc/logrotate.d/traefik

# Check log sizes
du -sh /opt/traefik/logs/*
```

### Monitoring Maintenance
```bash
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'

# Grafana backup
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /opt/monitoring/grafana/data
```

## Troubleshooting

### Common Issues

**SELinux Permission Denied**
```bash
# Check for denials
sudo ausearch -m avc -ts recent | grep traefik

# Temporarily disable to test
sudo setenforce 0

# Re-install policy if needed
cd selinux && ./install_selinux_policy.sh
```

**Authentication Not Working**
```bash
# Check service labels
docker service inspect traefik_traefik | jq '.[0].Spec.Labels'

# Verify bcrypt hash
echo 'admin:$2y$10$xvzBkbKKvRX.jGG6F7L.ReEMyEx.7BkqNGQO2rFt/1aBgx8jPElXW' | htpasswd -i -v /dev/stdin admin
```

**Certificate Issues**
```bash
# Check ACME log
docker service logs traefik_traefik | grep -i acme

# Verify DNS resolution
nslookup yourdomain.com

# Check rate limits
curl -I https://acme-v02.api.letsencrypt.org/directory
```

### Health Checks
```bash
# Traefik API health
curl -f http://localhost:8080/ping

# Service discovery
curl -s http://localhost:8080/api/http/services | jq '.'

# Prometheus metrics
curl -s http://localhost:8080/metrics | grep traefik_
```

## Performance Tuning

### Resource Limits
- **Traefik**: 1 CPU, 512MB RAM
- **Prometheus**: 1 CPU, 1GB RAM
- **Grafana**: 0.5 CPU, 512MB RAM
- **AlertManager**: 0.2 CPU, 256MB RAM

### Scaling Recommendations
- Single Traefik instance per manager node
- Prometheus data retention: 30 days
- Log rotation: Daily, keep 7 days
- Monitoring scrape interval: 15 seconds

## Backup Strategy

### Critical Data
- `/opt/traefik/letsencrypt/`: TLS certificates
- `/opt/monitoring/prometheus/data/`: Metrics data
- `/opt/monitoring/grafana/data/`: Dashboards and config
- `/opt/monitoring/alertmanager/config/`: Alert rules

### Backup Script
```bash
#!/bin/bash
BACKUP_DIR="/backup/traefik-$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

tar -czf "$BACKUP_DIR/traefik-config.tar.gz" /opt/traefik/
tar -czf "$BACKUP_DIR/monitoring-config.tar.gz" /opt/monitoring/
```

## Support and Documentation

### Log Locations
- **Traefik Logs**: `/opt/traefik/logs/`
- **Access Logs**: `/opt/traefik/logs/access.log`
- **Service Logs**: `docker service logs traefik_traefik`

### Monitoring Queries
```promql
# Authentication failure rate
rate(traefik_service_requests_total{code=~"401|403"}[5m])

# Service availability
up{job="traefik"}

# Response time 95th percentile
histogram_quantile(0.95, rate(traefik_service_request_duration_seconds_bucket[5m]))
```

This deployment provides enterprise-grade Traefik configuration with comprehensive security, monitoring, and operational capabilities.