Files
HomeAudit/stacks/monitoring/final-monitoring.yml
admin 45363040f3 feat: Complete infrastructure cleanup phase documentation and status updates
## Major Infrastructure Milestones Achieved

###  Service Migrations Completed
- Jellyfin: Successfully migrated to Docker Swarm with latest version
- Vaultwarden: Running in Docker Swarm on OMV800 (eliminated duplicate)
- Nextcloud: Operational with database optimization and cron setup
- Paperless services: Both NGX and AI running successfully

### 🚨 Duplicate Service Analysis Complete
- Identified MariaDB conflict (OMV800 Swarm vs lenovo410 standalone)
- Identified Vaultwarden duplication (now resolved)
- Documented PostgreSQL and Redis consolidation opportunities
- Mapped monitoring stack optimization needs

### 🏗️ Infrastructure Status Documentation
- Updated README with current cleanup phase status
- Enhanced Service Analysis with duplicate service inventory
- Updated Quick Start guide with immediate action items
- Documented current container distribution across 6 nodes

### 📋 Action Plan Documentation
- Phase 1: Immediate service conflict resolution (this week)
- Phase 2: Service migration and load balancing (next 2 weeks)
- Phase 3: Database consolidation and optimization (future)

### 🔧 Current Infrastructure Health
- Docker Swarm: All 6 nodes operational and healthy
- Caddy Reverse Proxy: Fully operational with SSL certificates
- Storage: MergerFS healthy, local storage for databases
- Monitoring: Prometheus + Grafana + Uptime Kuma operational

### 📊 Container Distribution Status
- OMV800: 25+ containers (needs load balancing)
- lenovo410: 9 containers (cleanup in progress)
- fedora: 1 container (ready for additional services)
- audrey: 4 containers (well-balanced, monitoring hub)
- lenovo420: 7 containers (balanced, can assist)
- surface: 9 containers (specialized, reverse proxy)

### 🎯 Next Steps
1. Remove lenovo410 MariaDB (eliminate port 3306 conflict)
2. Clean up lenovo410 Vaultwarden (256MB space savings)
3. Verify no service conflicts exist
4. Begin service migration from OMV800 to fedora/audrey

Status: Infrastructure 99% complete, entering cleanup and optimization phase
2025-09-01 16:50:37 -04:00

181 lines
4.6 KiB
YAML

version: '3.9'
services:
prometheus:
image: prom/prometheus:v2.47.0
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
- --web.enable-admin-api
volumes:
- prometheus_data:/prometheus
- /opt/configs/monitoring/prometheus-production.yml:/etc/prometheus/prometheus.yml:ro
networks:
- monitoring-network
- caddy-public
ports:
- 9091:9090
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9090/-/healthy
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.25'
placement:
constraints:
- node.role == manager
labels:
- caddy.enable=true
- caddy.http.routers.prometheus.rule=Host(`prometheus.pressmess.duckdns.org`)
- caddy.http.routers.prometheus.entrypoints=websecure
- caddy.http.routers.prometheus.tls=true
- caddy.http.services.prometheus.loadbalancer.server.port=9090
node-exporter:
image: prom/node-exporter:v1.6.1
command:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --web.listen-address=:9100
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
- --collector.filesystem.fs-types-exclude=^(sys|proc|auto)fs$$
- --collector.netdev.device-exclude=^(lo|docker0|veth.*)$$
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host/root:ro
networks:
- monitoring-network
ports:
- 9100:9100
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9100/-/healthy
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
memory: 256M
cpus: '0.25'
reservations:
memory: 128M
cpus: '0.1'
placement:
constraints:
- node.role == manager
blackbox-exporter:
image: prom/blackbox-exporter:v0.24.0
command:
- --config.file=/etc/blackbox_exporter/blackbox.yml
volumes:
- /opt/configs/monitoring/blackbox.yml:/etc/blackbox_exporter/blackbox.yml:ro
networks:
- monitoring-network
ports:
- 9115:9115
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9115/-/healthy
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
memory: 256M
cpus: '0.25'
reservations:
memory: 128M
cpus: '0.1'
placement:
constraints:
- node.role == manager
grafana:
image: grafana/grafana:10.1.2
environment:
GF_PROVISIONING_PATH: /etc/grafana/provisioning
GF_INSTALL_PLUGINS: grafana-clock-panel,grafana-simple-json-datasource,grafana-piechart-panel
GF_FEATURE_TOGGLES_ENABLE: publicDashboards
GF_SECURITY_ADMIN_PASSWORD: admin123
GF_SERVER_HTTP_PORT: 3001
volumes:
- grafana_data:/var/lib/grafana
- /opt/configs/monitoring/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
- /opt/configs/monitoring/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
networks:
- monitoring-network
- caddy-public
ports:
- 3002:3001
healthcheck:
test:
- CMD-SHELL
- curl -f http://localhost:3001/api/health || exit 1
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.25'
placement:
constraints:
- node.role == manager
labels:
- caddy.enable=true
- caddy.http.routers.grafana.rule=Host(`grafana.pressmess.duckdns.org`)
- caddy.http.routers.grafana.entrypoints=websecure
- caddy.http.routers.grafana.tls=true
- caddy.http.services.grafana.loadbalancer.server.port=3001
volumes:
prometheus_data:
driver: local
grafana_data:
driver: local
networks:
monitoring-network:
external: true
caddy-public:
external: true