# Monitoring Stack Deployment Guide ## Overview The HomeAudit monitoring stack provides comprehensive infrastructure monitoring using industry-standard tools: - **Prometheus**: Metrics collection and storage - **Grafana**: Data visualization and dashboards - **Node Exporter**: System metrics collection - **Blackbox Exporter**: Service health monitoring ## Architecture ### Components ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Prometheus │ │ Grafana │ │ Node Exporter │ │ (Port 9091) │ │ (Port 3002) │ │ (Port 9100) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ └───────────────────────┼───────────────────────┘ │ ┌─────────────────┐ │ Blackbox Exporter│ │ (Port 9115) │ └─────────────────┘ ``` ### Network Configuration - **Monitoring Network**: Internal communication between components - **Caddy Public Network**: External access via reverse proxy - **Host Network Access**: Node Exporter accesses system metrics ## Deployment ### Prerequisites 1. **Docker Swarm**: Initialized and operational 2. **Networks**: `monitoring-network` and `caddy-public` created 3. **Storage**: Persistent volumes for Prometheus and Grafana data ### Deployment Commands ```bash # Deploy monitoring stack ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring" # Check service status ssh root@192.168.50.229 "docker service ls | grep monitoring" # View service logs ssh root@192.168.50.229 "docker service logs monitoring_prometheus" ``` ### Service Configuration #### Prometheus - **Image**: `prom/prometheus:v2.47.0` - **Port**: 9091 (external), 9090 (internal) - **Storage**: 30-day retention - **Scrape Interval**: 15-60 seconds - **Configuration**: `/opt/configs/monitoring/prometheus-production.yml` #### Grafana - **Image**: `grafana/grafana:10.1.2` - **Port**: 3002 (external), 3000 (internal) - **Login**: admin/admin123 - **Plugins**: Clock, Simple JSON, Pie Chart - **Provisioning**: Auto-configured datasources and dashboards #### Node Exporter - **Image**: `prom/node-exporter:v1.6.1` - **Port**: 9100 - **Access**: Host filesystem for system metrics - **Filters**: Excludes system and container filesystems #### Blackbox Exporter - **Image**: `prom/blackbox-exporter:v0.24.0` - **Port**: 9115 - **Modules**: HTTP, TCP, ICMP health checks - **Configuration**: `/opt/configs/monitoring/blackbox.yml` ## Metrics Collection ### System Metrics (Node Exporter) #### CPU Metrics ```promql # CPU Usage Percentage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # CPU Load Average node_load1, node_load5, node_load15 ``` #### Memory Metrics ```promql # Memory Usage Percentage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Memory Breakdown node_memory_MemTotal_bytes node_memory_MemAvailable_bytes node_memory_Cached_bytes node_memory_Buffers_bytes ``` #### Disk Metrics ```promql # Disk Usage Percentage (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100 # Disk I/O rate(node_disk_io_time_seconds_total[5m]) ``` #### Network Metrics ```promql # Network I/O rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) ``` ### Service Health Metrics (Blackbox Exporter) #### HTTP Health Checks ```promql # Service Availability probe_success{job="http-service-health"} # Response Time probe_duration_seconds{job="http-service-health"} # HTTP Status Codes probe_http_status_code{job="http-service-health"} ``` #### TCP Health Checks ```promql # Database Connectivity probe_success{job="tcp-service-health"} # Connection Time probe_duration_seconds{job="tcp-service-health"} ``` ## Dashboards ### Infrastructure Overview Dashboard **Purpose**: Service health and availability monitoring **Panels**: 1. **HTTP Service Health Status**: Visual status of web services 2. **TCP Service Health Status**: Database and backend service status 3. **Service Response Time**: Performance tracking over time 4. **HTTP Service Availability Summary**: Count of healthy services 5. **Service Details Table**: Detailed status of all monitored services **Metrics Used**: - `probe_success{job="http-service-health"}` - `probe_success{job="tcp-service-health"}` - `probe_duration_seconds{job="http-service-health"}` - `sum(probe_success{job="http-service-health"})` ### System Overview Dashboard **Purpose**: Comprehensive system resource monitoring **Panels**: 1. **CPU Usage**: Real-time CPU utilization trends 2. **Memory Usage**: Memory consumption and availability 3. **Disk Usage**: Storage space and I/O monitoring 4. **Network I/O**: Network traffic analysis 5. **System Load**: Load average tracking 6. **System Info**: Hardware and OS information **Metrics Used**: - `100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)` - `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` - `(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100` - `rate(node_network_receive_bytes_total{device!="lo"}[5m])` - `node_load1, node_load5, node_load15` ## Configuration Files ### Prometheus Configuration **File**: `/opt/configs/monitoring/prometheus-production.yml` **Key Sections**: - **Global**: Scrape intervals and evaluation settings - **Scrape Configs**: Target definitions for each monitoring job - **Relabel Configs**: Metric transformation and labeling ### Blackbox Configuration **File**: `/opt/configs/monitoring/blackbox.yml` **Modules**: - **http_2xx**: HTTP health checks with status code validation - **tcp_connect**: TCP connectivity testing - **icmp**: Network ping testing ### Grafana Configuration **Datasources**: `/opt/configs/monitoring/provisioning/datasources/` - Auto-configured Prometheus data source **Dashboards**: `/opt/configs/monitoring/provisioning/dashboards/` - Infrastructure Overview dashboard - System Overview dashboard ## Monitoring Targets ### Current Targets (15 total) #### Service Health (6 targets) - Paperless-NGX (192.168.50.229:8000) - Paperless-AI (192.168.50.229:3000) - Nextcloud (192.168.50.229:8081) - Home Assistant (192.168.50.181:8123) - Portainer (192.168.50.181:9000) - AppFlowy (192.168.50.66:9080) #### Database Health (4 targets) - Redis (192.168.50.229:6379) - PostgreSQL (192.168.50.229:5432) - MariaDB (192.168.50.229:3306) - Mosquitto (192.168.50.229:1883) #### System Monitoring (5 targets) - Prometheus (192.168.50.229:9091) - Grafana (192.168.50.229:3002) - Node Exporter (192.168.50.229:9100) - Blackbox Exporter (192.168.50.229:9115) - Prometheus Internal (localhost:9090) ## Performance Characteristics ### Data Collection - **Total Metrics**: 784 different metrics - **Scrape Intervals**: 15-60 seconds per job - **Data Retention**: 30 days - **Storage**: Local persistent volumes ### Resource Usage - **Prometheus**: 1GB memory, 0.5 CPU cores - **Grafana**: 1GB memory, 0.5 CPU cores - **Node Exporter**: 256MB memory, 0.25 CPU cores - **Blackbox Exporter**: 256MB memory, 0.25 CPU cores ### Network Impact - **Internal Traffic**: Minimal (monitoring network) - **External Access**: Via Caddy reverse proxy only - **Data Transfer**: Compressed metrics over HTTP ## Troubleshooting ### Common Issues #### Service Not Starting ```bash # Check service status docker service ls | grep monitoring # View service logs docker service logs monitoring_prometheus docker service logs monitoring_grafana ``` #### Metrics Not Collecting ```bash # Check Prometheus targets curl "http://192.168.50.229:9091/api/v1/targets" # Test individual exporters curl "http://192.168.50.229:9100/metrics" | head -10 curl "http://192.168.50.229:9115/metrics" | head -10 ``` #### Dashboard Not Loading ```bash # Check Grafana logs docker service logs monitoring_grafana # Verify datasource configuration curl "http://192.168.50.229:3002/api/datasources" -u admin:admin123 ``` ### Health Checks #### Prometheus Health ```bash curl "http://192.168.50.229:9091/-/healthy" ``` #### Grafana Health ```bash curl "http://192.168.50.229:3002/api/health" ``` #### Node Exporter Health ```bash curl "http://192.168.50.229:9100/-/healthy" ``` #### Blackbox Exporter Health ```bash curl "http://192.168.50.229:9115/-/healthy" ``` ## Maintenance ### Regular Tasks #### Update Configurations ```bash # Copy updated configs scp configs/monitoring/*.yml root@192.168.50.229:/opt/configs/monitoring/ scp configs/monitoring/dashboards/*.json root@192.168.50.229:/opt/configs/monitoring/provisioning/dashboards/ # Redeploy stack ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring" ``` #### Backup Monitoring Data ```bash # Backup Prometheus data ssh root@192.168.50.229 "docker run --rm -v monitoring_prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus_backup_$(date +%Y%m%d).tar.gz -C /data ." # Backup Grafana data ssh root@192.168.50.229 "docker run --rm -v monitoring_grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana_backup_$(date +%Y%m%d).tar.gz -C /data ." ``` #### Clean Up Old Data ```bash # Prometheus automatically manages retention (30 days) # Grafana data is persistent and should be backed up regularly ``` ### Scaling Considerations #### Horizontal Scaling - **Prometheus**: Can be scaled with remote storage (Thanos/Cortex) - **Grafana**: Can be scaled with external database - **Node Exporter**: One per host (already optimal) - **Blackbox Exporter**: Can be scaled for high-frequency checks #### Vertical Scaling - **Memory**: Increase limits for high-metric environments - **CPU**: Adjust based on scrape frequency and target count - **Storage**: Expand volumes for longer retention ## Security Considerations ### Network Security - **Internal Communication**: Isolated monitoring network - **External Access**: HTTPS-only via Caddy - **Authentication**: Grafana login required for dashboard access ### Data Security - **Metrics**: No sensitive data in metrics - **Logs**: Monitor for sensitive information - **Backups**: Encrypt backup files ### Access Control - **Grafana**: Admin user with strong password - **Prometheus**: Read-only access via web interface - **Exporters**: No authentication (internal network only) ## Future Enhancements ### Planned Improvements 1. **AlertManager**: Add alerting and notification system 2. **cAdvisor**: Container resource monitoring 3. **Application Exporters**: Service-specific metrics 4. **Centralized Logging**: Log aggregation with Loki ### Optional Enhancements 1. **Distributed Tracing**: Request flow tracking 2. **APM**: Application performance monitoring 3. **Synthetic Monitoring**: User journey testing 4. **Automated Incident Response**: Self-healing capabilities --- **Last Updated**: August 30, 2025 **Version**: 1.0 **Status**: Production Ready