HomeAudit/dev_documentation/monitoring/MONITORING_STACK_DEPLOYMENT.md

# Monitoring Stack Deployment Guide

## Overview

The HomeAudit monitoring stack provides comprehensive infrastructure monitoring using industry-standard tools:

- **Prometheus**: Metrics collection and storage
- **Grafana**: Data visualization and dashboards
- **Node Exporter**: System metrics collection
- **Blackbox Exporter**: Service health monitoring

## Architecture

### Components

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Prometheus    │    │     Grafana     │    │  Node Exporter  │
│   (Port 9091)   │    │   (Port 3002)   │    │   (Port 9100)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │ Blackbox Exporter│
                    │   (Port 9115)   │
                    └─────────────────┘
```

### Network Configuration

- **Monitoring Network**: Internal communication between components
- **Caddy Public Network**: External access via reverse proxy
- **Host Network Access**: Node Exporter accesses system metrics

## Deployment

### Prerequisites

1. **Docker Swarm**: Initialized and operational
2. **Networks**: `monitoring-network` and `caddy-public` created
3. **Storage**: Persistent volumes for Prometheus and Grafana data

### Deployment Commands

```bash
# Deploy monitoring stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"

# Check service status
ssh root@192.168.50.229 "docker service ls | grep monitoring"

# View service logs
ssh root@192.168.50.229 "docker service logs monitoring_prometheus"
```

### Service Configuration

#### Prometheus
- **Image**: `prom/prometheus:v2.47.0`
- **Port**: 9091 (external), 9090 (internal)
- **Storage**: 30-day retention
- **Scrape Interval**: 15-60 seconds
- **Configuration**: `/opt/configs/monitoring/prometheus-production.yml`

#### Grafana
- **Image**: `grafana/grafana:10.1.2`
- **Port**: 3002 (external), 3000 (internal)
- **Login**: admin/admin123
- **Plugins**: Clock, Simple JSON, Pie Chart
- **Provisioning**: Auto-configured datasources and dashboards

#### Node Exporter
- **Image**: `prom/node-exporter:v1.6.1`
- **Port**: 9100
- **Access**: Host filesystem for system metrics
- **Filters**: Excludes system and container filesystems

#### Blackbox Exporter
- **Image**: `prom/blackbox-exporter:v0.24.0`
- **Port**: 9115
- **Modules**: HTTP, TCP, ICMP health checks
- **Configuration**: `/opt/configs/monitoring/blackbox.yml`

## Metrics Collection

### System Metrics (Node Exporter)

#### CPU Metrics
```promql
# CPU Usage Percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Load Average
node_load1, node_load5, node_load15
```

#### Memory Metrics
```promql
# Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory Breakdown
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_Cached_bytes
node_memory_Buffers_bytes
```

#### Disk Metrics
```promql
# Disk Usage Percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100

# Disk I/O
rate(node_disk_io_time_seconds_total[5m])
```

#### Network Metrics
```promql
# Network I/O
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
```

### Service Health Metrics (Blackbox Exporter)

#### HTTP Health Checks
```promql
# Service Availability
probe_success{job="http-service-health"}

# Response Time
probe_duration_seconds{job="http-service-health"}

# HTTP Status Codes
probe_http_status_code{job="http-service-health"}
```

#### TCP Health Checks
```promql
# Database Connectivity
probe_success{job="tcp-service-health"}

# Connection Time
probe_duration_seconds{job="tcp-service-health"}
```

## Dashboards

### Infrastructure Overview Dashboard

**Purpose**: Service health and availability monitoring

**Panels**:
1. **HTTP Service Health Status**: Visual status of web services
2. **TCP Service Health Status**: Database and backend service status
3. **Service Response Time**: Performance tracking over time
4. **HTTP Service Availability Summary**: Count of healthy services
5. **Service Details Table**: Detailed status of all monitored services

**Metrics Used**:
- `probe_success{job="http-service-health"}`
- `probe_success{job="tcp-service-health"}`
- `probe_duration_seconds{job="http-service-health"}`
- `sum(probe_success{job="http-service-health"})`

### System Overview Dashboard

**Purpose**: Comprehensive system resource monitoring

**Panels**:
1. **CPU Usage**: Real-time CPU utilization trends
2. **Memory Usage**: Memory consumption and availability
3. **Disk Usage**: Storage space and I/O monitoring
4. **Network I/O**: Network traffic analysis
5. **System Load**: Load average tracking
6. **System Info**: Hardware and OS information

**Metrics Used**:
- `100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
- `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100`
- `(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100`
- `rate(node_network_receive_bytes_total{device!="lo"}[5m])`
- `node_load1, node_load5, node_load15`

## Configuration Files

### Prometheus Configuration

**File**: `/opt/configs/monitoring/prometheus-production.yml`

**Key Sections**:
- **Global**: Scrape intervals and evaluation settings
- **Scrape Configs**: Target definitions for each monitoring job
- **Relabel Configs**: Metric transformation and labeling

### Blackbox Configuration

**File**: `/opt/configs/monitoring/blackbox.yml`

**Modules**:
- **http_2xx**: HTTP health checks with status code validation
- **tcp_connect**: TCP connectivity testing
- **icmp**: Network ping testing

### Grafana Configuration

**Datasources**: `/opt/configs/monitoring/provisioning/datasources/`
- Auto-configured Prometheus data source

**Dashboards**: `/opt/configs/monitoring/provisioning/dashboards/`
- Infrastructure Overview dashboard
- System Overview dashboard

## Monitoring Targets

### Current Targets (15 total)

#### Service Health (6 targets)
- Paperless-NGX (192.168.50.229:8000)
- Paperless-AI (192.168.50.229:3000)
- Nextcloud (192.168.50.229:8081)
- Home Assistant (192.168.50.181:8123)
- Portainer (192.168.50.181:9000)
- AppFlowy (192.168.50.66:9080)

#### Database Health (4 targets)
- Redis (192.168.50.229:6379)
- PostgreSQL (192.168.50.229:5432)
- MariaDB (192.168.50.229:3306)
- Mosquitto (192.168.50.229:1883)

#### System Monitoring (5 targets)
- Prometheus (192.168.50.229:9091)
- Grafana (192.168.50.229:3002)
- Node Exporter (192.168.50.229:9100)
- Blackbox Exporter (192.168.50.229:9115)
- Prometheus Internal (localhost:9090)

## Performance Characteristics

### Data Collection
- **Total Metrics**: 784 different metrics
- **Scrape Intervals**: 15-60 seconds per job
- **Data Retention**: 30 days
- **Storage**: Local persistent volumes

### Resource Usage
- **Prometheus**: 1GB memory, 0.5 CPU cores
- **Grafana**: 1GB memory, 0.5 CPU cores
- **Node Exporter**: 256MB memory, 0.25 CPU cores
- **Blackbox Exporter**: 256MB memory, 0.25 CPU cores

### Network Impact
- **Internal Traffic**: Minimal (monitoring network)
- **External Access**: Via Caddy reverse proxy only
- **Data Transfer**: Compressed metrics over HTTP

## Troubleshooting

### Common Issues

#### Service Not Starting
```bash
# Check service status
docker service ls | grep monitoring

# View service logs
docker service logs monitoring_prometheus
docker service logs monitoring_grafana
```

#### Metrics Not Collecting
```bash
# Check Prometheus targets
curl "http://192.168.50.229:9091/api/v1/targets"

# Test individual exporters
curl "http://192.168.50.229:9100/metrics" | head -10
curl "http://192.168.50.229:9115/metrics" | head -10
```

#### Dashboard Not Loading
```bash
# Check Grafana logs
docker service logs monitoring_grafana

# Verify datasource configuration
curl "http://192.168.50.229:3002/api/datasources" -u admin:admin123
```

### Health Checks

#### Prometheus Health
```bash
curl "http://192.168.50.229:9091/-/healthy"
```

#### Grafana Health
```bash
curl "http://192.168.50.229:3002/api/health"
```

#### Node Exporter Health
```bash
curl "http://192.168.50.229:9100/-/healthy"
```

#### Blackbox Exporter Health
```bash
curl "http://192.168.50.229:9115/-/healthy"
```

## Maintenance

### Regular Tasks

#### Update Configurations
```bash
# Copy updated configs
scp configs/monitoring/*.yml root@192.168.50.229:/opt/configs/monitoring/
scp configs/monitoring/dashboards/*.json root@192.168.50.229:/opt/configs/monitoring/provisioning/dashboards/

# Redeploy stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"
```

#### Backup Monitoring Data
```bash
# Backup Prometheus data
ssh root@192.168.50.229 "docker run --rm -v monitoring_prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus_backup_$(date +%Y%m%d).tar.gz -C /data ."

# Backup Grafana data
ssh root@192.168.50.229 "docker run --rm -v monitoring_grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana_backup_$(date +%Y%m%d).tar.gz -C /data ."
```

#### Clean Up Old Data
```bash
# Prometheus automatically manages retention (30 days)
# Grafana data is persistent and should be backed up regularly
```

### Scaling Considerations

#### Horizontal Scaling
- **Prometheus**: Can be scaled with remote storage (Thanos/Cortex)
- **Grafana**: Can be scaled with external database
- **Node Exporter**: One per host (already optimal)
- **Blackbox Exporter**: Can be scaled for high-frequency checks

#### Vertical Scaling
- **Memory**: Increase limits for high-metric environments
- **CPU**: Adjust based on scrape frequency and target count
- **Storage**: Expand volumes for longer retention

## Security Considerations

### Network Security
- **Internal Communication**: Isolated monitoring network
- **External Access**: HTTPS-only via Caddy
- **Authentication**: Grafana login required for dashboard access

### Data Security
- **Metrics**: No sensitive data in metrics
- **Logs**: Monitor for sensitive information
- **Backups**: Encrypt backup files

### Access Control
- **Grafana**: Admin user with strong password
- **Prometheus**: Read-only access via web interface
- **Exporters**: No authentication (internal network only)

## Future Enhancements

### Planned Improvements
1. **AlertManager**: Add alerting and notification system
2. **cAdvisor**: Container resource monitoring
3. **Application Exporters**: Service-specific metrics
4. **Centralized Logging**: Log aggregation with Loki

### Optional Enhancements
1. **Distributed Tracing**: Request flow tracking
2. **APM**: Application performance monitoring
3. **Synthetic Monitoring**: User journey testing
4. **Automated Incident Response**: Self-healing capabilities

---

**Last Updated**: August 30, 2025
**Version**: 1.0
**Status**: Production Ready