Files
HomeAudit/dev_documentation/monitoring/MONITORING_STACK_DEPLOYMENT.md
admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting
COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services:  Working and accessible externally
- Vaultwarden:  PostgreSQL configuration issues, old instance still working
- Monitoring:  Deployed and operational
- Caddy:  Updated and working for external access
- PostgreSQL:  Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts
2025-08-30 20:18:44 -04:00

393 lines
11 KiB
Markdown

# Monitoring Stack Deployment Guide
## Overview
The HomeAudit monitoring stack provides comprehensive infrastructure monitoring using industry-standard tools:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Data visualization and dashboards
- **Node Exporter**: System metrics collection
- **Blackbox Exporter**: Service health monitoring
## Architecture
### Components
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Grafana │ │ Node Exporter │
│ (Port 9091) │ │ (Port 3002) │ │ (Port 9100) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
┌─────────────────┐
│ Blackbox Exporter│
│ (Port 9115) │
└─────────────────┘
```
### Network Configuration
- **Monitoring Network**: Internal communication between components
- **Caddy Public Network**: External access via reverse proxy
- **Host Network Access**: Node Exporter accesses system metrics
## Deployment
### Prerequisites
1. **Docker Swarm**: Initialized and operational
2. **Networks**: `monitoring-network` and `caddy-public` created
3. **Storage**: Persistent volumes for Prometheus and Grafana data
### Deployment Commands
```bash
# Deploy monitoring stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"
# Check service status
ssh root@192.168.50.229 "docker service ls | grep monitoring"
# View service logs
ssh root@192.168.50.229 "docker service logs monitoring_prometheus"
```
### Service Configuration
#### Prometheus
- **Image**: `prom/prometheus:v2.47.0`
- **Port**: 9091 (external), 9090 (internal)
- **Storage**: 30-day retention
- **Scrape Interval**: 15-60 seconds
- **Configuration**: `/opt/configs/monitoring/prometheus-production.yml`
#### Grafana
- **Image**: `grafana/grafana:10.1.2`
- **Port**: 3002 (external), 3000 (internal)
- **Login**: admin/admin123
- **Plugins**: Clock, Simple JSON, Pie Chart
- **Provisioning**: Auto-configured datasources and dashboards
#### Node Exporter
- **Image**: `prom/node-exporter:v1.6.1`
- **Port**: 9100
- **Access**: Host filesystem for system metrics
- **Filters**: Excludes system and container filesystems
#### Blackbox Exporter
- **Image**: `prom/blackbox-exporter:v0.24.0`
- **Port**: 9115
- **Modules**: HTTP, TCP, ICMP health checks
- **Configuration**: `/opt/configs/monitoring/blackbox.yml`
## Metrics Collection
### System Metrics (Node Exporter)
#### CPU Metrics
```promql
# CPU Usage Percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU Load Average
node_load1, node_load5, node_load15
```
#### Memory Metrics
```promql
# Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Memory Breakdown
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_Cached_bytes
node_memory_Buffers_bytes
```
#### Disk Metrics
```promql
# Disk Usage Percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100
# Disk I/O
rate(node_disk_io_time_seconds_total[5m])
```
#### Network Metrics
```promql
# Network I/O
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])
```
### Service Health Metrics (Blackbox Exporter)
#### HTTP Health Checks
```promql
# Service Availability
probe_success{job="http-service-health"}
# Response Time
probe_duration_seconds{job="http-service-health"}
# HTTP Status Codes
probe_http_status_code{job="http-service-health"}
```
#### TCP Health Checks
```promql
# Database Connectivity
probe_success{job="tcp-service-health"}
# Connection Time
probe_duration_seconds{job="tcp-service-health"}
```
## Dashboards
### Infrastructure Overview Dashboard
**Purpose**: Service health and availability monitoring
**Panels**:
1. **HTTP Service Health Status**: Visual status of web services
2. **TCP Service Health Status**: Database and backend service status
3. **Service Response Time**: Performance tracking over time
4. **HTTP Service Availability Summary**: Count of healthy services
5. **Service Details Table**: Detailed status of all monitored services
**Metrics Used**:
- `probe_success{job="http-service-health"}`
- `probe_success{job="tcp-service-health"}`
- `probe_duration_seconds{job="http-service-health"}`
- `sum(probe_success{job="http-service-health"})`
### System Overview Dashboard
**Purpose**: Comprehensive system resource monitoring
**Panels**:
1. **CPU Usage**: Real-time CPU utilization trends
2. **Memory Usage**: Memory consumption and availability
3. **Disk Usage**: Storage space and I/O monitoring
4. **Network I/O**: Network traffic analysis
5. **System Load**: Load average tracking
6. **System Info**: Hardware and OS information
**Metrics Used**:
- `100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
- `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100`
- `(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100`
- `rate(node_network_receive_bytes_total{device!="lo"}[5m])`
- `node_load1, node_load5, node_load15`
## Configuration Files
### Prometheus Configuration
**File**: `/opt/configs/monitoring/prometheus-production.yml`
**Key Sections**:
- **Global**: Scrape intervals and evaluation settings
- **Scrape Configs**: Target definitions for each monitoring job
- **Relabel Configs**: Metric transformation and labeling
### Blackbox Configuration
**File**: `/opt/configs/monitoring/blackbox.yml`
**Modules**:
- **http_2xx**: HTTP health checks with status code validation
- **tcp_connect**: TCP connectivity testing
- **icmp**: Network ping testing
### Grafana Configuration
**Datasources**: `/opt/configs/monitoring/provisioning/datasources/`
- Auto-configured Prometheus data source
**Dashboards**: `/opt/configs/monitoring/provisioning/dashboards/`
- Infrastructure Overview dashboard
- System Overview dashboard
## Monitoring Targets
### Current Targets (15 total)
#### Service Health (6 targets)
- Paperless-NGX (192.168.50.229:8000)
- Paperless-AI (192.168.50.229:3000)
- Nextcloud (192.168.50.229:8081)
- Home Assistant (192.168.50.181:8123)
- Portainer (192.168.50.181:9000)
- AppFlowy (192.168.50.66:9080)
#### Database Health (4 targets)
- Redis (192.168.50.229:6379)
- PostgreSQL (192.168.50.229:5432)
- MariaDB (192.168.50.229:3306)
- Mosquitto (192.168.50.229:1883)
#### System Monitoring (5 targets)
- Prometheus (192.168.50.229:9091)
- Grafana (192.168.50.229:3002)
- Node Exporter (192.168.50.229:9100)
- Blackbox Exporter (192.168.50.229:9115)
- Prometheus Internal (localhost:9090)
## Performance Characteristics
### Data Collection
- **Total Metrics**: 784 different metrics
- **Scrape Intervals**: 15-60 seconds per job
- **Data Retention**: 30 days
- **Storage**: Local persistent volumes
### Resource Usage
- **Prometheus**: 1GB memory, 0.5 CPU cores
- **Grafana**: 1GB memory, 0.5 CPU cores
- **Node Exporter**: 256MB memory, 0.25 CPU cores
- **Blackbox Exporter**: 256MB memory, 0.25 CPU cores
### Network Impact
- **Internal Traffic**: Minimal (monitoring network)
- **External Access**: Via Caddy reverse proxy only
- **Data Transfer**: Compressed metrics over HTTP
## Troubleshooting
### Common Issues
#### Service Not Starting
```bash
# Check service status
docker service ls | grep monitoring
# View service logs
docker service logs monitoring_prometheus
docker service logs monitoring_grafana
```
#### Metrics Not Collecting
```bash
# Check Prometheus targets
curl "http://192.168.50.229:9091/api/v1/targets"
# Test individual exporters
curl "http://192.168.50.229:9100/metrics" | head -10
curl "http://192.168.50.229:9115/metrics" | head -10
```
#### Dashboard Not Loading
```bash
# Check Grafana logs
docker service logs monitoring_grafana
# Verify datasource configuration
curl "http://192.168.50.229:3002/api/datasources" -u admin:admin123
```
### Health Checks
#### Prometheus Health
```bash
curl "http://192.168.50.229:9091/-/healthy"
```
#### Grafana Health
```bash
curl "http://192.168.50.229:3002/api/health"
```
#### Node Exporter Health
```bash
curl "http://192.168.50.229:9100/-/healthy"
```
#### Blackbox Exporter Health
```bash
curl "http://192.168.50.229:9115/-/healthy"
```
## Maintenance
### Regular Tasks
#### Update Configurations
```bash
# Copy updated configs
scp configs/monitoring/*.yml root@192.168.50.229:/opt/configs/monitoring/
scp configs/monitoring/dashboards/*.json root@192.168.50.229:/opt/configs/monitoring/provisioning/dashboards/
# Redeploy stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"
```
#### Backup Monitoring Data
```bash
# Backup Prometheus data
ssh root@192.168.50.229 "docker run --rm -v monitoring_prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus_backup_$(date +%Y%m%d).tar.gz -C /data ."
# Backup Grafana data
ssh root@192.168.50.229 "docker run --rm -v monitoring_grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana_backup_$(date +%Y%m%d).tar.gz -C /data ."
```
#### Clean Up Old Data
```bash
# Prometheus automatically manages retention (30 days)
# Grafana data is persistent and should be backed up regularly
```
### Scaling Considerations
#### Horizontal Scaling
- **Prometheus**: Can be scaled with remote storage (Thanos/Cortex)
- **Grafana**: Can be scaled with external database
- **Node Exporter**: One per host (already optimal)
- **Blackbox Exporter**: Can be scaled for high-frequency checks
#### Vertical Scaling
- **Memory**: Increase limits for high-metric environments
- **CPU**: Adjust based on scrape frequency and target count
- **Storage**: Expand volumes for longer retention
## Security Considerations
### Network Security
- **Internal Communication**: Isolated monitoring network
- **External Access**: HTTPS-only via Caddy
- **Authentication**: Grafana login required for dashboard access
### Data Security
- **Metrics**: No sensitive data in metrics
- **Logs**: Monitor for sensitive information
- **Backups**: Encrypt backup files
### Access Control
- **Grafana**: Admin user with strong password
- **Prometheus**: Read-only access via web interface
- **Exporters**: No authentication (internal network only)
## Future Enhancements
### Planned Improvements
1. **AlertManager**: Add alerting and notification system
2. **cAdvisor**: Container resource monitoring
3. **Application Exporters**: Service-specific metrics
4. **Centralized Logging**: Log aggregation with Loki
### Optional Enhancements
1. **Distributed Tracing**: Request flow tracking
2. **APM**: Application performance monitoring
3. **Synthetic Monitoring**: User journey testing
4. **Automated Incident Response**: Self-healing capabilities
---
**Last Updated**: August 30, 2025
**Version**: 1.0
**Status**: Production Ready