Files

admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting

COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services: ✅ Working and accessible externally
- Vaultwarden: ❌ PostgreSQL configuration issues, old instance still working
- Monitoring: ✅ Deployed and operational
- Caddy: ✅ Updated and working for external access
- PostgreSQL: ✅ Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts

2025-08-30 20:18:44 -04:00

11 KiB

Raw Blame History

Monitoring Stack Deployment Guide

Overview

The HomeAudit monitoring stack provides comprehensive infrastructure monitoring using industry-standard tools:

Prometheus: Metrics collection and storage
Grafana: Data visualization and dashboards
Node Exporter: System metrics collection
Blackbox Exporter: Service health monitoring

Architecture

Components

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Prometheus    │    │     Grafana     │    │  Node Exporter  │
│   (Port 9091)   │    │   (Port 3002)   │    │   (Port 9100)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │ Blackbox Exporter│
                    │   (Port 9115)   │
                    └─────────────────┘

Network Configuration

Monitoring Network: Internal communication between components
Caddy Public Network: External access via reverse proxy
Host Network Access: Node Exporter accesses system metrics

Deployment

Prerequisites

Docker Swarm: Initialized and operational
Networks: monitoring-network and caddy-public created
Storage: Persistent volumes for Prometheus and Grafana data

Deployment Commands

# Deploy monitoring stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"

# Check service status
ssh root@192.168.50.229 "docker service ls | grep monitoring"

# View service logs
ssh root@192.168.50.229 "docker service logs monitoring_prometheus"

Service Configuration

Prometheus

Image: prom/prometheus:v2.47.0
Port: 9091 (external), 9090 (internal)
Storage: 30-day retention
Scrape Interval: 15-60 seconds
Configuration: /opt/configs/monitoring/prometheus-production.yml

Grafana

Image: grafana/grafana:10.1.2
Port: 3002 (external), 3000 (internal)
Login: admin/admin123
Plugins: Clock, Simple JSON, Pie Chart
Provisioning: Auto-configured datasources and dashboards

Node Exporter

Image: prom/node-exporter:v1.6.1
Port: 9100
Access: Host filesystem for system metrics
Filters: Excludes system and container filesystems

Blackbox Exporter

Image: prom/blackbox-exporter:v0.24.0
Port: 9115
Modules: HTTP, TCP, ICMP health checks
Configuration: /opt/configs/monitoring/blackbox.yml

Metrics Collection

System Metrics (Node Exporter)

CPU Metrics

# CPU Usage Percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU Load Average
node_load1, node_load5, node_load15

Memory Metrics

# Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory Breakdown
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_Cached_bytes
node_memory_Buffers_bytes

Disk Metrics

# Disk Usage Percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100

# Disk I/O
rate(node_disk_io_time_seconds_total[5m])

Network Metrics

# Network I/O
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

Service Health Metrics (Blackbox Exporter)

HTTP Health Checks

# Service Availability
probe_success{job="http-service-health"}

# Response Time
probe_duration_seconds{job="http-service-health"}

# HTTP Status Codes
probe_http_status_code{job="http-service-health"}

TCP Health Checks

# Database Connectivity
probe_success{job="tcp-service-health"}

# Connection Time
probe_duration_seconds{job="tcp-service-health"}

Dashboards

Infrastructure Overview Dashboard

Purpose: Service health and availability monitoring

Panels:

HTTP Service Health Status: Visual status of web services
TCP Service Health Status: Database and backend service status
Service Response Time: Performance tracking over time
HTTP Service Availability Summary: Count of healthy services
Service Details Table: Detailed status of all monitored services

Metrics Used:

probe_success{job="http-service-health"}
probe_success{job="tcp-service-health"}
probe_duration_seconds{job="http-service-health"}
sum(probe_success{job="http-service-health"})

System Overview Dashboard

Purpose: Comprehensive system resource monitoring

Panels:

CPU Usage: Real-time CPU utilization trends
Memory Usage: Memory consumption and availability
Disk Usage: Storage space and I/O monitoring
Network I/O: Network traffic analysis
System Load: Load average tracking
System Info: Hardware and OS information

Metrics Used:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})) * 100
rate(node_network_receive_bytes_total{device!="lo"}[5m])
node_load1, node_load5, node_load15

Configuration Files

Prometheus Configuration

File: /opt/configs/monitoring/prometheus-production.yml

Key Sections:

Global: Scrape intervals and evaluation settings
Scrape Configs: Target definitions for each monitoring job
Relabel Configs: Metric transformation and labeling

Blackbox Configuration

File: /opt/configs/monitoring/blackbox.yml

Modules:

http_2xx: HTTP health checks with status code validation
tcp_connect: TCP connectivity testing
icmp: Network ping testing

Grafana Configuration

Datasources: /opt/configs/monitoring/provisioning/datasources/

Auto-configured Prometheus data source

Dashboards: /opt/configs/monitoring/provisioning/dashboards/

Infrastructure Overview dashboard
System Overview dashboard

Monitoring Targets

Current Targets (15 total)

Service Health (6 targets)

Paperless-NGX (192.168.50.229:8000)
Paperless-AI (192.168.50.229:3000)
Nextcloud (192.168.50.229:8081)
Home Assistant (192.168.50.181:8123)
Portainer (192.168.50.181:9000)
AppFlowy (192.168.50.66:9080)

Database Health (4 targets)

Redis (192.168.50.229:6379)
PostgreSQL (192.168.50.229:5432)
MariaDB (192.168.50.229:3306)
Mosquitto (192.168.50.229:1883)

System Monitoring (5 targets)

Prometheus (192.168.50.229:9091)
Grafana (192.168.50.229:3002)
Node Exporter (192.168.50.229:9100)
Blackbox Exporter (192.168.50.229:9115)
Prometheus Internal (localhost:9090)

Performance Characteristics

Data Collection

Total Metrics: 784 different metrics
Scrape Intervals: 15-60 seconds per job
Data Retention: 30 days
Storage: Local persistent volumes

Resource Usage

Prometheus: 1GB memory, 0.5 CPU cores
Grafana: 1GB memory, 0.5 CPU cores
Node Exporter: 256MB memory, 0.25 CPU cores
Blackbox Exporter: 256MB memory, 0.25 CPU cores

Network Impact

Internal Traffic: Minimal (monitoring network)
External Access: Via Caddy reverse proxy only
Data Transfer: Compressed metrics over HTTP

Troubleshooting

Common Issues

Service Not Starting

# Check service status
docker service ls | grep monitoring

# View service logs
docker service logs monitoring_prometheus
docker service logs monitoring_grafana

Metrics Not Collecting

# Check Prometheus targets
curl "http://192.168.50.229:9091/api/v1/targets"

# Test individual exporters
curl "http://192.168.50.229:9100/metrics" | head -10
curl "http://192.168.50.229:9115/metrics" | head -10

Dashboard Not Loading

# Check Grafana logs
docker service logs monitoring_grafana

# Verify datasource configuration
curl "http://192.168.50.229:3002/api/datasources" -u admin:admin123

Health Checks

Prometheus Health

curl "http://192.168.50.229:9091/-/healthy"

Grafana Health

curl "http://192.168.50.229:3002/api/health"

Node Exporter Health

curl "http://192.168.50.229:9100/-/healthy"

Blackbox Exporter Health

curl "http://192.168.50.229:9115/-/healthy"

Maintenance

Regular Tasks

Update Configurations

# Copy updated configs
scp configs/monitoring/*.yml root@192.168.50.229:/opt/configs/monitoring/
scp configs/monitoring/dashboards/*.json root@192.168.50.229:/opt/configs/monitoring/provisioning/dashboards/

# Redeploy stack
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"

Backup Monitoring Data

# Backup Prometheus data
ssh root@192.168.50.229 "docker run --rm -v monitoring_prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus_backup_$(date +%Y%m%d).tar.gz -C /data ."

# Backup Grafana data
ssh root@192.168.50.229 "docker run --rm -v monitoring_grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana_backup_$(date +%Y%m%d).tar.gz -C /data ."

Clean Up Old Data

# Prometheus automatically manages retention (30 days)
# Grafana data is persistent and should be backed up regularly

Scaling Considerations

Horizontal Scaling

Prometheus: Can be scaled with remote storage (Thanos/Cortex)
Grafana: Can be scaled with external database
Node Exporter: One per host (already optimal)
Blackbox Exporter: Can be scaled for high-frequency checks

Vertical Scaling

Memory: Increase limits for high-metric environments
CPU: Adjust based on scrape frequency and target count
Storage: Expand volumes for longer retention

Security Considerations

Network Security

Internal Communication: Isolated monitoring network
External Access: HTTPS-only via Caddy
Authentication: Grafana login required for dashboard access

Data Security

Metrics: No sensitive data in metrics
Logs: Monitor for sensitive information
Backups: Encrypt backup files

Access Control

Grafana: Admin user with strong password
Prometheus: Read-only access via web interface
Exporters: No authentication (internal network only)

Future Enhancements

Planned Improvements

AlertManager: Add alerting and notification system
cAdvisor: Container resource monitoring
Application Exporters: Service-specific metrics
Centralized Logging: Log aggregation with Loki

Optional Enhancements

Distributed Tracing: Request flow tracking
APM: Application performance monitoring
Synthetic Monitoring: User journey testing
Automated Incident Response: Self-healing capabilities

Last Updated: August 30, 2025
Version: 1.0
Status: Production Ready

11 KiB Raw Blame History