Files

admin 705a2757c1 Major infrastructure migration and Vaultwarden PostgreSQL troubleshooting

COMPREHENSIVE CHANGES:

INFRASTRUCTURE MIGRATION:
- Migrated services to Docker Swarm on OMV800 (192.168.50.229)
- Deployed PostgreSQL database for Vaultwarden migration
- Updated all stack configurations for Docker Swarm compatibility
- Added comprehensive monitoring stack (Prometheus, Grafana, Blackbox)
- Implemented proper secret management for all services

VAULTWARDEN POSTGRESQL MIGRATION:
- Attempted migration from SQLite to PostgreSQL for NFS compatibility
- Created PostgreSQL stack with proper user/password configuration
- Built custom Vaultwarden image with PostgreSQL support
- Troubleshot persistent SQLite fallback issue despite PostgreSQL config
- Identified known issue where Vaultwarden silently falls back to SQLite
- Added ENABLE_DB_WAL=false to prevent filesystem compatibility issues
- Current status: Old Vaultwarden on lenovo410 still working, new one has config issues

PAPERLESS SERVICES:
- Successfully deployed Paperless-NGX and Paperless-AI on OMV800
- Both services running on ports 8000 and 3000 respectively
- Caddy configuration updated for external access
- Services accessible via paperless.pressmess.duckdns.org and paperless-ai.pressmess.duckdns.org

CADDY CONFIGURATION:
- Updated Caddyfile on Surface (192.168.50.254) for new service locations
- Fixed Vaultwarden reverse proxy to point to new Docker Swarm service
- Removed old notification hub reference that was causing conflicts
- All services properly configured for external access via DuckDNS

BACKUP AND DISCOVERY:
- Created comprehensive backup system for all hosts
- Generated detailed discovery reports for infrastructure analysis
- Implemented automated backup validation scripts
- Created migration progress tracking and verification reports

MONITORING STACK:
- Deployed Prometheus, Grafana, and Blackbox monitoring
- Created infrastructure and system overview dashboards
- Added proper service discovery and alerting configuration
- Implemented performance monitoring for all critical services

DOCUMENTATION:
- Reorganized documentation into logical structure
- Created comprehensive migration playbook and troubleshooting guides
- Added hardware specifications and optimization recommendations
- Documented all configuration changes and service dependencies

CURRENT STATUS:
- Paperless services: ✅ Working and accessible externally
- Vaultwarden: ❌ PostgreSQL configuration issues, old instance still working
- Monitoring: ✅ Deployed and operational
- Caddy: ✅ Updated and working for external access
- PostgreSQL: ✅ Database running, connection issues with Vaultwarden

NEXT STEPS:
- Continue troubleshooting Vaultwarden PostgreSQL configuration
- Consider alternative approaches for Vaultwarden migration
- Validate all external service access
- Complete final migration validation

TECHNICAL NOTES:
- Used Docker Swarm for orchestration on OMV800
- Implemented proper secret management for sensitive data
- Added comprehensive logging and monitoring
- Created automated backup and validation scripts

2025-08-30 20:18:44 -04:00

6.0 KiB

Raw Blame History

HomeAudit - Infrastructure Migration and Monitoring

A comprehensive home infrastructure audit, migration, and monitoring system for Docker Swarm deployment.

🏗️ Infrastructure Overview

Current Deployment Status

✅ Paperless Stack: Paperless-NGX (port 8000) + Paperless AI (port 3000) on OMV800
✅ Monitoring Stack: Prometheus + Grafana + Node Exporter + Blackbox Exporter
✅ Caddy Reverse Proxy: SSL termination and domain routing
🔄 Migration Progress: 85% complete

Device Inventory

Device	IP	Role	Status
OMV800	192.168.50.229	Docker Swarm Manager	✅ Active
Surface	192.168.50.254	Caddy Reverse Proxy	✅ Active
jonathan-2518f5u	192.168.50.181	Worker Node	✅ Active
lenovo420	192.168.50.66	Worker Node	✅ Active
audrey	192.168.50.145	Worker Node	✅ Active
fedora	192.168.50.225	Worker Node	✅ Active

📊 Monitoring Stack

Components

Prometheus (port 9091): Metrics collection and storage
Grafana (port 3002): Data visualization and dashboards
Node Exporter (port 9100): System metrics collection
Blackbox Exporter (port 9115): Service health monitoring

Metrics Coverage

15 Active Targets: Services, system, and health checks
784 Metrics: Comprehensive infrastructure monitoring
Real-time Data: 15-60 second scrape intervals
30-day Retention: Historical trend analysis

Dashboards

Infrastructure Overview: Service health and availability
System Overview: CPU, memory, disk, network monitoring

Access URLs

Grafana: https://grafana.pressmess.duckdns.org (admin/admin123)
Prometheus: https://prometheus.pressmess.duckdns.org

🔧 Services Status

Active Services

Paperless-NGX: Document management (port 8000)
Paperless AI: AI-powered document processing (port 3000)
Nextcloud: File storage and sync (port 8081)
Home Assistant: Home automation (port 8123)
Portainer: Container management (port 9000)
AppFlowy: Note-taking (port 9080)

Database Services

PostgreSQL: Primary database
MariaDB: Secondary database
Redis: Caching layer
Mosquitto: MQTT broker

🚀 Quick Start

1. Access Monitoring

# Grafana Dashboard
open https://grafana.pressmess.duckdns.org
# Login: admin / admin123

# Prometheus Metrics
open https://prometheus.pressmess.duckdns.org

2. Check Service Health

# View all monitoring targets
curl "http://192.168.50.229:9091/api/v1/targets"

# Check system metrics
curl "http://192.168.50.229:9091/api/v1/query?query=up"

3. Monitor System Resources

# CPU Usage
curl "http://192.168.50.229:9091/api/v1/query?query=100%20-%20(avg%20by%20(instance)%20(irate(node_cpu_seconds_total{mode=\"idle\"}[5m]))%20*%20100)"

# Memory Usage
curl "http://192.168.50.229:9091/api/v1/query?query=(1%20-%20(node_memory_MemAvailable_bytes%20/%20node_memory_MemTotal_bytes))%20*%20100"

📁 Project Structure

HomeAudit/
├── stacks/                    # Docker Swarm stacks
│   └── monitoring/           # Monitoring stack configuration
├── configs/                  # Configuration files
│   └── monitoring/          # Prometheus, Grafana configs
├── scripts/                  # Utility scripts
├── dev_documentation/        # Detailed documentation
└── comprehensive_discovery_results/  # Audit results

🔍 Monitoring Features

System Monitoring

CPU Usage: Per-core and overall utilization
Memory Usage: Total, available, cached, buffers
Disk Usage: Space, I/O, mount points
Network I/O: Bytes sent/received per interface
System Load: 1m, 5m, 15m averages

Service Monitoring

HTTP Health Checks: Web service availability
TCP Health Checks: Database and backend services
Response Times: Service performance tracking
Availability Metrics: Uptime and reliability

Infrastructure Monitoring

Docker Swarm: Service health and resource usage
Container Metrics: Resource consumption per container
Network Connectivity: Inter-service communication
Hardware Health: System temperature and status

🛠️ Maintenance

Update Monitoring Stack

# Deploy updated configuration
ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c final-monitoring.yml monitoring"

# Check service status
ssh root@192.168.50.229 "docker service ls | grep monitoring"

View Logs

# Prometheus logs
ssh root@192.168.50.229 "docker service logs monitoring_prometheus"

# Grafana logs
ssh root@192.168.50.229 "docker service logs monitoring_grafana"

📈 Performance Metrics

Current System Specs

Total Memory: 31GB
CPU Cores: Multi-core system
Storage: SSD-based storage
Network: Gigabit connectivity

Monitoring Performance

Scrape Interval: 15-60 seconds
Data Retention: 30 days
Metrics Count: 784 different metrics
Target Health: 15/15 targets healthy

🔮 Future Enhancements

Planned Improvements

AlertManager: Smart alerting and notifications
cAdvisor: Container resource monitoring
Application Exporters: Database and service-specific metrics
Centralized Logging: Log aggregation and analysis

Optional Enhancements

Distributed Tracing: Request flow tracking
APM: Application performance monitoring
Synthetic Monitoring: User journey testing
Automated Incident Response: Self-healing infrastructure

📞 Support

For issues or questions:

Check the monitoring dashboards for system health
Review service logs for error details
Consult the comprehensive documentation in dev_documentation/
Check the migration status in comprehensive_discovery_results/

Last Updated: August 30, 2025
Monitoring Status: ✅ Fully Operational
Migration Progress: 85% Complete

6.0 KiB Raw Blame History