# Post-Migration Monitoring Optimization Guide ## 📊 **Current Monitoring Status** ### ✅ **Healthy Services (23 Active Targets)** - **Prometheus**: Operational with 30-day retention - **Grafana**: Operational with dashboard provisioning - **Node Exporter**: System metrics collection - **Blackbox Exporter**: HTTP/TCP health checks - **Database Exporters**: PostgreSQL, MariaDB, Redis monitoring - **Container Metrics**: cAdvisor for Docker container performance - **Vaultwarden**: Now being monitored (external + internal endpoints) ### 🔍 **Currently Monitored Services** - **HTTP Services**: Paperless-NGX, Paperless-AI, Nextcloud, Home Assistant, Portainer, AppFlowy - **TCP Services**: Redis, PostgreSQL, MariaDB, Mosquitto - **System Metrics**: CPU, Memory, Disk, Network (via Node Exporter) - **Database Performance**: Query times, connection pools, slow queries - **Container Performance**: Per-container resource usage, bottlenecks - **Cache Performance**: Redis hit rates, memory usage, operations ### ❌ **Missing from Monitoring** - **Log Aggregation**: No centralized logging system - **Application Metrics**: No custom business metrics for each service - **Network Monitoring**: Limited network traffic analysis - **Security Monitoring**: No container security event monitoring - **Custom Alerting**: Basic alerting rules only ## 🚀 **Immediate Monitoring Enhancements** ### **1. Add Vaultwarden Monitoring** ```bash # Deploy Vaultwarden-specific monitoring scp stacks/monitoring/vaultwarden-monitoring.yml root@192.168.50.229:/opt/stacks/monitoring/ scp stacks/monitoring/vaultwarden-blackbox.yml root@192.168.50.229:/opt/stacks/monitoring/ ssh root@192.168.50.229 "cd /opt/stacks/monitoring && docker stack deploy -c vaultwarden-monitoring.yml vaultwarden-monitoring" ``` ### **2. Update Prometheus Configuration** ```bash # Copy updated Prometheus config scp configs/monitoring/prometheus-production.yml root@192.168.50.229:/opt/configs/monitoring/ ssh root@192.168.50.229 "docker service update --force monitoring_prometheus" ``` ### **3. Add Vaultwarden Dashboard to Grafana** ```bash # Copy dashboard configuration scp configs/monitoring/grafana/dashboards/vaultwarden-dashboard.json root@192.168.50.229:/opt/configs/monitoring/grafana/dashboards/ ssh root@192.168.50.229 "docker service update --force monitoring_grafana" ``` ## 📈 **Advanced Monitoring Optimizations** ### **4. Database Performance Monitoring** ```yaml # Add to prometheus-production.yml - job_name: 'postgresql-exporter' static_configs: - targets: ['192.168.50.229:9187'] scrape_interval: 30s - job_name: 'mariadb-exporter' static_configs: - targets: ['192.168.50.229:9104'] scrape_interval: 30s ``` ### **5. Caddy Reverse Proxy Monitoring** ```yaml # Add Caddy metrics endpoint - job_name: 'caddy-metrics' static_configs: - targets: ['192.168.50.225:2019'] # Caddy admin API scrape_interval: 30s metrics_path: /metrics ``` ### **6. Application-Specific Metrics** ```yaml # Custom application metrics - job_name: 'vaultwarden-custom' static_configs: - targets: ['192.168.50.229:9092'] # Vaultwarden exporter scrape_interval: 60s ``` ## 🎯 **Dashboard Optimization Checklist** ### **Current Dashboard Issues** - ❌ **Vaultwarden Dashboard**: Not deployed - ❌ **Infrastructure Overview**: Duplicate dashboard error in logs - ❌ **Service Dependencies**: No dependency mapping - ❌ **Alerting**: No alert rules configured ### **Dashboard Improvements** 1. **Fix Duplicate Dashboard Error** ```bash # Remove duplicate dashboard ssh root@192.168.50.229 "docker exec -it \$(docker ps -q -f name=grafana) rm /etc/grafana/provisioning/dashboards/infrastructure-overview.json" ``` 2. **Add Service Dependency Dashboard** ```json { "title": "Service Dependencies", "panels": [ { "title": "Service Health Matrix", "type": "table", "targets": [ { "expr": "up", "format": "table" } ] } ] } ``` 3. **Create Alerting Rules** ```yaml # prometheus-rules.yml groups: - name: vaultwarden rules: - alert: VaultwardenDown expr: up{job="vaultwarden-monitoring"} == 0 for: 1m labels: severity: critical annotations: summary: "Vaultwarden is down" ``` ## 🔧 **Monitoring Infrastructure Optimization** ### **7. Resource Optimization** ```yaml # Optimize Prometheus retention and compression command: - --storage.tsdb.retention.time=30d - --storage.tsdb.retention.size=10GB - --storage.tsdb.wal-compression - --web.enable-lifecycle ``` ### **8. High Availability Setup** ```yaml # Deploy multiple Prometheus instances deploy: replicas: 2 placement: max_replicas_per_node: 1 ``` ### **9. Backup and Recovery** ```bash # Automated backup script #!/bin/bash # backup-monitoring.sh docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-$(date +%Y%m%d).tar.gz -C /data . ``` ## 📋 **Service Migration Monitoring Checklist** ### **For Each New Service Migration:** 1. **Pre-Migration** - [ ] Add service to Prometheus targets - [ ] Create service-specific dashboard - [ ] Set up health checks - [ ] Configure alerting rules 2. **During Migration** - [ ] Monitor service availability - [ ] Track performance metrics - [ ] Verify data integrity - [ ] Check error rates 3. **Post-Migration** - [ ] Validate all metrics are collected - [ ] Test dashboard functionality - [ ] Verify alerting works - [ ] Document service dependencies ## 🎯 **Next Steps Priority** ### **High Priority** 1. ✅ Deploy Vaultwarden monitoring 2. ✅ Fix Caddy labels in monitoring stack 3. ✅ Add Vaultwarden dashboard 4. ✅ Fix duplicate dashboard error 5. ✅ Add database exporters ### **Medium Priority** 1. ✅ Implement alerting rules 2. ✅ Create service dependency mapping 3. ✅ Optimize Prometheus retention 4. ✅ Add Caddy metrics ### **Low Priority** 1. ✅ High availability setup 2. ✅ Advanced application metrics 3. ✅ Custom business metrics 4. ✅ Automated backup system ## 🚀 **Future Enhancement To-Dos** ### **1. Log Aggregation System (Loki + Grafana)** - [ ] **Deploy Loki** for centralized log collection - [ ] **Configure log shipping** from all containers - [ ] **Create log-based dashboards** in Grafana - [ ] **Set up log-based alerting** for critical errors - [ ] **Implement log retention policies** (30-90 days) - [ ] **Add log correlation** with metrics for troubleshooting - [ ] **Create log search and filtering** capabilities ### **2. Application-Specific Metrics** - [ ] **Vaultwarden custom metrics** (user count, vault size, sync status) - [ ] **Nextcloud metrics** (file operations, user activity, storage usage) - [ ] **Home Assistant metrics** (automation triggers, device states, energy usage) - [ ] **Database custom metrics** (slow queries, connection pool status, backup status) - [ ] **Custom business metrics** (service usage patterns, user engagement) ### **3. Network Monitoring Enhancement** - [ ] **Deploy SNMP monitoring** for network devices - [ ] **Implement NetFlow analysis** for traffic patterns - [ ] **Add bandwidth monitoring** per service/container - [ ] **Create network topology mapping** - [ ] **Monitor network latency** between services - [ ] **Set up network security monitoring** (unusual traffic patterns) ### **4. Security Monitoring (Falco)** - [ ] **Deploy Falco** for container security monitoring - [ ] **Configure security rules** for common attack patterns - [ ] **Set up security event alerting** - [ ] **Create security dashboards** in Grafana - [ ] **Implement anomaly detection** for suspicious activities - [ ] **Add compliance monitoring** (PCI, GDPR, etc.) ### **5. Advanced Alerting and Notification** - [ ] **Deploy AlertManager** for advanced alert routing - [ ] **Configure multiple notification channels** (email, Slack, Discord, SMS) - [ ] **Implement alert escalation** policies - [ ] **Create alert templates** with rich formatting - [ ] **Set up alert grouping** and deduplication - [ ] **Add alert acknowledgment** and resolution tracking ### **6. Performance Optimization** - [ ] **Implement metric cardinality optimization** - [ ] **Add metric relabeling** for better organization - [ ] **Optimize scrape intervals** based on service criticality - [ ] **Implement metric caching** for frequently accessed data - [ ] **Add query optimization** for complex dashboards - [ ] **Set up metric aggregation** for long-term trends ### **7. High Availability and Disaster Recovery** - [ ] **Deploy multiple Prometheus instances** with federation - [ ] **Set up Grafana clustering** for high availability - [ ] **Implement monitoring data backup** and recovery procedures - [ ] **Create monitoring failover** procedures - [ ] **Add cross-datacenter monitoring** if applicable - [ ] **Document disaster recovery** runbooks ### **8. Custom Dashboards and Visualizations** - [ ] **Create executive summary dashboards** for business stakeholders - [ ] **Add capacity planning dashboards** for resource forecasting - [ ] **Implement cost monitoring** dashboards (if applicable) - [ ] **Create SLA/SLO tracking** dashboards - [ ] **Add custom Grafana plugins** for specialized visualizations - [ ] **Implement dashboard templating** for dynamic content ### **9. Integration and Automation** - [ ] **Integrate with CI/CD pipelines** for deployment monitoring - [ ] **Add monitoring as code** (Infrastructure as Code for monitoring) - [ ] **Implement automated dashboard creation** for new services - [ ] **Set up monitoring self-service** for developers - [ ] **Add API monitoring** for external service dependencies - [ ] **Create monitoring API** for external integrations ### **10. Compliance and Governance** - [ ] **Implement audit logging** for monitoring system access - [ ] **Add role-based access control** for monitoring dashboards - [ ] **Create compliance dashboards** for regulatory requirements - [ ] **Implement data retention policies** for monitoring data - [ ] **Add monitoring system health** self-monitoring - [ ] **Create monitoring documentation** and runbooks ## 📊 **Monitoring Metrics to Track** ### **Infrastructure Metrics** - CPU, Memory, Disk usage per node - Network traffic and latency - Container resource usage - Service availability and uptime ### **Application Metrics** - Response times and throughput - Error rates and status codes - Database connection pools - Cache hit rates ### **Business Metrics** - User activity and sessions - Data growth rates - Backup success rates - Security events ## 🔍 **Troubleshooting Guide** ### **Common Issues** 1. **Dashboard Not Loading**: Check Grafana logs for errors 2. **Metrics Missing**: Verify Prometheus targets are up 3. **High Resource Usage**: Optimize retention and scrape intervals 4. **Alert Not Firing**: Check alert rule syntax and thresholds ### **Debug Commands** ```bash # Check Prometheus targets curl -s "http://192.168.50.229:9091/api/v1/targets" | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health}' # Check Grafana dashboards curl -s "http://192.168.50.229:3002/api/dashboards" | jq '.[] | {title: .title, id: .id}' # Check service logs docker service logs monitoring_prometheus --tail 50 docker service logs monitoring_grafana --tail 50 ``` ## 📈 **Success Metrics** ### **Operational Metrics** - ✅ 99.9% uptime for monitoring services - ✅ <5 second dashboard load times - ✅ <30 second alert delivery - ✅ 100% target coverage ### **Performance Metrics** - ✅ <1GB Prometheus memory usage - ✅ <10 second query response times - ✅ <5% storage growth per month - ✅ <100ms scrape intervals --- **Last Updated**: 2025-08-31 **Next Review**: After Vaultwarden monitoring deployment **Status**: Ready for implementation with comprehensive future enhancement roadmap