Files
cim_summary/MONITORING_AND_ALERTING_GUIDE.md

14 KiB

Monitoring and Alerting Guide

Complete Monitoring Strategy for CIM Document Processor

🎯 Overview

This document provides comprehensive guidance for monitoring and alerting in the CIM Document Processor, covering system health, performance metrics, error tracking, and operational alerts.


📊 Monitoring Architecture

Monitoring Stack

  • Application Monitoring: Custom logging with Winston
  • Infrastructure Monitoring: Google Cloud Monitoring
  • Error Tracking: Structured error logging
  • Performance Monitoring: Custom metrics and timing
  • User Analytics: Usage tracking and analytics

Monitoring Layers

  1. Application Layer - Service health and performance
  2. Infrastructure Layer - Cloud resources and availability
  3. Business Layer - User activity and document processing
  4. Security Layer - Authentication and access patterns

🔍 Key Metrics to Monitor

Application Performance Metrics

Document Processing Metrics

interface ProcessingMetrics {
  uploadSuccessRate: number;        // % of successful uploads
  processingTime: number;           // Average processing time (ms)
  queueLength: number;              // Number of pending documents
  errorRate: number;                // % of processing errors
  throughput: number;               // Documents processed per hour
}

API Performance Metrics

interface APIMetrics {
  responseTime: number;             // Average response time (ms)
  requestRate: number;              // Requests per minute
  errorRate: number;                // % of API errors
  activeConnections: number;        // Current active connections
  timeoutRate: number;              // % of request timeouts
}

Storage Metrics

interface StorageMetrics {
  uploadSpeed: number;              // MB/s upload rate
  storageUsage: number;             // % of storage used
  fileCount: number;                // Total files stored
  retrievalTime: number;            // Average file retrieval time
  errorRate: number;                // % of storage errors
}

Infrastructure Metrics

Server Metrics

  • CPU Usage: Average and peak CPU utilization
  • Memory Usage: RAM usage and garbage collection
  • Disk I/O: Read/write operations and latency
  • Network I/O: Bandwidth usage and connection count

Database Metrics

  • Connection Pool: Active and idle connections
  • Query Performance: Average query execution time
  • Storage Usage: Database size and growth rate
  • Error Rate: Database connection and query errors

Cloud Service Metrics

  • Firebase Auth: Authentication success/failure rates
  • Firebase Storage: Upload/download success rates
  • Supabase: Database performance and connection health
  • Google Cloud: Document AI processing metrics

🚨 Alerting Strategy

Alert Severity Levels

🔴 Critical Alerts

Immediate Action Required

  • System downtime or unavailability
  • Authentication service failures
  • Database connection failures
  • Storage service failures
  • Security breaches or suspicious activity

🟡 Warning Alerts

Attention Required

  • High error rates (>5%)
  • Performance degradation
  • Resource usage approaching limits
  • Unusual traffic patterns
  • Service degradation

🟢 Informational Alerts

Monitoring Only

  • Normal operational events
  • Scheduled maintenance
  • Performance improvements
  • Usage statistics

Alert Channels

Primary Channels

  • Email: Critical alerts to operations team
  • Slack: Real-time notifications to development team
  • PagerDuty: Escalation for critical issues
  • SMS: Emergency alerts for system downtime

Secondary Channels

  • Dashboard: Real-time monitoring dashboard
  • Logs: Structured logging for investigation
  • Metrics: Time-series data for trend analysis

📈 Monitoring Implementation

Application Logging

Structured Logging Setup

// utils/logger.ts
import winston from 'winston';

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: { service: 'cim-processor' },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
    new winston.transports.Console({
      format: winston.format.simple()
    })
  ]
});

Performance Monitoring

// middleware/performance.ts
import { Request, Response, NextFunction } from 'express';

export const performanceMonitor = (req: Request, res: Response, next: NextFunction) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    const { method, path, statusCode } = req;
    
    logger.info('API Request', {
      method,
      path,
      statusCode,
      duration,
      userAgent: req.get('User-Agent'),
      ip: req.ip
    });
    
    // Alert on slow requests
    if (duration > 5000) {
      logger.warn('Slow API Request', {
        method,
        path,
        duration,
        threshold: 5000
      });
    }
  });
  
  next();
};

Error Tracking

// middleware/errorHandler.ts
export const errorHandler = (error: Error, req: Request, res: Response, next: NextFunction) => {
  const errorInfo = {
    message: error.message,
    stack: error.stack,
    method: req.method,
    path: req.path,
    userAgent: req.get('User-Agent'),
    ip: req.ip,
    timestamp: new Date().toISOString()
  };
  
  logger.error('Application Error', errorInfo);
  
  // Alert on critical errors
  if (error.message.includes('Database connection failed') || 
      error.message.includes('Authentication failed')) {
    // Send critical alert
    sendCriticalAlert('System Error', errorInfo);
  }
  
  res.status(500).json({ error: 'Internal server error' });
};

Health Checks

Application Health Check

// routes/health.ts
router.get('/health', async (req: Request, res: Response) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    services: {
      database: await checkDatabaseHealth(),
      storage: await checkStorageHealth(),
      auth: await checkAuthHealth(),
      ai: await checkAIHealth()
    }
  };
  
  const isHealthy = Object.values(health.services).every(service => service.status === 'healthy');
  health.status = isHealthy ? 'healthy' : 'unhealthy';
  
  res.status(isHealthy ? 200 : 503).json(health);
});

Service Health Checks

// utils/healthChecks.ts
export const checkDatabaseHealth = async () => {
  try {
    const start = Date.now();
    await supabase.from('documents').select('count').limit(1);
    const responseTime = Date.now() - start;
    
    return {
      status: 'healthy',
      responseTime,
      timestamp: new Date().toISOString()
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      error: error.message,
      timestamp: new Date().toISOString()
    };
  }
};

export const checkStorageHealth = async () => {
  try {
    const start = Date.now();
    await firebase.storage().bucket().getMetadata();
    const responseTime = Date.now() - start;
    
    return {
      status: 'healthy',
      responseTime,
      timestamp: new Date().toISOString()
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      error: error.message,
      timestamp: new Date().toISOString()
    };
  }
};

📊 Dashboard and Visualization

Monitoring Dashboard

Real-time Metrics

  • System Status: Overall system health indicator
  • Active Users: Current number of active users
  • Processing Queue: Number of documents in processing
  • Error Rate: Current error percentage
  • Response Time: Average API response time

Performance Charts

  • Throughput: Documents processed over time
  • Error Trends: Error rates over time
  • Resource Usage: CPU, memory, and storage usage
  • User Activity: User sessions and interactions

Alert History

  • Recent Alerts: Last 24 hours of alerts
  • Alert Trends: Alert frequency over time
  • Resolution Time: Time to resolve issues
  • Escalation History: Alert escalation patterns

Custom Metrics

Business Metrics

// metrics/businessMetrics.ts
export const trackDocumentProcessing = (documentId: string, processingTime: number) => {
  logger.info('Document Processing Complete', {
    documentId,
    processingTime,
    timestamp: new Date().toISOString()
  });
  
  // Update metrics
  updateMetric('documents_processed', 1);
  updateMetric('avg_processing_time', processingTime);
};

export const trackUserActivity = (userId: string, action: string) => {
  logger.info('User Activity', {
    userId,
    action,
    timestamp: new Date().toISOString()
  });
  
  // Update metrics
  updateMetric('user_actions', 1);
  updateMetric(`action_${action}`, 1);
};

🔔 Alert Configuration

Alert Rules

Critical Alerts

// alerts/criticalAlerts.ts
export const criticalAlertRules = {
  systemDown: {
    condition: 'health_check_fails > 3',
    action: 'send_critical_alert',
    message: 'System is down - immediate action required'
  },
  
  authFailure: {
    condition: 'auth_error_rate > 10%',
    action: 'send_critical_alert',
    message: 'Authentication service failing'
  },
  
  databaseDown: {
    condition: 'db_connection_fails > 5',
    action: 'send_critical_alert',
    message: 'Database connection failed'
  }
};

Warning Alerts

// alerts/warningAlerts.ts
export const warningAlertRules = {
  highErrorRate: {
    condition: 'error_rate > 5%',
    action: 'send_warning_alert',
    message: 'High error rate detected'
  },
  
  slowResponse: {
    condition: 'avg_response_time > 3000ms',
    action: 'send_warning_alert',
    message: 'API response time degraded'
  },
  
  highResourceUsage: {
    condition: 'cpu_usage > 80% OR memory_usage > 85%',
    action: 'send_warning_alert',
    message: 'High resource usage detected'
  }
};

Alert Actions

Alert Handlers

// alerts/alertHandlers.ts
export const sendCriticalAlert = async (title: string, details: any) => {
  // Send to multiple channels
  await Promise.all([
    sendEmailAlert(title, details),
    sendSlackAlert(title, details),
    sendPagerDutyAlert(title, details)
  ]);
  
  logger.error('Critical Alert Sent', { title, details });
};

export const sendWarningAlert = async (title: string, details: any) => {
  // Send to monitoring channels
  await Promise.all([
    sendSlackAlert(title, details),
    updateDashboard(title, details)
  ]);
  
  logger.warn('Warning Alert Sent', { title, details });
};

📋 Operational Procedures

Incident Response

Critical Incident Response

  1. Immediate Assessment

    • Check system health endpoints
    • Review recent error logs
    • Assess impact on users
  2. Communication

    • Send immediate alert to operations team
    • Update status page
    • Notify stakeholders
  3. Investigation

    • Analyze error logs and metrics
    • Identify root cause
    • Implement immediate fix
  4. Resolution

    • Deploy fix or rollback
    • Verify system recovery
    • Document incident

Post-Incident Review

  1. Incident Documentation

    • Timeline of events
    • Root cause analysis
    • Actions taken
    • Lessons learned
  2. Process Improvement

    • Update monitoring rules
    • Improve alert thresholds
    • Enhance response procedures

Maintenance Procedures

Scheduled Maintenance

  1. Pre-Maintenance

    • Notify users in advance
    • Prepare rollback plan
    • Set maintenance mode
  2. During Maintenance

    • Monitor system health
    • Track maintenance progress
    • Handle any issues
  3. Post-Maintenance

    • Verify system functionality
    • Remove maintenance mode
    • Update documentation

🔧 Monitoring Tools

Application Monitoring

  • Winston: Structured logging
  • Custom Metrics: Business-specific metrics
  • Health Checks: Service availability monitoring

Infrastructure Monitoring

  • Google Cloud Monitoring: Cloud resource monitoring
  • Firebase Console: Firebase service monitoring
  • Supabase Dashboard: Database monitoring

Alert Management

  • Slack: Team notifications
  • Email: Critical alerts
  • PagerDuty: Incident escalation
  • Custom Dashboard: Real-time monitoring

Implementation Checklist

Setup Phase

  • Configure structured logging
  • Implement health checks
  • Set up alert rules
  • Create monitoring dashboard
  • Configure alert channels

Operational Phase

  • Monitor system metrics
  • Review alert effectiveness
  • Update alert thresholds
  • Document incidents
  • Improve procedures

📈 Performance Optimization

Monitoring-Driven Optimization

Performance Analysis

  • Identify Bottlenecks: Use metrics to find slow operations
  • Resource Optimization: Monitor resource usage patterns
  • Capacity Planning: Use trends to plan for growth

Continuous Improvement

  • Alert Tuning: Adjust thresholds based on patterns
  • Process Optimization: Streamline operational procedures
  • Tool Enhancement: Improve monitoring tools and dashboards

This comprehensive monitoring and alerting guide provides the foundation for effective system monitoring, ensuring high availability and quick response to issues in the CIM Document Processor.