536 lines
14 KiB
Markdown
536 lines
14 KiB
Markdown
# Monitoring and Alerting Guide
|
|
## Complete Monitoring Strategy for CIM Document Processor
|
|
|
|
### 🎯 Overview
|
|
|
|
This document provides comprehensive guidance for monitoring and alerting in the CIM Document Processor, covering system health, performance metrics, error tracking, and operational alerts.
|
|
|
|
---
|
|
|
|
## 📊 Monitoring Architecture
|
|
|
|
### Monitoring Stack
|
|
- **Application Monitoring**: Custom logging with Winston
|
|
- **Infrastructure Monitoring**: Google Cloud Monitoring
|
|
- **Error Tracking**: Structured error logging
|
|
- **Performance Monitoring**: Custom metrics and timing
|
|
- **User Analytics**: Usage tracking and analytics
|
|
|
|
### Monitoring Layers
|
|
1. **Application Layer** - Service health and performance
|
|
2. **Infrastructure Layer** - Cloud resources and availability
|
|
3. **Business Layer** - User activity and document processing
|
|
4. **Security Layer** - Authentication and access patterns
|
|
|
|
---
|
|
|
|
## 🔍 Key Metrics to Monitor
|
|
|
|
### Application Performance Metrics
|
|
|
|
#### **Document Processing Metrics**
|
|
```typescript
|
|
interface ProcessingMetrics {
|
|
uploadSuccessRate: number; // % of successful uploads
|
|
processingTime: number; // Average processing time (ms)
|
|
queueLength: number; // Number of pending documents
|
|
errorRate: number; // % of processing errors
|
|
throughput: number; // Documents processed per hour
|
|
}
|
|
```
|
|
|
|
#### **API Performance Metrics**
|
|
```typescript
|
|
interface APIMetrics {
|
|
responseTime: number; // Average response time (ms)
|
|
requestRate: number; // Requests per minute
|
|
errorRate: number; // % of API errors
|
|
activeConnections: number; // Current active connections
|
|
timeoutRate: number; // % of request timeouts
|
|
}
|
|
```
|
|
|
|
#### **Storage Metrics**
|
|
```typescript
|
|
interface StorageMetrics {
|
|
uploadSpeed: number; // MB/s upload rate
|
|
storageUsage: number; // % of storage used
|
|
fileCount: number; // Total files stored
|
|
retrievalTime: number; // Average file retrieval time
|
|
errorRate: number; // % of storage errors
|
|
}
|
|
```
|
|
|
|
### Infrastructure Metrics
|
|
|
|
#### **Server Metrics**
|
|
- **CPU Usage**: Average and peak CPU utilization
|
|
- **Memory Usage**: RAM usage and garbage collection
|
|
- **Disk I/O**: Read/write operations and latency
|
|
- **Network I/O**: Bandwidth usage and connection count
|
|
|
|
#### **Database Metrics**
|
|
- **Connection Pool**: Active and idle connections
|
|
- **Query Performance**: Average query execution time
|
|
- **Storage Usage**: Database size and growth rate
|
|
- **Error Rate**: Database connection and query errors
|
|
|
|
#### **Cloud Service Metrics**
|
|
- **Firebase Auth**: Authentication success/failure rates
|
|
- **Firebase Storage**: Upload/download success rates
|
|
- **Supabase**: Database performance and connection health
|
|
- **Google Cloud**: Document AI processing metrics
|
|
|
|
---
|
|
|
|
## 🚨 Alerting Strategy
|
|
|
|
### Alert Severity Levels
|
|
|
|
#### **🔴 Critical Alerts**
|
|
**Immediate Action Required**
|
|
- System downtime or unavailability
|
|
- Authentication service failures
|
|
- Database connection failures
|
|
- Storage service failures
|
|
- Security breaches or suspicious activity
|
|
|
|
#### **🟡 Warning Alerts**
|
|
**Attention Required**
|
|
- High error rates (>5%)
|
|
- Performance degradation
|
|
- Resource usage approaching limits
|
|
- Unusual traffic patterns
|
|
- Service degradation
|
|
|
|
#### **🟢 Informational Alerts**
|
|
**Monitoring Only**
|
|
- Normal operational events
|
|
- Scheduled maintenance
|
|
- Performance improvements
|
|
- Usage statistics
|
|
|
|
### Alert Channels
|
|
|
|
#### **Primary Channels**
|
|
- **Email**: Critical alerts to operations team
|
|
- **Slack**: Real-time notifications to development team
|
|
- **PagerDuty**: Escalation for critical issues
|
|
- **SMS**: Emergency alerts for system downtime
|
|
|
|
#### **Secondary Channels**
|
|
- **Dashboard**: Real-time monitoring dashboard
|
|
- **Logs**: Structured logging for investigation
|
|
- **Metrics**: Time-series data for trend analysis
|
|
|
|
---
|
|
|
|
## 📈 Monitoring Implementation
|
|
|
|
### Application Logging
|
|
|
|
#### **Structured Logging Setup**
|
|
```typescript
|
|
// utils/logger.ts
|
|
import winston from 'winston';
|
|
|
|
const logger = winston.createLogger({
|
|
level: 'info',
|
|
format: winston.format.combine(
|
|
winston.format.timestamp(),
|
|
winston.format.errors({ stack: true }),
|
|
winston.format.json()
|
|
),
|
|
defaultMeta: { service: 'cim-processor' },
|
|
transports: [
|
|
new winston.transports.File({ filename: 'error.log', level: 'error' }),
|
|
new winston.transports.File({ filename: 'combined.log' }),
|
|
new winston.transports.Console({
|
|
format: winston.format.simple()
|
|
})
|
|
]
|
|
});
|
|
```
|
|
|
|
#### **Performance Monitoring**
|
|
```typescript
|
|
// middleware/performance.ts
|
|
import { Request, Response, NextFunction } from 'express';
|
|
|
|
export const performanceMonitor = (req: Request, res: Response, next: NextFunction) => {
|
|
const start = Date.now();
|
|
|
|
res.on('finish', () => {
|
|
const duration = Date.now() - start;
|
|
const { method, path, statusCode } = req;
|
|
|
|
logger.info('API Request', {
|
|
method,
|
|
path,
|
|
statusCode,
|
|
duration,
|
|
userAgent: req.get('User-Agent'),
|
|
ip: req.ip
|
|
});
|
|
|
|
// Alert on slow requests
|
|
if (duration > 5000) {
|
|
logger.warn('Slow API Request', {
|
|
method,
|
|
path,
|
|
duration,
|
|
threshold: 5000
|
|
});
|
|
}
|
|
});
|
|
|
|
next();
|
|
};
|
|
```
|
|
|
|
#### **Error Tracking**
|
|
```typescript
|
|
// middleware/errorHandler.ts
|
|
export const errorHandler = (error: Error, req: Request, res: Response, next: NextFunction) => {
|
|
const errorInfo = {
|
|
message: error.message,
|
|
stack: error.stack,
|
|
method: req.method,
|
|
path: req.path,
|
|
userAgent: req.get('User-Agent'),
|
|
ip: req.ip,
|
|
timestamp: new Date().toISOString()
|
|
};
|
|
|
|
logger.error('Application Error', errorInfo);
|
|
|
|
// Alert on critical errors
|
|
if (error.message.includes('Database connection failed') ||
|
|
error.message.includes('Authentication failed')) {
|
|
// Send critical alert
|
|
sendCriticalAlert('System Error', errorInfo);
|
|
}
|
|
|
|
res.status(500).json({ error: 'Internal server error' });
|
|
};
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
#### **Application Health Check**
|
|
```typescript
|
|
// routes/health.ts
|
|
router.get('/health', async (req: Request, res: Response) => {
|
|
const health = {
|
|
status: 'healthy',
|
|
timestamp: new Date().toISOString(),
|
|
uptime: process.uptime(),
|
|
services: {
|
|
database: await checkDatabaseHealth(),
|
|
storage: await checkStorageHealth(),
|
|
auth: await checkAuthHealth(),
|
|
ai: await checkAIHealth()
|
|
}
|
|
};
|
|
|
|
const isHealthy = Object.values(health.services).every(service => service.status === 'healthy');
|
|
health.status = isHealthy ? 'healthy' : 'unhealthy';
|
|
|
|
res.status(isHealthy ? 200 : 503).json(health);
|
|
});
|
|
```
|
|
|
|
#### **Service Health Checks**
|
|
```typescript
|
|
// utils/healthChecks.ts
|
|
export const checkDatabaseHealth = async () => {
|
|
try {
|
|
const start = Date.now();
|
|
await supabase.from('documents').select('count').limit(1);
|
|
const responseTime = Date.now() - start;
|
|
|
|
return {
|
|
status: 'healthy',
|
|
responseTime,
|
|
timestamp: new Date().toISOString()
|
|
};
|
|
} catch (error) {
|
|
return {
|
|
status: 'unhealthy',
|
|
error: error.message,
|
|
timestamp: new Date().toISOString()
|
|
};
|
|
}
|
|
};
|
|
|
|
export const checkStorageHealth = async () => {
|
|
try {
|
|
const start = Date.now();
|
|
await firebase.storage().bucket().getMetadata();
|
|
const responseTime = Date.now() - start;
|
|
|
|
return {
|
|
status: 'healthy',
|
|
responseTime,
|
|
timestamp: new Date().toISOString()
|
|
};
|
|
} catch (error) {
|
|
return {
|
|
status: 'unhealthy',
|
|
error: error.message,
|
|
timestamp: new Date().toISOString()
|
|
};
|
|
}
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Dashboard and Visualization
|
|
|
|
### Monitoring Dashboard
|
|
|
|
#### **Real-time Metrics**
|
|
- **System Status**: Overall system health indicator
|
|
- **Active Users**: Current number of active users
|
|
- **Processing Queue**: Number of documents in processing
|
|
- **Error Rate**: Current error percentage
|
|
- **Response Time**: Average API response time
|
|
|
|
#### **Performance Charts**
|
|
- **Throughput**: Documents processed over time
|
|
- **Error Trends**: Error rates over time
|
|
- **Resource Usage**: CPU, memory, and storage usage
|
|
- **User Activity**: User sessions and interactions
|
|
|
|
#### **Alert History**
|
|
- **Recent Alerts**: Last 24 hours of alerts
|
|
- **Alert Trends**: Alert frequency over time
|
|
- **Resolution Time**: Time to resolve issues
|
|
- **Escalation History**: Alert escalation patterns
|
|
|
|
### Custom Metrics
|
|
|
|
#### **Business Metrics**
|
|
```typescript
|
|
// metrics/businessMetrics.ts
|
|
export const trackDocumentProcessing = (documentId: string, processingTime: number) => {
|
|
logger.info('Document Processing Complete', {
|
|
documentId,
|
|
processingTime,
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
|
|
// Update metrics
|
|
updateMetric('documents_processed', 1);
|
|
updateMetric('avg_processing_time', processingTime);
|
|
};
|
|
|
|
export const trackUserActivity = (userId: string, action: string) => {
|
|
logger.info('User Activity', {
|
|
userId,
|
|
action,
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
|
|
// Update metrics
|
|
updateMetric('user_actions', 1);
|
|
updateMetric(`action_${action}`, 1);
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## 🔔 Alert Configuration
|
|
|
|
### Alert Rules
|
|
|
|
#### **Critical Alerts**
|
|
```typescript
|
|
// alerts/criticalAlerts.ts
|
|
export const criticalAlertRules = {
|
|
systemDown: {
|
|
condition: 'health_check_fails > 3',
|
|
action: 'send_critical_alert',
|
|
message: 'System is down - immediate action required'
|
|
},
|
|
|
|
authFailure: {
|
|
condition: 'auth_error_rate > 10%',
|
|
action: 'send_critical_alert',
|
|
message: 'Authentication service failing'
|
|
},
|
|
|
|
databaseDown: {
|
|
condition: 'db_connection_fails > 5',
|
|
action: 'send_critical_alert',
|
|
message: 'Database connection failed'
|
|
}
|
|
};
|
|
```
|
|
|
|
#### **Warning Alerts**
|
|
```typescript
|
|
// alerts/warningAlerts.ts
|
|
export const warningAlertRules = {
|
|
highErrorRate: {
|
|
condition: 'error_rate > 5%',
|
|
action: 'send_warning_alert',
|
|
message: 'High error rate detected'
|
|
},
|
|
|
|
slowResponse: {
|
|
condition: 'avg_response_time > 3000ms',
|
|
action: 'send_warning_alert',
|
|
message: 'API response time degraded'
|
|
},
|
|
|
|
highResourceUsage: {
|
|
condition: 'cpu_usage > 80% OR memory_usage > 85%',
|
|
action: 'send_warning_alert',
|
|
message: 'High resource usage detected'
|
|
}
|
|
};
|
|
```
|
|
|
|
### Alert Actions
|
|
|
|
#### **Alert Handlers**
|
|
```typescript
|
|
// alerts/alertHandlers.ts
|
|
export const sendCriticalAlert = async (title: string, details: any) => {
|
|
// Send to multiple channels
|
|
await Promise.all([
|
|
sendEmailAlert(title, details),
|
|
sendSlackAlert(title, details),
|
|
sendPagerDutyAlert(title, details)
|
|
]);
|
|
|
|
logger.error('Critical Alert Sent', { title, details });
|
|
};
|
|
|
|
export const sendWarningAlert = async (title: string, details: any) => {
|
|
// Send to monitoring channels
|
|
await Promise.all([
|
|
sendSlackAlert(title, details),
|
|
updateDashboard(title, details)
|
|
]);
|
|
|
|
logger.warn('Warning Alert Sent', { title, details });
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## 📋 Operational Procedures
|
|
|
|
### Incident Response
|
|
|
|
#### **Critical Incident Response**
|
|
1. **Immediate Assessment**
|
|
- Check system health endpoints
|
|
- Review recent error logs
|
|
- Assess impact on users
|
|
|
|
2. **Communication**
|
|
- Send immediate alert to operations team
|
|
- Update status page
|
|
- Notify stakeholders
|
|
|
|
3. **Investigation**
|
|
- Analyze error logs and metrics
|
|
- Identify root cause
|
|
- Implement immediate fix
|
|
|
|
4. **Resolution**
|
|
- Deploy fix or rollback
|
|
- Verify system recovery
|
|
- Document incident
|
|
|
|
#### **Post-Incident Review**
|
|
1. **Incident Documentation**
|
|
- Timeline of events
|
|
- Root cause analysis
|
|
- Actions taken
|
|
- Lessons learned
|
|
|
|
2. **Process Improvement**
|
|
- Update monitoring rules
|
|
- Improve alert thresholds
|
|
- Enhance response procedures
|
|
|
|
### Maintenance Procedures
|
|
|
|
#### **Scheduled Maintenance**
|
|
1. **Pre-Maintenance**
|
|
- Notify users in advance
|
|
- Prepare rollback plan
|
|
- Set maintenance mode
|
|
|
|
2. **During Maintenance**
|
|
- Monitor system health
|
|
- Track maintenance progress
|
|
- Handle any issues
|
|
|
|
3. **Post-Maintenance**
|
|
- Verify system functionality
|
|
- Remove maintenance mode
|
|
- Update documentation
|
|
|
|
---
|
|
|
|
## 🔧 Monitoring Tools
|
|
|
|
### Recommended Tools
|
|
|
|
#### **Application Monitoring**
|
|
- **Winston**: Structured logging
|
|
- **Custom Metrics**: Business-specific metrics
|
|
- **Health Checks**: Service availability monitoring
|
|
|
|
#### **Infrastructure Monitoring**
|
|
- **Google Cloud Monitoring**: Cloud resource monitoring
|
|
- **Firebase Console**: Firebase service monitoring
|
|
- **Supabase Dashboard**: Database monitoring
|
|
|
|
#### **Alert Management**
|
|
- **Slack**: Team notifications
|
|
- **Email**: Critical alerts
|
|
- **PagerDuty**: Incident escalation
|
|
- **Custom Dashboard**: Real-time monitoring
|
|
|
|
### Implementation Checklist
|
|
|
|
#### **Setup Phase**
|
|
- [ ] Configure structured logging
|
|
- [ ] Implement health checks
|
|
- [ ] Set up alert rules
|
|
- [ ] Create monitoring dashboard
|
|
- [ ] Configure alert channels
|
|
|
|
#### **Operational Phase**
|
|
- [ ] Monitor system metrics
|
|
- [ ] Review alert effectiveness
|
|
- [ ] Update alert thresholds
|
|
- [ ] Document incidents
|
|
- [ ] Improve procedures
|
|
|
|
---
|
|
|
|
## 📈 Performance Optimization
|
|
|
|
### Monitoring-Driven Optimization
|
|
|
|
#### **Performance Analysis**
|
|
- **Identify Bottlenecks**: Use metrics to find slow operations
|
|
- **Resource Optimization**: Monitor resource usage patterns
|
|
- **Capacity Planning**: Use trends to plan for growth
|
|
|
|
#### **Continuous Improvement**
|
|
- **Alert Tuning**: Adjust thresholds based on patterns
|
|
- **Process Optimization**: Streamline operational procedures
|
|
- **Tool Enhancement**: Improve monitoring tools and dashboards
|
|
|
|
---
|
|
|
|
This comprehensive monitoring and alerting guide provides the foundation for effective system monitoring, ensuring high availability and quick response to issues in the CIM Document Processor. |