# Google Cloud Storage Integration This document describes the Google Cloud Storage (GCS) integration implementation for the CIM Document Processor backend. ## Overview The GCS integration replaces the previous local file storage system with a cloud-only approach using Google Cloud Storage. This provides: - **Scalability**: No local storage limitations - **Reliability**: Google's infrastructure with 99.9%+ availability - **Security**: IAM-based access control and encryption - **Cost-effectiveness**: Pay only for what you use - **Global access**: Files accessible from anywhere ## Configuration ### Environment Variables The following environment variables are required for GCS integration: ```bash # Google Cloud Configuration GCLOUD_PROJECT_ID=your-project-id GCS_BUCKET_NAME=your-bucket-name GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json ``` ### Service Account Setup 1. Create a service account in Google Cloud Console 2. Grant the following roles: - `Storage Object Admin` (for full bucket access) - `Storage Object Viewer` (for read-only access if needed) 3. Download the JSON key file as `serviceAccountKey.json` 4. Place it in the `backend/` directory ### Bucket Configuration 1. Create a GCS bucket in your Google Cloud project 2. Configure bucket settings: - **Location**: Choose a region close to your users - **Storage class**: Standard (for frequently accessed files) - **Access control**: Uniform bucket-level access (recommended) - **Public access**: Prevent public access (files are private by default) ## Implementation Details ### File Storage Service The `FileStorageService` class provides the following operations: #### Core Operations - **Upload**: `storeFile(file, userId)` - Upload files to GCS with metadata - **Download**: `getFile(filePath)` - Download files from GCS - **Delete**: `deleteFile(filePath)` - Delete files from GCS - **Exists**: `fileExists(filePath)` - Check if file exists - **Info**: `getFileInfo(filePath)` - Get file metadata and info #### Advanced Operations - **List**: `listFiles(prefix, maxResults)` - List files with prefix filtering - **Copy**: `copyFile(sourcePath, destinationPath)` - Copy files within GCS - **Move**: `moveFile(sourcePath, destinationPath)` - Move files within GCS - **Signed URLs**: `generateSignedUrl(filePath, expirationMinutes)` - Generate temporary access URLs - **Cleanup**: `cleanupOldFiles(prefix, daysOld)` - Remove old files - **Stats**: `getStorageStats(prefix)` - Get storage statistics #### Error Handling & Retry Logic - **Exponential backoff**: Retries with increasing delays (1s, 2s, 4s) - **Configurable retries**: Default 3 attempts per operation - **Comprehensive logging**: All operations logged with context - **Graceful failures**: Operations return null/false on failure instead of throwing ### File Organization Files are organized in GCS using the following structure: ``` bucket-name/ ├── uploads/ │ ├── user-id-1/ │ │ ├── timestamp-filename1.pdf │ │ └── timestamp-filename2.pdf │ └── user-id-2/ │ └── timestamp-filename3.pdf └── processed/ ├── user-id-1/ │ └── processed-files/ └── user-id-2/ └── processed-files/ ``` ### File Metadata Each uploaded file includes metadata: ```json { "originalName": "document.pdf", "userId": "user-123", "uploadedAt": "2024-01-15T10:30:00Z", "size": "1048576" } ``` ## Usage Examples ### Basic File Operations ```typescript import { fileStorageService } from '../services/fileStorageService'; // Upload a file const uploadResult = await fileStorageService.storeFile(file, userId); if (uploadResult.success) { console.log('File uploaded:', uploadResult.fileInfo); } // Download a file const fileBuffer = await fileStorageService.getFile(gcsPath); if (fileBuffer) { // Process the file buffer } // Delete a file const deleted = await fileStorageService.deleteFile(gcsPath); if (deleted) { console.log('File deleted successfully'); } ``` ### Advanced Operations ```typescript // List user's files const userFiles = await fileStorageService.listFiles(`uploads/${userId}/`); // Generate signed URL for temporary access const signedUrl = await fileStorageService.generateSignedUrl(gcsPath, 60); // Copy file to processed directory await fileStorageService.copyFile( `uploads/${userId}/original.pdf`, `processed/${userId}/processed.pdf` ); // Get storage statistics const stats = await fileStorageService.getStorageStats(`uploads/${userId}/`); console.log(`User has ${stats.totalFiles} files, ${stats.totalSize} bytes total`); ``` ## Testing ### Running Integration Tests ```bash # Test GCS integration npm run test:gcs ``` The test script performs the following operations: 1. **Connection Test**: Verifies GCS bucket access 2. **Upload Test**: Uploads a test file 3. **Existence Check**: Verifies file exists 4. **Metadata Retrieval**: Gets file information 5. **Download Test**: Downloads and verifies content 6. **Signed URL**: Generates temporary access URL 7. **Copy/Move**: Tests file operations 8. **Listing**: Lists files in directory 9. **Statistics**: Gets storage stats 10. **Cleanup**: Removes test files ### Manual Testing ```typescript // Test connection const connected = await fileStorageService.testConnection(); console.log('GCS connected:', connected); // Test with a real file const mockFile = { originalname: 'test.pdf', filename: 'test.pdf', path: '/path/to/local/file.pdf', size: 1024, mimetype: 'application/pdf' }; const result = await fileStorageService.storeFile(mockFile, 'test-user'); ``` ## Security Considerations ### Access Control - **Service Account**: Uses least-privilege service account - **Bucket Permissions**: Files are private by default - **Signed URLs**: Temporary access for specific files - **User Isolation**: Files organized by user ID ### Data Protection - **Encryption**: GCS provides encryption at rest and in transit - **Metadata**: Sensitive information stored in metadata - **Cleanup**: Automatic cleanup of old files - **Audit Logging**: All operations logged for audit ## Performance Optimization ### Upload Optimization - **Resumable Uploads**: Large files can be resumed if interrupted - **Parallel Uploads**: Multiple files can be uploaded simultaneously - **Chunked Uploads**: Large files uploaded in chunks ### Download Optimization - **Streaming**: Files can be streamed instead of loaded entirely into memory - **Caching**: Consider implementing client-side caching - **CDN**: Use Cloud CDN for frequently accessed files ## Monitoring and Logging ### Log Levels - **INFO**: Successful operations - **WARN**: Retry attempts and non-critical issues - **ERROR**: Failed operations and critical issues ### Metrics to Monitor - **Upload Success Rate**: Percentage of successful uploads - **Download Latency**: Time to download files - **Storage Usage**: Total storage and file count - **Error Rates**: Failed operations by type ## Troubleshooting ### Common Issues 1. **Authentication Errors** - Verify service account key file exists - Check service account permissions - Ensure project ID is correct 2. **Bucket Access Errors** - Verify bucket exists - Check bucket permissions - Ensure bucket name is correct 3. **Upload Failures** - Check file size limits - Verify network connectivity - Review error logs for specific issues 4. **Download Failures** - Verify file exists in GCS - Check file permissions - Review network connectivity ### Debug Commands ```bash # Test GCS connection npm run test:gcs # Check environment variables echo $GCLOUD_PROJECT_ID echo $GCS_BUCKET_NAME # Verify service account gcloud auth activate-service-account --key-file=serviceAccountKey.json ``` ## Migration from Local Storage ### Migration Steps 1. **Backup**: Ensure all local files are backed up 2. **Upload**: Upload existing files to GCS 3. **Update Paths**: Update database records with GCS paths 4. **Test**: Verify all operations work with GCS 5. **Cleanup**: Remove local files after verification ### Migration Script ```typescript // Example migration script async function migrateToGCS() { const localFiles = await getLocalFiles(); for (const file of localFiles) { const uploadResult = await fileStorageService.storeFile(file, file.userId); if (uploadResult.success) { await updateDatabaseRecord(file.id, uploadResult.fileInfo); } } } ``` ## Cost Optimization ### Storage Classes - **Standard**: For frequently accessed files - **Nearline**: For files accessed less than once per month - **Coldline**: For files accessed less than once per quarter - **Archive**: For long-term storage ### Lifecycle Management - **Automatic Cleanup**: Remove old files automatically - **Storage Class Transitions**: Move files to cheaper storage classes - **Compression**: Compress files before upload ## Future Enhancements ### Planned Features - **Multi-region Support**: Distribute files across regions - **Versioning**: File version control - **Backup**: Automated backup to secondary bucket - **Analytics**: Detailed usage analytics - **Webhooks**: Notifications for file events ### Integration Opportunities - **Cloud Functions**: Process files on upload - **Cloud Run**: Serverless file processing - **BigQuery**: Analytics on file metadata - **Cloud Logging**: Centralized logging - **Cloud Monitoring**: Performance monitoring