Files
cim_summary/backend/GCS_INTEGRATION_README.md
Jon 6057d1d7fd 🔧 Fix authentication and document upload issues
## What was done:
 Fixed Firebase Admin initialization to use default credentials for Firebase Functions
 Updated frontend to use correct Firebase Functions URL (was using Cloud Run URL)
 Added comprehensive debugging to authentication middleware
 Added debugging to file upload middleware and CORS handling
 Added debug buttons to frontend for troubleshooting authentication
 Enhanced error handling and logging throughout the stack

## Current issues:
 Document upload still returns 400 Bad Request despite authentication working
 GET requests work fine (200 OK) but POST upload requests fail
 Frontend authentication is working correctly (valid JWT tokens)
 Backend authentication middleware is working (rejects invalid tokens)
 CORS is configured correctly and allowing requests

## Root cause analysis:
- Authentication is NOT the issue (tokens are valid, GET requests work)
- The problem appears to be in the file upload handling or multer configuration
- Request reaches the server but fails during upload processing
- Need to identify exactly where in the upload pipeline the failure occurs

## TODO next steps:
1. 🔍 Check Firebase Functions logs after next upload attempt to see debugging output
2. 🔍 Verify if request reaches upload middleware (look for '�� Upload middleware called' logs)
3. 🔍 Check if file validation is triggered (look for '🔍 File filter called' logs)
4. 🔍 Identify specific error in upload pipeline (multer, file processing, etc.)
5. 🔍 Test with smaller file or different file type to isolate issue
6. 🔍 Check if issue is with Firebase Functions file size limits or timeout
7. 🔍 Verify multer configuration and file handling in Firebase Functions environment

## Technical details:
- Frontend: https://cim-summarizer.web.app
- Backend: https://us-central1-cim-summarizer.cloudfunctions.net/api
- Authentication: Firebase Auth with JWT tokens (working correctly)
- File upload: Multer with memory storage for immediate GCS upload
- Debug buttons available in production frontend for troubleshooting
2025-07-31 16:18:53 -04:00

335 lines
9.3 KiB
Markdown

# Google Cloud Storage Integration
This document describes the Google Cloud Storage (GCS) integration implementation for the CIM Document Processor backend.
## Overview
The GCS integration replaces the previous local file storage system with a cloud-only approach using Google Cloud Storage. This provides:
- **Scalability**: No local storage limitations
- **Reliability**: Google's infrastructure with 99.9%+ availability
- **Security**: IAM-based access control and encryption
- **Cost-effectiveness**: Pay only for what you use
- **Global access**: Files accessible from anywhere
## Configuration
### Environment Variables
The following environment variables are required for GCS integration:
```bash
# Google Cloud Configuration
GCLOUD_PROJECT_ID=your-project-id
GCS_BUCKET_NAME=your-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json
```
### Service Account Setup
1. Create a service account in Google Cloud Console
2. Grant the following roles:
- `Storage Object Admin` (for full bucket access)
- `Storage Object Viewer` (for read-only access if needed)
3. Download the JSON key file as `serviceAccountKey.json`
4. Place it in the `backend/` directory
### Bucket Configuration
1. Create a GCS bucket in your Google Cloud project
2. Configure bucket settings:
- **Location**: Choose a region close to your users
- **Storage class**: Standard (for frequently accessed files)
- **Access control**: Uniform bucket-level access (recommended)
- **Public access**: Prevent public access (files are private by default)
## Implementation Details
### File Storage Service
The `FileStorageService` class provides the following operations:
#### Core Operations
- **Upload**: `storeFile(file, userId)` - Upload files to GCS with metadata
- **Download**: `getFile(filePath)` - Download files from GCS
- **Delete**: `deleteFile(filePath)` - Delete files from GCS
- **Exists**: `fileExists(filePath)` - Check if file exists
- **Info**: `getFileInfo(filePath)` - Get file metadata and info
#### Advanced Operations
- **List**: `listFiles(prefix, maxResults)` - List files with prefix filtering
- **Copy**: `copyFile(sourcePath, destinationPath)` - Copy files within GCS
- **Move**: `moveFile(sourcePath, destinationPath)` - Move files within GCS
- **Signed URLs**: `generateSignedUrl(filePath, expirationMinutes)` - Generate temporary access URLs
- **Cleanup**: `cleanupOldFiles(prefix, daysOld)` - Remove old files
- **Stats**: `getStorageStats(prefix)` - Get storage statistics
#### Error Handling & Retry Logic
- **Exponential backoff**: Retries with increasing delays (1s, 2s, 4s)
- **Configurable retries**: Default 3 attempts per operation
- **Comprehensive logging**: All operations logged with context
- **Graceful failures**: Operations return null/false on failure instead of throwing
### File Organization
Files are organized in GCS using the following structure:
```
bucket-name/
├── uploads/
│ ├── user-id-1/
│ │ ├── timestamp-filename1.pdf
│ │ └── timestamp-filename2.pdf
│ └── user-id-2/
│ └── timestamp-filename3.pdf
└── processed/
├── user-id-1/
│ └── processed-files/
└── user-id-2/
└── processed-files/
```
### File Metadata
Each uploaded file includes metadata:
```json
{
"originalName": "document.pdf",
"userId": "user-123",
"uploadedAt": "2024-01-15T10:30:00Z",
"size": "1048576"
}
```
## Usage Examples
### Basic File Operations
```typescript
import { fileStorageService } from '../services/fileStorageService';
// Upload a file
const uploadResult = await fileStorageService.storeFile(file, userId);
if (uploadResult.success) {
console.log('File uploaded:', uploadResult.fileInfo);
}
// Download a file
const fileBuffer = await fileStorageService.getFile(gcsPath);
if (fileBuffer) {
// Process the file buffer
}
// Delete a file
const deleted = await fileStorageService.deleteFile(gcsPath);
if (deleted) {
console.log('File deleted successfully');
}
```
### Advanced Operations
```typescript
// List user's files
const userFiles = await fileStorageService.listFiles(`uploads/${userId}/`);
// Generate signed URL for temporary access
const signedUrl = await fileStorageService.generateSignedUrl(gcsPath, 60);
// Copy file to processed directory
await fileStorageService.copyFile(
`uploads/${userId}/original.pdf`,
`processed/${userId}/processed.pdf`
);
// Get storage statistics
const stats = await fileStorageService.getStorageStats(`uploads/${userId}/`);
console.log(`User has ${stats.totalFiles} files, ${stats.totalSize} bytes total`);
```
## Testing
### Running Integration Tests
```bash
# Test GCS integration
npm run test:gcs
```
The test script performs the following operations:
1. **Connection Test**: Verifies GCS bucket access
2. **Upload Test**: Uploads a test file
3. **Existence Check**: Verifies file exists
4. **Metadata Retrieval**: Gets file information
5. **Download Test**: Downloads and verifies content
6. **Signed URL**: Generates temporary access URL
7. **Copy/Move**: Tests file operations
8. **Listing**: Lists files in directory
9. **Statistics**: Gets storage stats
10. **Cleanup**: Removes test files
### Manual Testing
```typescript
// Test connection
const connected = await fileStorageService.testConnection();
console.log('GCS connected:', connected);
// Test with a real file
const mockFile = {
originalname: 'test.pdf',
filename: 'test.pdf',
path: '/path/to/local/file.pdf',
size: 1024,
mimetype: 'application/pdf'
};
const result = await fileStorageService.storeFile(mockFile, 'test-user');
```
## Security Considerations
### Access Control
- **Service Account**: Uses least-privilege service account
- **Bucket Permissions**: Files are private by default
- **Signed URLs**: Temporary access for specific files
- **User Isolation**: Files organized by user ID
### Data Protection
- **Encryption**: GCS provides encryption at rest and in transit
- **Metadata**: Sensitive information stored in metadata
- **Cleanup**: Automatic cleanup of old files
- **Audit Logging**: All operations logged for audit
## Performance Optimization
### Upload Optimization
- **Resumable Uploads**: Large files can be resumed if interrupted
- **Parallel Uploads**: Multiple files can be uploaded simultaneously
- **Chunked Uploads**: Large files uploaded in chunks
### Download Optimization
- **Streaming**: Files can be streamed instead of loaded entirely into memory
- **Caching**: Consider implementing client-side caching
- **CDN**: Use Cloud CDN for frequently accessed files
## Monitoring and Logging
### Log Levels
- **INFO**: Successful operations
- **WARN**: Retry attempts and non-critical issues
- **ERROR**: Failed operations and critical issues
### Metrics to Monitor
- **Upload Success Rate**: Percentage of successful uploads
- **Download Latency**: Time to download files
- **Storage Usage**: Total storage and file count
- **Error Rates**: Failed operations by type
## Troubleshooting
### Common Issues
1. **Authentication Errors**
- Verify service account key file exists
- Check service account permissions
- Ensure project ID is correct
2. **Bucket Access Errors**
- Verify bucket exists
- Check bucket permissions
- Ensure bucket name is correct
3. **Upload Failures**
- Check file size limits
- Verify network connectivity
- Review error logs for specific issues
4. **Download Failures**
- Verify file exists in GCS
- Check file permissions
- Review network connectivity
### Debug Commands
```bash
# Test GCS connection
npm run test:gcs
# Check environment variables
echo $GCLOUD_PROJECT_ID
echo $GCS_BUCKET_NAME
# Verify service account
gcloud auth activate-service-account --key-file=serviceAccountKey.json
```
## Migration from Local Storage
### Migration Steps
1. **Backup**: Ensure all local files are backed up
2. **Upload**: Upload existing files to GCS
3. **Update Paths**: Update database records with GCS paths
4. **Test**: Verify all operations work with GCS
5. **Cleanup**: Remove local files after verification
### Migration Script
```typescript
// Example migration script
async function migrateToGCS() {
const localFiles = await getLocalFiles();
for (const file of localFiles) {
const uploadResult = await fileStorageService.storeFile(file, file.userId);
if (uploadResult.success) {
await updateDatabaseRecord(file.id, uploadResult.fileInfo);
}
}
}
```
## Cost Optimization
### Storage Classes
- **Standard**: For frequently accessed files
- **Nearline**: For files accessed less than once per month
- **Coldline**: For files accessed less than once per quarter
- **Archive**: For long-term storage
### Lifecycle Management
- **Automatic Cleanup**: Remove old files automatically
- **Storage Class Transitions**: Move files to cheaper storage classes
- **Compression**: Compress files before upload
## Future Enhancements
### Planned Features
- **Multi-region Support**: Distribute files across regions
- **Versioning**: File version control
- **Backup**: Automated backup to secondary bucket
- **Analytics**: Detailed usage analytics
- **Webhooks**: Notifications for file events
### Integration Opportunities
- **Cloud Functions**: Process files on upload
- **Cloud Run**: Serverless file processing
- **BigQuery**: Analytics on file metadata
- **Cloud Logging**: Centralized logging
- **Cloud Monitoring**: Performance monitoring