## What was done: ✅ Fixed Firebase Admin initialization to use default credentials for Firebase Functions ✅ Updated frontend to use correct Firebase Functions URL (was using Cloud Run URL) ✅ Added comprehensive debugging to authentication middleware ✅ Added debugging to file upload middleware and CORS handling ✅ Added debug buttons to frontend for troubleshooting authentication ✅ Enhanced error handling and logging throughout the stack ## Current issues: ❌ Document upload still returns 400 Bad Request despite authentication working ❌ GET requests work fine (200 OK) but POST upload requests fail ❌ Frontend authentication is working correctly (valid JWT tokens) ❌ Backend authentication middleware is working (rejects invalid tokens) ❌ CORS is configured correctly and allowing requests ## Root cause analysis: - Authentication is NOT the issue (tokens are valid, GET requests work) - The problem appears to be in the file upload handling or multer configuration - Request reaches the server but fails during upload processing - Need to identify exactly where in the upload pipeline the failure occurs ## TODO next steps: 1. 🔍 Check Firebase Functions logs after next upload attempt to see debugging output 2. 🔍 Verify if request reaches upload middleware (look for '�� Upload middleware called' logs) 3. 🔍 Check if file validation is triggered (look for '🔍 File filter called' logs) 4. 🔍 Identify specific error in upload pipeline (multer, file processing, etc.) 5. 🔍 Test with smaller file or different file type to isolate issue 6. 🔍 Check if issue is with Firebase Functions file size limits or timeout 7. 🔍 Verify multer configuration and file handling in Firebase Functions environment ## Technical details: - Frontend: https://cim-summarizer.web.app - Backend: https://us-central1-cim-summarizer.cloudfunctions.net/api - Authentication: Firebase Auth with JWT tokens (working correctly) - File upload: Multer with memory storage for immediate GCS upload - Debug buttons available in production frontend for troubleshooting
335 lines
9.3 KiB
Markdown
335 lines
9.3 KiB
Markdown
# Google Cloud Storage Integration
|
|
|
|
This document describes the Google Cloud Storage (GCS) integration implementation for the CIM Document Processor backend.
|
|
|
|
## Overview
|
|
|
|
The GCS integration replaces the previous local file storage system with a cloud-only approach using Google Cloud Storage. This provides:
|
|
|
|
- **Scalability**: No local storage limitations
|
|
- **Reliability**: Google's infrastructure with 99.9%+ availability
|
|
- **Security**: IAM-based access control and encryption
|
|
- **Cost-effectiveness**: Pay only for what you use
|
|
- **Global access**: Files accessible from anywhere
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
The following environment variables are required for GCS integration:
|
|
|
|
```bash
|
|
# Google Cloud Configuration
|
|
GCLOUD_PROJECT_ID=your-project-id
|
|
GCS_BUCKET_NAME=your-bucket-name
|
|
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json
|
|
```
|
|
|
|
### Service Account Setup
|
|
|
|
1. Create a service account in Google Cloud Console
|
|
2. Grant the following roles:
|
|
- `Storage Object Admin` (for full bucket access)
|
|
- `Storage Object Viewer` (for read-only access if needed)
|
|
3. Download the JSON key file as `serviceAccountKey.json`
|
|
4. Place it in the `backend/` directory
|
|
|
|
### Bucket Configuration
|
|
|
|
1. Create a GCS bucket in your Google Cloud project
|
|
2. Configure bucket settings:
|
|
- **Location**: Choose a region close to your users
|
|
- **Storage class**: Standard (for frequently accessed files)
|
|
- **Access control**: Uniform bucket-level access (recommended)
|
|
- **Public access**: Prevent public access (files are private by default)
|
|
|
|
## Implementation Details
|
|
|
|
### File Storage Service
|
|
|
|
The `FileStorageService` class provides the following operations:
|
|
|
|
#### Core Operations
|
|
|
|
- **Upload**: `storeFile(file, userId)` - Upload files to GCS with metadata
|
|
- **Download**: `getFile(filePath)` - Download files from GCS
|
|
- **Delete**: `deleteFile(filePath)` - Delete files from GCS
|
|
- **Exists**: `fileExists(filePath)` - Check if file exists
|
|
- **Info**: `getFileInfo(filePath)` - Get file metadata and info
|
|
|
|
#### Advanced Operations
|
|
|
|
- **List**: `listFiles(prefix, maxResults)` - List files with prefix filtering
|
|
- **Copy**: `copyFile(sourcePath, destinationPath)` - Copy files within GCS
|
|
- **Move**: `moveFile(sourcePath, destinationPath)` - Move files within GCS
|
|
- **Signed URLs**: `generateSignedUrl(filePath, expirationMinutes)` - Generate temporary access URLs
|
|
- **Cleanup**: `cleanupOldFiles(prefix, daysOld)` - Remove old files
|
|
- **Stats**: `getStorageStats(prefix)` - Get storage statistics
|
|
|
|
#### Error Handling & Retry Logic
|
|
|
|
- **Exponential backoff**: Retries with increasing delays (1s, 2s, 4s)
|
|
- **Configurable retries**: Default 3 attempts per operation
|
|
- **Comprehensive logging**: All operations logged with context
|
|
- **Graceful failures**: Operations return null/false on failure instead of throwing
|
|
|
|
### File Organization
|
|
|
|
Files are organized in GCS using the following structure:
|
|
|
|
```
|
|
bucket-name/
|
|
├── uploads/
|
|
│ ├── user-id-1/
|
|
│ │ ├── timestamp-filename1.pdf
|
|
│ │ └── timestamp-filename2.pdf
|
|
│ └── user-id-2/
|
|
│ └── timestamp-filename3.pdf
|
|
└── processed/
|
|
├── user-id-1/
|
|
│ └── processed-files/
|
|
└── user-id-2/
|
|
└── processed-files/
|
|
```
|
|
|
|
### File Metadata
|
|
|
|
Each uploaded file includes metadata:
|
|
|
|
```json
|
|
{
|
|
"originalName": "document.pdf",
|
|
"userId": "user-123",
|
|
"uploadedAt": "2024-01-15T10:30:00Z",
|
|
"size": "1048576"
|
|
}
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic File Operations
|
|
|
|
```typescript
|
|
import { fileStorageService } from '../services/fileStorageService';
|
|
|
|
// Upload a file
|
|
const uploadResult = await fileStorageService.storeFile(file, userId);
|
|
if (uploadResult.success) {
|
|
console.log('File uploaded:', uploadResult.fileInfo);
|
|
}
|
|
|
|
// Download a file
|
|
const fileBuffer = await fileStorageService.getFile(gcsPath);
|
|
if (fileBuffer) {
|
|
// Process the file buffer
|
|
}
|
|
|
|
// Delete a file
|
|
const deleted = await fileStorageService.deleteFile(gcsPath);
|
|
if (deleted) {
|
|
console.log('File deleted successfully');
|
|
}
|
|
```
|
|
|
|
### Advanced Operations
|
|
|
|
```typescript
|
|
// List user's files
|
|
const userFiles = await fileStorageService.listFiles(`uploads/${userId}/`);
|
|
|
|
// Generate signed URL for temporary access
|
|
const signedUrl = await fileStorageService.generateSignedUrl(gcsPath, 60);
|
|
|
|
// Copy file to processed directory
|
|
await fileStorageService.copyFile(
|
|
`uploads/${userId}/original.pdf`,
|
|
`processed/${userId}/processed.pdf`
|
|
);
|
|
|
|
// Get storage statistics
|
|
const stats = await fileStorageService.getStorageStats(`uploads/${userId}/`);
|
|
console.log(`User has ${stats.totalFiles} files, ${stats.totalSize} bytes total`);
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Running Integration Tests
|
|
|
|
```bash
|
|
# Test GCS integration
|
|
npm run test:gcs
|
|
```
|
|
|
|
The test script performs the following operations:
|
|
|
|
1. **Connection Test**: Verifies GCS bucket access
|
|
2. **Upload Test**: Uploads a test file
|
|
3. **Existence Check**: Verifies file exists
|
|
4. **Metadata Retrieval**: Gets file information
|
|
5. **Download Test**: Downloads and verifies content
|
|
6. **Signed URL**: Generates temporary access URL
|
|
7. **Copy/Move**: Tests file operations
|
|
8. **Listing**: Lists files in directory
|
|
9. **Statistics**: Gets storage stats
|
|
10. **Cleanup**: Removes test files
|
|
|
|
### Manual Testing
|
|
|
|
```typescript
|
|
// Test connection
|
|
const connected = await fileStorageService.testConnection();
|
|
console.log('GCS connected:', connected);
|
|
|
|
// Test with a real file
|
|
const mockFile = {
|
|
originalname: 'test.pdf',
|
|
filename: 'test.pdf',
|
|
path: '/path/to/local/file.pdf',
|
|
size: 1024,
|
|
mimetype: 'application/pdf'
|
|
};
|
|
|
|
const result = await fileStorageService.storeFile(mockFile, 'test-user');
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Access Control
|
|
|
|
- **Service Account**: Uses least-privilege service account
|
|
- **Bucket Permissions**: Files are private by default
|
|
- **Signed URLs**: Temporary access for specific files
|
|
- **User Isolation**: Files organized by user ID
|
|
|
|
### Data Protection
|
|
|
|
- **Encryption**: GCS provides encryption at rest and in transit
|
|
- **Metadata**: Sensitive information stored in metadata
|
|
- **Cleanup**: Automatic cleanup of old files
|
|
- **Audit Logging**: All operations logged for audit
|
|
|
|
## Performance Optimization
|
|
|
|
### Upload Optimization
|
|
|
|
- **Resumable Uploads**: Large files can be resumed if interrupted
|
|
- **Parallel Uploads**: Multiple files can be uploaded simultaneously
|
|
- **Chunked Uploads**: Large files uploaded in chunks
|
|
|
|
### Download Optimization
|
|
|
|
- **Streaming**: Files can be streamed instead of loaded entirely into memory
|
|
- **Caching**: Consider implementing client-side caching
|
|
- **CDN**: Use Cloud CDN for frequently accessed files
|
|
|
|
## Monitoring and Logging
|
|
|
|
### Log Levels
|
|
|
|
- **INFO**: Successful operations
|
|
- **WARN**: Retry attempts and non-critical issues
|
|
- **ERROR**: Failed operations and critical issues
|
|
|
|
### Metrics to Monitor
|
|
|
|
- **Upload Success Rate**: Percentage of successful uploads
|
|
- **Download Latency**: Time to download files
|
|
- **Storage Usage**: Total storage and file count
|
|
- **Error Rates**: Failed operations by type
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Authentication Errors**
|
|
- Verify service account key file exists
|
|
- Check service account permissions
|
|
- Ensure project ID is correct
|
|
|
|
2. **Bucket Access Errors**
|
|
- Verify bucket exists
|
|
- Check bucket permissions
|
|
- Ensure bucket name is correct
|
|
|
|
3. **Upload Failures**
|
|
- Check file size limits
|
|
- Verify network connectivity
|
|
- Review error logs for specific issues
|
|
|
|
4. **Download Failures**
|
|
- Verify file exists in GCS
|
|
- Check file permissions
|
|
- Review network connectivity
|
|
|
|
### Debug Commands
|
|
|
|
```bash
|
|
# Test GCS connection
|
|
npm run test:gcs
|
|
|
|
# Check environment variables
|
|
echo $GCLOUD_PROJECT_ID
|
|
echo $GCS_BUCKET_NAME
|
|
|
|
# Verify service account
|
|
gcloud auth activate-service-account --key-file=serviceAccountKey.json
|
|
```
|
|
|
|
## Migration from Local Storage
|
|
|
|
### Migration Steps
|
|
|
|
1. **Backup**: Ensure all local files are backed up
|
|
2. **Upload**: Upload existing files to GCS
|
|
3. **Update Paths**: Update database records with GCS paths
|
|
4. **Test**: Verify all operations work with GCS
|
|
5. **Cleanup**: Remove local files after verification
|
|
|
|
### Migration Script
|
|
|
|
```typescript
|
|
// Example migration script
|
|
async function migrateToGCS() {
|
|
const localFiles = await getLocalFiles();
|
|
|
|
for (const file of localFiles) {
|
|
const uploadResult = await fileStorageService.storeFile(file, file.userId);
|
|
if (uploadResult.success) {
|
|
await updateDatabaseRecord(file.id, uploadResult.fileInfo);
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Cost Optimization
|
|
|
|
### Storage Classes
|
|
|
|
- **Standard**: For frequently accessed files
|
|
- **Nearline**: For files accessed less than once per month
|
|
- **Coldline**: For files accessed less than once per quarter
|
|
- **Archive**: For long-term storage
|
|
|
|
### Lifecycle Management
|
|
|
|
- **Automatic Cleanup**: Remove old files automatically
|
|
- **Storage Class Transitions**: Move files to cheaper storage classes
|
|
- **Compression**: Compress files before upload
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
|
|
- **Multi-region Support**: Distribute files across regions
|
|
- **Versioning**: File version control
|
|
- **Backup**: Automated backup to secondary bucket
|
|
- **Analytics**: Detailed usage analytics
|
|
- **Webhooks**: Notifications for file events
|
|
|
|
### Integration Opportunities
|
|
|
|
- **Cloud Functions**: Process files on upload
|
|
- **Cloud Run**: Serverless file processing
|
|
- **BigQuery**: Analytics on file metadata
|
|
- **Cloud Logging**: Centralized logging
|
|
- **Cloud Monitoring**: Performance monitoring |