Files
cim_summary/backend/GCS_INTEGRATION_README.md
Jon 6057d1d7fd 🔧 Fix authentication and document upload issues
## What was done:
 Fixed Firebase Admin initialization to use default credentials for Firebase Functions
 Updated frontend to use correct Firebase Functions URL (was using Cloud Run URL)
 Added comprehensive debugging to authentication middleware
 Added debugging to file upload middleware and CORS handling
 Added debug buttons to frontend for troubleshooting authentication
 Enhanced error handling and logging throughout the stack

## Current issues:
 Document upload still returns 400 Bad Request despite authentication working
 GET requests work fine (200 OK) but POST upload requests fail
 Frontend authentication is working correctly (valid JWT tokens)
 Backend authentication middleware is working (rejects invalid tokens)
 CORS is configured correctly and allowing requests

## Root cause analysis:
- Authentication is NOT the issue (tokens are valid, GET requests work)
- The problem appears to be in the file upload handling or multer configuration
- Request reaches the server but fails during upload processing
- Need to identify exactly where in the upload pipeline the failure occurs

## TODO next steps:
1. 🔍 Check Firebase Functions logs after next upload attempt to see debugging output
2. 🔍 Verify if request reaches upload middleware (look for '�� Upload middleware called' logs)
3. 🔍 Check if file validation is triggered (look for '🔍 File filter called' logs)
4. 🔍 Identify specific error in upload pipeline (multer, file processing, etc.)
5. 🔍 Test with smaller file or different file type to isolate issue
6. 🔍 Check if issue is with Firebase Functions file size limits or timeout
7. 🔍 Verify multer configuration and file handling in Firebase Functions environment

## Technical details:
- Frontend: https://cim-summarizer.web.app
- Backend: https://us-central1-cim-summarizer.cloudfunctions.net/api
- Authentication: Firebase Auth with JWT tokens (working correctly)
- File upload: Multer with memory storage for immediate GCS upload
- Debug buttons available in production frontend for troubleshooting
2025-07-31 16:18:53 -04:00

9.3 KiB

Google Cloud Storage Integration

This document describes the Google Cloud Storage (GCS) integration implementation for the CIM Document Processor backend.

Overview

The GCS integration replaces the previous local file storage system with a cloud-only approach using Google Cloud Storage. This provides:

  • Scalability: No local storage limitations
  • Reliability: Google's infrastructure with 99.9%+ availability
  • Security: IAM-based access control and encryption
  • Cost-effectiveness: Pay only for what you use
  • Global access: Files accessible from anywhere

Configuration

Environment Variables

The following environment variables are required for GCS integration:

# Google Cloud Configuration
GCLOUD_PROJECT_ID=your-project-id
GCS_BUCKET_NAME=your-bucket-name
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json

Service Account Setup

  1. Create a service account in Google Cloud Console
  2. Grant the following roles:
    • Storage Object Admin (for full bucket access)
    • Storage Object Viewer (for read-only access if needed)
  3. Download the JSON key file as serviceAccountKey.json
  4. Place it in the backend/ directory

Bucket Configuration

  1. Create a GCS bucket in your Google Cloud project
  2. Configure bucket settings:
    • Location: Choose a region close to your users
    • Storage class: Standard (for frequently accessed files)
    • Access control: Uniform bucket-level access (recommended)
    • Public access: Prevent public access (files are private by default)

Implementation Details

File Storage Service

The FileStorageService class provides the following operations:

Core Operations

  • Upload: storeFile(file, userId) - Upload files to GCS with metadata
  • Download: getFile(filePath) - Download files from GCS
  • Delete: deleteFile(filePath) - Delete files from GCS
  • Exists: fileExists(filePath) - Check if file exists
  • Info: getFileInfo(filePath) - Get file metadata and info

Advanced Operations

  • List: listFiles(prefix, maxResults) - List files with prefix filtering
  • Copy: copyFile(sourcePath, destinationPath) - Copy files within GCS
  • Move: moveFile(sourcePath, destinationPath) - Move files within GCS
  • Signed URLs: generateSignedUrl(filePath, expirationMinutes) - Generate temporary access URLs
  • Cleanup: cleanupOldFiles(prefix, daysOld) - Remove old files
  • Stats: getStorageStats(prefix) - Get storage statistics

Error Handling & Retry Logic

  • Exponential backoff: Retries with increasing delays (1s, 2s, 4s)
  • Configurable retries: Default 3 attempts per operation
  • Comprehensive logging: All operations logged with context
  • Graceful failures: Operations return null/false on failure instead of throwing

File Organization

Files are organized in GCS using the following structure:

bucket-name/
├── uploads/
│   ├── user-id-1/
│   │   ├── timestamp-filename1.pdf
│   │   └── timestamp-filename2.pdf
│   └── user-id-2/
│       └── timestamp-filename3.pdf
└── processed/
    ├── user-id-1/
    │   └── processed-files/
    └── user-id-2/
        └── processed-files/

File Metadata

Each uploaded file includes metadata:

{
  "originalName": "document.pdf",
  "userId": "user-123",
  "uploadedAt": "2024-01-15T10:30:00Z",
  "size": "1048576"
}

Usage Examples

Basic File Operations

import { fileStorageService } from '../services/fileStorageService';

// Upload a file
const uploadResult = await fileStorageService.storeFile(file, userId);
if (uploadResult.success) {
  console.log('File uploaded:', uploadResult.fileInfo);
}

// Download a file
const fileBuffer = await fileStorageService.getFile(gcsPath);
if (fileBuffer) {
  // Process the file buffer
}

// Delete a file
const deleted = await fileStorageService.deleteFile(gcsPath);
if (deleted) {
  console.log('File deleted successfully');
}

Advanced Operations

// List user's files
const userFiles = await fileStorageService.listFiles(`uploads/${userId}/`);

// Generate signed URL for temporary access
const signedUrl = await fileStorageService.generateSignedUrl(gcsPath, 60);

// Copy file to processed directory
await fileStorageService.copyFile(
  `uploads/${userId}/original.pdf`,
  `processed/${userId}/processed.pdf`
);

// Get storage statistics
const stats = await fileStorageService.getStorageStats(`uploads/${userId}/`);
console.log(`User has ${stats.totalFiles} files, ${stats.totalSize} bytes total`);

Testing

Running Integration Tests

# Test GCS integration
npm run test:gcs

The test script performs the following operations:

  1. Connection Test: Verifies GCS bucket access
  2. Upload Test: Uploads a test file
  3. Existence Check: Verifies file exists
  4. Metadata Retrieval: Gets file information
  5. Download Test: Downloads and verifies content
  6. Signed URL: Generates temporary access URL
  7. Copy/Move: Tests file operations
  8. Listing: Lists files in directory
  9. Statistics: Gets storage stats
  10. Cleanup: Removes test files

Manual Testing

// Test connection
const connected = await fileStorageService.testConnection();
console.log('GCS connected:', connected);

// Test with a real file
const mockFile = {
  originalname: 'test.pdf',
  filename: 'test.pdf',
  path: '/path/to/local/file.pdf',
  size: 1024,
  mimetype: 'application/pdf'
};

const result = await fileStorageService.storeFile(mockFile, 'test-user');

Security Considerations

Access Control

  • Service Account: Uses least-privilege service account
  • Bucket Permissions: Files are private by default
  • Signed URLs: Temporary access for specific files
  • User Isolation: Files organized by user ID

Data Protection

  • Encryption: GCS provides encryption at rest and in transit
  • Metadata: Sensitive information stored in metadata
  • Cleanup: Automatic cleanup of old files
  • Audit Logging: All operations logged for audit

Performance Optimization

Upload Optimization

  • Resumable Uploads: Large files can be resumed if interrupted
  • Parallel Uploads: Multiple files can be uploaded simultaneously
  • Chunked Uploads: Large files uploaded in chunks

Download Optimization

  • Streaming: Files can be streamed instead of loaded entirely into memory
  • Caching: Consider implementing client-side caching
  • CDN: Use Cloud CDN for frequently accessed files

Monitoring and Logging

Log Levels

  • INFO: Successful operations
  • WARN: Retry attempts and non-critical issues
  • ERROR: Failed operations and critical issues

Metrics to Monitor

  • Upload Success Rate: Percentage of successful uploads
  • Download Latency: Time to download files
  • Storage Usage: Total storage and file count
  • Error Rates: Failed operations by type

Troubleshooting

Common Issues

  1. Authentication Errors

    • Verify service account key file exists
    • Check service account permissions
    • Ensure project ID is correct
  2. Bucket Access Errors

    • Verify bucket exists
    • Check bucket permissions
    • Ensure bucket name is correct
  3. Upload Failures

    • Check file size limits
    • Verify network connectivity
    • Review error logs for specific issues
  4. Download Failures

    • Verify file exists in GCS
    • Check file permissions
    • Review network connectivity

Debug Commands

# Test GCS connection
npm run test:gcs

# Check environment variables
echo $GCLOUD_PROJECT_ID
echo $GCS_BUCKET_NAME

# Verify service account
gcloud auth activate-service-account --key-file=serviceAccountKey.json

Migration from Local Storage

Migration Steps

  1. Backup: Ensure all local files are backed up
  2. Upload: Upload existing files to GCS
  3. Update Paths: Update database records with GCS paths
  4. Test: Verify all operations work with GCS
  5. Cleanup: Remove local files after verification

Migration Script

// Example migration script
async function migrateToGCS() {
  const localFiles = await getLocalFiles();
  
  for (const file of localFiles) {
    const uploadResult = await fileStorageService.storeFile(file, file.userId);
    if (uploadResult.success) {
      await updateDatabaseRecord(file.id, uploadResult.fileInfo);
    }
  }
}

Cost Optimization

Storage Classes

  • Standard: For frequently accessed files
  • Nearline: For files accessed less than once per month
  • Coldline: For files accessed less than once per quarter
  • Archive: For long-term storage

Lifecycle Management

  • Automatic Cleanup: Remove old files automatically
  • Storage Class Transitions: Move files to cheaper storage classes
  • Compression: Compress files before upload

Future Enhancements

Planned Features

  • Multi-region Support: Distribute files across regions
  • Versioning: File version control
  • Backup: Automated backup to secondary bucket
  • Analytics: Detailed usage analytics
  • Webhooks: Notifications for file events

Integration Opportunities

  • Cloud Functions: Process files on upload
  • Cloud Run: Serverless file processing
  • BigQuery: Analytics on file metadata
  • Cloud Logging: Centralized logging
  • Cloud Monitoring: Performance monitoring