# Database Schema Documentation ## Complete Database Structure for CIM Document Processor ### 🎯 Overview This document provides comprehensive documentation of the database schema for the CIM Document Processor, including all tables, relationships, indexes, and data structures. --- ## 🗄️ Database Architecture ### Technology Stack - **Database**: PostgreSQL (via Supabase) - **ORM**: Supabase Client (TypeScript) - **Migrations**: SQL migration files - **Backup**: Supabase automated backups ### Database Features - **JSONB Support**: For flexible analysis data storage - **UUID Primary Keys**: For secure document identification - **Row Level Security**: For user data isolation - **Full-Text Search**: For document content search - **Vector Storage**: For AI embeddings and similarity search --- ## 📊 Core Tables ### Documents Table **Purpose**: Primary table for storing document metadata and processing results ```sql CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id TEXT NOT NULL, original_file_name TEXT NOT NULL, file_path TEXT NOT NULL, file_size INTEGER NOT NULL, status TEXT NOT NULL DEFAULT 'uploaded', extracted_text TEXT, generated_summary TEXT, summary_pdf_path TEXT, analysis_data JSONB, error_message TEXT, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique document identifier (UUID) - `user_id` - User who owns the document - `original_file_name` - Original uploaded file name - `file_path` - Storage path for the document - `file_size` - File size in bytes - `status` - Processing status (uploaded, processing, completed, failed, cancelled) - `extracted_text` - Text extracted from document - `generated_summary` - AI-generated summary - `summary_pdf_path` - Path to generated PDF report - `analysis_data` - Structured analysis results (JSONB) - `error_message` - Error message if processing failed - `created_at` - Document creation timestamp - `updated_at` - Last update timestamp **Indexes**: ```sql CREATE INDEX idx_documents_user_id ON documents(user_id); CREATE INDEX idx_documents_status ON documents(status); CREATE INDEX idx_documents_created_at ON documents(created_at); CREATE INDEX idx_documents_analysis_data ON documents USING GIN (analysis_data); ``` ### Users Table **Purpose**: User authentication and profile information ```sql CREATE TABLE users ( id TEXT PRIMARY KEY, name TEXT, email TEXT UNIQUE NOT NULL, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Firebase user ID - `name` - User display name - `email` - User email address - `created_at` - Account creation timestamp - `updated_at` - Last update timestamp **Indexes**: ```sql CREATE INDEX idx_users_email ON users(email); ``` ### Processing Jobs Table **Purpose**: Background job tracking and management ```sql CREATE TABLE processing_jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID REFERENCES documents(id) ON DELETE CASCADE, user_id TEXT NOT NULL, job_type TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'pending', priority INTEGER DEFAULT 0, attempts INTEGER DEFAULT 0, max_attempts INTEGER DEFAULT 3, started_at TIMESTAMP, completed_at TIMESTAMP, error_message TEXT, result_data JSONB, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique job identifier - `document_id` - Associated document - `user_id` - User who initiated the job - `job_type` - Type of processing job - `status` - Job status (pending, running, completed, failed) - `priority` - Job priority (higher = more important) - `attempts` - Number of processing attempts - `max_attempts` - Maximum allowed attempts - `started_at` - Job start timestamp - `completed_at` - Job completion timestamp - `error_message` - Error message if failed - `result_data` - Job result data (JSONB) - `created_at` - Job creation timestamp - `updated_at` - Last update timestamp **Indexes**: ```sql CREATE INDEX idx_processing_jobs_document_id ON processing_jobs(document_id); CREATE INDEX idx_processing_jobs_user_id ON processing_jobs(user_id); CREATE INDEX idx_processing_jobs_status ON processing_jobs(status); CREATE INDEX idx_processing_jobs_priority ON processing_jobs(priority); ``` --- ## 🤖 AI Processing Tables ### Agentic RAG Sessions Table **Purpose**: Track AI processing sessions and results ```sql CREATE TABLE agentic_rag_sessions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID REFERENCES documents(id) ON DELETE CASCADE, user_id TEXT NOT NULL, strategy TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'pending', total_agents INTEGER DEFAULT 0, completed_agents INTEGER DEFAULT 0, failed_agents INTEGER DEFAULT 0, overall_validation_score DECIMAL(3,2), processing_time_ms INTEGER, api_calls_count INTEGER DEFAULT 0, total_cost DECIMAL(10,4), reasoning_steps JSONB, final_result JSONB, created_at TIMESTAMP DEFAULT NOW(), completed_at TIMESTAMP ); ``` **Columns**: - `id` - Unique session identifier - `document_id` - Associated document - `user_id` - User who initiated processing - `strategy` - Processing strategy used - `status` - Session status - `total_agents` - Total number of AI agents - `completed_agents` - Successfully completed agents - `failed_agents` - Failed agents - `overall_validation_score` - Quality validation score - `processing_time_ms` - Total processing time - `api_calls_count` - Number of API calls made - `total_cost` - Total cost of processing - `reasoning_steps` - AI reasoning process (JSONB) - `final_result` - Final analysis result (JSONB) - `created_at` - Session creation timestamp - `completed_at` - Session completion timestamp **Indexes**: ```sql CREATE INDEX idx_agentic_rag_sessions_document_id ON agentic_rag_sessions(document_id); CREATE INDEX idx_agentic_rag_sessions_user_id ON agentic_rag_sessions(user_id); CREATE INDEX idx_agentic_rag_sessions_status ON agentic_rag_sessions(status); CREATE INDEX idx_agentic_rag_sessions_strategy ON agentic_rag_sessions(strategy); ``` ### Agent Executions Table **Purpose**: Track individual AI agent executions ```sql CREATE TABLE agent_executions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), session_id UUID REFERENCES agentic_rag_sessions(id) ON DELETE CASCADE, agent_name TEXT NOT NULL, agent_type TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'pending', input_data JSONB, output_data JSONB, error_message TEXT, execution_time_ms INTEGER, api_calls INTEGER DEFAULT 0, cost DECIMAL(10,4), validation_score DECIMAL(3,2), created_at TIMESTAMP DEFAULT NOW(), completed_at TIMESTAMP ); ``` **Columns**: - `id` - Unique execution identifier - `session_id` - Associated processing session - `agent_name` - Name of the AI agent - `agent_type` - Type of agent - `status` - Execution status - `input_data` - Input data for agent (JSONB) - `output_data` - Output data from agent (JSONB) - `error_message` - Error message if failed - `execution_time_ms` - Execution time in milliseconds - `api_calls` - Number of API calls made - `cost` - Cost of this execution - `validation_score` - Quality validation score - `created_at` - Execution creation timestamp - `completed_at` - Execution completion timestamp **Indexes**: ```sql CREATE INDEX idx_agent_executions_session_id ON agent_executions(session_id); CREATE INDEX idx_agent_executions_agent_name ON agent_executions(agent_name); CREATE INDEX idx_agent_executions_status ON agent_executions(status); ``` ### Quality Metrics Table **Purpose**: Track quality metrics for AI processing ```sql CREATE TABLE quality_metrics ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), session_id UUID REFERENCES agentic_rag_sessions(id) ON DELETE CASCADE, metric_name TEXT NOT NULL, metric_value DECIMAL(10,4), metric_type TEXT NOT NULL, threshold_value DECIMAL(10,4), passed BOOLEAN, details JSONB, created_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique metric identifier - `session_id` - Associated processing session - `metric_name` - Name of the quality metric - `metric_value` - Actual metric value - `metric_type` - Type of metric (accuracy, completeness, etc.) - `threshold_value` - Threshold for passing - `passed` - Whether metric passed threshold - `details` - Additional metric details (JSONB) - `created_at` - Metric creation timestamp **Indexes**: ```sql CREATE INDEX idx_quality_metrics_session_id ON quality_metrics(session_id); CREATE INDEX idx_quality_metrics_metric_name ON quality_metrics(metric_name); CREATE INDEX idx_quality_metrics_passed ON quality_metrics(passed); ``` --- ## 🔍 Vector Database Tables ### Document Chunks Table **Purpose**: Store document chunks with vector embeddings ```sql CREATE TABLE document_chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID REFERENCES documents(id) ON DELETE CASCADE, chunk_index INTEGER NOT NULL, content TEXT NOT NULL, embedding VECTOR(1536), metadata JSONB, created_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique chunk identifier - `document_id` - Associated document - `chunk_index` - Sequential chunk index - `content` - Chunk text content - `embedding` - Vector embedding (1536 dimensions) - `metadata` - Chunk metadata (JSONB) - `created_at` - Chunk creation timestamp **Indexes**: ```sql CREATE INDEX idx_document_chunks_document_id ON document_chunks(document_id); CREATE INDEX idx_document_chunks_chunk_index ON document_chunks(chunk_index); CREATE INDEX idx_document_chunks_embedding ON document_chunks USING ivfflat (embedding vector_cosine_ops); ``` ### Search Analytics Table **Purpose**: Track vector search usage and performance ```sql CREATE TABLE search_analytics ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id TEXT NOT NULL, query_text TEXT NOT NULL, results_count INTEGER, search_time_ms INTEGER, success BOOLEAN, error_message TEXT, created_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique search identifier - `user_id` - User who performed search - `query_text` - Search query text - `results_count` - Number of results returned - `search_time_ms` - Search execution time - `success` - Whether search was successful - `error_message` - Error message if failed - `created_at` - Search timestamp **Indexes**: ```sql CREATE INDEX idx_search_analytics_user_id ON search_analytics(user_id); CREATE INDEX idx_search_analytics_created_at ON search_analytics(created_at); CREATE INDEX idx_search_analytics_success ON search_analytics(success); ``` --- ## 📈 Analytics Tables ### Performance Metrics Table **Purpose**: Track system performance metrics ```sql CREATE TABLE performance_metrics ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), metric_name TEXT NOT NULL, metric_value DECIMAL(10,4), metric_unit TEXT, tags JSONB, timestamp TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique metric identifier - `metric_name` - Name of the performance metric - `metric_value` - Metric value - `metric_unit` - Unit of measurement - `tags` - Additional tags (JSONB) - `timestamp` - Metric timestamp **Indexes**: ```sql CREATE INDEX idx_performance_metrics_name ON performance_metrics(metric_name); CREATE INDEX idx_performance_metrics_timestamp ON performance_metrics(timestamp); ``` ### Usage Analytics Table **Purpose**: Track user usage patterns ```sql CREATE TABLE usage_analytics ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id TEXT NOT NULL, action_type TEXT NOT NULL, action_details JSONB, ip_address INET, user_agent TEXT, created_at TIMESTAMP DEFAULT NOW() ); ``` **Columns**: - `id` - Unique analytics identifier - `user_id` - User who performed action - `action_type` - Type of action performed - `action_details` - Action details (JSONB) - `ip_address` - User IP address - `user_agent` - User agent string - `created_at` - Action timestamp **Indexes**: ```sql CREATE INDEX idx_usage_analytics_user_id ON usage_analytics(user_id); CREATE INDEX idx_usage_analytics_action_type ON usage_analytics(action_type); CREATE INDEX idx_usage_analytics_created_at ON usage_analytics(created_at); ``` --- ## 🔗 Table Relationships ### Primary Relationships ```mermaid erDiagram users ||--o{ documents : "owns" documents ||--o{ processing_jobs : "has" documents ||--o{ agentic_rag_sessions : "has" agentic_rag_sessions ||--o{ agent_executions : "contains" agentic_rag_sessions ||--o{ quality_metrics : "has" documents ||--o{ document_chunks : "contains" users ||--o{ search_analytics : "performs" users ||--o{ usage_analytics : "generates" ``` ### Foreign Key Constraints ```sql -- Documents table constraints ALTER TABLE documents ADD CONSTRAINT fk_documents_user_id FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE; -- Processing jobs table constraints ALTER TABLE processing_jobs ADD CONSTRAINT fk_processing_jobs_document_id FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE; -- Agentic RAG sessions table constraints ALTER TABLE agentic_rag_sessions ADD CONSTRAINT fk_agentic_rag_sessions_document_id FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE; -- Agent executions table constraints ALTER TABLE agent_executions ADD CONSTRAINT fk_agent_executions_session_id FOREIGN KEY (session_id) REFERENCES agentic_rag_sessions(id) ON DELETE CASCADE; -- Quality metrics table constraints ALTER TABLE quality_metrics ADD CONSTRAINT fk_quality_metrics_session_id FOREIGN KEY (session_id) REFERENCES agentic_rag_sessions(id) ON DELETE CASCADE; -- Document chunks table constraints ALTER TABLE document_chunks ADD CONSTRAINT fk_document_chunks_document_id FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE; ``` --- ## 🔐 Row Level Security (RLS) ### Documents Table RLS ```sql -- Enable RLS ALTER TABLE documents ENABLE ROW LEVEL SECURITY; -- Policy: Users can only access their own documents CREATE POLICY "Users can view own documents" ON documents FOR SELECT USING (auth.uid()::text = user_id); CREATE POLICY "Users can insert own documents" ON documents FOR INSERT WITH CHECK (auth.uid()::text = user_id); CREATE POLICY "Users can update own documents" ON documents FOR UPDATE USING (auth.uid()::text = user_id); CREATE POLICY "Users can delete own documents" ON documents FOR DELETE USING (auth.uid()::text = user_id); ``` ### Processing Jobs Table RLS ```sql -- Enable RLS ALTER TABLE processing_jobs ENABLE ROW LEVEL SECURITY; -- Policy: Users can only access their own jobs CREATE POLICY "Users can view own jobs" ON processing_jobs FOR SELECT USING (auth.uid()::text = user_id); CREATE POLICY "Users can insert own jobs" ON processing_jobs FOR INSERT WITH CHECK (auth.uid()::text = user_id); CREATE POLICY "Users can update own jobs" ON processing_jobs FOR UPDATE USING (auth.uid()::text = user_id); ``` --- ## 📊 Data Types and Constraints ### Status Enums ```sql -- Document status enum CREATE TYPE document_status AS ENUM ( 'uploaded', 'processing', 'completed', 'failed', 'cancelled' ); -- Job status enum CREATE TYPE job_status AS ENUM ( 'pending', 'running', 'completed', 'failed', 'cancelled' ); -- Session status enum CREATE TYPE session_status AS ENUM ( 'pending', 'processing', 'completed', 'failed', 'cancelled' ); ``` ### Check Constraints ```sql -- File size constraint ALTER TABLE documents ADD CONSTRAINT check_file_size CHECK (file_size > 0 AND file_size <= 104857600); -- Processing time constraint ALTER TABLE agentic_rag_sessions ADD CONSTRAINT check_processing_time CHECK (processing_time_ms >= 0); -- Validation score constraint ALTER TABLE quality_metrics ADD CONSTRAINT check_validation_score CHECK (metric_value >= 0 AND metric_value <= 1); ``` --- ## 🔄 Migration Scripts ### Initial Schema Migration ```sql -- Migration: 001_create_initial_schema.sql BEGIN; -- Create users table CREATE TABLE users ( id TEXT PRIMARY KEY, name TEXT, email TEXT UNIQUE NOT NULL, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); -- Create documents table CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id TEXT NOT NULL, original_file_name TEXT NOT NULL, file_path TEXT NOT NULL, file_size INTEGER NOT NULL, status TEXT NOT NULL DEFAULT 'uploaded', extracted_text TEXT, generated_summary TEXT, summary_pdf_path TEXT, analysis_data JSONB, error_message TEXT, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); -- Create indexes CREATE INDEX idx_documents_user_id ON documents(user_id); CREATE INDEX idx_documents_status ON documents(status); CREATE INDEX idx_documents_created_at ON documents(created_at); -- Enable RLS ALTER TABLE documents ENABLE ROW LEVEL SECURITY; COMMIT; ``` ### Add Vector Support Migration ```sql -- Migration: 002_add_vector_support.sql BEGIN; -- Enable vector extension CREATE EXTENSION IF NOT EXISTS vector; -- Create document chunks table CREATE TABLE document_chunks ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID REFERENCES documents(id) ON DELETE CASCADE, chunk_index INTEGER NOT NULL, content TEXT NOT NULL, embedding VECTOR(1536), metadata JSONB, created_at TIMESTAMP DEFAULT NOW() ); -- Create vector indexes CREATE INDEX idx_document_chunks_document_id ON document_chunks(document_id); CREATE INDEX idx_document_chunks_embedding ON document_chunks USING ivfflat (embedding vector_cosine_ops); COMMIT; ``` --- ## 📈 Performance Optimization ### Query Optimization ```sql -- Optimize document queries with composite indexes CREATE INDEX idx_documents_user_status ON documents(user_id, status); CREATE INDEX idx_documents_user_created ON documents(user_id, created_at DESC); -- Optimize processing job queries CREATE INDEX idx_processing_jobs_user_status ON processing_jobs(user_id, status); CREATE INDEX idx_processing_jobs_priority_status ON processing_jobs(priority DESC, status); -- Optimize analytics queries CREATE INDEX idx_usage_analytics_user_action ON usage_analytics(user_id, action_type); CREATE INDEX idx_performance_metrics_name_time ON performance_metrics(metric_name, timestamp DESC); ``` ### Partitioning Strategy ```sql -- Partition documents table by creation date CREATE TABLE documents_2024 PARTITION OF documents FOR VALUES FROM ('2024-01-01') TO ('2025-01-01'); CREATE TABLE documents_2025 PARTITION OF documents FOR VALUES FROM ('2025-01-01') TO ('2026-01-01'); ``` --- ## 🔍 Monitoring and Maintenance ### Database Health Queries ```sql -- Check table sizes SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE tablename = 'documents'; -- Check index usage SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE tablename = 'documents'; -- Check slow queries SELECT query, calls, total_time, mean_time, rows FROM pg_stat_statements WHERE query LIKE '%documents%' ORDER BY mean_time DESC LIMIT 10; ``` ### Maintenance Procedures ```sql -- Vacuum and analyze tables VACUUM ANALYZE documents; VACUUM ANALYZE processing_jobs; VACUUM ANALYZE agentic_rag_sessions; -- Update statistics ANALYZE documents; ANALYZE processing_jobs; ANALYZE agentic_rag_sessions; ``` --- This comprehensive database schema documentation provides complete information about the database structure, relationships, and optimization strategies for the CIM Document Processor.