# Financial Data Extraction: Implementation Plan ## Overview This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data. --- ## Phase 1: Quick Win Implementation (RECOMMENDED START) **Timeline**: 1-2 hours **Expected Improvement**: 60-70% accuracy gain **Risk**: Low - additive changes, no breaking modifications ### Step 1.1: Update DocumentAIOutput Interface **File**: `backend/src/services/documentAiProcessor.ts` **Current (lines 15-25):** ```typescript interface DocumentAIOutput { text: string; entities: Array<{...}>; tables: Array; // ❌ Just counts, no structure pages: Array; mimeType: string; } ``` **Updated:** ```typescript export interface StructuredTable { headers: string[]; rows: string[][]; position: { pageNumber: number; confidence: number; }; rawTable?: any; // Keep original for debugging } interface DocumentAIOutput { text: string; entities: Array<{...}>; tables: StructuredTable[]; // ✅ Full structure pages: Array; mimeType: string; } ``` ### Step 1.2: Add Table Text Extraction Helper **File**: `backend/src/services/documentAiProcessor.ts` **Location**: Add after line 51 (after constructor) ```typescript /** * Extract text from a Document AI layout object using text anchors * Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response */ private getTextFromLayout(layout: any, documentText: string): string { try { const textAnchor = layout?.textAnchor; if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) { return ''; } // Get the first segment (most common case) const segment = textAnchor.textSegments[0]; const startIndex = parseInt(segment.startIndex || '0'); const endIndex = parseInt(segment.endIndex || documentText.length.toString()); // Validate indices if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) { logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length }); return ''; } return documentText.substring(startIndex, endIndex).trim(); } catch (error) { logger.error('Failed to extract text from layout', { error: error instanceof Error ? error.message : String(error), layout }); return ''; } } ``` ### Step 1.3: Add Structured Table Extraction **File**: `backend/src/services/documentAiProcessor.ts` **Location**: Add after getTextFromLayout method ```typescript /** * Extract structured tables from Document AI response * Preserves column alignment and table structure */ private extractStructuredTables(document: any, documentText: string): StructuredTable[] { const tables: StructuredTable[] = []; try { const pages = document.pages || []; logger.info('Extracting structured tables from Document AI response', { pageCount: pages.length }); for (const page of pages) { const pageTables = page.tables || []; const pageNumber = page.pageNumber || 0; logger.info('Processing page for tables', { pageNumber, tableCount: pageTables.length }); for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) { const table = pageTables[tableIndex]; try { // Extract headers from first header row const headers: string[] = []; if (table.headerRows && table.headerRows.length > 0) { const headerRow = table.headerRows[0]; for (const cell of headerRow.cells || []) { const cellText = this.getTextFromLayout(cell.layout, documentText); headers.push(cellText); } } // Extract data rows const rows: string[][] = []; for (const bodyRow of table.bodyRows || []) { const row: string[] = []; for (const cell of bodyRow.cells || []) { const cellText = this.getTextFromLayout(cell.layout, documentText); row.push(cellText); } if (row.length > 0) { rows.push(row); } } // Only add tables with content if (headers.length > 0 || rows.length > 0) { tables.push({ headers, rows, position: { pageNumber, confidence: table.confidence || 0.9 }, rawTable: table // Keep for debugging }); logger.info('Extracted structured table', { pageNumber, tableIndex, headerCount: headers.length, rowCount: rows.length, headers: headers.slice(0, 10) // Log first 10 headers }); } } catch (tableError) { logger.error('Failed to extract table', { pageNumber, tableIndex, error: tableError instanceof Error ? tableError.message : String(tableError) }); } } } logger.info('Structured table extraction completed', { totalTables: tables.length }); } catch (error) { logger.error('Failed to extract structured tables', { error: error instanceof Error ? error.message : String(error) }); } return tables; } ``` ### Step 1.4: Update processWithDocumentAI to Use Structured Tables **File**: `backend/src/services/documentAiProcessor.ts` **Location**: Update lines 462-482 **Current:** ```typescript // Extract tables const tables = document.pages?.flatMap(page => page.tables?.map(table => ({ rows: table.headerRows?.length || 0, columns: table.bodyRows?.[0]?.cells?.length || 0 })) || [] ) || []; ``` **Updated:** ```typescript // Extract structured tables with full content const tables = this.extractStructuredTables(document, text); ``` ### Step 1.5: Pass Tables to Agentic RAG Processor **File**: `backend/src/services/documentAiProcessor.ts` **Location**: Update line 337 (processLargeDocument call) **Current:** ```typescript const result = await optimizedAgenticRAGProcessor.processLargeDocument( documentId, extractedText, {} ); ``` **Updated:** ```typescript const result = await optimizedAgenticRAGProcessor.processLargeDocument( documentId, extractedText, { structuredTables: documentAiOutput.tables || [] } ); ``` ### Step 1.6: Update Agentic RAG Processor Signature **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Update lines 41-48 **Current:** ```typescript async processLargeDocument( documentId: string, text: string, options: { enableSemanticChunking?: boolean; enableMetadataEnrichment?: boolean; similarityThreshold?: number; } = {} ) ``` **Updated:** ```typescript async processLargeDocument( documentId: string, text: string, options: { enableSemanticChunking?: boolean; enableMetadataEnrichment?: boolean; similarityThreshold?: number; structuredTables?: StructuredTable[]; } = {} ) ``` ### Step 1.7: Add Import for StructuredTable Type **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Add to imports at top (around line 1-6) ```typescript import type { StructuredTable } from './documentAiProcessor'; ``` ### Step 1.8: Create Financial Table Identifier **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Add after line 503 (after calculateCosineSimilarity) ```typescript /** * Identify if a structured table contains financial data * Uses heuristics to detect financial tables vs. other tables */ private isFinancialTable(table: StructuredTable): boolean { const headerText = table.headers.join(' ').toLowerCase(); const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' '); // Check for year/period indicators in headers const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText); // Check for financial metrics in rows const financialMetrics = [ 'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin', 'gross profit', 'operating income', 'net income', 'cash flow', 'earnings', 'assets', 'liabilities', 'equity' ]; const hasFinancialMetrics = financialMetrics.some(metric => allRowsText.includes(metric) ); // Check for currency/percentage values const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText); // A financial table should have periods AND (metrics OR currency values) const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency); if (isFinancial) { logger.info('Identified financial table', { headers: table.headers, rowCount: table.rows.length, pageNumber: table.position.pageNumber }); } return isFinancial; } /** * Format a structured table as markdown for better LLM comprehension * Preserves column alignment and makes tables human-readable */ private formatTableAsMarkdown(table: StructuredTable): string { const lines: string[] = []; // Add header row if (table.headers.length > 0) { lines.push(`| ${table.headers.join(' | ')} |`); lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`); } // Add data rows for (const row of table.rows) { lines.push(`| ${row.join(' | ')} |`); } return lines.join('\n'); } ``` ### Step 1.9: Update Chunk Creation to Include Financial Tables **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Update createIntelligentChunks method (lines 115-158) **Add after line 118:** ```typescript // Extract structured tables from options const structuredTables = (options as any)?.structuredTables || []; ``` **Add after line 119 (inside the method, before semantic chunking):** ```typescript // PRIORITY: Create dedicated chunks for financial tables if (structuredTables.length > 0) { logger.info('Processing structured tables for chunking', { documentId, tableCount: structuredTables.length }); for (let i = 0; i < structuredTables.length; i++) { const table = structuredTables[i]; const isFinancial = this.isFinancialTable(table); // Format table as markdown for better readability const markdownTable = this.formatTableAsMarkdown(table); chunks.push({ id: `${documentId}-table-${i}`, content: markdownTable, chunkIndex: chunks.length, startPosition: -1, // Tables don't have text positions endPosition: -1, sectionType: isFinancial ? 'financial-table' : 'table', metadata: { isStructuredTable: true, isFinancialTable: isFinancial, tableIndex: i, pageNumber: table.position.pageNumber, headerCount: table.headers.length, rowCount: table.rows.length, structuredData: table // Preserve original structure } }); logger.info('Created chunk for structured table', { documentId, tableIndex: i, isFinancial, chunkId: chunks[chunks.length - 1].id, contentPreview: markdownTable.substring(0, 200) }); } } ``` ### Step 1.10: Pin Financial Tables in Extraction **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260) **Add before the return statement (around line 1259):** ```typescript // Identify and pin financial table chunks to ensure they're always included const financialTableChunks = chunks.filter( chunk => chunk.metadata?.isFinancialTable === true ); logger.info('Financial table chunks identified for pinning', { documentId, financialTableCount: financialTableChunks.length, chunkIds: financialTableChunks.map(c => c.id) }); // Combine deterministic financial chunks with structured table chunks const allPinnedChunks = [ ...pinnedChunks, ...financialTableChunks ]; ``` **Update the return statement to use allPinnedChunks:** ```typescript return await this.extractWithTargetedQuery( documentId, text, financialChunks, query, targetFields, 7, allPinnedChunks // ✅ Now includes both deterministic and structured tables ); ``` --- ## Testing Phase 1 ### Test 1.1: Verify Table Extraction ```bash # Monitor logs for table extraction cd backend npm run dev # Look for log entries: # - "Extracting structured tables from Document AI response" # - "Extracted structured table" # - "Identified financial table" ``` ### Test 1.2: Upload a CIM Document ```bash # Upload a test document and check processing curl -X POST http://localhost:8080/api/documents/upload \ -F "file=@test-cim.pdf" \ -H "Authorization: Bearer YOUR_TOKEN" ``` ### Test 1.3: Verify Financial Data Populated Check the database or API response for: - `financialSummary.financials.fy3.revenue` - Should have values - `financialSummary.financials.fy2.ebitda` - Should have values - NOT "Not specified in CIM" for fields that exist in tables ### Test 1.4: Check Logs for Success Indicators ```bash # Should see: ✅ "Identified financial table" - confirms tables detected ✅ "Created chunk for structured table" - confirms chunking worked ✅ "Financial table chunks identified for pinning" - confirms pinning worked ✅ "Deterministic financial data merged successfully" - confirms data merged ``` --- ### Baseline & Post-Change Metrics Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback: 1. Instrument the processing metadata (see `FINANCIAL_EXTRACTION_ANALYSIS.md`) with `tablesFound`, `financialTablesIdentified`, `structuredParsingUsed`, `textParsingFallback`, and `financialDataPopulated`. 2. Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample `documentId`s with `tablesFound === 0`). 3. Repeat after deploying Phase 1 and Phase 2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates. --- ## Expected Results After Phase 1 ### Before Phase 1: ```json { "financialSummary": { "financials": { "fy3": { "revenue": "Not specified in CIM", "ebitda": "Not specified in CIM" }, "fy2": { "revenue": "Not specified in CIM", "ebitda": "Not specified in CIM" } } } } ``` ### After Phase 1: ```json { "financialSummary": { "financials": { "fy3": { "revenue": "$45.2M", "revenueGrowth": "N/A", "ebitda": "$8.5M", "ebitdaMargin": "18.8%" }, "fy2": { "revenue": "$52.8M", "revenueGrowth": "16.8%", "ebitda": "$10.2M", "ebitdaMargin": "19.3%" } } } } ``` --- ## Phase 2: Enhanced Deterministic Parsing (Optional) **Timeline**: 2-3 hours **Expected Additional Improvement**: +15-20% accuracy **Trigger**: If Phase 1 results are below 70% accuracy ### Step 2.1: Create Structured Table Parser **File**: Create `backend/src/services/structuredFinancialParser.ts` ```typescript import { logger } from '../utils/logger'; import type { StructuredTable } from './documentAiProcessor'; import type { ParsedFinancials, FinancialPeriod } from './financialTableParser'; /** * Parse financials directly from Document AI structured tables * This is more reliable than parsing from flattened text */ export function parseFinancialsFromStructuredTable( table: StructuredTable ): ParsedFinancials { const result: ParsedFinancials = { fy3: {}, fy2: {}, fy1: {}, ltm: {} }; try { // 1. Identify period columns from headers const periodMapping = mapHeadersToPeriods(table.headers); logger.info('Structured table period mapping', { headers: table.headers, periodMapping }); // 2. Process each row to extract metrics for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) { const row = table.rows[rowIndex]; if (row.length === 0) continue; const metricName = row[0].toLowerCase(); // Match against known financial metrics const fieldName = identifyMetricField(metricName); if (!fieldName) continue; // 3. Assign values to correct periods periodMapping.forEach((period, columnIndex) => { if (!period) return; // Skip unmapped columns const value = row[columnIndex + 1]; // +1 because first column is metric name if (!value || value.trim() === '') return; // 4. Validate value type matches field if (isValidValueForField(value, fieldName)) { result[period][fieldName] = value.trim(); logger.debug('Mapped structured table value', { period, field: fieldName, value: value.trim(), row: rowIndex, column: columnIndex }); } }); } logger.info('Structured table parsing completed', { fy3: result.fy3, fy2: result.fy2, fy1: result.fy1, ltm: result.ltm }); } catch (error) { logger.error('Failed to parse structured financial table', { error: error instanceof Error ? error.message : String(error) }); } return result; } /** * Map header columns to financial periods (fy3, fy2, fy1, ltm) */ function mapHeadersToPeriods(headers: string[]): Array { const periodMapping: Array = []; for (const header of headers) { const normalized = header.trim().toUpperCase().replace(/\s+/g, ''); let period: keyof ParsedFinancials | null = null; // Check for LTM/TTM if (normalized.includes('LTM') || normalized.includes('TTM')) { period = 'ltm'; } // Check for year patterns else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) { period = 'fy1'; // Most recent full year } else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) { period = 'fy2'; // Second most recent year } else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) { period = 'fy3'; // Third most recent year } // Generic FY pattern - assign based on position else if (/FY\d{2}/.test(normalized)) { // Will be assigned based on relative position period = null; // Handle in second pass } periodMapping.push(period); } // Second pass: fill in generic FY columns based on position // Most recent on right, oldest on left (common CIM format) let fyIndex = 1; for (let i = periodMapping.length - 1; i >= 0; i--) { if (periodMapping[i] === null && /FY/i.test(headers[i])) { if (fyIndex === 1) periodMapping[i] = 'fy1'; else if (fyIndex === 2) periodMapping[i] = 'fy2'; else if (fyIndex === 3) periodMapping[i] = 'fy3'; fyIndex++; } } return periodMapping; } /** * Identify which financial field a metric name corresponds to */ function identifyMetricField(metricName: string): keyof FinancialPeriod | null { const name = metricName.toLowerCase(); if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) { return 'revenue'; } if (/gross\s*profit/.test(name)) { return 'grossProfit'; } if (/gross\s*margin/.test(name)) { return 'grossMargin'; } if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) { return 'ebitdaMargin'; } if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) { return 'ebitda'; } if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) { return 'revenueGrowth'; } return null; } /** * Validate that a value is appropriate for a given field */ function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean { const trimmed = value.trim(); // Margin and growth fields should have % if (field.includes('Margin') || field.includes('Growth')) { return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a'); } // Revenue, profit, EBITDA should have $ or numbers if (['revenue', 'grossProfit', 'ebitda'].includes(field)) { return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed)); } return /\d/.test(trimmed); } ``` ### Step 2.2: Integrate Structured Parser **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Location**: Update multi-pass extraction (around line 1063-1088) **Add import:** ```typescript import { parseFinancialsFromStructuredTable } from './structuredFinancialParser'; ``` **Update financial extraction logic (around line 1066-1088):** ```typescript // Try structured table parsing first (most reliable) try { const structuredTables = (options as any)?.structuredTables || []; const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t)); if (financialTables.length > 0) { logger.info('Attempting structured table parsing', { documentId, financialTableCount: financialTables.length }); // Try each financial table until we get good data for (const table of financialTables) { const parsedFromTable = parseFinancialsFromStructuredTable(table); if (this.hasStructuredFinancialData(parsedFromTable)) { deterministicFinancials = parsedFromTable; deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable); logger.info('Structured table parsing successful', { documentId, tableIndex: financialTables.indexOf(table), fy3: parsedFromTable.fy3, fy2: parsedFromTable.fy2, fy1: parsedFromTable.fy1, ltm: parsedFromTable.ltm }); break; // Found good data, stop trying tables } } } } catch (structuredParserError) { logger.warn('Structured table parsing failed, falling back to text parser', { documentId, error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError) }); } // Fallback to text-based parsing if structured parsing failed if (!deterministicFinancials) { try { const { parseFinancialsFromText } = await import('./financialTableParser'); const parsedFinancials = parseFinancialsFromText(text); // ... existing code } catch (parserError) { // ... existing error handling } } ``` --- ## Rollback Plan If Phase 1 causes issues: ### Quick Rollback (5 minutes) ```bash git checkout HEAD -- backend/src/services/documentAiProcessor.ts git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts npm run build npm start ``` ### Feature Flag Approach (Recommended) Add environment variable to control new behavior: ```typescript // backend/src/config/env.ts export const config = { features: { useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true' } }; ``` Then wrap new code: ```typescript if (config.features.useStructuredTables) { // Use structured tables } else { // Use old flat text approach } ``` --- ## Success Criteria ### Phase 1 Success: - ✅ 60%+ of CIM documents have populated financial data (validated via new telemetry) - ✅ No regression in processing time (< 10% increase acceptable) - ✅ No errors in table extraction pipeline - ✅ Structured tables logged in console ### Phase 2 Success: - ✅ 85%+ of CIM documents have populated financial data or fall back to the hybrid path when `tablesFound === 0` - ✅ Column alignment accuracy > 95% - ✅ Reduction in "Not specified in CIM" responses --- ## Monitoring & Debugging ### Key Metrics to Track ```typescript // Add to processing result metadata: { tablesFound: number; financialTablesIdentified: number; structuredParsingUsed: boolean; textParsingFallback: boolean; financialDataPopulated: boolean; } ``` ### Log Analysis Queries ```bash # Find documents with no tables grep "totalTables: 0" backend.log # Find failed table extractions grep "Failed to extract table" backend.log # Find successful financial extractions grep "Structured table parsing successful" backend.log ``` --- ## Next Steps After Implementation 1. **Run on historical documents**: Reprocess 10-20 existing CIMs to compare before/after 2. **A/B test**: Process new documents with both old and new system, compare results 3. **Tune thresholds**: Adjust financial table identification heuristics based on results 4. **Document findings**: Update this plan with actual results and lessons learned --- ## Resources - [Document AI Table Extraction Docs](https://cloud.google.com/document-ai/docs/handle-response) - [Financial Parser (current)](backend/src/services/financialTableParser.ts) - [Financial Extractor (unused)](backend/src/utils/financialExtractor.ts) - [Analysis Document](FINANCIAL_EXTRACTION_ANALYSIS.md)