Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
24 KiB
Financial Data Extraction: Implementation Plan
Overview
This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data.
Phase 1: Quick Win Implementation (RECOMMENDED START)
Timeline: 1-2 hours Expected Improvement: 60-70% accuracy gain Risk: Low - additive changes, no breaking modifications
Step 1.1: Update DocumentAIOutput Interface
File: backend/src/services/documentAiProcessor.ts
Current (lines 15-25):
interface DocumentAIOutput {
text: string;
entities: Array<{...}>;
tables: Array<any>; // ❌ Just counts, no structure
pages: Array<any>;
mimeType: string;
}
Updated:
export interface StructuredTable {
headers: string[];
rows: string[][];
position: {
pageNumber: number;
confidence: number;
};
rawTable?: any; // Keep original for debugging
}
interface DocumentAIOutput {
text: string;
entities: Array<{...}>;
tables: StructuredTable[]; // ✅ Full structure
pages: Array<any>;
mimeType: string;
}
Step 1.2: Add Table Text Extraction Helper
File: backend/src/services/documentAiProcessor.ts
Location: Add after line 51 (after constructor)
/**
* Extract text from a Document AI layout object using text anchors
* Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response
*/
private getTextFromLayout(layout: any, documentText: string): string {
try {
const textAnchor = layout?.textAnchor;
if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) {
return '';
}
// Get the first segment (most common case)
const segment = textAnchor.textSegments[0];
const startIndex = parseInt(segment.startIndex || '0');
const endIndex = parseInt(segment.endIndex || documentText.length.toString());
// Validate indices
if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) {
logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length });
return '';
}
return documentText.substring(startIndex, endIndex).trim();
} catch (error) {
logger.error('Failed to extract text from layout', {
error: error instanceof Error ? error.message : String(error),
layout
});
return '';
}
}
Step 1.3: Add Structured Table Extraction
File: backend/src/services/documentAiProcessor.ts
Location: Add after getTextFromLayout method
/**
* Extract structured tables from Document AI response
* Preserves column alignment and table structure
*/
private extractStructuredTables(document: any, documentText: string): StructuredTable[] {
const tables: StructuredTable[] = [];
try {
const pages = document.pages || [];
logger.info('Extracting structured tables from Document AI response', {
pageCount: pages.length
});
for (const page of pages) {
const pageTables = page.tables || [];
const pageNumber = page.pageNumber || 0;
logger.info('Processing page for tables', {
pageNumber,
tableCount: pageTables.length
});
for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) {
const table = pageTables[tableIndex];
try {
// Extract headers from first header row
const headers: string[] = [];
if (table.headerRows && table.headerRows.length > 0) {
const headerRow = table.headerRows[0];
for (const cell of headerRow.cells || []) {
const cellText = this.getTextFromLayout(cell.layout, documentText);
headers.push(cellText);
}
}
// Extract data rows
const rows: string[][] = [];
for (const bodyRow of table.bodyRows || []) {
const row: string[] = [];
for (const cell of bodyRow.cells || []) {
const cellText = this.getTextFromLayout(cell.layout, documentText);
row.push(cellText);
}
if (row.length > 0) {
rows.push(row);
}
}
// Only add tables with content
if (headers.length > 0 || rows.length > 0) {
tables.push({
headers,
rows,
position: {
pageNumber,
confidence: table.confidence || 0.9
},
rawTable: table // Keep for debugging
});
logger.info('Extracted structured table', {
pageNumber,
tableIndex,
headerCount: headers.length,
rowCount: rows.length,
headers: headers.slice(0, 10) // Log first 10 headers
});
}
} catch (tableError) {
logger.error('Failed to extract table', {
pageNumber,
tableIndex,
error: tableError instanceof Error ? tableError.message : String(tableError)
});
}
}
}
logger.info('Structured table extraction completed', {
totalTables: tables.length
});
} catch (error) {
logger.error('Failed to extract structured tables', {
error: error instanceof Error ? error.message : String(error)
});
}
return tables;
}
Step 1.4: Update processWithDocumentAI to Use Structured Tables
File: backend/src/services/documentAiProcessor.ts
Location: Update lines 462-482
Current:
// Extract tables
const tables = document.pages?.flatMap(page =>
page.tables?.map(table => ({
rows: table.headerRows?.length || 0,
columns: table.bodyRows?.[0]?.cells?.length || 0
})) || []
) || [];
Updated:
// Extract structured tables with full content
const tables = this.extractStructuredTables(document, text);
Step 1.5: Pass Tables to Agentic RAG Processor
File: backend/src/services/documentAiProcessor.ts
Location: Update line 337 (processLargeDocument call)
Current:
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
documentId,
extractedText,
{}
);
Updated:
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
documentId,
extractedText,
{
structuredTables: documentAiOutput.tables || []
}
);
Step 1.6: Update Agentic RAG Processor Signature
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Update lines 41-48
Current:
async processLargeDocument(
documentId: string,
text: string,
options: {
enableSemanticChunking?: boolean;
enableMetadataEnrichment?: boolean;
similarityThreshold?: number;
} = {}
)
Updated:
async processLargeDocument(
documentId: string,
text: string,
options: {
enableSemanticChunking?: boolean;
enableMetadataEnrichment?: boolean;
similarityThreshold?: number;
structuredTables?: StructuredTable[];
} = {}
)
Step 1.7: Add Import for StructuredTable Type
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Add to imports at top (around line 1-6)
import type { StructuredTable } from './documentAiProcessor';
Step 1.8: Create Financial Table Identifier
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Add after line 503 (after calculateCosineSimilarity)
/**
* Identify if a structured table contains financial data
* Uses heuristics to detect financial tables vs. other tables
*/
private isFinancialTable(table: StructuredTable): boolean {
const headerText = table.headers.join(' ').toLowerCase();
const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' ');
// Check for year/period indicators in headers
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText);
// Check for financial metrics in rows
const financialMetrics = [
'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin',
'gross profit', 'operating income', 'net income', 'cash flow',
'earnings', 'assets', 'liabilities', 'equity'
];
const hasFinancialMetrics = financialMetrics.some(metric =>
allRowsText.includes(metric)
);
// Check for currency/percentage values
const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText);
// A financial table should have periods AND (metrics OR currency values)
const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency);
if (isFinancial) {
logger.info('Identified financial table', {
headers: table.headers,
rowCount: table.rows.length,
pageNumber: table.position.pageNumber
});
}
return isFinancial;
}
/**
* Format a structured table as markdown for better LLM comprehension
* Preserves column alignment and makes tables human-readable
*/
private formatTableAsMarkdown(table: StructuredTable): string {
const lines: string[] = [];
// Add header row
if (table.headers.length > 0) {
lines.push(`| ${table.headers.join(' | ')} |`);
lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`);
}
// Add data rows
for (const row of table.rows) {
lines.push(`| ${row.join(' | ')} |`);
}
return lines.join('\n');
}
Step 1.9: Update Chunk Creation to Include Financial Tables
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Update createIntelligentChunks method (lines 115-158)
Add after line 118:
// Extract structured tables from options
const structuredTables = (options as any)?.structuredTables || [];
Add after line 119 (inside the method, before semantic chunking):
// PRIORITY: Create dedicated chunks for financial tables
if (structuredTables.length > 0) {
logger.info('Processing structured tables for chunking', {
documentId,
tableCount: structuredTables.length
});
for (let i = 0; i < structuredTables.length; i++) {
const table = structuredTables[i];
const isFinancial = this.isFinancialTable(table);
// Format table as markdown for better readability
const markdownTable = this.formatTableAsMarkdown(table);
chunks.push({
id: `${documentId}-table-${i}`,
content: markdownTable,
chunkIndex: chunks.length,
startPosition: -1, // Tables don't have text positions
endPosition: -1,
sectionType: isFinancial ? 'financial-table' : 'table',
metadata: {
isStructuredTable: true,
isFinancialTable: isFinancial,
tableIndex: i,
pageNumber: table.position.pageNumber,
headerCount: table.headers.length,
rowCount: table.rows.length,
structuredData: table // Preserve original structure
}
});
logger.info('Created chunk for structured table', {
documentId,
tableIndex: i,
isFinancial,
chunkId: chunks[chunks.length - 1].id,
contentPreview: markdownTable.substring(0, 200)
});
}
}
Step 1.10: Pin Financial Tables in Extraction
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260)
Add before the return statement (around line 1259):
// Identify and pin financial table chunks to ensure they're always included
const financialTableChunks = chunks.filter(
chunk => chunk.metadata?.isFinancialTable === true
);
logger.info('Financial table chunks identified for pinning', {
documentId,
financialTableCount: financialTableChunks.length,
chunkIds: financialTableChunks.map(c => c.id)
});
// Combine deterministic financial chunks with structured table chunks
const allPinnedChunks = [
...pinnedChunks,
...financialTableChunks
];
Update the return statement to use allPinnedChunks:
return await this.extractWithTargetedQuery(
documentId,
text,
financialChunks,
query,
targetFields,
7,
allPinnedChunks // ✅ Now includes both deterministic and structured tables
);
Testing Phase 1
Test 1.1: Verify Table Extraction
# Monitor logs for table extraction
cd backend
npm run dev
# Look for log entries:
# - "Extracting structured tables from Document AI response"
# - "Extracted structured table"
# - "Identified financial table"
Test 1.2: Upload a CIM Document
# Upload a test document and check processing
curl -X POST http://localhost:8080/api/documents/upload \
-F "file=@test-cim.pdf" \
-H "Authorization: Bearer YOUR_TOKEN"
Test 1.3: Verify Financial Data Populated
Check the database or API response for:
financialSummary.financials.fy3.revenue- Should have valuesfinancialSummary.financials.fy2.ebitda- Should have values- NOT "Not specified in CIM" for fields that exist in tables
Test 1.4: Check Logs for Success Indicators
# Should see:
✅ "Identified financial table" - confirms tables detected
✅ "Created chunk for structured table" - confirms chunking worked
✅ "Financial table chunks identified for pinning" - confirms pinning worked
✅ "Deterministic financial data merged successfully" - confirms data merged
Baseline & Post-Change Metrics
Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback:
- Instrument the processing metadata (see
FINANCIAL_EXTRACTION_ANALYSIS.md) withtablesFound,financialTablesIdentified,structuredParsingUsed,textParsingFallback, andfinancialDataPopulated. - Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample
documentIds withtablesFound === 0). - Repeat after deploying Phase 1 and Phase 2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates.
Expected Results After Phase 1
Before Phase 1:
{
"financialSummary": {
"financials": {
"fy3": {
"revenue": "Not specified in CIM",
"ebitda": "Not specified in CIM"
},
"fy2": {
"revenue": "Not specified in CIM",
"ebitda": "Not specified in CIM"
}
}
}
}
After Phase 1:
{
"financialSummary": {
"financials": {
"fy3": {
"revenue": "$45.2M",
"revenueGrowth": "N/A",
"ebitda": "$8.5M",
"ebitdaMargin": "18.8%"
},
"fy2": {
"revenue": "$52.8M",
"revenueGrowth": "16.8%",
"ebitda": "$10.2M",
"ebitdaMargin": "19.3%"
}
}
}
}
Phase 2: Enhanced Deterministic Parsing (Optional)
Timeline: 2-3 hours Expected Additional Improvement: +15-20% accuracy Trigger: If Phase 1 results are below 70% accuracy
Step 2.1: Create Structured Table Parser
File: Create backend/src/services/structuredFinancialParser.ts
import { logger } from '../utils/logger';
import type { StructuredTable } from './documentAiProcessor';
import type { ParsedFinancials, FinancialPeriod } from './financialTableParser';
/**
* Parse financials directly from Document AI structured tables
* This is more reliable than parsing from flattened text
*/
export function parseFinancialsFromStructuredTable(
table: StructuredTable
): ParsedFinancials {
const result: ParsedFinancials = {
fy3: {},
fy2: {},
fy1: {},
ltm: {}
};
try {
// 1. Identify period columns from headers
const periodMapping = mapHeadersToPeriods(table.headers);
logger.info('Structured table period mapping', {
headers: table.headers,
periodMapping
});
// 2. Process each row to extract metrics
for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) {
const row = table.rows[rowIndex];
if (row.length === 0) continue;
const metricName = row[0].toLowerCase();
// Match against known financial metrics
const fieldName = identifyMetricField(metricName);
if (!fieldName) continue;
// 3. Assign values to correct periods
periodMapping.forEach((period, columnIndex) => {
if (!period) return; // Skip unmapped columns
const value = row[columnIndex + 1]; // +1 because first column is metric name
if (!value || value.trim() === '') return;
// 4. Validate value type matches field
if (isValidValueForField(value, fieldName)) {
result[period][fieldName] = value.trim();
logger.debug('Mapped structured table value', {
period,
field: fieldName,
value: value.trim(),
row: rowIndex,
column: columnIndex
});
}
});
}
logger.info('Structured table parsing completed', {
fy3: result.fy3,
fy2: result.fy2,
fy1: result.fy1,
ltm: result.ltm
});
} catch (error) {
logger.error('Failed to parse structured financial table', {
error: error instanceof Error ? error.message : String(error)
});
}
return result;
}
/**
* Map header columns to financial periods (fy3, fy2, fy1, ltm)
*/
function mapHeadersToPeriods(headers: string[]): Array<keyof ParsedFinancials | null> {
const periodMapping: Array<keyof ParsedFinancials | null> = [];
for (const header of headers) {
const normalized = header.trim().toUpperCase().replace(/\s+/g, '');
let period: keyof ParsedFinancials | null = null;
// Check for LTM/TTM
if (normalized.includes('LTM') || normalized.includes('TTM')) {
period = 'ltm';
}
// Check for year patterns
else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) {
period = 'fy1'; // Most recent full year
}
else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) {
period = 'fy2'; // Second most recent year
}
else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) {
period = 'fy3'; // Third most recent year
}
// Generic FY pattern - assign based on position
else if (/FY\d{2}/.test(normalized)) {
// Will be assigned based on relative position
period = null; // Handle in second pass
}
periodMapping.push(period);
}
// Second pass: fill in generic FY columns based on position
// Most recent on right, oldest on left (common CIM format)
let fyIndex = 1;
for (let i = periodMapping.length - 1; i >= 0; i--) {
if (periodMapping[i] === null && /FY/i.test(headers[i])) {
if (fyIndex === 1) periodMapping[i] = 'fy1';
else if (fyIndex === 2) periodMapping[i] = 'fy2';
else if (fyIndex === 3) periodMapping[i] = 'fy3';
fyIndex++;
}
}
return periodMapping;
}
/**
* Identify which financial field a metric name corresponds to
*/
function identifyMetricField(metricName: string): keyof FinancialPeriod | null {
const name = metricName.toLowerCase();
if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) {
return 'revenue';
}
if (/gross\s*profit/.test(name)) {
return 'grossProfit';
}
if (/gross\s*margin/.test(name)) {
return 'grossMargin';
}
if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) {
return 'ebitdaMargin';
}
if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) {
return 'ebitda';
}
if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) {
return 'revenueGrowth';
}
return null;
}
/**
* Validate that a value is appropriate for a given field
*/
function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean {
const trimmed = value.trim();
// Margin and growth fields should have %
if (field.includes('Margin') || field.includes('Growth')) {
return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a');
}
// Revenue, profit, EBITDA should have $ or numbers
if (['revenue', 'grossProfit', 'ebitda'].includes(field)) {
return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed));
}
return /\d/.test(trimmed);
}
Step 2.2: Integrate Structured Parser
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Location: Update multi-pass extraction (around line 1063-1088)
Add import:
import { parseFinancialsFromStructuredTable } from './structuredFinancialParser';
Update financial extraction logic (around line 1066-1088):
// Try structured table parsing first (most reliable)
try {
const structuredTables = (options as any)?.structuredTables || [];
const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t));
if (financialTables.length > 0) {
logger.info('Attempting structured table parsing', {
documentId,
financialTableCount: financialTables.length
});
// Try each financial table until we get good data
for (const table of financialTables) {
const parsedFromTable = parseFinancialsFromStructuredTable(table);
if (this.hasStructuredFinancialData(parsedFromTable)) {
deterministicFinancials = parsedFromTable;
deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable);
logger.info('Structured table parsing successful', {
documentId,
tableIndex: financialTables.indexOf(table),
fy3: parsedFromTable.fy3,
fy2: parsedFromTable.fy2,
fy1: parsedFromTable.fy1,
ltm: parsedFromTable.ltm
});
break; // Found good data, stop trying tables
}
}
}
} catch (structuredParserError) {
logger.warn('Structured table parsing failed, falling back to text parser', {
documentId,
error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError)
});
}
// Fallback to text-based parsing if structured parsing failed
if (!deterministicFinancials) {
try {
const { parseFinancialsFromText } = await import('./financialTableParser');
const parsedFinancials = parseFinancialsFromText(text);
// ... existing code
} catch (parserError) {
// ... existing error handling
}
}
Rollback Plan
If Phase 1 causes issues:
Quick Rollback (5 minutes)
git checkout HEAD -- backend/src/services/documentAiProcessor.ts
git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts
npm run build
npm start
Feature Flag Approach (Recommended)
Add environment variable to control new behavior:
// backend/src/config/env.ts
export const config = {
features: {
useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true'
}
};
Then wrap new code:
if (config.features.useStructuredTables) {
// Use structured tables
} else {
// Use old flat text approach
}
Success Criteria
Phase 1 Success:
- ✅ 60%+ of CIM documents have populated financial data (validated via new telemetry)
- ✅ No regression in processing time (< 10% increase acceptable)
- ✅ No errors in table extraction pipeline
- ✅ Structured tables logged in console
Phase 2 Success:
- ✅ 85%+ of CIM documents have populated financial data or fall back to the hybrid path when
tablesFound === 0 - ✅ Column alignment accuracy > 95%
- ✅ Reduction in "Not specified in CIM" responses
Monitoring & Debugging
Key Metrics to Track
// Add to processing result
metadata: {
tablesFound: number;
financialTablesIdentified: number;
structuredParsingUsed: boolean;
textParsingFallback: boolean;
financialDataPopulated: boolean;
}
Log Analysis Queries
# Find documents with no tables
grep "totalTables: 0" backend.log
# Find failed table extractions
grep "Failed to extract table" backend.log
# Find successful financial extractions
grep "Structured table parsing successful" backend.log
Next Steps After Implementation
- Run on historical documents: Reprocess 10-20 existing CIMs to compare before/after
- A/B test: Process new documents with both old and new system, compare results
- Tune thresholds: Adjust financial table identification heuristics based on results
- Document findings: Update this plan with actual results and lessons learned