Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
872 lines
24 KiB
Markdown
872 lines
24 KiB
Markdown
# Financial Data Extraction: Implementation Plan
|
||
|
||
## Overview
|
||
|
||
This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data.
|
||
|
||
---
|
||
|
||
## Phase 1: Quick Win Implementation (RECOMMENDED START)
|
||
|
||
**Timeline**: 1-2 hours
|
||
**Expected Improvement**: 60-70% accuracy gain
|
||
**Risk**: Low - additive changes, no breaking modifications
|
||
|
||
### Step 1.1: Update DocumentAIOutput Interface
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
|
||
**Current (lines 15-25):**
|
||
```typescript
|
||
interface DocumentAIOutput {
|
||
text: string;
|
||
entities: Array<{...}>;
|
||
tables: Array<any>; // ❌ Just counts, no structure
|
||
pages: Array<any>;
|
||
mimeType: string;
|
||
}
|
||
```
|
||
|
||
**Updated:**
|
||
```typescript
|
||
export interface StructuredTable {
|
||
headers: string[];
|
||
rows: string[][];
|
||
position: {
|
||
pageNumber: number;
|
||
confidence: number;
|
||
};
|
||
rawTable?: any; // Keep original for debugging
|
||
}
|
||
|
||
interface DocumentAIOutput {
|
||
text: string;
|
||
entities: Array<{...}>;
|
||
tables: StructuredTable[]; // ✅ Full structure
|
||
pages: Array<any>;
|
||
mimeType: string;
|
||
}
|
||
```
|
||
|
||
### Step 1.2: Add Table Text Extraction Helper
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
**Location**: Add after line 51 (after constructor)
|
||
|
||
```typescript
|
||
/**
|
||
* Extract text from a Document AI layout object using text anchors
|
||
* Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response
|
||
*/
|
||
private getTextFromLayout(layout: any, documentText: string): string {
|
||
try {
|
||
const textAnchor = layout?.textAnchor;
|
||
if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) {
|
||
return '';
|
||
}
|
||
|
||
// Get the first segment (most common case)
|
||
const segment = textAnchor.textSegments[0];
|
||
const startIndex = parseInt(segment.startIndex || '0');
|
||
const endIndex = parseInt(segment.endIndex || documentText.length.toString());
|
||
|
||
// Validate indices
|
||
if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) {
|
||
logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length });
|
||
return '';
|
||
}
|
||
|
||
return documentText.substring(startIndex, endIndex).trim();
|
||
} catch (error) {
|
||
logger.error('Failed to extract text from layout', {
|
||
error: error instanceof Error ? error.message : String(error),
|
||
layout
|
||
});
|
||
return '';
|
||
}
|
||
}
|
||
```
|
||
|
||
### Step 1.3: Add Structured Table Extraction
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
**Location**: Add after getTextFromLayout method
|
||
|
||
```typescript
|
||
/**
|
||
* Extract structured tables from Document AI response
|
||
* Preserves column alignment and table structure
|
||
*/
|
||
private extractStructuredTables(document: any, documentText: string): StructuredTable[] {
|
||
const tables: StructuredTable[] = [];
|
||
|
||
try {
|
||
const pages = document.pages || [];
|
||
logger.info('Extracting structured tables from Document AI response', {
|
||
pageCount: pages.length
|
||
});
|
||
|
||
for (const page of pages) {
|
||
const pageTables = page.tables || [];
|
||
const pageNumber = page.pageNumber || 0;
|
||
|
||
logger.info('Processing page for tables', {
|
||
pageNumber,
|
||
tableCount: pageTables.length
|
||
});
|
||
|
||
for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) {
|
||
const table = pageTables[tableIndex];
|
||
|
||
try {
|
||
// Extract headers from first header row
|
||
const headers: string[] = [];
|
||
if (table.headerRows && table.headerRows.length > 0) {
|
||
const headerRow = table.headerRows[0];
|
||
for (const cell of headerRow.cells || []) {
|
||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||
headers.push(cellText);
|
||
}
|
||
}
|
||
|
||
// Extract data rows
|
||
const rows: string[][] = [];
|
||
for (const bodyRow of table.bodyRows || []) {
|
||
const row: string[] = [];
|
||
for (const cell of bodyRow.cells || []) {
|
||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||
row.push(cellText);
|
||
}
|
||
if (row.length > 0) {
|
||
rows.push(row);
|
||
}
|
||
}
|
||
|
||
// Only add tables with content
|
||
if (headers.length > 0 || rows.length > 0) {
|
||
tables.push({
|
||
headers,
|
||
rows,
|
||
position: {
|
||
pageNumber,
|
||
confidence: table.confidence || 0.9
|
||
},
|
||
rawTable: table // Keep for debugging
|
||
});
|
||
|
||
logger.info('Extracted structured table', {
|
||
pageNumber,
|
||
tableIndex,
|
||
headerCount: headers.length,
|
||
rowCount: rows.length,
|
||
headers: headers.slice(0, 10) // Log first 10 headers
|
||
});
|
||
}
|
||
} catch (tableError) {
|
||
logger.error('Failed to extract table', {
|
||
pageNumber,
|
||
tableIndex,
|
||
error: tableError instanceof Error ? tableError.message : String(tableError)
|
||
});
|
||
}
|
||
}
|
||
}
|
||
|
||
logger.info('Structured table extraction completed', {
|
||
totalTables: tables.length
|
||
});
|
||
|
||
} catch (error) {
|
||
logger.error('Failed to extract structured tables', {
|
||
error: error instanceof Error ? error.message : String(error)
|
||
});
|
||
}
|
||
|
||
return tables;
|
||
}
|
||
```
|
||
|
||
### Step 1.4: Update processWithDocumentAI to Use Structured Tables
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
**Location**: Update lines 462-482
|
||
|
||
**Current:**
|
||
```typescript
|
||
// Extract tables
|
||
const tables = document.pages?.flatMap(page =>
|
||
page.tables?.map(table => ({
|
||
rows: table.headerRows?.length || 0,
|
||
columns: table.bodyRows?.[0]?.cells?.length || 0
|
||
})) || []
|
||
) || [];
|
||
```
|
||
|
||
**Updated:**
|
||
```typescript
|
||
// Extract structured tables with full content
|
||
const tables = this.extractStructuredTables(document, text);
|
||
```
|
||
|
||
### Step 1.5: Pass Tables to Agentic RAG Processor
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
**Location**: Update line 337 (processLargeDocument call)
|
||
|
||
**Current:**
|
||
```typescript
|
||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||
documentId,
|
||
extractedText,
|
||
{}
|
||
);
|
||
```
|
||
|
||
**Updated:**
|
||
```typescript
|
||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||
documentId,
|
||
extractedText,
|
||
{
|
||
structuredTables: documentAiOutput.tables || []
|
||
}
|
||
);
|
||
```
|
||
|
||
### Step 1.6: Update Agentic RAG Processor Signature
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Update lines 41-48
|
||
|
||
**Current:**
|
||
```typescript
|
||
async processLargeDocument(
|
||
documentId: string,
|
||
text: string,
|
||
options: {
|
||
enableSemanticChunking?: boolean;
|
||
enableMetadataEnrichment?: boolean;
|
||
similarityThreshold?: number;
|
||
} = {}
|
||
)
|
||
```
|
||
|
||
**Updated:**
|
||
```typescript
|
||
async processLargeDocument(
|
||
documentId: string,
|
||
text: string,
|
||
options: {
|
||
enableSemanticChunking?: boolean;
|
||
enableMetadataEnrichment?: boolean;
|
||
similarityThreshold?: number;
|
||
structuredTables?: StructuredTable[];
|
||
} = {}
|
||
)
|
||
```
|
||
|
||
### Step 1.7: Add Import for StructuredTable Type
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Add to imports at top (around line 1-6)
|
||
|
||
```typescript
|
||
import type { StructuredTable } from './documentAiProcessor';
|
||
```
|
||
|
||
### Step 1.8: Create Financial Table Identifier
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Add after line 503 (after calculateCosineSimilarity)
|
||
|
||
```typescript
|
||
/**
|
||
* Identify if a structured table contains financial data
|
||
* Uses heuristics to detect financial tables vs. other tables
|
||
*/
|
||
private isFinancialTable(table: StructuredTable): boolean {
|
||
const headerText = table.headers.join(' ').toLowerCase();
|
||
const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' ');
|
||
|
||
// Check for year/period indicators in headers
|
||
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText);
|
||
|
||
// Check for financial metrics in rows
|
||
const financialMetrics = [
|
||
'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin',
|
||
'gross profit', 'operating income', 'net income', 'cash flow',
|
||
'earnings', 'assets', 'liabilities', 'equity'
|
||
];
|
||
const hasFinancialMetrics = financialMetrics.some(metric =>
|
||
allRowsText.includes(metric)
|
||
);
|
||
|
||
// Check for currency/percentage values
|
||
const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText);
|
||
|
||
// A financial table should have periods AND (metrics OR currency values)
|
||
const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency);
|
||
|
||
if (isFinancial) {
|
||
logger.info('Identified financial table', {
|
||
headers: table.headers,
|
||
rowCount: table.rows.length,
|
||
pageNumber: table.position.pageNumber
|
||
});
|
||
}
|
||
|
||
return isFinancial;
|
||
}
|
||
|
||
/**
|
||
* Format a structured table as markdown for better LLM comprehension
|
||
* Preserves column alignment and makes tables human-readable
|
||
*/
|
||
private formatTableAsMarkdown(table: StructuredTable): string {
|
||
const lines: string[] = [];
|
||
|
||
// Add header row
|
||
if (table.headers.length > 0) {
|
||
lines.push(`| ${table.headers.join(' | ')} |`);
|
||
lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`);
|
||
}
|
||
|
||
// Add data rows
|
||
for (const row of table.rows) {
|
||
lines.push(`| ${row.join(' | ')} |`);
|
||
}
|
||
|
||
return lines.join('\n');
|
||
}
|
||
```
|
||
|
||
### Step 1.9: Update Chunk Creation to Include Financial Tables
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Update createIntelligentChunks method (lines 115-158)
|
||
|
||
**Add after line 118:**
|
||
```typescript
|
||
// Extract structured tables from options
|
||
const structuredTables = (options as any)?.structuredTables || [];
|
||
```
|
||
|
||
**Add after line 119 (inside the method, before semantic chunking):**
|
||
```typescript
|
||
// PRIORITY: Create dedicated chunks for financial tables
|
||
if (structuredTables.length > 0) {
|
||
logger.info('Processing structured tables for chunking', {
|
||
documentId,
|
||
tableCount: structuredTables.length
|
||
});
|
||
|
||
for (let i = 0; i < structuredTables.length; i++) {
|
||
const table = structuredTables[i];
|
||
const isFinancial = this.isFinancialTable(table);
|
||
|
||
// Format table as markdown for better readability
|
||
const markdownTable = this.formatTableAsMarkdown(table);
|
||
|
||
chunks.push({
|
||
id: `${documentId}-table-${i}`,
|
||
content: markdownTable,
|
||
chunkIndex: chunks.length,
|
||
startPosition: -1, // Tables don't have text positions
|
||
endPosition: -1,
|
||
sectionType: isFinancial ? 'financial-table' : 'table',
|
||
metadata: {
|
||
isStructuredTable: true,
|
||
isFinancialTable: isFinancial,
|
||
tableIndex: i,
|
||
pageNumber: table.position.pageNumber,
|
||
headerCount: table.headers.length,
|
||
rowCount: table.rows.length,
|
||
structuredData: table // Preserve original structure
|
||
}
|
||
});
|
||
|
||
logger.info('Created chunk for structured table', {
|
||
documentId,
|
||
tableIndex: i,
|
||
isFinancial,
|
||
chunkId: chunks[chunks.length - 1].id,
|
||
contentPreview: markdownTable.substring(0, 200)
|
||
});
|
||
}
|
||
}
|
||
```
|
||
|
||
### Step 1.10: Pin Financial Tables in Extraction
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260)
|
||
|
||
**Add before the return statement (around line 1259):**
|
||
```typescript
|
||
// Identify and pin financial table chunks to ensure they're always included
|
||
const financialTableChunks = chunks.filter(
|
||
chunk => chunk.metadata?.isFinancialTable === true
|
||
);
|
||
|
||
logger.info('Financial table chunks identified for pinning', {
|
||
documentId,
|
||
financialTableCount: financialTableChunks.length,
|
||
chunkIds: financialTableChunks.map(c => c.id)
|
||
});
|
||
|
||
// Combine deterministic financial chunks with structured table chunks
|
||
const allPinnedChunks = [
|
||
...pinnedChunks,
|
||
...financialTableChunks
|
||
];
|
||
```
|
||
|
||
**Update the return statement to use allPinnedChunks:**
|
||
```typescript
|
||
return await this.extractWithTargetedQuery(
|
||
documentId,
|
||
text,
|
||
financialChunks,
|
||
query,
|
||
targetFields,
|
||
7,
|
||
allPinnedChunks // ✅ Now includes both deterministic and structured tables
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## Testing Phase 1
|
||
|
||
### Test 1.1: Verify Table Extraction
|
||
```bash
|
||
# Monitor logs for table extraction
|
||
cd backend
|
||
npm run dev
|
||
|
||
# Look for log entries:
|
||
# - "Extracting structured tables from Document AI response"
|
||
# - "Extracted structured table"
|
||
# - "Identified financial table"
|
||
```
|
||
|
||
### Test 1.2: Upload a CIM Document
|
||
```bash
|
||
# Upload a test document and check processing
|
||
curl -X POST http://localhost:8080/api/documents/upload \
|
||
-F "file=@test-cim.pdf" \
|
||
-H "Authorization: Bearer YOUR_TOKEN"
|
||
```
|
||
|
||
### Test 1.3: Verify Financial Data Populated
|
||
Check the database or API response for:
|
||
- `financialSummary.financials.fy3.revenue` - Should have values
|
||
- `financialSummary.financials.fy2.ebitda` - Should have values
|
||
- NOT "Not specified in CIM" for fields that exist in tables
|
||
|
||
### Test 1.4: Check Logs for Success Indicators
|
||
```bash
|
||
# Should see:
|
||
✅ "Identified financial table" - confirms tables detected
|
||
✅ "Created chunk for structured table" - confirms chunking worked
|
||
✅ "Financial table chunks identified for pinning" - confirms pinning worked
|
||
✅ "Deterministic financial data merged successfully" - confirms data merged
|
||
```
|
||
|
||
---
|
||
|
||
### Baseline & Post-Change Metrics
|
||
|
||
Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback:
|
||
|
||
1. Instrument the processing metadata (see `FINANCIAL_EXTRACTION_ANALYSIS.md`) with `tablesFound`, `financialTablesIdentified`, `structuredParsingUsed`, `textParsingFallback`, and `financialDataPopulated`.
|
||
2. Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample `documentId`s with `tablesFound === 0`).
|
||
3. Repeat after deploying Phase 1 and Phase 2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates.
|
||
|
||
---
|
||
|
||
## Expected Results After Phase 1
|
||
|
||
### Before Phase 1:
|
||
```json
|
||
{
|
||
"financialSummary": {
|
||
"financials": {
|
||
"fy3": {
|
||
"revenue": "Not specified in CIM",
|
||
"ebitda": "Not specified in CIM"
|
||
},
|
||
"fy2": {
|
||
"revenue": "Not specified in CIM",
|
||
"ebitda": "Not specified in CIM"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### After Phase 1:
|
||
```json
|
||
{
|
||
"financialSummary": {
|
||
"financials": {
|
||
"fy3": {
|
||
"revenue": "$45.2M",
|
||
"revenueGrowth": "N/A",
|
||
"ebitda": "$8.5M",
|
||
"ebitdaMargin": "18.8%"
|
||
},
|
||
"fy2": {
|
||
"revenue": "$52.8M",
|
||
"revenueGrowth": "16.8%",
|
||
"ebitda": "$10.2M",
|
||
"ebitdaMargin": "19.3%"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 2: Enhanced Deterministic Parsing (Optional)
|
||
|
||
**Timeline**: 2-3 hours
|
||
**Expected Additional Improvement**: +15-20% accuracy
|
||
**Trigger**: If Phase 1 results are below 70% accuracy
|
||
|
||
### Step 2.1: Create Structured Table Parser
|
||
|
||
**File**: Create `backend/src/services/structuredFinancialParser.ts`
|
||
|
||
```typescript
|
||
import { logger } from '../utils/logger';
|
||
import type { StructuredTable } from './documentAiProcessor';
|
||
import type { ParsedFinancials, FinancialPeriod } from './financialTableParser';
|
||
|
||
/**
|
||
* Parse financials directly from Document AI structured tables
|
||
* This is more reliable than parsing from flattened text
|
||
*/
|
||
export function parseFinancialsFromStructuredTable(
|
||
table: StructuredTable
|
||
): ParsedFinancials {
|
||
const result: ParsedFinancials = {
|
||
fy3: {},
|
||
fy2: {},
|
||
fy1: {},
|
||
ltm: {}
|
||
};
|
||
|
||
try {
|
||
// 1. Identify period columns from headers
|
||
const periodMapping = mapHeadersToPeriods(table.headers);
|
||
|
||
logger.info('Structured table period mapping', {
|
||
headers: table.headers,
|
||
periodMapping
|
||
});
|
||
|
||
// 2. Process each row to extract metrics
|
||
for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) {
|
||
const row = table.rows[rowIndex];
|
||
if (row.length === 0) continue;
|
||
|
||
const metricName = row[0].toLowerCase();
|
||
|
||
// Match against known financial metrics
|
||
const fieldName = identifyMetricField(metricName);
|
||
if (!fieldName) continue;
|
||
|
||
// 3. Assign values to correct periods
|
||
periodMapping.forEach((period, columnIndex) => {
|
||
if (!period) return; // Skip unmapped columns
|
||
|
||
const value = row[columnIndex + 1]; // +1 because first column is metric name
|
||
if (!value || value.trim() === '') return;
|
||
|
||
// 4. Validate value type matches field
|
||
if (isValidValueForField(value, fieldName)) {
|
||
result[period][fieldName] = value.trim();
|
||
|
||
logger.debug('Mapped structured table value', {
|
||
period,
|
||
field: fieldName,
|
||
value: value.trim(),
|
||
row: rowIndex,
|
||
column: columnIndex
|
||
});
|
||
}
|
||
});
|
||
}
|
||
|
||
logger.info('Structured table parsing completed', {
|
||
fy3: result.fy3,
|
||
fy2: result.fy2,
|
||
fy1: result.fy1,
|
||
ltm: result.ltm
|
||
});
|
||
|
||
} catch (error) {
|
||
logger.error('Failed to parse structured financial table', {
|
||
error: error instanceof Error ? error.message : String(error)
|
||
});
|
||
}
|
||
|
||
return result;
|
||
}
|
||
|
||
/**
|
||
* Map header columns to financial periods (fy3, fy2, fy1, ltm)
|
||
*/
|
||
function mapHeadersToPeriods(headers: string[]): Array<keyof ParsedFinancials | null> {
|
||
const periodMapping: Array<keyof ParsedFinancials | null> = [];
|
||
|
||
for (const header of headers) {
|
||
const normalized = header.trim().toUpperCase().replace(/\s+/g, '');
|
||
let period: keyof ParsedFinancials | null = null;
|
||
|
||
// Check for LTM/TTM
|
||
if (normalized.includes('LTM') || normalized.includes('TTM')) {
|
||
period = 'ltm';
|
||
}
|
||
// Check for year patterns
|
||
else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) {
|
||
period = 'fy1'; // Most recent full year
|
||
}
|
||
else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) {
|
||
period = 'fy2'; // Second most recent year
|
||
}
|
||
else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) {
|
||
period = 'fy3'; // Third most recent year
|
||
}
|
||
// Generic FY pattern - assign based on position
|
||
else if (/FY\d{2}/.test(normalized)) {
|
||
// Will be assigned based on relative position
|
||
period = null; // Handle in second pass
|
||
}
|
||
|
||
periodMapping.push(period);
|
||
}
|
||
|
||
// Second pass: fill in generic FY columns based on position
|
||
// Most recent on right, oldest on left (common CIM format)
|
||
let fyIndex = 1;
|
||
for (let i = periodMapping.length - 1; i >= 0; i--) {
|
||
if (periodMapping[i] === null && /FY/i.test(headers[i])) {
|
||
if (fyIndex === 1) periodMapping[i] = 'fy1';
|
||
else if (fyIndex === 2) periodMapping[i] = 'fy2';
|
||
else if (fyIndex === 3) periodMapping[i] = 'fy3';
|
||
fyIndex++;
|
||
}
|
||
}
|
||
|
||
return periodMapping;
|
||
}
|
||
|
||
/**
|
||
* Identify which financial field a metric name corresponds to
|
||
*/
|
||
function identifyMetricField(metricName: string): keyof FinancialPeriod | null {
|
||
const name = metricName.toLowerCase();
|
||
|
||
if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) {
|
||
return 'revenue';
|
||
}
|
||
if (/gross\s*profit/.test(name)) {
|
||
return 'grossProfit';
|
||
}
|
||
if (/gross\s*margin/.test(name)) {
|
||
return 'grossMargin';
|
||
}
|
||
if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) {
|
||
return 'ebitdaMargin';
|
||
}
|
||
if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) {
|
||
return 'ebitda';
|
||
}
|
||
if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) {
|
||
return 'revenueGrowth';
|
||
}
|
||
|
||
return null;
|
||
}
|
||
|
||
/**
|
||
* Validate that a value is appropriate for a given field
|
||
*/
|
||
function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean {
|
||
const trimmed = value.trim();
|
||
|
||
// Margin and growth fields should have %
|
||
if (field.includes('Margin') || field.includes('Growth')) {
|
||
return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a');
|
||
}
|
||
|
||
// Revenue, profit, EBITDA should have $ or numbers
|
||
if (['revenue', 'grossProfit', 'ebitda'].includes(field)) {
|
||
return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed));
|
||
}
|
||
|
||
return /\d/.test(trimmed);
|
||
}
|
||
```
|
||
|
||
### Step 2.2: Integrate Structured Parser
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
**Location**: Update multi-pass extraction (around line 1063-1088)
|
||
|
||
**Add import:**
|
||
```typescript
|
||
import { parseFinancialsFromStructuredTable } from './structuredFinancialParser';
|
||
```
|
||
|
||
**Update financial extraction logic (around line 1066-1088):**
|
||
```typescript
|
||
// Try structured table parsing first (most reliable)
|
||
try {
|
||
const structuredTables = (options as any)?.structuredTables || [];
|
||
const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t));
|
||
|
||
if (financialTables.length > 0) {
|
||
logger.info('Attempting structured table parsing', {
|
||
documentId,
|
||
financialTableCount: financialTables.length
|
||
});
|
||
|
||
// Try each financial table until we get good data
|
||
for (const table of financialTables) {
|
||
const parsedFromTable = parseFinancialsFromStructuredTable(table);
|
||
|
||
if (this.hasStructuredFinancialData(parsedFromTable)) {
|
||
deterministicFinancials = parsedFromTable;
|
||
deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable);
|
||
|
||
logger.info('Structured table parsing successful', {
|
||
documentId,
|
||
tableIndex: financialTables.indexOf(table),
|
||
fy3: parsedFromTable.fy3,
|
||
fy2: parsedFromTable.fy2,
|
||
fy1: parsedFromTable.fy1,
|
||
ltm: parsedFromTable.ltm
|
||
});
|
||
break; // Found good data, stop trying tables
|
||
}
|
||
}
|
||
}
|
||
} catch (structuredParserError) {
|
||
logger.warn('Structured table parsing failed, falling back to text parser', {
|
||
documentId,
|
||
error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError)
|
||
});
|
||
}
|
||
|
||
// Fallback to text-based parsing if structured parsing failed
|
||
if (!deterministicFinancials) {
|
||
try {
|
||
const { parseFinancialsFromText } = await import('./financialTableParser');
|
||
const parsedFinancials = parseFinancialsFromText(text);
|
||
// ... existing code
|
||
} catch (parserError) {
|
||
// ... existing error handling
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Rollback Plan
|
||
|
||
If Phase 1 causes issues:
|
||
|
||
### Quick Rollback (5 minutes)
|
||
```bash
|
||
git checkout HEAD -- backend/src/services/documentAiProcessor.ts
|
||
git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts
|
||
npm run build
|
||
npm start
|
||
```
|
||
|
||
### Feature Flag Approach (Recommended)
|
||
Add environment variable to control new behavior:
|
||
|
||
```typescript
|
||
// backend/src/config/env.ts
|
||
export const config = {
|
||
features: {
|
||
useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true'
|
||
}
|
||
};
|
||
```
|
||
|
||
Then wrap new code:
|
||
```typescript
|
||
if (config.features.useStructuredTables) {
|
||
// Use structured tables
|
||
} else {
|
||
// Use old flat text approach
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
### Phase 1 Success:
|
||
- ✅ 60%+ of CIM documents have populated financial data (validated via new telemetry)
|
||
- ✅ No regression in processing time (< 10% increase acceptable)
|
||
- ✅ No errors in table extraction pipeline
|
||
- ✅ Structured tables logged in console
|
||
|
||
### Phase 2 Success:
|
||
- ✅ 85%+ of CIM documents have populated financial data or fall back to the hybrid path when `tablesFound === 0`
|
||
- ✅ Column alignment accuracy > 95%
|
||
- ✅ Reduction in "Not specified in CIM" responses
|
||
|
||
---
|
||
|
||
## Monitoring & Debugging
|
||
|
||
### Key Metrics to Track
|
||
```typescript
|
||
// Add to processing result
|
||
metadata: {
|
||
tablesFound: number;
|
||
financialTablesIdentified: number;
|
||
structuredParsingUsed: boolean;
|
||
textParsingFallback: boolean;
|
||
financialDataPopulated: boolean;
|
||
}
|
||
```
|
||
|
||
### Log Analysis Queries
|
||
```bash
|
||
# Find documents with no tables
|
||
grep "totalTables: 0" backend.log
|
||
|
||
# Find failed table extractions
|
||
grep "Failed to extract table" backend.log
|
||
|
||
# Find successful financial extractions
|
||
grep "Structured table parsing successful" backend.log
|
||
```
|
||
|
||
---
|
||
|
||
## Next Steps After Implementation
|
||
|
||
1. **Run on historical documents**: Reprocess 10-20 existing CIMs to compare before/after
|
||
2. **A/B test**: Process new documents with both old and new system, compare results
|
||
3. **Tune thresholds**: Adjust financial table identification heuristics based on results
|
||
4. **Document findings**: Update this plan with actual results and lessons learned
|
||
|
||
---
|
||
|
||
## Resources
|
||
|
||
- [Document AI Table Extraction Docs](https://cloud.google.com/document-ai/docs/handle-response)
|
||
- [Financial Parser (current)](backend/src/services/financialTableParser.ts)
|
||
- [Financial Extractor (unused)](backend/src/utils/financialExtractor.ts)
|
||
- [Analysis Document](FINANCIAL_EXTRACTION_ANALYSIS.md)
|