Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
507 lines
16 KiB
Markdown
507 lines
16 KiB
Markdown
# Financial Data Extraction Issue: Root Cause Analysis & Solution
|
||
|
||
## Executive Summary
|
||
|
||
**Problem**: Financial data showing "Not specified in CIM" even when tables exist in the PDF.
|
||
|
||
**Root Cause**: Document AI's structured table data is being **completely ignored** in favor of flattened text, causing the parser to fail.
|
||
|
||
**Impact**: ~80-90% of financial tables fail to parse correctly.
|
||
|
||
---
|
||
|
||
## Current Pipeline Analysis
|
||
|
||
### Stage 1: Document AI Processing ✅ (Working but underutilized)
|
||
```typescript
|
||
// documentAiProcessor.ts:408-482
|
||
private async processWithDocumentAI() {
|
||
const [result] = await this.documentAiClient.processDocument(request);
|
||
const { document } = result;
|
||
|
||
// ✅ Extracts structured tables
|
||
const tables = document.pages?.flatMap(page =>
|
||
page.tables?.map(table => ({
|
||
rows: table.headerRows?.length || 0, // ❌ Only counting!
|
||
columns: table.bodyRows?.[0]?.cells?.length || 0 // ❌ Not using!
|
||
}))
|
||
);
|
||
|
||
// ❌ PROBLEM: Only returns flat text, throws away table structure
|
||
return { text: document.text, entities, tables, pages };
|
||
}
|
||
```
|
||
|
||
**What Document AI Actually Provides:**
|
||
- `document.pages[].tables[]` - Fully structured tables with:
|
||
- `headerRows[]` - Column headers with cell text via layout anchors
|
||
- `bodyRows[]` - Data rows with aligned cell values
|
||
- `layout` - Text positions in the original document
|
||
- `cells[]` - Individual cell data with rowSpan/colSpan
|
||
|
||
**What We're Using:** Only `document.text` (flattened)
|
||
|
||
---
|
||
|
||
### Stage 2: Text Extraction ❌ (Losing structure)
|
||
```typescript
|
||
// documentAiProcessor.ts:151-207
|
||
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
|
||
// Returns: "FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M EBITDA $8.5M..."
|
||
// Lost: Column alignment, row structure, table boundaries
|
||
```
|
||
|
||
**Original PDF Table:**
|
||
```
|
||
FY-3 FY-2 FY-1 LTM
|
||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||
Revenue Growth N/A 16.8% 15.9% (4.4)%
|
||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||
EBITDA Margin 18.8% 19.3% 19.8% 19.7%
|
||
```
|
||
|
||
**What Parser Receives (flattened):**
|
||
```
|
||
FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M Revenue Growth N/A 16.8% 15.9% (4.4)% EBITDA $8.5M $10.2M $12.1M $11.5M EBITDA Margin 18.8% 19.3% 19.8% 19.7%
|
||
```
|
||
|
||
---
|
||
|
||
### Stage 3: Deterministic Parser ❌ (Fighting lost structure)
|
||
```typescript
|
||
// financialTableParser.ts:181-406
|
||
export function parseFinancialsFromText(fullText: string): ParsedFinancials {
|
||
// 1. Find header line with year tokens (FY-3, FY-2, etc.)
|
||
// ❌ PROBLEM: Years might be on different lines now
|
||
|
||
// 2. Look for revenue/EBITDA rows within 20 lines
|
||
// ❌ PROBLEM: Row detection works, but...
|
||
|
||
// 3. Extract numeric tokens and assign to columns
|
||
// ❌ PROBLEM: Can't determine which number belongs to which column!
|
||
// Numbers are just in sequence: $45.2M $52.8M $61.2M $58.5M
|
||
// Are these revenues for FY-3, FY-2, FY-1, LTM? Or something else?
|
||
|
||
// Result: Returns empty {} or incorrect mappings
|
||
}
|
||
```
|
||
|
||
**Failure Points:**
|
||
1. **Header Detection** (lines 197-278): Requires period tokens in ONE line
|
||
- Flattened text scatters tokens across multiple lines
|
||
- Scoring system can't find tables with both revenue AND EBITDA
|
||
|
||
2. **Column Alignment** (lines 160-179): Assumes tokens map to buckets by position
|
||
- No way to know which token belongs to which column
|
||
- Whitespace-based alignment is lost
|
||
|
||
3. **Multi-line Tables**: Financial tables often span multiple lines per row
|
||
- Parser combines 2-3 lines but still can't reconstruct columns
|
||
|
||
---
|
||
|
||
### Stage 4: LLM Extraction ⚠️ (Limited context)
|
||
```typescript
|
||
// optimizedAgenticRAGProcessor.ts:1552-1641
|
||
private async extractWithTargetedQuery() {
|
||
// 1. RAG selects ~7 most relevant chunks
|
||
// 2. Each chunk truncated to 1500 chars
|
||
// 3. Total context: ~10,500 chars
|
||
|
||
// ❌ PROBLEM: Financial tables might be:
|
||
// - Split across multiple chunks
|
||
// - Not in the top 7 most "similar" chunks
|
||
// - Truncated mid-table
|
||
// - Still in flattened format anyway
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Unused Assets
|
||
|
||
### 1. Document AI Table Structure (BIGGEST MISS)
|
||
**Location**: Available in Document AI response but never used
|
||
|
||
**What It Provides:**
|
||
```typescript
|
||
document.pages[0].tables[0] = {
|
||
layout: { /* table position */ },
|
||
headerRows: [{
|
||
cells: [
|
||
{ layout: { textAnchor: { start: 123, end: 127 } } }, // "FY-3"
|
||
{ layout: { textAnchor: { start: 135, end: 139 } } }, // "FY-2"
|
||
// ...
|
||
]
|
||
}],
|
||
bodyRows: [{
|
||
cells: [
|
||
{ layout: { textAnchor: { start: 200, end: 207 } } }, // "Revenue"
|
||
{ layout: { textAnchor: { start: 215, end: 222 } } }, // "$45.2M"
|
||
{ layout: { textAnchor: { start: 230, end: 237 } } }, // "$52.8M"
|
||
// ...
|
||
]
|
||
}]
|
||
}
|
||
```
|
||
|
||
**How to Use:**
|
||
```typescript
|
||
function getTableText(layout, documentText) {
|
||
const start = layout.textAnchor.textSegments[0].startIndex;
|
||
const end = layout.textAnchor.textSegments[0].endIndex;
|
||
return documentText.substring(start, end);
|
||
}
|
||
```
|
||
|
||
### 2. Financial Extractor Utility
|
||
**Location**: `src/utils/financialExtractor.ts` (lines 1-159)
|
||
|
||
**Features:**
|
||
- Robust column splitting: `/\s{2,}|\t/` (2+ spaces or tabs)
|
||
- Clean value parsing with K/M/B multipliers
|
||
- Percentage and negative number handling
|
||
- Better than current parser but still works on flat text
|
||
|
||
**Status**: Never imported or used anywhere in the codebase
|
||
|
||
---
|
||
|
||
## Root Cause Summary
|
||
|
||
| Issue | Impact | Severity |
|
||
|-------|--------|----------|
|
||
| Document AI table structure ignored | 100% structure loss | 🔴 CRITICAL |
|
||
| Only flat text used for parsing | Parser can't align columns | 🔴 CRITICAL |
|
||
| financialExtractor.ts not used | Missing better parsing logic | 🟡 MEDIUM |
|
||
| RAG chunks miss complete tables | LLM has incomplete data | 🟡 MEDIUM |
|
||
| No table-aware chunking | Financial sections fragmented | 🟡 MEDIUM |
|
||
|
||
---
|
||
|
||
## Baseline Measurements & Instrumentation
|
||
|
||
Before changing the pipeline, capture hard numbers so we can prove the fix works and spot remaining gaps. Add the following telemetry to the processing result (also referenced in `IMPLEMENTATION_PLAN.md`):
|
||
|
||
```typescript
|
||
metadata: {
|
||
tablesFound: structuredTables.length,
|
||
financialTablesIdentified: structuredTables.filter(isFinancialTable).length,
|
||
structuredParsingUsed: Boolean(deterministicFinancialsFromTables),
|
||
textParsingFallback: !deterministicFinancialsFromTables,
|
||
financialDataPopulated: hasPopulatedFinancialSummary(result)
|
||
}
|
||
```
|
||
|
||
**Baseline checklist (run on ≥20 recent CIM uploads):**
|
||
|
||
1. Count how many documents have `tablesFound > 0` but `financialDataPopulated === false`.
|
||
2. Record the average/median `tablesFound`, `financialTablesIdentified`, and current financial fill rate.
|
||
3. Log sample `documentId`s where `tablesFound === 0` (helps scope Phase 3 hybrid work).
|
||
|
||
Paste the aggregated numbers back into this doc so Success Metrics are grounded in actual data rather than estimates.
|
||
|
||
---
|
||
|
||
## Recommended Solution Architecture
|
||
|
||
### Phase 1: Use Document AI Table Structure (HIGHEST IMPACT)
|
||
|
||
**Implementation:**
|
||
```typescript
|
||
// NEW: documentAiProcessor.ts
|
||
interface StructuredTable {
|
||
headers: string[];
|
||
rows: string[][];
|
||
position: { page: number; confidence: number };
|
||
}
|
||
|
||
private extractStructuredTables(document: any, text: string): StructuredTable[] {
|
||
const tables: StructuredTable[] = [];
|
||
|
||
for (const page of document.pages || []) {
|
||
for (const table of page.tables || []) {
|
||
// Extract headers
|
||
const headers = table.headerRows?.[0]?.cells?.map(cell =>
|
||
this.getTextFromLayout(cell.layout, text)
|
||
) || [];
|
||
|
||
// Extract data rows
|
||
const rows = table.bodyRows?.map(row =>
|
||
row.cells.map(cell => this.getTextFromLayout(cell.layout, text))
|
||
) || [];
|
||
|
||
tables.push({ headers, rows, position: { page: page.pageNumber, confidence: 0.9 } });
|
||
}
|
||
}
|
||
|
||
return tables;
|
||
}
|
||
|
||
private getTextFromLayout(layout: any, documentText: string): string {
|
||
const segments = layout.textAnchor?.textSegments || [];
|
||
if (segments.length === 0) return '';
|
||
|
||
const start = parseInt(segments[0].startIndex || '0');
|
||
const end = parseInt(segments[0].endIndex || documentText.length.toString());
|
||
|
||
return documentText.substring(start, end).trim();
|
||
}
|
||
```
|
||
|
||
**Return Enhanced Output:**
|
||
```typescript
|
||
interface DocumentAIOutput {
|
||
text: string;
|
||
entities: Array<any>;
|
||
tables: StructuredTable[]; // ✅ Now usable!
|
||
pages: Array<any>;
|
||
mimeType: string;
|
||
}
|
||
```
|
||
|
||
### Phase 2: Financial Table Classifier
|
||
|
||
**Purpose**: Identify which tables are financial data
|
||
|
||
```typescript
|
||
// NEW: services/financialTableClassifier.ts
|
||
export function isFinancialTable(table: StructuredTable): boolean {
|
||
const headerText = table.headers.join(' ').toLowerCase();
|
||
const firstRowText = table.rows[0]?.join(' ').toLowerCase() || '';
|
||
|
||
// Check for year/period indicators
|
||
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd/.test(headerText);
|
||
|
||
// Check for financial metrics
|
||
const hasMetrics = /(revenue|ebitda|sales|profit|margin|cash flow)/i.test(
|
||
table.rows.slice(0, 5).join(' ')
|
||
);
|
||
|
||
// Check for currency values
|
||
const hasCurrency = /\$[\d,]+|\d+[km]|\d+\.\d+%/.test(firstRowText);
|
||
|
||
return hasPeriods && (hasMetrics || hasCurrency);
|
||
}
|
||
```
|
||
|
||
### Phase 3: Enhanced Financial Parser
|
||
|
||
**Use structured tables instead of flat text:**
|
||
|
||
```typescript
|
||
// UPDATED: financialTableParser.ts
|
||
export function parseFinancialsFromStructuredTable(
|
||
table: StructuredTable
|
||
): ParsedFinancials {
|
||
const result: ParsedFinancials = { fy3: {}, fy2: {}, fy1: {}, ltm: {} };
|
||
|
||
// 1. Parse headers to identify periods
|
||
const buckets = yearTokensToBuckets(
|
||
table.headers.map(h => normalizePeriodToken(h))
|
||
);
|
||
|
||
// 2. For each row, identify the metric
|
||
for (const row of table.rows) {
|
||
const metricName = row[0].toLowerCase();
|
||
const values = row.slice(1); // Skip first column (metric name)
|
||
|
||
// 3. Match metric to field
|
||
for (const [field, matcher] of Object.entries(ROW_MATCHERS)) {
|
||
if (matcher.test(metricName)) {
|
||
// 4. Assign values to buckets (GUARANTEED ALIGNMENT!)
|
||
buckets.forEach((bucket, index) => {
|
||
if (bucket && values[index]) {
|
||
result[bucket][field] = values[index];
|
||
}
|
||
});
|
||
}
|
||
}
|
||
}
|
||
|
||
return result;
|
||
}
|
||
```
|
||
|
||
**Key Improvement**: Column alignment is **guaranteed** because:
|
||
- Headers and values come from the same table structure
|
||
- Index positions are preserved
|
||
- No string parsing or whitespace guessing needed
|
||
|
||
### Phase 4: Table-Aware Chunking
|
||
|
||
**Store financial tables as special chunks:**
|
||
|
||
```typescript
|
||
// UPDATED: optimizedAgenticRAGProcessor.ts
|
||
private async createIntelligentChunks(
|
||
text: string,
|
||
documentId: string,
|
||
tables: StructuredTable[]
|
||
): Promise<ProcessingChunk[]> {
|
||
const chunks: ProcessingChunk[] = [];
|
||
|
||
// 1. Create dedicated chunks for financial tables
|
||
for (const table of tables.filter(isFinancialTable)) {
|
||
chunks.push({
|
||
id: `${documentId}-financial-table-${chunks.length}`,
|
||
content: this.formatTableAsMarkdown(table),
|
||
chunkIndex: chunks.length,
|
||
sectionType: 'financial-table',
|
||
metadata: {
|
||
isFinancialTable: true,
|
||
tablePosition: table.position,
|
||
structuredData: table // ✅ Preserve structure!
|
||
}
|
||
});
|
||
}
|
||
|
||
// 2. Continue with normal text chunking
|
||
// ...
|
||
}
|
||
|
||
private formatTableAsMarkdown(table: StructuredTable): string {
|
||
const header = `| ${table.headers.join(' | ')} |`;
|
||
const separator = `| ${table.headers.map(() => '---').join(' | ')} |`;
|
||
const rows = table.rows.map(row => `| ${row.join(' | ')} |`);
|
||
|
||
return [header, separator, ...rows].join('\n');
|
||
}
|
||
```
|
||
|
||
### Phase 5: Priority Pinning for Financial Chunks
|
||
|
||
**Ensure financial tables always included in LLM context:**
|
||
|
||
```typescript
|
||
// UPDATED: optimizedAgenticRAGProcessor.ts
|
||
private async extractPass1CombinedMetadataFinancial() {
|
||
// 1. Find all financial table chunks
|
||
const financialTableChunks = chunks.filter(
|
||
c => c.metadata?.isFinancialTable === true
|
||
);
|
||
|
||
// 2. PIN them to always be included
|
||
return await this.extractWithTargetedQuery(
|
||
documentId,
|
||
text,
|
||
chunks,
|
||
query,
|
||
targetFields,
|
||
7,
|
||
financialTableChunks // ✅ Always included!
|
||
);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Phases & Priorities
|
||
|
||
### Phase 1: Quick Win (1-2 hours) - RECOMMENDED START
|
||
**Goal**: Use Document AI tables immediately (matches `IMPLEMENTATION_PLAN.md` Phase 1)
|
||
|
||
**Planned changes:**
|
||
1. Extract structured tables in `documentAiProcessor.ts`.
|
||
2. Pass tables (and metadata) to `optimizedAgenticRAGProcessor`.
|
||
3. Emit dedicated financial-table chunks that preserve structure.
|
||
4. Pin financial chunks so every RAG/LLM pass sees them.
|
||
|
||
**Expected Improvement**: 60-70% accuracy gain (verify via new instrumentation).
|
||
|
||
### Phase 2: Enhanced Parsing (2-3 hours)
|
||
**Goal**: Deterministic extraction from structured tables before falling back to text (see `IMPLEMENTATION_PLAN.md` Phase 2).
|
||
|
||
**Planned changes:**
|
||
1. Implement `parseFinancialsFromStructuredTable()` and reuse existing deterministic merge paths.
|
||
2. Add a classifier that flags which structured tables are financial.
|
||
3. Update merge logic to favor structured data yet keep the text/LLM fallback.
|
||
|
||
**Expected Improvement**: 85-90% accuracy (subject to measured baseline).
|
||
|
||
### Phase 3: LLM Optimization (1-2 hours)
|
||
**Goal**: Better context for LLM when tables are incomplete or absent (aligns with `HYBRID_SOLUTION.md` Phase 2/3).
|
||
|
||
**Planned changes:**
|
||
1. Format tables as markdown and raise chunk limits for financial passes.
|
||
2. Prioritize and pin financial chunks in `extractPass1CombinedMetadataFinancial`.
|
||
3. Inject explicit “find the table” instructions into the prompt.
|
||
|
||
**Expected Improvement**: 90-95% accuracy when Document AI tables exist; otherwise falls back to the hybrid regex/LLM path.
|
||
|
||
### Phase 4: Integration & Testing (2-3 hours)
|
||
**Goal**: Ensure backward compatibility and document measured improvements
|
||
|
||
**Planned changes:**
|
||
1. Keep the legacy text parser as a fallback whenever `tablesFound === 0`.
|
||
2. Capture the telemetry outlined earlier and publish before/after numbers.
|
||
3. Test against a labeled CIM set covering: clean tables, multi-line rows, scanned PDFs (no structured tables), and partial data cases.
|
||
|
||
---
|
||
|
||
### Handling Documents With No Structured Tables
|
||
|
||
Even after Phases 1-2, some CIMs (e.g., scans or image-only tables) will have `tablesFound === 0`. When that happens:
|
||
|
||
1. Trigger the enhanced preprocessing + regex route from `HYBRID_SOLUTION.md` (Phase 1).
|
||
2. Surface an explicit warning in metadata/logs so analysts know the deterministic path was skipped.
|
||
3. Feed the isolated table text (if any) plus surrounding context into the LLM with the financial prompt upgrades from Phase 3.
|
||
|
||
This ensures the hybrid approach only engages when the Document AI path truly lacks structured tables, keeping maintenance manageable while covering the remaining gap.
|
||
|
||
---
|
||
|
||
## Success Metrics
|
||
|
||
| Metric | Current | Phase 1 | Phase 2 | Phase 3 |
|
||
|--------|---------|---------|---------|---------|
|
||
| Financial data extracted | 10-20% | 60-70% | 85-90% | 90-95% |
|
||
| Tables identified | 0% | 80% | 90% | 95% |
|
||
| Column alignment accuracy | 10% | 95% | 98% | 99% |
|
||
| Processing time | 45s | 42s | 38s | 35s |
|
||
|
||
---
|
||
|
||
## Code Quality Improvements
|
||
|
||
### Current Issues:
|
||
1. ❌ Document AI tables extracted but never used
|
||
2. ❌ `financialExtractor.ts` exists but never imported
|
||
3. ❌ Parser assumes flat text has structure
|
||
4. ❌ No table-specific chunking strategy
|
||
|
||
### After Implementation:
|
||
1. ✅ Full use of Document AI's structured data
|
||
2. ✅ Multi-tier extraction strategy (structured → fallback → LLM)
|
||
3. ✅ Table-aware chunking and RAG
|
||
4. ✅ Guaranteed column alignment
|
||
5. ✅ Better error handling and logging
|
||
|
||
---
|
||
|
||
## Alternative Approaches Considered
|
||
|
||
### Option 1: Better Regex Parsing (REJECTED)
|
||
**Reason**: Can't solve the fundamental problem of lost structure
|
||
|
||
### Option 2: Use Only LLM (REJECTED)
|
||
**Reason**: Expensive, slower, less accurate than structured extraction
|
||
|
||
### Option 3: Replace Document AI (REJECTED)
|
||
**Reason**: Document AI works fine, we're just not using it properly
|
||
|
||
### Option 4: Manual Table Markup (REJECTED)
|
||
**Reason**: Not scalable, requires user intervention
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
The issue is **NOT** a parsing problem or an LLM problem.
|
||
|
||
The issue is an **architecture problem**: We're extracting structured tables from Document AI and then **throwing away the structure**.
|
||
|
||
**The fix is simple**: Use the data we're already getting.
|
||
|
||
**Recommended action**: Implement Phase 1 (Quick Win) immediately for 60-70% improvement, then evaluate if Phases 2-3 are needed based on results.
|