Add single-pass CIM processor: 2 LLM calls, ~2.5 min processing
New processing strategy `single_pass_quality_check` replaces the multi-pass agentic RAG pipeline (15-25 min) with a streamlined 2-call approach: 1. Full-document LLM extraction (Sonnet) — single call with complete CIM text 2. Delta quality-check (Haiku) — reviews extraction, returns only corrections Key changes: - New singlePassProcessor.ts with extraction + quality check flow - llmService: qualityCheckCIMDocument() with delta-only corrections array - llmService: improved prompt requiring professional inferences for qualitative fields instead of defaulting to "Not specified in CIM" - Removed deterministic financial parser from single-pass flow (LLM outperforms it — parser matched footnotes and narrative text as financials) - Default strategy changed to single_pass_quality_check - Completeness scoring with diagnostic logging of empty fields Tested on 2 real CIMs: 100% completeness, correct financials, ~150s each. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
-- Fix vector search timeout by adding document_id filtering and optimizing the query
|
||||
-- This prevents searching across all documents and only searches within a specific document
|
||||
-- Fix vector search timeout by pre-filtering on document_id BEFORE vector search
|
||||
-- When document_id is provided, this avoids the full IVFFlat index scan (26K+ rows)
|
||||
-- and instead computes distances on only ~80 chunks per document.
|
||||
|
||||
-- Drop the old function (handle all possible signatures)
|
||||
-- Drop old function signatures
|
||||
DROP FUNCTION IF EXISTS match_document_chunks(vector(1536), float, int);
|
||||
DROP FUNCTION IF EXISTS match_document_chunks(vector(1536), float, int, text);
|
||||
|
||||
-- Create optimized function with document_id filtering
|
||||
-- document_id is TEXT (varchar) in the actual schema
|
||||
-- Create optimized function that branches based on whether document_id is provided
|
||||
CREATE OR REPLACE FUNCTION match_document_chunks (
|
||||
query_embedding vector(1536),
|
||||
match_threshold float,
|
||||
@@ -15,29 +15,51 @@ CREATE OR REPLACE FUNCTION match_document_chunks (
|
||||
)
|
||||
RETURNS TABLE (
|
||||
id UUID,
|
||||
document_id TEXT,
|
||||
document_id VARCHAR(255),
|
||||
content text,
|
||||
metadata JSONB,
|
||||
chunk_index INT,
|
||||
similarity float
|
||||
)
|
||||
LANGUAGE sql STABLE
|
||||
LANGUAGE plpgsql STABLE
|
||||
AS $$
|
||||
SELECT
|
||||
document_chunks.id,
|
||||
document_chunks.document_id,
|
||||
document_chunks.content,
|
||||
document_chunks.metadata,
|
||||
document_chunks.chunk_index,
|
||||
1 - (document_chunks.embedding <=> query_embedding) AS similarity
|
||||
FROM document_chunks
|
||||
WHERE document_chunks.embedding IS NOT NULL
|
||||
AND (filter_document_id IS NULL OR document_chunks.document_id = filter_document_id)
|
||||
AND 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
|
||||
ORDER BY document_chunks.embedding <=> query_embedding
|
||||
LIMIT match_count;
|
||||
BEGIN
|
||||
IF filter_document_id IS NOT NULL THEN
|
||||
-- FAST PATH: Pre-filter by document_id using btree index, then compute
|
||||
-- vector distances on only that document's chunks (~80 rows).
|
||||
-- This completely bypasses the IVFFlat index scan.
|
||||
RETURN QUERY
|
||||
SELECT
|
||||
dc.id,
|
||||
dc.document_id,
|
||||
dc.content,
|
||||
dc.metadata,
|
||||
dc.chunk_index,
|
||||
1 - (dc.embedding <=> query_embedding) AS similarity
|
||||
FROM document_chunks dc
|
||||
WHERE dc.document_id = filter_document_id
|
||||
AND dc.embedding IS NOT NULL
|
||||
AND 1 - (dc.embedding <=> query_embedding) > match_threshold
|
||||
ORDER BY dc.embedding <=> query_embedding
|
||||
LIMIT match_count;
|
||||
ELSE
|
||||
-- SLOW PATH: Search across all documents using IVFFlat index.
|
||||
-- Only used when no document_id filter is provided.
|
||||
RETURN QUERY
|
||||
SELECT
|
||||
dc.id,
|
||||
dc.document_id,
|
||||
dc.content,
|
||||
dc.metadata,
|
||||
dc.chunk_index,
|
||||
1 - (dc.embedding <=> query_embedding) AS similarity
|
||||
FROM document_chunks dc
|
||||
WHERE dc.embedding IS NOT NULL
|
||||
AND 1 - (dc.embedding <=> query_embedding) > match_threshold
|
||||
ORDER BY dc.embedding <=> query_embedding
|
||||
LIMIT match_count;
|
||||
END IF;
|
||||
END;
|
||||
$$;
|
||||
|
||||
-- Add comment explaining the optimization
|
||||
COMMENT ON FUNCTION match_document_chunks IS 'Optimized vector search that filters by document_id first to prevent timeouts. Always pass filter_document_id when searching within a specific document.';
|
||||
|
||||
COMMENT ON FUNCTION match_document_chunks IS 'Vector search with fast document-scoped path. When filter_document_id is provided, uses btree index to pre-filter (~80 rows) instead of scanning the full IVFFlat index (26K+ rows).';
|
||||
|
||||
Reference in New Issue
Block a user