Compare commits
2 Commits
production
...
2025-11-11
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
e406d33074 | ||
|
|
ecd4b13115 |
File diff suppressed because it is too large
Load Diff
@@ -1,746 +0,0 @@
|
||||
<img src="https://r2cdn.perplexity.ai/pplx-full-logo-primary-dark%402x.png" style="height:64px;margin-right:32px"/>
|
||||
|
||||
## Best Practices for Debugging with Cursor: Becoming a Senior Developer-Level Debugger
|
||||
|
||||
Transform Cursor into an elite debugging partner with these comprehensive strategies, workflow optimizations, and hidden power features that professional developers use to maximize productivity.
|
||||
|
||||
### Core Debugging Philosophy: Test-Driven Development with AI
|
||||
|
||||
**Write Tests First, Always**
|
||||
|
||||
The single most effective debugging strategy is implementing Test-Driven Development (TDD) with Cursor. This gives you verifiable proof that code works before deployment[^1][^2][^3].
|
||||
|
||||
**Workflow:**
|
||||
|
||||
- Start with: "Write tests first, then the code, then run the tests and update the code until tests pass"[^1]
|
||||
- Enable YOLO mode (Settings → scroll down → enable YOLO mode) to allow Cursor to automatically run tests, build commands, and iterate until passing[^1][^4]
|
||||
- Let the AI cycle through test failures autonomously—it will fix lint errors and test failures without manual intervention[^1][^5]
|
||||
|
||||
**YOLO Mode Configuration:**
|
||||
Add this prompt to YOLO settings:
|
||||
|
||||
```
|
||||
any kind of tests are always allowed like vitest, npm test, nr test, etc. also basic build commands like build, tsc, etc. creating files and making directories (like touch, mkdir, etc) is always ok too
|
||||
```
|
||||
|
||||
This enables autonomous iteration on builds and tests[^1][^4].
|
||||
|
||||
### Advanced Debugging Techniques
|
||||
|
||||
**1. Log-Driven Debugging Workflow**
|
||||
|
||||
When facing persistent bugs, use this iterative logging approach[^1][^6]:
|
||||
|
||||
- Tell Cursor: "Please add logs to the code to get better visibility into what is going on so we can find the fix. I'll run the code and feed you the logs results"[^1]
|
||||
- Run your code and collect log output
|
||||
- Paste the raw logs back into Cursor: "Here's the log output. What do you now think is causing the issue? And how do we fix it?"[^1]
|
||||
- Cursor will propose targeted fixes based on actual runtime behavior
|
||||
|
||||
**For Firebase Projects:**
|
||||
Use the logger SDK with proper severity levels[^7]:
|
||||
|
||||
```javascript
|
||||
const { log, info, debug, warn, error } = require("firebase-functions/logger");
|
||||
|
||||
// Log with structured data
|
||||
logger.error("API call failed", {
|
||||
endpoint: endpoint,
|
||||
statusCode: response.status,
|
||||
userId: userId
|
||||
});
|
||||
```
|
||||
|
||||
**2. Autonomous Workflow with Plan-Approve-Execute Pattern**
|
||||
|
||||
Use Cursor in Project Manager mode for complex debugging tasks[^5][^8]:
|
||||
|
||||
**Setup `.cursorrules` file:**
|
||||
|
||||
```
|
||||
You are working with me as PM/Technical Approver while you act as developer.
|
||||
- Work from PRD file one item at a time
|
||||
- Generate detailed story file outlining approach
|
||||
- Wait for approval before executing
|
||||
- Use TDD for implementation
|
||||
- Update story with progress after completion
|
||||
```
|
||||
|
||||
**Workflow:**
|
||||
|
||||
- Agent creates story file breaking down the fix in detail
|
||||
- You review and approve the approach
|
||||
- Agent executes using TDD
|
||||
- Agent runs tests until all pass
|
||||
- Agent pushes changes with clear commit message[^5][^8]
|
||||
|
||||
This prevents the AI from going off-track and ensures deliberate, verifiable fixes.
|
||||
|
||||
### Context Management Mastery
|
||||
|
||||
**3. Strategic Use of @ Symbols**
|
||||
|
||||
Master these context references for precise debugging[^9][^10]:
|
||||
|
||||
- `@Files` - Reference specific files
|
||||
- `@Folders` - Include entire directories
|
||||
- `@Code` - Reference specific functions/classes
|
||||
- `@Docs` - Pull in library documentation (add libraries via Settings → Cursor Settings → Docs)[^4][^9]
|
||||
- `@Web` - Search current information online
|
||||
- `@Codebase` - Search entire codebase (Chat only)
|
||||
- `@Lint Errors` - Reference current lint errors (Chat only)[^9]
|
||||
- `@Git` - Access git history and recent changes
|
||||
- `@Recent Changes` - View recent modifications
|
||||
|
||||
**Pro tip:** Stack multiple @ symbols in one prompt for comprehensive context[^9].
|
||||
|
||||
**4. Reference Open Editors Strategy**
|
||||
|
||||
Keep your AI focused by managing context deliberately[^11]:
|
||||
|
||||
- Close all irrelevant tabs
|
||||
- Open only files related to current debugging task
|
||||
- Use `@` to reference open editors
|
||||
- This prevents the AI from getting confused by unrelated code[^11]
|
||||
|
||||
**5. Context7 MCP for Up-to-Date Documentation**
|
||||
|
||||
Integrate Context7 MCP to eliminate outdated API suggestions[^12][^13][^14]:
|
||||
|
||||
**Installation:**
|
||||
|
||||
```json
|
||||
// ~/.cursor/mcp.json
|
||||
{
|
||||
"mcpServers": {
|
||||
"context7": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@upstash/context7-mcp@latest"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
```
|
||||
use context7 for latest documentation on [library name]
|
||||
```
|
||||
|
||||
Add to your cursor rules:
|
||||
|
||||
```
|
||||
When referencing documentation for any library, use the context7 MCP server for lookups to ensure up-to-date information
|
||||
```
|
||||
|
||||
|
||||
### Power Tools and Integrations
|
||||
|
||||
**6. Browser Tools MCP for Live Debugging**
|
||||
|
||||
Debug live applications by connecting Cursor directly to your browser[^15][^16]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Clone browser-tools-mcp repository
|
||||
2. Install Chrome extension
|
||||
3. Configure MCP in Cursor settings:
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"browser-tools": {
|
||||
"command": "node",
|
||||
"args": ["/path/to/browser-tools-mcp/server.js"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. Run the server: `npm start`
|
||||
|
||||
**Features:**
|
||||
|
||||
- "Investigate what happens when users click the pay button and resolve any JavaScript errors"
|
||||
- "Summarize these console logs and identify recurring errors"
|
||||
- "Which API calls are failing?"
|
||||
- Automatically captures screenshots, console logs, network requests, and DOM state[^15][^16]
|
||||
|
||||
**7. Sequential Thinking MCP for Complex Problems**
|
||||
|
||||
For intricate debugging requiring multi-step reasoning[^17][^18][^19]:
|
||||
|
||||
**Installation:**
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"sequential-thinking": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@modelcontextprotocol/server-sequential-thinking"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
|
||||
- Breaking down complex bugs into manageable steps
|
||||
- Problems where the full scope isn't clear initially
|
||||
- Analysis that might need course correction
|
||||
- Maintaining context over multiple debugging steps[^17]
|
||||
|
||||
Add to cursor rules:
|
||||
|
||||
```
|
||||
Use Sequential thinking for complex reflections and multi-step debugging
|
||||
```
|
||||
|
||||
**8. Firebase Crashlytics MCP Integration**
|
||||
|
||||
Connect Crashlytics directly to Cursor for AI-powered crash analysis[^20][^21]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Enable BigQuery export in Firebase Console → Project Settings → Integrations
|
||||
2. Generate Firebase service account JSON key
|
||||
3. Configure MCP:
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"crashlytics": {
|
||||
"command": "node",
|
||||
"args": ["/path/to/mcp-crashlytics-server/dist/index.js"],
|
||||
"env": {
|
||||
"GOOGLE_SERVICE_ACCOUNT_KEY": "/path/to/service-account.json",
|
||||
"BIGQUERY_PROJECT_ID": "your-project-id",
|
||||
"BIGQUERY_DATASET_ID": "firebase_crashlytics"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
- "Fetch the latest Crashlytics issues for my project"
|
||||
- "Add a note to issue xyz summarizing investigation"
|
||||
- Use `crashlytics:connect` command for structured debugging flow[^20][^21]
|
||||
|
||||
|
||||
### Cursor Rules \& Configuration
|
||||
|
||||
**9. Master .cursorrules Files**
|
||||
|
||||
Create powerful project-specific rules[^22][^23][^24]:
|
||||
|
||||
**Structure:**
|
||||
|
||||
```markdown
|
||||
# Project Overview
|
||||
[High-level description of what you're building]
|
||||
|
||||
# Tech Stack
|
||||
- Framework: [e.g., Next.js 14]
|
||||
- Language: TypeScript (strict mode)
|
||||
- Database: [e.g., PostgreSQL with Prisma]
|
||||
|
||||
# Critical Rules
|
||||
- Always use strict TypeScript - never use `any`
|
||||
- Never modify files without explicit approval
|
||||
- Always read relevant files before making changes
|
||||
- Log all exceptions in catch blocks using Crashlytics
|
||||
|
||||
# Deprecated Patterns (DO NOT USE)
|
||||
- Old API: `oldMethod()` ❌
|
||||
- Use instead: `newMethod()` ✅
|
||||
|
||||
# Common Bugs to Document
|
||||
[Add bugs you encounter here so they don't recur]
|
||||
```
|
||||
|
||||
**Pro Tips:**
|
||||
|
||||
- Document bugs you encounter in .cursorrules so AI avoids them in future[^23]
|
||||
- Use cursor.directory for template examples[^11][^23]
|
||||
- Stack multiple rule files: global rules + project-specific + feature-specific[^24]
|
||||
- Use `.cursor/rules` directory for organized rule management[^24][^25]
|
||||
|
||||
**10. Global Rules Configuration**
|
||||
|
||||
Set personal coding standards in Settings → Rules for AI[^11][^4]:
|
||||
|
||||
```
|
||||
- Always prefer strict types over any in TypeScript
|
||||
- Ensure answers are brief and to the point
|
||||
- Propose alternative solutions when stuck
|
||||
- Skip unnecessary elaborations
|
||||
- Emphasize technical specifics over general advice
|
||||
- Always examine relevant files before taking action
|
||||
```
|
||||
|
||||
**11. Notepads for Reusable Context**
|
||||
|
||||
Use Notepads to store debugging patterns and common fixes[^11][^26][^27][^28]:
|
||||
|
||||
**Create notepads for:**
|
||||
|
||||
- Common error patterns and solutions
|
||||
- Debugging checklists for specific features
|
||||
- File references for complex features
|
||||
- Standard prompts like "code review" or "vulnerability search"
|
||||
|
||||
**Usage:**
|
||||
Reference notepads in prompts to quickly load debugging context without retyping[^27][^28].
|
||||
|
||||
### Keyboard Shortcuts for Speed
|
||||
|
||||
**Essential Debugging Shortcuts**[^29][^30][^31]:
|
||||
|
||||
**Core AI Commands:**
|
||||
|
||||
- `Cmd/Ctrl + K` - Inline editing (fastest for quick fixes)[^1][^32][^30]
|
||||
- `Cmd/Ctrl + L` - Open AI chat[^30][^31]
|
||||
- `Cmd/Ctrl + I` - Open Composer[^30]
|
||||
- `Cmd/Ctrl + Shift + I` - Full-screen Composer[^30]
|
||||
|
||||
**When to use what:**
|
||||
|
||||
- Use `Cmd+K` for fast, localized changes to selected code[^1][^32]
|
||||
- Use `Cmd+L` for questions and explanations[^31]
|
||||
- Use `Cmd+I` (Composer) for multi-file changes and complex refactors[^32][^4]
|
||||
|
||||
**Navigation:**
|
||||
|
||||
- `Cmd/Ctrl + P` - Quick file open[^29][^33]
|
||||
- `Cmd/Ctrl + Shift + O` - Go to symbol in file[^33]
|
||||
- `Ctrl + G` - Go to line (for stack traces)[^33]
|
||||
- `F12` - Go to definition[^29]
|
||||
|
||||
**Terminal:**
|
||||
|
||||
- `Cmd/Ctrl + `` - Toggle terminal[^29][^30]
|
||||
- `Cmd + K` in terminal - Clear terminal (note: may need custom keybinding)[^34][^35]
|
||||
|
||||
|
||||
### Advanced Workflow Strategies
|
||||
|
||||
**12. Agent Mode with Plan Mode**
|
||||
|
||||
Use Plan Mode for complex debugging[^36][^37]:
|
||||
|
||||
1. Hit `Cmd+N` for new chat
|
||||
2. Press `Shift+Tab` to toggle Plan Mode
|
||||
3. Describe the bug or feature
|
||||
4. Agent researches codebase and creates detailed plan
|
||||
5. Review and approve before implementation
|
||||
|
||||
**Agent mode benefits:**
|
||||
|
||||
- Autonomous exploration of codebase
|
||||
- Edits multiple files
|
||||
- Runs commands automatically
|
||||
- Fixes errors iteratively[^37][^38]
|
||||
|
||||
**13. Composer Agent Mode Best Practices**
|
||||
|
||||
For large-scale debugging and refactoring[^39][^5][^4]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
- Always use Agent mode (toggle in Composer)
|
||||
- Enable YOLO mode for autonomous execution[^5][^4]
|
||||
- Start with clear, detailed problem descriptions
|
||||
|
||||
**Workflow:**
|
||||
|
||||
1. Describe the complete bug context in detail
|
||||
2. Let Agent plan the approach
|
||||
3. Agent will:
|
||||
- Pull relevant files automatically
|
||||
- Run terminal commands as needed
|
||||
- Iterate on test failures
|
||||
- Fix linting errors autonomously[^4]
|
||||
|
||||
**Recovery strategies:**
|
||||
|
||||
- If Agent goes off-track, hit stop immediately
|
||||
- Say: "Wait, you're way off track here. Reset, recalibrate"[^1]
|
||||
- Use Composer history to restore checkpoints[^40][^41]
|
||||
|
||||
**14. Index Management**
|
||||
|
||||
Keep your codebase index fresh[^11]:
|
||||
|
||||
**Manual resync:**
|
||||
Settings → Cursor Settings → Resync Index
|
||||
|
||||
**Why this matters:**
|
||||
|
||||
- Outdated index causes incorrect suggestions
|
||||
- AI may reference deleted files
|
||||
- Prevents hallucinations about code structure[^11]
|
||||
|
||||
**15. Error Pattern Recognition**
|
||||
|
||||
Watch for these warning signs and intervene[^1][^42]:
|
||||
|
||||
- AI repeatedly apologizing
|
||||
- Same error occurring 3+ times
|
||||
- Complexity escalating unexpectedly
|
||||
- AI asking same diagnostic questions repeatedly
|
||||
|
||||
**When you see these:**
|
||||
|
||||
- Stop the current chat
|
||||
- Start fresh conversation with better context
|
||||
- Add specific constraints to prevent loops
|
||||
- Use "explain your thinking" to understand AI's logic[^42]
|
||||
|
||||
|
||||
### Firebase-Specific Debugging
|
||||
|
||||
**16. Firebase Logging Best Practices**
|
||||
|
||||
Structure logs for effective debugging[^7][^43]:
|
||||
|
||||
**Severity levels:**
|
||||
|
||||
```javascript
|
||||
logger.debug("Detailed diagnostic info")
|
||||
logger.info("Normal operations")
|
||||
logger.warn("Warning conditions")
|
||||
logger.error("Error conditions", { context: details })
|
||||
logger.write({ severity: "EMERGENCY", message: "Critical failure" })
|
||||
```
|
||||
|
||||
**Add context:**
|
||||
|
||||
```javascript
|
||||
// Tag user IDs for filtering
|
||||
Crashlytics.setUserIdentifier(userId)
|
||||
|
||||
// Log exceptions with context
|
||||
Crashlytics.logException(error)
|
||||
Crashlytics.log(priority, tag, message)
|
||||
```
|
||||
|
||||
**View logs:**
|
||||
|
||||
- Firebase Console → Functions → Logs
|
||||
- Cloud Logging for advanced filtering
|
||||
- Filter by severity, user ID, version[^43]
|
||||
|
||||
**17. Version and User Tagging**
|
||||
|
||||
Enable precise debugging of production issues[^43]:
|
||||
|
||||
```javascript
|
||||
// Set version
|
||||
Crashlytics.setCustomKey("app_version", "1.2.3")
|
||||
|
||||
// Set user identifier
|
||||
Crashlytics.setUserIdentifier(userId)
|
||||
|
||||
// Add custom context
|
||||
Crashlytics.setCustomKey("feature_flag", "beta_enabled")
|
||||
```
|
||||
|
||||
Filter crashes in Firebase Console by version and user to isolate issues.
|
||||
|
||||
### Meta-Strategies
|
||||
|
||||
**18. Minimize Context Pollution**
|
||||
|
||||
**Project-level tactics:**
|
||||
|
||||
- Use `.cursorignore` similar to `.gitignore` to exclude unnecessary files[^44]
|
||||
- Keep only relevant documentation indexed[^4]
|
||||
- Close unrelated editor tabs before asking questions[^11]
|
||||
|
||||
**19. Commit Often**
|
||||
|
||||
Let Cursor handle commits[^40]:
|
||||
|
||||
```
|
||||
Push all changes, update story with progress, write clear commit message, and push to remote
|
||||
```
|
||||
|
||||
This creates restoration points if debugging goes sideways.
|
||||
|
||||
**20. Multi-Model Strategy**
|
||||
|
||||
Don't rely on one model[^4][^45]:
|
||||
|
||||
- Use Claude 3.5 Sonnet for complex reasoning and file generation[^5][^8]
|
||||
- Try different models if stuck
|
||||
- Some tasks work better with specific models
|
||||
|
||||
**21. Break Down Complex Debugging**
|
||||
|
||||
When debugging fails repeatedly[^39][^40]:
|
||||
|
||||
- Break the problem into smallest possible sub-tasks
|
||||
- Start new chats for discrete issues
|
||||
- Ask AI to explain its approach before implementing
|
||||
- Use sequential prompts rather than one massive request
|
||||
|
||||
|
||||
### Troubleshooting Cursor Itself
|
||||
|
||||
**When Cursor Misbehaves:**
|
||||
|
||||
**Context loss issues:**[^46][^47][^48]
|
||||
|
||||
- Check for .mdc glob attachment issues in settings
|
||||
- Disable workbench/editor auto-attachment if causing crashes[^46]
|
||||
- Start new chat if context becomes corrupted[^48]
|
||||
|
||||
**Agent loops:**[^47]
|
||||
|
||||
- Stop immediately when looping detected
|
||||
- Provide explicit, numbered steps
|
||||
- Use "complete step 1, then stop and report" approach
|
||||
- Restart with clearer constraints
|
||||
|
||||
**Rule conflicts:**[^49][^46]
|
||||
|
||||
- User rules may not apply automatically - use project .cursorrules instead[^49]
|
||||
- Test rules by asking AI to recite them
|
||||
- Check rules are being loaded (mention them in responses)[^46]
|
||||
|
||||
|
||||
### Ultimate Debugging Checklist
|
||||
|
||||
Before starting any debugging session:
|
||||
|
||||
**Setup:**
|
||||
|
||||
- [ ] Enable YOLO mode
|
||||
- [ ] Configure .cursorrules with project specifics
|
||||
- [ ] Resync codebase index
|
||||
- [ ] Close irrelevant files
|
||||
- [ ] Add relevant documentation to Cursor docs
|
||||
|
||||
**During Debugging:**
|
||||
|
||||
- [ ] Write tests first before fixing
|
||||
- [ ] Add logging at critical points
|
||||
- [ ] Use @ symbols to reference exact files
|
||||
- [ ] Let Agent run tests autonomously
|
||||
- [ ] Stop immediately if AI goes off-track
|
||||
- [ ] Commit frequently with clear messages
|
||||
|
||||
**Advanced Tools (when needed):**
|
||||
|
||||
- [ ] Context7 MCP for up-to-date docs
|
||||
- [ ] Browser Tools MCP for live debugging
|
||||
- [ ] Sequential Thinking MCP for complex issues
|
||||
- [ ] Crashlytics MCP for production errors
|
||||
|
||||
**Recovery Strategies:**
|
||||
|
||||
- [ ] Use Composer checkpoints to restore state
|
||||
- [ ] Start new chat with git diff context if lost
|
||||
- [ ] Ask AI to recite instructions to verify context
|
||||
- [ ] Use Plan Mode to reset approach
|
||||
|
||||
By implementing these strategies systematically, you transform Cursor from a coding assistant into an elite debugging partner that operates at senior developer level. The key is combining AI autonomy (YOLO mode, Agent mode) with human oversight (TDD, plan approval, checkpoints) to create a powerful, verifiable debugging workflow[^1][^5][^8][^4].
|
||||
<span style="display:none">[^50][^51][^52][^53][^54][^55][^56][^57][^58][^59][^60][^61][^62][^63][^64][^65][^66][^67][^68][^69][^70][^71][^72][^73][^74][^75][^76][^77][^78][^79][^80][^81][^82][^83][^84][^85][^86][^87][^88][^89][^90][^91][^92][^93][^94][^95][^96][^97][^98]</span>
|
||||
|
||||
<div align="center">⁂</div>
|
||||
|
||||
[^1]: https://www.builder.io/blog/cursor-tips
|
||||
|
||||
[^2]: https://cursorintro.com/insights/Test-Driven-Development-as-a-Framework-for-AI-Assisted-Development
|
||||
|
||||
[^3]: https://www.linkedin.com/posts/richardsondx_i-built-tdd-for-cursor-ai-agents-and-its-activity-7330360750995132416-Jt5A
|
||||
|
||||
[^4]: https://stack.convex.dev/6-tips-for-improving-your-cursor-composer-and-convex-workflow
|
||||
|
||||
[^5]: https://www.reddit.com/r/cursor/comments/1iga00x/refined_workflow_for_cursor_composer_agent_mode/
|
||||
|
||||
[^6]: https://www.sidetool.co/post/how-to-use-cursor-for-efficient-code-review-and-debugging/
|
||||
|
||||
[^7]: https://firebase.google.com/docs/functions/writing-and-viewing-logs
|
||||
|
||||
[^8]: https://forum.cursor.com/t/composer-agent-refined-workflow-detailed-instructions-and-example-repo-for-practice/47180
|
||||
|
||||
[^9]: https://learncursor.dev/features/at-symbols
|
||||
|
||||
[^10]: https://cursor.com/docs/context/symbols
|
||||
|
||||
[^11]: https://www.reddit.com/r/ChatGPTCoding/comments/1hu276s/how_to_use_cursor_more_efficiently/
|
||||
|
||||
[^12]: https://dev.to/mehmetakar/context7-mcp-tutorial-3he2
|
||||
|
||||
[^13]: https://github.com/upstash/context7
|
||||
|
||||
[^14]: https://apidog.com/blog/context7-mcp-server/
|
||||
|
||||
[^15]: https://www.reddit.com/r/cursor/comments/1jg0in6/i_cut_my_browser_debugging_time_in_half_using_ai/
|
||||
|
||||
[^16]: https://www.youtube.com/watch?v=K5hLY0mytV0
|
||||
|
||||
[^17]: https://mcpcursor.com/server/sequential-thinking
|
||||
|
||||
[^18]: https://apidog.com/blog/mcp-sequential-thinking/
|
||||
|
||||
[^19]: https://skywork.ai/skypage/en/An-AI-Engineer's-Deep-Dive:-Mastering-Complex-Reasoning-with-the-sequential-thinking-MCP-Server-and-Claude-Code/1971471570609172480
|
||||
|
||||
[^20]: https://firebase.google.com/docs/crashlytics/ai-assistance-mcp
|
||||
|
||||
[^21]: https://lobehub.com/mcp/your-username-mcp-crashlytics-server
|
||||
|
||||
[^22]: https://trigger.dev/blog/cursor-rules
|
||||
|
||||
[^23]: https://www.youtube.com/watch?v=Vy7dJKv1EpA
|
||||
|
||||
[^24]: https://www.reddit.com/r/cursor/comments/1ik06ol/a_guide_to_understand_new_cursorrules_in_045/
|
||||
|
||||
[^25]: https://cursor.com/docs/context/rules
|
||||
|
||||
[^26]: https://forum.cursor.com/t/enhanced-productivity-persistent-notepads-smart-organization-and-project-integration/60757
|
||||
|
||||
[^27]: https://iroidsolutions.com/blog/mastering-cursor-ai-16-golden-tips-for-next-level-productivity
|
||||
|
||||
[^28]: https://dev.to/heymarkkop/my-top-cursor-tips-v043-1kcg
|
||||
|
||||
[^29]: https://www.dotcursorrules.dev/cheatsheet
|
||||
|
||||
[^30]: https://cursor101.com/en/cursor/cheat-sheet
|
||||
|
||||
[^31]: https://mehmetbaykar.com/posts/top-15-cursor-shortcuts-to-speed-up-development/
|
||||
|
||||
[^32]: https://dev.to/romainsimon/4-tips-for-a-10x-productivity-using-cursor-1n3o
|
||||
|
||||
[^33]: https://skywork.ai/blog/vibecoding/cursor-2-0-workflow-tips/
|
||||
|
||||
[^34]: https://forum.cursor.com/t/command-k-and-the-terminal/7265
|
||||
|
||||
[^35]: https://forum.cursor.com/t/shortcut-conflict-for-cmd-k-terminal-clear-and-ai-window/22693
|
||||
|
||||
[^36]: https://www.youtube.com/watch?v=WVeYLlKOWc0
|
||||
|
||||
[^37]: https://cursor.com/docs/agent/modes
|
||||
|
||||
[^38]: https://forum.cursor.com/t/10-pro-tips-for-working-with-cursor-agent/137212
|
||||
|
||||
[^39]: https://ryanocm.substack.com/p/137-10-ways-to-10x-your-cursor-workflow
|
||||
|
||||
[^40]: https://forum.cursor.com/t/add-the-best-practices-section-to-the-documentation/129131
|
||||
|
||||
[^41]: https://www.nocode.mba/articles/debug-vibe-coding-faster
|
||||
|
||||
[^42]: https://www.siddharthbharath.com/coding-with-cursor-beginners-guide/
|
||||
|
||||
[^43]: https://www.letsenvision.com/blog/effective-logging-in-production-with-firebase-crashlytics
|
||||
|
||||
[^44]: https://www.ellenox.com/post/mastering-cursor-ai-advanced-workflows-and-best-practices
|
||||
|
||||
[^45]: https://forum.cursor.com/t/best-practices-setups-for-custom-agents-in-cursor/76725
|
||||
|
||||
[^46]: https://www.reddit.com/r/cursor/comments/1jtc9ej/cursors_internal_prompt_and_context_management_is/
|
||||
|
||||
[^47]: https://forum.cursor.com/t/endless-loops-and-unrelated-code/122518
|
||||
|
||||
[^48]: https://forum.cursor.com/t/auto-injected-summarization-and-loss-of-context/86609
|
||||
|
||||
[^49]: https://github.com/cursor/cursor/issues/3706
|
||||
|
||||
[^50]: https://www.youtube.com/watch?v=TFIkzc74CsI
|
||||
|
||||
[^51]: https://www.codecademy.com/article/how-to-use-cursor-ai-a-complete-guide-with-practical-examples
|
||||
|
||||
[^52]: https://launchdarkly.com/docs/tutorials/cursor-tips-and-tricks
|
||||
|
||||
[^53]: https://www.reddit.com/r/programming/comments/1g20jej/18_observations_from_using_cursor_for_6_months/
|
||||
|
||||
[^54]: https://www.youtube.com/watch?v=TrcyAWGC1k4
|
||||
|
||||
[^55]: https://forum.cursor.com/t/composer-agent-refined-workflow-detailed-instructions-and-example-repo-for-practice/47180/5
|
||||
|
||||
[^56]: https://hackernoon.com/two-hours-with-cursor-changed-how-i-see-ai-coding
|
||||
|
||||
[^57]: https://forum.cursor.com/t/how-are-you-using-ai-inside-cursor-for-real-world-projects/97801
|
||||
|
||||
[^58]: https://www.youtube.com/watch?v=eQD5NncxXgE
|
||||
|
||||
[^59]: https://forum.cursor.com/t/guide-a-simpler-more-autonomous-ai-workflow-for-cursor-new-update/70688
|
||||
|
||||
[^60]: https://forum.cursor.com/t/good-examples-of-cursorrules-file/4346
|
||||
|
||||
[^61]: https://patagonian.com/cursor-features-developers-must-know/
|
||||
|
||||
[^62]: https://forum.cursor.com/t/ai-test-driven-development/23993
|
||||
|
||||
[^63]: https://www.reddit.com/r/cursor/comments/1iq6pc7/all_you_need_is_tdd/
|
||||
|
||||
[^64]: https://forum.cursor.com/t/best-practices-cursorrules/41775
|
||||
|
||||
[^65]: https://www.youtube.com/watch?v=A9BiNPf34Z4
|
||||
|
||||
[^66]: https://engineering.monday.com/coding-with-cursor-heres-why-you-still-need-tdd/
|
||||
|
||||
[^67]: https://github.com/PatrickJS/awesome-cursorrules
|
||||
|
||||
[^68]: https://www.datadoghq.com/blog/datadog-cursor-extension/
|
||||
|
||||
[^69]: https://www.youtube.com/watch?v=oAoigBWLZgE
|
||||
|
||||
[^70]: https://www.reddit.com/r/cursor/comments/1khn8hw/noob_question_about_mcp_specifically_context7/
|
||||
|
||||
[^71]: https://www.reddit.com/r/ChatGPTCoding/comments/1if8lbr/cursor_has_mcp_features_that_dont_work_for_me_any/
|
||||
|
||||
[^72]: https://cursor.com/docs/context/mcp
|
||||
|
||||
[^73]: https://upstash.com/blog/context7-mcp
|
||||
|
||||
[^74]: https://cursor.directory/mcp/sequential-thinking
|
||||
|
||||
[^75]: https://forum.cursor.com/t/how-to-debug-localhost-site-with-mcp/48853
|
||||
|
||||
[^76]: https://www.youtube.com/watch?v=gnx2dxtM-Ys
|
||||
|
||||
[^77]: https://www.mcp-repository.com/use-cases/ai-data-analysis
|
||||
|
||||
[^78]: https://cursor.directory/mcp
|
||||
|
||||
[^79]: https://www.youtube.com/watch?v=tDGJ12sD-UQ
|
||||
|
||||
[^80]: https://github.com/firebase/firebase-functions/issues/1439
|
||||
|
||||
[^81]: https://firebase.google.com/docs/app-hosting/logging
|
||||
|
||||
[^82]: https://dotcursorrules.com/cheat-sheet
|
||||
|
||||
[^83]: https://www.reddit.com/r/webdev/comments/1k8ld2l/whats_easy_way_to_see_errors_and_logs_once_in/
|
||||
|
||||
[^84]: https://www.youtube.com/watch?v=HlYyU2XOXk0
|
||||
|
||||
[^85]: https://stackoverflow.com/questions/51212886/how-to-log-errors-with-firebase-hosting-for-a-deployed-angular-web-app
|
||||
|
||||
[^86]: https://forum.cursor.com/t/list-of-shortcuts/520
|
||||
|
||||
[^87]: https://firebase.google.com/docs/analytics/debugview
|
||||
|
||||
[^88]: https://forum.cursor.com/t/cmd-k-vs-cmd-r-keyboard-shortcuts-default/1172
|
||||
|
||||
[^89]: https://www.youtube.com/watch?v=CeYr7C8UqLE
|
||||
|
||||
[^90]: https://forum.cursor.com/t/can-we-reference-docs-files-in-the-rules/23300
|
||||
|
||||
[^91]: https://forum.cursor.com/t/cmd-l-l-i-and-cmd-k-k-hotkeys-to-switch-between-models-and-chat-modes/2442
|
||||
|
||||
[^92]: https://www.reddit.com/r/cursor/comments/1gqr207/can_i_mention_docs_in_cursorrules_file/
|
||||
|
||||
[^93]: https://cursor.com/docs/configuration/kbd
|
||||
|
||||
[^94]: https://forum.cursor.com/t/how-to-reference-symbols-like-docs-or-web-from-within-a-text-prompt/66850
|
||||
|
||||
[^95]: https://forum.cursor.com/t/tired-of-cursor-not-putting-what-you-want-into-context-solved/75682
|
||||
|
||||
[^96]: https://www.reddit.com/r/vscode/comments/1frnoca/which_keyboard_shortcuts_do_you_use_most_but/
|
||||
|
||||
[^97]: https://forum.cursor.com/t/fixing-basic-features-before-adding-new-ones/141183
|
||||
|
||||
[^98]: https://cursor.com/en-US/docs
|
||||
|
||||
186
CLEANUP_PLAN.md
186
CLEANUP_PLAN.md
@@ -1,186 +0,0 @@
|
||||
# Project Cleanup Plan
|
||||
|
||||
## Files Found for Cleanup
|
||||
|
||||
### 🗑️ Category 1: SAFE TO DELETE (Backups & Temp Files)
|
||||
|
||||
**Backup Files:**
|
||||
- `backend/.env.backup` (4.1K, Nov 4)
|
||||
- `backend/.env.backup.20251031_221937` (4.1K, Oct 31)
|
||||
- `backend/diagnostic-report.json` (1.9K, Oct 31)
|
||||
|
||||
**Total Space:** ~10KB
|
||||
|
||||
**Action:** DELETE - These are temporary diagnostic/backup files
|
||||
|
||||
---
|
||||
|
||||
### 📄 Category 2: REDUNDANT DOCUMENTATION (Consider Deleting)
|
||||
|
||||
**Analysis Reports (Already in Git History):**
|
||||
- `CLEANUP_ANALYSIS_REPORT.md` (staged for deletion)
|
||||
- `CLEANUP_COMPLETION_REPORT.md` (staged for deletion)
|
||||
- `DOCUMENTATION_AUDIT_REPORT.md` (staged for deletion)
|
||||
- `DOCUMENTATION_COMPLETION_REPORT.md` (staged for deletion)
|
||||
- `FRONTEND_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
- `LLM_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
- `OPERATIONAL_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
|
||||
**Action:** ALREADY STAGED FOR DELETION - Git will handle
|
||||
|
||||
**Duplicate/Outdated Guides:**
|
||||
- `BETTER_APPROACHES.md` (untracked)
|
||||
- `DEPLOYMENT_INSTRUCTIONS.md` (untracked) - Duplicate of `DEPLOYMENT_GUIDE.md`?
|
||||
- `IMPLEMENTATION_GUIDE.md` (untracked)
|
||||
- `LLM_ANALYSIS.md` (untracked)
|
||||
|
||||
**Action:** REVIEW THEN DELETE if redundant with other docs
|
||||
|
||||
---
|
||||
|
||||
### 🛠️ Category 3: DIAGNOSTIC SCRIPTS (28 total)
|
||||
|
||||
**Keep These (Core Utilities):**
|
||||
- `check-database-failures.ts` ✅ (used in troubleshooting)
|
||||
- `check-current-processing.ts` ✅ (monitoring)
|
||||
- `test-openrouter-simple.ts` ✅ (testing)
|
||||
- `test-full-llm-pipeline.ts` ✅ (testing)
|
||||
- `setup-database.ts` ✅ (setup)
|
||||
|
||||
**Consider Deleting (One-Time Use):**
|
||||
- `check-current-job.ts` (redundant with check-current-processing)
|
||||
- `check-table-schema.ts` (one-time diagnostic)
|
||||
- `check-third-party-services.ts` (one-time diagnostic)
|
||||
- `comprehensive-diagnostic.ts` (one-time diagnostic)
|
||||
- `create-job-direct.ts` (testing helper)
|
||||
- `create-job-for-stuck-document.ts` (one-time fix)
|
||||
- `create-test-job.ts` (testing helper)
|
||||
- `diagnose-processing-issues.ts` (one-time diagnostic)
|
||||
- `diagnose-upload-issues.ts` (one-time diagnostic)
|
||||
- `fix-table-schema.ts` (one-time fix)
|
||||
- `mark-stuck-as-failed.ts` (one-time fix)
|
||||
- `monitor-document-processing.ts` (redundant)
|
||||
- `monitor-system.ts` (redundant)
|
||||
- `setup-gcs-permissions.ts` (one-time setup)
|
||||
- `setup-processing-jobs-table.ts` (one-time setup)
|
||||
- `test-gcs-integration.ts` (one-time test)
|
||||
- `test-job-creation.ts` (testing helper)
|
||||
- `test-linkage.ts` (one-time test)
|
||||
- `test-llm-processing-offline.ts` (testing)
|
||||
- `test-openrouter-quick.ts` (redundant with simple)
|
||||
- `test-postgres-connection.ts` (one-time test)
|
||||
- `test-production-upload.ts` (one-time test)
|
||||
- `test-staging-environment.ts` (one-time test)
|
||||
|
||||
**Action:** ARCHIVE or DELETE ~18-20 scripts
|
||||
|
||||
---
|
||||
|
||||
### 📁 Category 4: SHELL SCRIPTS & SQL
|
||||
|
||||
**Shell Scripts:**
|
||||
- `backend/scripts/check-document-status.sh` (shell version, have TS version)
|
||||
- `backend/scripts/sync-firebase-config.sh` (one-time use)
|
||||
- `backend/scripts/sync-firebase-config.ts` (one-time use)
|
||||
- `backend/scripts/run-sql-file.js` (utility, keep?)
|
||||
- `backend/scripts/verify-schema.js` (one-time use)
|
||||
|
||||
**SQL Directory:**
|
||||
- `backend/sql/` (contains migration scripts?)
|
||||
|
||||
**Action:** REVIEW - Keep utilities, delete one-time scripts
|
||||
|
||||
---
|
||||
|
||||
### 📝 Category 5: DOCUMENTATION TO KEEP
|
||||
|
||||
**Essential Docs:**
|
||||
- `README.md` ✅
|
||||
- `QUICK_START.md` ✅
|
||||
- `backend/TROUBLESHOOTING_PLAN.md` ✅ (just created)
|
||||
- `DEPLOYMENT_GUIDE.md` ✅
|
||||
- `CONFIGURATION_GUIDE.md` ✅
|
||||
- `DATABASE_SCHEMA_DOCUMENTATION.md` ✅
|
||||
- `BPCP CIM REVIEW TEMPLATE.md` ✅
|
||||
|
||||
**Consider Consolidating:**
|
||||
- Multiple service `.md` files in `backend/src/services/`
|
||||
- Multiple component `.md` files in `frontend/src/`
|
||||
|
||||
---
|
||||
|
||||
## Recommended Action Plan
|
||||
|
||||
### Phase 1: Safe Cleanup (No Risk)
|
||||
```bash
|
||||
# Delete backup files
|
||||
rm backend/.env.backup*
|
||||
rm backend/diagnostic-report.json
|
||||
|
||||
# Clear old logs (keep last 7 days)
|
||||
find backend/logs -name "*.log" -mtime +7 -delete
|
||||
```
|
||||
|
||||
### Phase 2: Remove One-Time Diagnostic Scripts
|
||||
```bash
|
||||
cd backend/src/scripts
|
||||
|
||||
# Delete one-time diagnostics
|
||||
rm check-table-schema.ts
|
||||
rm check-third-party-services.ts
|
||||
rm comprehensive-diagnostic.ts
|
||||
rm create-job-direct.ts
|
||||
rm create-job-for-stuck-document.ts
|
||||
rm create-test-job.ts
|
||||
rm diagnose-processing-issues.ts
|
||||
rm diagnose-upload-issues.ts
|
||||
rm fix-table-schema.ts
|
||||
rm mark-stuck-as-failed.ts
|
||||
rm setup-gcs-permissions.ts
|
||||
rm setup-processing-jobs-table.ts
|
||||
rm test-gcs-integration.ts
|
||||
rm test-job-creation.ts
|
||||
rm test-linkage.ts
|
||||
rm test-openrouter-quick.ts
|
||||
rm test-postgres-connection.ts
|
||||
rm test-production-upload.ts
|
||||
rm test-staging-environment.ts
|
||||
```
|
||||
|
||||
### Phase 3: Remove Redundant Documentation
|
||||
```bash
|
||||
cd /home/jonathan/Coding/cim_summary
|
||||
|
||||
# Delete untracked redundant docs
|
||||
rm BETTER_APPROACHES.md
|
||||
rm LLM_ANALYSIS.md
|
||||
rm IMPLEMENTATION_GUIDE.md
|
||||
|
||||
# If DEPLOYMENT_INSTRUCTIONS.md is duplicate:
|
||||
# rm DEPLOYMENT_INSTRUCTIONS.md
|
||||
```
|
||||
|
||||
### Phase 4: Consolidate Service Documentation
|
||||
Move inline documentation comments instead of separate `.md` files
|
||||
|
||||
---
|
||||
|
||||
## Estimated Space Saved
|
||||
|
||||
- Backup files: ~10KB
|
||||
- Diagnostic scripts: ~50-100KB
|
||||
- Documentation: ~50KB
|
||||
- Old logs: Variable (could be 100s of KB)
|
||||
|
||||
**Total:** ~200-300KB (not huge, but cleaner project)
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Execute Phase 1 immediately** (safe, no risk)
|
||||
**Execute Phase 2 after review** (can always recreate scripts)
|
||||
**Hold Phase 3** until you confirm docs are redundant
|
||||
**Hold Phase 4** for later refactoring
|
||||
|
||||
Would you like me to execute the cleanup?
|
||||
@@ -1,143 +0,0 @@
|
||||
# Cleanup Completed - Summary Report
|
||||
|
||||
**Date:** $(date)
|
||||
|
||||
## ✅ Phase 1: Backup & Temporary Files (COMPLETED)
|
||||
|
||||
**Deleted:**
|
||||
- `backend/.env.backup` (4.1K)
|
||||
- `backend/.env.backup.20251031_221937` (4.1K)
|
||||
- `backend/diagnostic-report.json` (1.9K)
|
||||
|
||||
**Total:** ~10KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 2: One-Time Diagnostic Scripts (COMPLETED)
|
||||
|
||||
**Deleted 19 scripts from `backend/src/scripts/`:**
|
||||
1. check-table-schema.ts
|
||||
2. check-third-party-services.ts
|
||||
3. comprehensive-diagnostic.ts
|
||||
4. create-job-direct.ts
|
||||
5. create-job-for-stuck-document.ts
|
||||
6. create-test-job.ts
|
||||
7. diagnose-processing-issues.ts
|
||||
8. diagnose-upload-issues.ts
|
||||
9. fix-table-schema.ts
|
||||
10. mark-stuck-as-failed.ts
|
||||
11. setup-gcs-permissions.ts
|
||||
12. setup-processing-jobs-table.ts
|
||||
13. test-gcs-integration.ts
|
||||
14. test-job-creation.ts
|
||||
15. test-linkage.ts
|
||||
16. test-openrouter-quick.ts
|
||||
17. test-postgres-connection.ts
|
||||
18. test-production-upload.ts
|
||||
19. test-staging-environment.ts
|
||||
|
||||
**Remaining scripts (9):**
|
||||
- check-current-job.ts
|
||||
- check-current-processing.ts
|
||||
- check-database-failures.ts
|
||||
- monitor-document-processing.ts
|
||||
- monitor-system.ts
|
||||
- setup-database.ts
|
||||
- test-full-llm-pipeline.ts
|
||||
- test-llm-processing-offline.ts
|
||||
- test-openrouter-simple.ts
|
||||
|
||||
**Total:** ~100KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 3: Redundant Documentation & Scripts (COMPLETED)
|
||||
|
||||
**Deleted Documentation:**
|
||||
- BETTER_APPROACHES.md
|
||||
- LLM_ANALYSIS.md
|
||||
- IMPLEMENTATION_GUIDE.md
|
||||
- DOCUMENT_AUDIT_GUIDE.md
|
||||
- DEPLOYMENT_INSTRUCTIONS.md (duplicate)
|
||||
|
||||
**Deleted Backend Docs:**
|
||||
- backend/MIGRATION_GUIDE.md
|
||||
- backend/PERFORMANCE_OPTIMIZATION_OPTIONS.md
|
||||
|
||||
**Deleted Shell Scripts:**
|
||||
- backend/scripts/check-document-status.sh
|
||||
- backend/scripts/sync-firebase-config.sh
|
||||
- backend/scripts/sync-firebase-config.ts
|
||||
- backend/scripts/verify-schema.js
|
||||
- backend/scripts/run-sql-file.js
|
||||
|
||||
**Total:** ~50KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 4: Old Log Files (COMPLETED)
|
||||
|
||||
**Deleted logs older than 7 days:**
|
||||
- backend/logs/upload.log (0 bytes, Aug 2)
|
||||
- backend/logs/app.log (39K, Aug 14)
|
||||
- backend/logs/exceptions.log (26K, Aug 15)
|
||||
- backend/logs/rejections.log (0 bytes, Aug 15)
|
||||
|
||||
**Total:** ~65KB
|
||||
|
||||
**Logs directory size after cleanup:** 620K
|
||||
|
||||
---
|
||||
|
||||
## 📊 Summary Statistics
|
||||
|
||||
| Category | Files Deleted | Space Saved |
|
||||
|----------|---------------|-------------|
|
||||
| Backups & Temp | 3 | ~10KB |
|
||||
| Diagnostic Scripts | 19 | ~100KB |
|
||||
| Documentation | 7 | ~50KB |
|
||||
| Shell Scripts | 5 | ~10KB |
|
||||
| Old Logs | 4 | ~65KB |
|
||||
| **TOTAL** | **38** | **~235KB** |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 What Remains
|
||||
|
||||
### Essential Scripts (9):
|
||||
- Database checks and monitoring
|
||||
- LLM testing and pipeline tests
|
||||
- Database setup
|
||||
|
||||
### Essential Documentation:
|
||||
- README.md
|
||||
- QUICK_START.md
|
||||
- DEPLOYMENT_GUIDE.md
|
||||
- CONFIGURATION_GUIDE.md
|
||||
- DATABASE_SCHEMA_DOCUMENTATION.md
|
||||
- backend/TROUBLESHOOTING_PLAN.md
|
||||
- BPCP CIM REVIEW TEMPLATE.md
|
||||
|
||||
### Reference Materials (Kept):
|
||||
- `backend/sql/` directory (migration scripts for reference)
|
||||
- Service documentation (.md files in src/services/)
|
||||
- Recent logs (< 7 days old)
|
||||
|
||||
---
|
||||
|
||||
## ✨ Project Status After Cleanup
|
||||
|
||||
**Project is now:**
|
||||
- ✅ Leaner (38 fewer files)
|
||||
- ✅ More maintainable (removed one-time scripts)
|
||||
- ✅ Better organized (removed duplicate docs)
|
||||
- ✅ Kept all essential utilities and documentation
|
||||
|
||||
**Next recommended actions:**
|
||||
1. Commit these changes to git
|
||||
2. Review remaining 9 scripts - consolidate if needed
|
||||
3. Consider archiving `backend/sql/` to a separate repo if not needed
|
||||
|
||||
---
|
||||
|
||||
**Cleanup completed successfully!**
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,370 +0,0 @@
|
||||
# Full Documentation Plan
|
||||
## Comprehensive Documentation Strategy for CIM Document Processor
|
||||
|
||||
### 🎯 Project Overview
|
||||
|
||||
This plan outlines a systematic approach to create complete, accurate, and LLM-optimized documentation for the CIM Document Processor project. The documentation will cover all aspects of the system from high-level architecture to detailed implementation guides.
|
||||
|
||||
---
|
||||
|
||||
## 📋 Documentation Inventory & Status
|
||||
|
||||
### ✅ Existing Documentation (Good Quality)
|
||||
- `README.md` - Project overview and quick start
|
||||
- `APP_DESIGN_DOCUMENTATION.md` - System architecture
|
||||
- `AGENTIC_RAG_IMPLEMENTATION_PLAN.md` - AI processing strategy
|
||||
- `PDF_GENERATION_ANALYSIS.md` - PDF optimization details
|
||||
- `DEPLOYMENT_GUIDE.md` - Deployment instructions
|
||||
- `ARCHITECTURE_DIAGRAMS.md` - Visual architecture
|
||||
- `DOCUMENTATION_AUDIT_REPORT.md` - Accuracy audit
|
||||
|
||||
### ⚠️ Existing Documentation (Needs Updates)
|
||||
- `codebase-audit-report.md` - May need updates
|
||||
- `DEPENDENCY_ANALYSIS_REPORT.md` - May need updates
|
||||
- `DOCUMENT_AI_INTEGRATION_SUMMARY.md` - May need updates
|
||||
|
||||
### ❌ Missing Documentation (To Be Created)
|
||||
- Individual service documentation
|
||||
- API endpoint documentation
|
||||
- Database schema documentation
|
||||
- Configuration guide
|
||||
- Testing documentation
|
||||
- Troubleshooting guide
|
||||
- Development workflow guide
|
||||
- Security documentation
|
||||
- Performance optimization guide
|
||||
- Monitoring and alerting guide
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Documentation Architecture
|
||||
|
||||
### Level 1: Project Overview
|
||||
- **README.md** - Entry point and quick start
|
||||
- **PROJECT_OVERVIEW.md** - Detailed project description
|
||||
- **ARCHITECTURE_OVERVIEW.md** - High-level system design
|
||||
|
||||
### Level 2: System Architecture
|
||||
- **APP_DESIGN_DOCUMENTATION.md** - Complete architecture
|
||||
- **ARCHITECTURE_DIAGRAMS.md** - Visual diagrams
|
||||
- **DATA_FLOW_DOCUMENTATION.md** - System data flow
|
||||
- **INTEGRATION_GUIDE.md** - External service integration
|
||||
|
||||
### Level 3: Component Documentation
|
||||
- **SERVICES/** - Individual service documentation
|
||||
- **API/** - API endpoint documentation
|
||||
- **DATABASE/** - Database schema and models
|
||||
- **FRONTEND/** - Frontend component documentation
|
||||
|
||||
### Level 4: Implementation Guides
|
||||
- **CONFIGURATION_GUIDE.md** - Environment setup
|
||||
- **DEPLOYMENT_GUIDE.md** - Deployment procedures
|
||||
- **TESTING_GUIDE.md** - Testing strategies
|
||||
- **DEVELOPMENT_WORKFLOW.md** - Development processes
|
||||
|
||||
### Level 5: Operational Documentation
|
||||
- **MONITORING_GUIDE.md** - Monitoring and alerting
|
||||
- **TROUBLESHOOTING_GUIDE.md** - Common issues and solutions
|
||||
- **SECURITY_GUIDE.md** - Security considerations
|
||||
- **PERFORMANCE_GUIDE.md** - Performance optimization
|
||||
|
||||
---
|
||||
|
||||
## 📊 Documentation Priority Matrix
|
||||
|
||||
### 🔴 High Priority (Critical for LLM Agents)
|
||||
1. **Service Documentation** - All backend services
|
||||
2. **API Documentation** - Complete endpoint documentation
|
||||
3. **Configuration Guide** - Environment and setup
|
||||
4. **Database Schema** - Data models and relationships
|
||||
5. **Error Handling** - Comprehensive error documentation
|
||||
|
||||
### 🟡 Medium Priority (Important for Development)
|
||||
1. **Frontend Documentation** - React components and services
|
||||
2. **Testing Documentation** - Test strategies and examples
|
||||
3. **Development Workflow** - Development processes
|
||||
4. **Performance Guide** - Optimization strategies
|
||||
5. **Security Guide** - Security considerations
|
||||
|
||||
### 🟢 Low Priority (Nice to Have)
|
||||
1. **Monitoring Guide** - Monitoring and alerting
|
||||
2. **Troubleshooting Guide** - Common issues
|
||||
3. **Integration Guide** - External service integration
|
||||
4. **Data Flow Documentation** - Detailed data flow
|
||||
5. **Project Overview** - Detailed project description
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Implementation Plan
|
||||
|
||||
### Phase 1: Core Service Documentation (Week 1)
|
||||
**Goal**: Document all backend services for LLM agent understanding
|
||||
|
||||
#### Day 1-2: Critical Services
|
||||
- [ ] `unifiedDocumentProcessor.ts` - Main orchestrator
|
||||
- [ ] `optimizedAgenticRAGProcessor.ts` - AI processing engine
|
||||
- [ ] `llmService.ts` - LLM interactions
|
||||
- [ ] `documentAiProcessor.ts` - Document AI integration
|
||||
|
||||
#### Day 3-4: File Management Services
|
||||
- [ ] `fileStorageService.ts` - Google Cloud Storage
|
||||
- [ ] `pdfGenerationService.ts` - PDF generation
|
||||
- [ ] `uploadMonitoringService.ts` - Upload tracking
|
||||
- [ ] `uploadProgressService.ts` - Progress tracking
|
||||
|
||||
#### Day 5-7: Data Management Services
|
||||
- [ ] `agenticRAGDatabaseService.ts` - Analytics and sessions
|
||||
- [ ] `vectorDatabaseService.ts` - Vector embeddings
|
||||
- [ ] `sessionService.ts` - Session management
|
||||
- [ ] `jobQueueService.ts` - Background processing
|
||||
|
||||
### Phase 2: API Documentation (Week 2)
|
||||
**Goal**: Complete API endpoint documentation
|
||||
|
||||
#### Day 1-2: Document Routes
|
||||
- [ ] `documents.ts` - Document management endpoints
|
||||
- [ ] `monitoring.ts` - Monitoring endpoints
|
||||
- [ ] `vector.ts` - Vector database endpoints
|
||||
|
||||
#### Day 3-4: Controller Documentation
|
||||
- [ ] `documentController.ts` - Document controller
|
||||
- [ ] `authController.ts` - Authentication controller
|
||||
|
||||
#### Day 5-7: API Integration Guide
|
||||
- [ ] API authentication guide
|
||||
- [ ] Request/response examples
|
||||
- [ ] Error handling documentation
|
||||
- [ ] Rate limiting documentation
|
||||
|
||||
### Phase 3: Database & Models (Week 3)
|
||||
**Goal**: Complete database schema and model documentation
|
||||
|
||||
#### Day 1-2: Core Models
|
||||
- [ ] `DocumentModel.ts` - Document data model
|
||||
- [ ] `UserModel.ts` - User data model
|
||||
- [ ] `ProcessingJobModel.ts` - Job processing model
|
||||
|
||||
#### Day 3-4: AI Models
|
||||
- [ ] `AgenticRAGModels.ts` - AI processing models
|
||||
- [ ] `agenticTypes.ts` - AI type definitions
|
||||
- [ ] `VectorDatabaseModel.ts` - Vector database model
|
||||
|
||||
#### Day 5-7: Database Schema
|
||||
- [ ] Complete database schema documentation
|
||||
- [ ] Migration documentation
|
||||
- [ ] Data relationships and constraints
|
||||
- [ ] Query optimization guide
|
||||
|
||||
### Phase 4: Configuration & Setup (Week 4)
|
||||
**Goal**: Complete configuration and setup documentation
|
||||
|
||||
#### Day 1-2: Environment Configuration
|
||||
- [ ] Environment variables guide
|
||||
- [ ] Configuration validation
|
||||
- [ ] Service account setup
|
||||
- [ ] API key management
|
||||
|
||||
#### Day 3-4: Development Setup
|
||||
- [ ] Local development setup
|
||||
- [ ] Development environment configuration
|
||||
- [ ] Testing environment setup
|
||||
- [ ] Debugging configuration
|
||||
|
||||
#### Day 5-7: Production Setup
|
||||
- [ ] Production environment setup
|
||||
- [ ] Deployment configuration
|
||||
- [ ] Monitoring setup
|
||||
- [ ] Security configuration
|
||||
|
||||
### Phase 5: Frontend Documentation (Week 5)
|
||||
**Goal**: Complete frontend component and service documentation
|
||||
|
||||
#### Day 1-2: Core Components
|
||||
- [ ] `App.tsx` - Main application component
|
||||
- [ ] `DocumentUpload.tsx` - Upload component
|
||||
- [ ] `DocumentList.tsx` - Document listing
|
||||
- [ ] `DocumentViewer.tsx` - Document viewing
|
||||
|
||||
#### Day 3-4: Service Components
|
||||
- [ ] `authService.ts` - Authentication service
|
||||
- [ ] `documentService.ts` - Document service
|
||||
- [ ] Context providers and hooks
|
||||
- [ ] Utility functions
|
||||
|
||||
#### Day 5-7: Frontend Integration
|
||||
- [ ] Component interaction patterns
|
||||
- [ ] State management documentation
|
||||
- [ ] Error handling in frontend
|
||||
- [ ] Performance optimization
|
||||
|
||||
### Phase 6: Testing & Quality Assurance (Week 6)
|
||||
**Goal**: Complete testing documentation and quality assurance
|
||||
|
||||
#### Day 1-2: Testing Strategy
|
||||
- [ ] Unit testing documentation
|
||||
- [ ] Integration testing documentation
|
||||
- [ ] End-to-end testing documentation
|
||||
- [ ] Test data management
|
||||
|
||||
#### Day 3-4: Quality Assurance
|
||||
- [ ] Code quality standards
|
||||
- [ ] Review processes
|
||||
- [ ] Performance testing
|
||||
- [ ] Security testing
|
||||
|
||||
#### Day 5-7: Continuous Integration
|
||||
- [ ] CI/CD pipeline documentation
|
||||
- [ ] Automated testing
|
||||
- [ ] Quality gates
|
||||
- [ ] Release processes
|
||||
|
||||
### Phase 7: Operational Documentation (Week 7)
|
||||
**Goal**: Complete operational and maintenance documentation
|
||||
|
||||
#### Day 1-2: Monitoring & Alerting
|
||||
- [ ] Monitoring setup guide
|
||||
- [ ] Alert configuration
|
||||
- [ ] Performance metrics
|
||||
- [ ] Health checks
|
||||
|
||||
#### Day 3-4: Troubleshooting
|
||||
- [ ] Common issues and solutions
|
||||
- [ ] Debug procedures
|
||||
- [ ] Log analysis
|
||||
- [ ] Error recovery
|
||||
|
||||
#### Day 5-7: Maintenance
|
||||
- [ ] Backup procedures
|
||||
- [ ] Update procedures
|
||||
- [ ] Scaling strategies
|
||||
- [ ] Disaster recovery
|
||||
|
||||
---
|
||||
|
||||
## 📝 Documentation Standards
|
||||
|
||||
### File Naming Convention
|
||||
- Use descriptive, lowercase names with hyphens
|
||||
- Include component type in filename
|
||||
- Example: `unified-document-processor-service.md`
|
||||
|
||||
### Content Structure
|
||||
- Use consistent section headers with emojis
|
||||
- Include file information header
|
||||
- Provide usage examples
|
||||
- Include error handling documentation
|
||||
- Add LLM agent notes
|
||||
|
||||
### Code Examples
|
||||
- Include TypeScript interfaces
|
||||
- Provide realistic usage examples
|
||||
- Show error handling patterns
|
||||
- Include configuration examples
|
||||
|
||||
### Cross-References
|
||||
- Link related documentation
|
||||
- Reference external resources
|
||||
- Include version information
|
||||
- Maintain consistency across documents
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Quality Assurance
|
||||
|
||||
### Documentation Review Process
|
||||
1. **Technical Accuracy** - Verify against actual code
|
||||
2. **Completeness** - Ensure all aspects are covered
|
||||
3. **Clarity** - Ensure clear and understandable
|
||||
4. **Consistency** - Maintain consistent style and format
|
||||
5. **LLM Optimization** - Optimize for AI agent understanding
|
||||
|
||||
### Review Checklist
|
||||
- [ ] All code examples are current and working
|
||||
- [ ] API documentation matches implementation
|
||||
- [ ] Configuration examples are accurate
|
||||
- [ ] Error handling documentation is complete
|
||||
- [ ] Performance metrics are realistic
|
||||
- [ ] Links and references are valid
|
||||
- [ ] LLM agent notes are included
|
||||
- [ ] Cross-references are accurate
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Metrics
|
||||
|
||||
### Documentation Quality Metrics
|
||||
- **Completeness**: 100% of services documented
|
||||
- **Accuracy**: 0% of inaccurate references
|
||||
- **Clarity**: Clear and understandable content
|
||||
- **Consistency**: Consistent style and format
|
||||
|
||||
### LLM Agent Effectiveness Metrics
|
||||
- **Understanding Accuracy**: LLM agents comprehend codebase
|
||||
- **Modification Success**: Successful code modifications
|
||||
- **Error Reduction**: Reduced LLM-generated errors
|
||||
- **Development Speed**: Faster development with LLM assistance
|
||||
|
||||
### User Experience Metrics
|
||||
- **Onboarding Time**: Reduced time for new developers
|
||||
- **Issue Resolution**: Faster issue resolution
|
||||
- **Feature Development**: Faster feature implementation
|
||||
- **Code Review Efficiency**: More efficient code reviews
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Expected Outcomes
|
||||
|
||||
### Immediate Benefits
|
||||
1. **Complete Documentation Coverage** - All components documented
|
||||
2. **Accurate References** - No more inaccurate information
|
||||
3. **LLM Optimization** - Optimized for AI agent understanding
|
||||
4. **Developer Onboarding** - Faster onboarding for new developers
|
||||
|
||||
### Long-term Benefits
|
||||
1. **Maintainability** - Easier to maintain and update
|
||||
2. **Scalability** - Easier to scale development team
|
||||
3. **Quality** - Higher code quality through better understanding
|
||||
4. **Efficiency** - More efficient development processes
|
||||
|
||||
---
|
||||
|
||||
## 📋 Implementation Timeline
|
||||
|
||||
### Week 1: Core Service Documentation
|
||||
- Complete documentation of all backend services
|
||||
- Focus on critical services first
|
||||
- Ensure LLM agent optimization
|
||||
|
||||
### Week 2: API Documentation
|
||||
- Complete API endpoint documentation
|
||||
- Include authentication and error handling
|
||||
- Provide usage examples
|
||||
|
||||
### Week 3: Database & Models
|
||||
- Complete database schema documentation
|
||||
- Document all data models
|
||||
- Include relationships and constraints
|
||||
|
||||
### Week 4: Configuration & Setup
|
||||
- Complete configuration documentation
|
||||
- Include environment setup guides
|
||||
- Document deployment procedures
|
||||
|
||||
### Week 5: Frontend Documentation
|
||||
- Complete frontend component documentation
|
||||
- Document state management
|
||||
- Include performance optimization
|
||||
|
||||
### Week 6: Testing & Quality Assurance
|
||||
- Complete testing documentation
|
||||
- Document quality assurance processes
|
||||
- Include CI/CD documentation
|
||||
|
||||
### Week 7: Operational Documentation
|
||||
- Complete monitoring and alerting documentation
|
||||
- Document troubleshooting procedures
|
||||
- Include maintenance procedures
|
||||
|
||||
---
|
||||
|
||||
This comprehensive documentation plan ensures that the CIM Document Processor project will have complete, accurate, and LLM-optimized documentation that supports efficient development and maintenance.
|
||||
@@ -1,888 +0,0 @@
|
||||
# Financial Data Extraction: Hybrid Solution
|
||||
## Better Regex + Enhanced LLM Approach
|
||||
|
||||
## Philosophy
|
||||
|
||||
Rather than a major architectural refactor, this solution enhances what's already working:
|
||||
1. **Smarter regex** to catch more table patterns
|
||||
2. **Better LLM context** to ensure financial tables are always seen
|
||||
3. **Hybrid validation** where regex and LLM cross-check each other
|
||||
|
||||
---
|
||||
|
||||
## Problem Analysis (Refined)
|
||||
|
||||
### Current Issues:
|
||||
1. **Regex is too strict** - Misses valid table formats
|
||||
2. **LLM gets incomplete context** - Financial tables truncated or missing
|
||||
3. **No cross-validation** - Regex and LLM don't verify each other
|
||||
4. **Table structure lost** - But we can preserve it better with preprocessing
|
||||
|
||||
### Key Insight:
|
||||
The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:
|
||||
- Give it the RIGHT chunks (always include financial sections)
|
||||
- Give it MORE context (increase chunk size for financial data)
|
||||
- Give it BETTER formatting hints (preserve spacing/alignment where possible)
|
||||
|
||||
**When to use this hybrid track:** Rely on the telemetry described in `FINANCIAL_EXTRACTION_ANALYSIS.md` / `IMPLEMENTATION_PLAN.md`. If a document finishes Phase 1/2 processing with `tablesFound === 0` or `financialDataPopulated === false`, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.
|
||||
|
||||
---
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### Three-Tier Extraction Strategy
|
||||
|
||||
```
|
||||
Tier 1: Enhanced Regex Parser (Fast, Deterministic)
|
||||
↓ (if successful)
|
||||
✓ Use regex results
|
||||
↓ (if incomplete/failed)
|
||||
|
||||
Tier 2: LLM with Enhanced Context (Powerful, Flexible)
|
||||
↓ (extract from full financial sections)
|
||||
✓ Fill in gaps from Tier 1
|
||||
↓ (if still missing data)
|
||||
|
||||
Tier 3: LLM Deep Dive (Focused, Exhaustive)
|
||||
↓ (targeted re-scan of entire document)
|
||||
✓ Final gap-filling
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
## Phase 1: Enhanced Regex Parser (2-3 hours)
|
||||
|
||||
### 1.1: Improve Text Preprocessing
|
||||
|
||||
**Goal**: Preserve table structure better before regex parsing
|
||||
|
||||
**File**: Create `backend/src/utils/textPreprocessor.ts`
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Enhanced text preprocessing to preserve table structures
|
||||
* Attempts to maintain column alignment from PDF extraction
|
||||
*/
|
||||
|
||||
export interface PreprocessedText {
|
||||
original: string;
|
||||
enhanced: string;
|
||||
tableRegions: TextRegion[];
|
||||
metadata: {
|
||||
likelyTableCount: number;
|
||||
preservedAlignment: boolean;
|
||||
};
|
||||
}
|
||||
|
||||
export interface TextRegion {
|
||||
start: number;
|
||||
end: number;
|
||||
type: 'table' | 'narrative' | 'header';
|
||||
confidence: number;
|
||||
content: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Identify regions that look like tables based on formatting patterns
|
||||
*/
|
||||
export function identifyTableRegions(text: string): TextRegion[] {
|
||||
const regions: TextRegion[] = [];
|
||||
const lines = text.split('\n');
|
||||
|
||||
let currentRegion: TextRegion | null = null;
|
||||
let regionStart = 0;
|
||||
let linePosition = 0;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i];
|
||||
const nextLine = lines[i + 1] || '';
|
||||
|
||||
const isTableLike = detectTableLine(line, nextLine);
|
||||
|
||||
if (isTableLike.isTable && !currentRegion) {
|
||||
// Start new table region
|
||||
currentRegion = {
|
||||
start: linePosition,
|
||||
end: linePosition + line.length,
|
||||
type: 'table',
|
||||
confidence: isTableLike.confidence,
|
||||
content: line
|
||||
};
|
||||
regionStart = i;
|
||||
} else if (isTableLike.isTable && currentRegion) {
|
||||
// Extend current table region
|
||||
currentRegion.end = linePosition + line.length;
|
||||
currentRegion.content += '\n' + line;
|
||||
currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
|
||||
} else if (!isTableLike.isTable && currentRegion) {
|
||||
// End table region
|
||||
if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
|
||||
regions.push(currentRegion);
|
||||
}
|
||||
currentRegion = null;
|
||||
}
|
||||
|
||||
linePosition += line.length + 1; // +1 for newline
|
||||
}
|
||||
|
||||
// Add final region if exists
|
||||
if (currentRegion && currentRegion.confidence > 0.5) {
|
||||
regions.push(currentRegion);
|
||||
}
|
||||
|
||||
return regions;
|
||||
}
|
||||
|
||||
/**
|
||||
* Detect if a line looks like part of a table
|
||||
*/
|
||||
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
|
||||
let score = 0;
|
||||
|
||||
// Check for multiple aligned numbers
|
||||
const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
|
||||
if (numberMatches && numberMatches.length >= 3) {
|
||||
score += 0.4; // Multiple numbers = likely table row
|
||||
}
|
||||
|
||||
// Check for consistent spacing (indicates columns)
|
||||
const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
|
||||
if (hasConsistentSpacing && numberMatches) {
|
||||
score += 0.3;
|
||||
}
|
||||
|
||||
// Check for year/period patterns
|
||||
if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
|
||||
score += 0.3;
|
||||
}
|
||||
|
||||
// Check for financial keywords
|
||||
if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
|
||||
score += 0.2;
|
||||
}
|
||||
|
||||
// Bonus: Next line also looks like a table
|
||||
if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
|
||||
score += 0.2;
|
||||
}
|
||||
|
||||
return {
|
||||
isTable: score > 0.5,
|
||||
confidence: Math.min(score, 1.0)
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Enhance text by preserving spacing in table regions
|
||||
*/
|
||||
export function preprocessText(text: string): PreprocessedText {
|
||||
const tableRegions = identifyTableRegions(text);
|
||||
|
||||
// For now, return original text with identified regions
|
||||
// In the future, could normalize spacing, align columns, etc.
|
||||
|
||||
return {
|
||||
original: text,
|
||||
enhanced: text, // TODO: Apply enhancement algorithms
|
||||
tableRegions,
|
||||
metadata: {
|
||||
likelyTableCount: tableRegions.length,
|
||||
preservedAlignment: true
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract just the table regions as separate texts
|
||||
*/
|
||||
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
|
||||
return preprocessed.tableRegions
|
||||
.filter(region => region.type === 'table' && region.confidence > 0.6)
|
||||
.map(region => region.content);
|
||||
}
|
||||
```
|
||||
|
||||
### 1.2: Enhance Financial Table Parser
|
||||
|
||||
**File**: `backend/src/services/financialTableParser.ts`
|
||||
|
||||
**Add new patterns to catch more variations:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: More flexible period token regex (add around line 21)
|
||||
const PERIOD_TOKEN_REGEX = /\b(?:
|
||||
(?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc.
|
||||
(?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc.
|
||||
(?:FY[-\s]?[1234])| # FY1, FY 2
|
||||
(?:LTM|TTM)| # LTM, TTM
|
||||
(?:CY\d{2})| # CY21, CY22
|
||||
(?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022
|
||||
)\b/gix;
|
||||
|
||||
// ENHANCED: Better money regex to catch more formats (update line 22)
|
||||
const MONEY_REGEX = /(?:
|
||||
\$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M
|
||||
[\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M
|
||||
\([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative
|
||||
[\d,]+(?:\.\d+)? # Plain numbers
|
||||
)/gx;
|
||||
|
||||
// ENHANCED: Better percentage regex (update line 23)
|
||||
const PERCENT_REGEX = /(?:
|
||||
\(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%)
|
||||
[\d,]+\.?\d*\s*pct| # 12.5 pct
|
||||
NM|N\/A|n\/a # Not meaningful, N/A
|
||||
)/gix;
|
||||
```
|
||||
|
||||
**Add multi-pass header detection:**
|
||||
|
||||
```typescript
|
||||
// ADD after line 278 (after current header detection)
|
||||
|
||||
// ENHANCED: Multi-pass header detection if first pass failed
|
||||
if (bestHeaderIndex === -1) {
|
||||
logger.info('First pass header detection failed, trying relaxed patterns');
|
||||
|
||||
// Second pass: Look for ANY line with 3+ numbers and a year pattern
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i];
|
||||
const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
|
||||
const numberCount = (line.match(/[\d,]+/g) || []).length;
|
||||
|
||||
if (hasYearPattern && numberCount >= 3) {
|
||||
// Look at next 10 lines for financial keywords
|
||||
const lookAhead = lines.slice(i + 1, i + 11).join(' ');
|
||||
const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);
|
||||
|
||||
if (hasFinancialKeywords) {
|
||||
logger.info('Relaxed header detection found candidate', {
|
||||
headerIndex: i,
|
||||
headerLine: line.substring(0, 100)
|
||||
});
|
||||
|
||||
// Try to parse this as header
|
||||
const tokens = tokenizePeriodHeaders(line);
|
||||
if (tokens.length >= 2) {
|
||||
bestHeaderIndex = i;
|
||||
bestBuckets = yearTokensToBuckets(tokens);
|
||||
bestHeaderScore = 50; // Lower confidence than primary detection
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Add fuzzy row matching:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Add after line 354 (in the row matching loop)
|
||||
// If exact match fails, try fuzzy matching
|
||||
|
||||
if (!ROW_MATCHERS[field].test(line)) {
|
||||
// Try fuzzy matching (partial matches, typos)
|
||||
const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
|
||||
if (!fuzzyMatch) continue;
|
||||
}
|
||||
|
||||
// ADD this helper function
|
||||
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
|
||||
const lineLower = line.toLowerCase();
|
||||
|
||||
switch (field) {
|
||||
case 'revenue':
|
||||
return /rev\b|sales|top.?line/.test(lineLower);
|
||||
case 'ebitda':
|
||||
return /ebit|earnings.*operations|operating.*income/.test(lineLower);
|
||||
case 'grossProfit':
|
||||
return /gross.*profit|gp\b/.test(lineLower);
|
||||
case 'grossMargin':
|
||||
return /gross.*margin|gm\b|gross.*%/.test(lineLower);
|
||||
case 'ebitdaMargin':
|
||||
return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
|
||||
case 'revenueGrowth':
|
||||
return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Enhanced LLM Context Delivery (2-3 hours)
|
||||
|
||||
### 2.1: Financial Section Prioritization
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Improve the `prioritizeFinancialChunks` method (around line 1265):**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Much more aggressive financial chunk prioritization
|
||||
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
|
||||
const scoredChunks = chunks.map(chunk => {
|
||||
const content = chunk.content.toLowerCase();
|
||||
let score = 0;
|
||||
|
||||
// TIER 1: Strong financial indicators (high score)
|
||||
const tier1Patterns = [
|
||||
/financial\s+summary/i,
|
||||
/historical\s+financials/i,
|
||||
/financial\s+performance/i,
|
||||
/income\s+statement/i,
|
||||
/financial\s+highlights/i,
|
||||
];
|
||||
tier1Patterns.forEach(pattern => {
|
||||
if (pattern.test(content)) score += 100;
|
||||
});
|
||||
|
||||
// TIER 2: Contains both periods AND metrics (very likely financial table)
|
||||
const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
|
||||
const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
|
||||
const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);
|
||||
|
||||
if (hasPeriods && hasMetrics && hasNumbers) {
|
||||
score += 80; // Very likely financial table
|
||||
} else if (hasPeriods && hasMetrics) {
|
||||
score += 50;
|
||||
} else if (hasPeriods && hasNumbers) {
|
||||
score += 30;
|
||||
}
|
||||
|
||||
// TIER 3: Multiple financial keywords
|
||||
const financialKeywords = [
|
||||
'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
|
||||
'operating income', 'net income', 'cash flow', 'growth'
|
||||
];
|
||||
const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
|
||||
score += keywordMatches * 5;
|
||||
|
||||
// TIER 4: Has year progression (2021, 2022, 2023)
|
||||
const years = content.match(/20[12]\d/g);
|
||||
if (years && years.length >= 3) {
|
||||
score += 25; // Sequential years = likely financial table
|
||||
}
|
||||
|
||||
// TIER 5: Multiple currency values
|
||||
const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
|
||||
if (currencyMatches) {
|
||||
score += Math.min(currencyMatches.length * 3, 30);
|
||||
}
|
||||
|
||||
// TIER 6: Section type boost
|
||||
if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
|
||||
score += 40;
|
||||
}
|
||||
|
||||
return { chunk, score };
|
||||
});
|
||||
|
||||
// Sort by score and return
|
||||
const sorted = scoredChunks.sort((a, b) => b.score - a.score);
|
||||
|
||||
// Log top financial chunks for debugging
|
||||
logger.info('Financial chunk prioritization results', {
|
||||
topScores: sorted.slice(0, 5).map(s => ({
|
||||
chunkIndex: s.chunk.chunkIndex,
|
||||
score: s.score,
|
||||
preview: s.chunk.content.substring(0, 100)
|
||||
}))
|
||||
});
|
||||
|
||||
return sorted.map(s => s.chunk);
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2: Increase Context for Financial Pass
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Update Pass 1 to use more chunks and larger context:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
|
||||
// Change from 7 chunks to 12 chunks, and increase character limit
|
||||
|
||||
const maxChunks = 12; // Was 7 - give LLM more context for financials
|
||||
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively
|
||||
|
||||
// And update line 1595 in extractWithTargetedQuery
|
||||
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;
|
||||
```
|
||||
|
||||
### 2.3: Enhanced Financial Extraction Prompt
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Update the Pass 1 query (around line 1196-1240) to be more explicit:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Much more detailed extraction instructions
|
||||
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.
|
||||
|
||||
CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:
|
||||
|
||||
I. LOCATE FINANCIAL TABLES
|
||||
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
|
||||
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.
|
||||
|
||||
Financial tables typically appear in these formats:
|
||||
|
||||
FORMAT 1 - Row-based:
|
||||
FY 2021 FY 2022 FY 2023 LTM
|
||||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||||
Revenue Growth N/A 16.8% 15.9% (4.4%)
|
||||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||||
|
||||
FORMAT 2 - Column-based:
|
||||
Metric | Value
|
||||
-------------------|---------
|
||||
FY21 Revenue | $45.2M
|
||||
FY22 Revenue | $52.8M
|
||||
FY23 Revenue | $61.2M
|
||||
|
||||
FORMAT 3 - Inline:
|
||||
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)
|
||||
|
||||
II. EXTRACTION RULES
|
||||
|
||||
1. PERIOD IDENTIFICATION
|
||||
- FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
|
||||
- LTM/TTM = Most recent 12-month period
|
||||
- Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
|
||||
* FY2021 → fy3
|
||||
* FY2022 → fy2
|
||||
* FY2023 → fy1
|
||||
* LTM Sep'23 → ltm
|
||||
|
||||
2. VALUE EXTRACTION
|
||||
- Extract EXACT values as shown: "$45.2M", "16.8%", etc.
|
||||
- Preserve formatting: "$45.2M" not "45.2" or "45200000"
|
||||
- Include negative indicators: "(4.4%)" or "-4.4%"
|
||||
- Use "N/A" or "NM" if explicitly stated (not "Not specified")
|
||||
|
||||
3. METRIC IDENTIFICATION
|
||||
- Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
|
||||
- EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
|
||||
- Margins = Look for "%" after metric name
|
||||
- Growth = "Growth %", "YoY", "Y/Y", "Change %"
|
||||
|
||||
4. DEAL OVERVIEW
|
||||
- Extract: company name, industry, geography, transaction type
|
||||
- Extract: employee count, deal source, reason for sale
|
||||
- Extract: CIM dates and metadata
|
||||
|
||||
III. QUALITY CHECKS
|
||||
|
||||
Before submitting your response:
|
||||
- [ ] Did I find at least 3 distinct fiscal periods?
|
||||
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
|
||||
- [ ] Did I preserve exact number formats from the document?
|
||||
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?
|
||||
|
||||
IV. WHAT TO DO IF TABLE IS UNCLEAR
|
||||
|
||||
If the table is hard to parse:
|
||||
- Include the ENTIRE table section in your analysis
|
||||
- Extract what you can with confidence
|
||||
- Mark unclear values as "Not specified in CIM" only if truly absent
|
||||
- DO NOT guess or interpolate values
|
||||
|
||||
V. ADDITIONAL FINANCIAL DATA
|
||||
|
||||
Also extract:
|
||||
- Quality of earnings notes
|
||||
- EBITDA adjustments and add-backs
|
||||
- Revenue growth drivers
|
||||
- Margin trends and analysis
|
||||
- CapEx requirements
|
||||
- Working capital needs
|
||||
- Free cash flow comments`;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)
|
||||
|
||||
### 3.1: Create Validation Layer
|
||||
|
||||
**File**: Create `backend/src/services/financialDataValidator.ts`
|
||||
|
||||
```typescript
|
||||
import { logger } from '../utils/logger';
|
||||
import type { ParsedFinancials } from './financialTableParser';
|
||||
import type { CIMReview } from './llmSchemas';
|
||||
|
||||
export interface ValidationResult {
|
||||
isValid: boolean;
|
||||
confidence: number;
|
||||
issues: string[];
|
||||
corrections: ParsedFinancials;
|
||||
}
|
||||
|
||||
/**
|
||||
* Cross-validate financial data from multiple sources
|
||||
*/
|
||||
export function validateFinancialData(
|
||||
regexResult: ParsedFinancials,
|
||||
llmResult: Partial<CIMReview>
|
||||
): ValidationResult {
|
||||
const issues: string[] = [];
|
||||
const corrections: ParsedFinancials = { ...regexResult };
|
||||
let confidence = 1.0;
|
||||
|
||||
// Extract LLM financials
|
||||
const llmFinancials = llmResult.financialSummary?.financials;
|
||||
|
||||
if (!llmFinancials) {
|
||||
return {
|
||||
isValid: true,
|
||||
confidence: 0.5,
|
||||
issues: ['No LLM financial data to validate against'],
|
||||
corrections: regexResult
|
||||
};
|
||||
}
|
||||
|
||||
// Validate each period
|
||||
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||
|
||||
for (const period of periods) {
|
||||
const regexPeriod = regexResult[period];
|
||||
const llmPeriod = llmFinancials[period];
|
||||
|
||||
if (!llmPeriod) continue;
|
||||
|
||||
// Compare revenue
|
||||
if (regexPeriod.revenue && llmPeriod.revenue) {
|
||||
const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
|
||||
if (!match.matches) {
|
||||
issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
|
||||
confidence -= 0.1;
|
||||
|
||||
// Trust LLM if regex value looks suspicious
|
||||
if (match.llmMoreCredible) {
|
||||
corrections[period].revenue = llmPeriod.revenue;
|
||||
}
|
||||
}
|
||||
} else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
|
||||
// Regex missed it, LLM found it
|
||||
corrections[period].revenue = llmPeriod.revenue;
|
||||
issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
|
||||
}
|
||||
|
||||
// Compare EBITDA
|
||||
if (regexPeriod.ebitda && llmPeriod.ebitda) {
|
||||
const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
|
||||
if (!match.matches) {
|
||||
issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
|
||||
confidence -= 0.1;
|
||||
|
||||
if (match.llmMoreCredible) {
|
||||
corrections[period].ebitda = llmPeriod.ebitda;
|
||||
}
|
||||
}
|
||||
} else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
|
||||
corrections[period].ebitda = llmPeriod.ebitda;
|
||||
issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
|
||||
}
|
||||
|
||||
// Fill in other fields from LLM if regex didn't get them
|
||||
const fields: Array<keyof typeof regexPeriod> = [
|
||||
'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
|
||||
];
|
||||
|
||||
for (const field of fields) {
|
||||
if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
|
||||
corrections[period][field] = llmPeriod[field];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
logger.info('Financial data validation completed', {
|
||||
confidence,
|
||||
issueCount: issues.length,
|
||||
issues: issues.slice(0, 5)
|
||||
});
|
||||
|
||||
return {
|
||||
isValid: confidence > 0.6,
|
||||
confidence,
|
||||
issues,
|
||||
corrections
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Compare two financial values to see if they match
|
||||
*/
|
||||
function compareFinancialValues(
|
||||
value1: string,
|
||||
value2: string
|
||||
): { matches: boolean; llmMoreCredible: boolean } {
|
||||
const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
|
||||
const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();
|
||||
|
||||
// Exact match
|
||||
if (clean1 === clean2) {
|
||||
return { matches: true, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
// Check if numeric values are close (within 5%)
|
||||
const num1 = parseFinancialValue(value1);
|
||||
const num2 = parseFinancialValue(value2);
|
||||
|
||||
if (num1 && num2) {
|
||||
const percentDiff = Math.abs((num1 - num2) / num1);
|
||||
if (percentDiff < 0.05) {
|
||||
// Values are close enough
|
||||
return { matches: true, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
// Large difference - trust value with more precision
|
||||
const precision1 = (value1.match(/\./g) || []).length;
|
||||
const precision2 = (value2.match(/\./g) || []).length;
|
||||
|
||||
return {
|
||||
matches: false,
|
||||
llmMoreCredible: precision2 > precision1
|
||||
};
|
||||
}
|
||||
|
||||
return { matches: false, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse a financial value string to number
|
||||
*/
|
||||
function parseFinancialValue(value: string): number | null {
|
||||
const clean = value.replace(/[$,\s]/g, '');
|
||||
|
||||
let multiplier = 1;
|
||||
if (/M$/i.test(clean)) {
|
||||
multiplier = 1000000;
|
||||
} else if (/K$/i.test(clean)) {
|
||||
multiplier = 1000;
|
||||
} else if (/B$/i.test(clean)) {
|
||||
multiplier = 1000000000;
|
||||
}
|
||||
|
||||
const numStr = clean.replace(/[MKB]/i, '');
|
||||
const num = parseFloat(numStr);
|
||||
|
||||
return isNaN(num) ? null : num * multiplier;
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2: Integrate Validation into Processing
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Add after line 1137 (after merging partial results):**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Cross-validate regex and LLM results
|
||||
if (deterministicFinancials) {
|
||||
logger.info('Validating deterministic financials against LLM results');
|
||||
|
||||
const { validateFinancialData } = await import('./financialDataValidator');
|
||||
const validation = validateFinancialData(deterministicFinancials, mergedData);
|
||||
|
||||
logger.info('Validation results', {
|
||||
documentId,
|
||||
isValid: validation.isValid,
|
||||
confidence: validation.confidence,
|
||||
issueCount: validation.issues.length
|
||||
});
|
||||
|
||||
// Use validated/corrected data
|
||||
if (validation.confidence > 0.7) {
|
||||
deterministicFinancials = validation.corrections;
|
||||
logger.info('Using validated corrections', {
|
||||
documentId,
|
||||
corrections: validation.corrections
|
||||
});
|
||||
}
|
||||
|
||||
// Merge validated data
|
||||
this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
|
||||
} else {
|
||||
logger.info('No deterministic financial data to validate', { documentId });
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Text Preprocessing Integration (1 hour)
|
||||
|
||||
### 4.1: Apply Preprocessing to Document AI Text
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
|
||||
**Add preprocessing before passing to RAG:**
|
||||
|
||||
```typescript
|
||||
// ADD import at top
|
||||
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';
|
||||
|
||||
// UPDATE line 83 (processWithAgenticRAG method)
|
||||
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
|
||||
try {
|
||||
logger.info('Processing extracted text with Agentic RAG', {
|
||||
documentId,
|
||||
textLength: extractedText.length
|
||||
});
|
||||
|
||||
// ENHANCED: Preprocess text to identify table regions
|
||||
const preprocessed = preprocessText(extractedText);
|
||||
|
||||
logger.info('Text preprocessing completed', {
|
||||
documentId,
|
||||
tableRegionsFound: preprocessed.tableRegions.length,
|
||||
likelyTableCount: preprocessed.metadata.likelyTableCount
|
||||
});
|
||||
|
||||
// Extract table texts separately for better parsing
|
||||
const tableSections = extractTableTexts(preprocessed);
|
||||
|
||||
// Import and use the optimized agentic RAG processor
|
||||
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
|
||||
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{
|
||||
preprocessedData: preprocessed, // Pass preprocessing results
|
||||
tableSections: tableSections // Pass isolated table texts
|
||||
}
|
||||
);
|
||||
|
||||
return result;
|
||||
} catch (error) {
|
||||
// ... existing error handling
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Current State (Baseline):
|
||||
```
|
||||
Financial data extraction rate: 10-20%
|
||||
Typical result: "Not specified in CIM" for most fields
|
||||
```
|
||||
|
||||
### After Phase 1 (Enhanced Regex):
|
||||
```
|
||||
Financial data extraction rate: 35-45%
|
||||
Improvement: Better pattern matching catches more tables
|
||||
```
|
||||
|
||||
### After Phase 2 (Enhanced LLM):
|
||||
```
|
||||
Financial data extraction rate: 65-75%
|
||||
Improvement: LLM sees financial tables more reliably
|
||||
```
|
||||
|
||||
### After Phase 3 (Validation):
|
||||
```
|
||||
Financial data extraction rate: 75-85%
|
||||
Improvement: Cross-validation fills gaps and corrects errors
|
||||
```
|
||||
|
||||
### After Phase 4 (Preprocessing):
|
||||
```
|
||||
Financial data extraction rate: 80-90%
|
||||
Improvement: Table structure preservation helps both regex and LLM
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
### Start Here (Highest ROI):
|
||||
1. **Phase 2.1** - Financial Section Prioritization (30 min, +30% accuracy)
|
||||
2. **Phase 2.2** - Increase LLM Context (15 min, +15% accuracy)
|
||||
3. **Phase 2.3** - Enhanced Prompt (30 min, +20% accuracy)
|
||||
|
||||
**Total: 1.5 hours for ~50-60% improvement**
|
||||
|
||||
### Then Do:
|
||||
4. **Phase 1.2** - Enhanced Parser Patterns (1 hour, +10% accuracy)
|
||||
5. **Phase 3.1-3.2** - Validation (1.5 hours, +10% accuracy)
|
||||
|
||||
**Total: 4 hours for ~70-80% improvement**
|
||||
|
||||
### Optional:
|
||||
6. **Phase 1.1, 4.1** - Text Preprocessing (2 hours, +10% accuracy)
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Test 1: Baseline Measurement
|
||||
```bash
|
||||
# Process 10 CIMs and record extraction rate
|
||||
npm run test:pipeline
|
||||
# Record: How many financial fields are populated?
|
||||
```
|
||||
|
||||
### Test 2: After Each Phase
|
||||
```bash
|
||||
# Same 10 CIMs, measure improvement
|
||||
npm run test:pipeline
|
||||
# Compare against baseline
|
||||
```
|
||||
|
||||
### Test 3: Edge Cases
|
||||
- PDFs with rotated pages
|
||||
- PDFs with merged table cells
|
||||
- PDFs with multi-line headers
|
||||
- Narrative-only financials (no tables)
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
Each phase is additive and can be disabled via feature flags:
|
||||
|
||||
```typescript
|
||||
// config/env.ts
|
||||
export const features = {
|
||||
enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
|
||||
enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
|
||||
financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
|
||||
textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
|
||||
};
|
||||
```
|
||||
|
||||
Set `ENHANCED_REGEX=false` to disable any phase.
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Current | Target | Measurement |
|
||||
|--------|---------|--------|-------------|
|
||||
| Financial data extracted | 10-20% | 80-90% | % of fields populated |
|
||||
| Processing time | 45s | <60s | End-to-end time |
|
||||
| False positives | Unknown | <5% | Manual validation |
|
||||
| Column misalignment | ~50% | <10% | Check FY mapping |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
|
||||
2. Test with 5-10 real CIM documents
|
||||
3. Measure improvement
|
||||
4. If >70% accuracy, stop. If not, add Phase 1 and 3.
|
||||
5. Keep Phase 4 as optional enhancement
|
||||
|
||||
The LLM is actually very good at this - we just need to give it the right context!
|
||||
@@ -1,871 +0,0 @@
|
||||
# Financial Data Extraction: Implementation Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Quick Win Implementation (RECOMMENDED START)
|
||||
|
||||
**Timeline**: 1-2 hours
|
||||
**Expected Improvement**: 60-70% accuracy gain
|
||||
**Risk**: Low - additive changes, no breaking modifications
|
||||
|
||||
### Step 1.1: Update DocumentAIOutput Interface
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
|
||||
**Current (lines 15-25):**
|
||||
```typescript
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<{...}>;
|
||||
tables: Array<any>; // ❌ Just counts, no structure
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
export interface StructuredTable {
|
||||
headers: string[];
|
||||
rows: string[][];
|
||||
position: {
|
||||
pageNumber: number;
|
||||
confidence: number;
|
||||
};
|
||||
rawTable?: any; // Keep original for debugging
|
||||
}
|
||||
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<{...}>;
|
||||
tables: StructuredTable[]; // ✅ Full structure
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.2: Add Table Text Extraction Helper
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Add after line 51 (after constructor)
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Extract text from a Document AI layout object using text anchors
|
||||
* Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response
|
||||
*/
|
||||
private getTextFromLayout(layout: any, documentText: string): string {
|
||||
try {
|
||||
const textAnchor = layout?.textAnchor;
|
||||
if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Get the first segment (most common case)
|
||||
const segment = textAnchor.textSegments[0];
|
||||
const startIndex = parseInt(segment.startIndex || '0');
|
||||
const endIndex = parseInt(segment.endIndex || documentText.length.toString());
|
||||
|
||||
// Validate indices
|
||||
if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) {
|
||||
logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length });
|
||||
return '';
|
||||
}
|
||||
|
||||
return documentText.substring(startIndex, endIndex).trim();
|
||||
} catch (error) {
|
||||
logger.error('Failed to extract text from layout', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
layout
|
||||
});
|
||||
return '';
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.3: Add Structured Table Extraction
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Add after getTextFromLayout method
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Extract structured tables from Document AI response
|
||||
* Preserves column alignment and table structure
|
||||
*/
|
||||
private extractStructuredTables(document: any, documentText: string): StructuredTable[] {
|
||||
const tables: StructuredTable[] = [];
|
||||
|
||||
try {
|
||||
const pages = document.pages || [];
|
||||
logger.info('Extracting structured tables from Document AI response', {
|
||||
pageCount: pages.length
|
||||
});
|
||||
|
||||
for (const page of pages) {
|
||||
const pageTables = page.tables || [];
|
||||
const pageNumber = page.pageNumber || 0;
|
||||
|
||||
logger.info('Processing page for tables', {
|
||||
pageNumber,
|
||||
tableCount: pageTables.length
|
||||
});
|
||||
|
||||
for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) {
|
||||
const table = pageTables[tableIndex];
|
||||
|
||||
try {
|
||||
// Extract headers from first header row
|
||||
const headers: string[] = [];
|
||||
if (table.headerRows && table.headerRows.length > 0) {
|
||||
const headerRow = table.headerRows[0];
|
||||
for (const cell of headerRow.cells || []) {
|
||||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||||
headers.push(cellText);
|
||||
}
|
||||
}
|
||||
|
||||
// Extract data rows
|
||||
const rows: string[][] = [];
|
||||
for (const bodyRow of table.bodyRows || []) {
|
||||
const row: string[] = [];
|
||||
for (const cell of bodyRow.cells || []) {
|
||||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||||
row.push(cellText);
|
||||
}
|
||||
if (row.length > 0) {
|
||||
rows.push(row);
|
||||
}
|
||||
}
|
||||
|
||||
// Only add tables with content
|
||||
if (headers.length > 0 || rows.length > 0) {
|
||||
tables.push({
|
||||
headers,
|
||||
rows,
|
||||
position: {
|
||||
pageNumber,
|
||||
confidence: table.confidence || 0.9
|
||||
},
|
||||
rawTable: table // Keep for debugging
|
||||
});
|
||||
|
||||
logger.info('Extracted structured table', {
|
||||
pageNumber,
|
||||
tableIndex,
|
||||
headerCount: headers.length,
|
||||
rowCount: rows.length,
|
||||
headers: headers.slice(0, 10) // Log first 10 headers
|
||||
});
|
||||
}
|
||||
} catch (tableError) {
|
||||
logger.error('Failed to extract table', {
|
||||
pageNumber,
|
||||
tableIndex,
|
||||
error: tableError instanceof Error ? tableError.message : String(tableError)
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
logger.info('Structured table extraction completed', {
|
||||
totalTables: tables.length
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Failed to extract structured tables', {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
}
|
||||
|
||||
return tables;
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.4: Update processWithDocumentAI to Use Structured Tables
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Update lines 462-482
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
// Extract tables
|
||||
const tables = document.pages?.flatMap(page =>
|
||||
page.tables?.map(table => ({
|
||||
rows: table.headerRows?.length || 0,
|
||||
columns: table.bodyRows?.[0]?.cells?.length || 0
|
||||
})) || []
|
||||
) || [];
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
// Extract structured tables with full content
|
||||
const tables = this.extractStructuredTables(document, text);
|
||||
```
|
||||
|
||||
### Step 1.5: Pass Tables to Agentic RAG Processor
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Update line 337 (processLargeDocument call)
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{}
|
||||
);
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{
|
||||
structuredTables: documentAiOutput.tables || []
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
### Step 1.6: Update Agentic RAG Processor Signature
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update lines 41-48
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
async processLargeDocument(
|
||||
documentId: string,
|
||||
text: string,
|
||||
options: {
|
||||
enableSemanticChunking?: boolean;
|
||||
enableMetadataEnrichment?: boolean;
|
||||
similarityThreshold?: number;
|
||||
} = {}
|
||||
)
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
async processLargeDocument(
|
||||
documentId: string,
|
||||
text: string,
|
||||
options: {
|
||||
enableSemanticChunking?: boolean;
|
||||
enableMetadataEnrichment?: boolean;
|
||||
similarityThreshold?: number;
|
||||
structuredTables?: StructuredTable[];
|
||||
} = {}
|
||||
)
|
||||
```
|
||||
|
||||
### Step 1.7: Add Import for StructuredTable Type
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Add to imports at top (around line 1-6)
|
||||
|
||||
```typescript
|
||||
import type { StructuredTable } from './documentAiProcessor';
|
||||
```
|
||||
|
||||
### Step 1.8: Create Financial Table Identifier
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Add after line 503 (after calculateCosineSimilarity)
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Identify if a structured table contains financial data
|
||||
* Uses heuristics to detect financial tables vs. other tables
|
||||
*/
|
||||
private isFinancialTable(table: StructuredTable): boolean {
|
||||
const headerText = table.headers.join(' ').toLowerCase();
|
||||
const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' ');
|
||||
|
||||
// Check for year/period indicators in headers
|
||||
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText);
|
||||
|
||||
// Check for financial metrics in rows
|
||||
const financialMetrics = [
|
||||
'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin',
|
||||
'gross profit', 'operating income', 'net income', 'cash flow',
|
||||
'earnings', 'assets', 'liabilities', 'equity'
|
||||
];
|
||||
const hasFinancialMetrics = financialMetrics.some(metric =>
|
||||
allRowsText.includes(metric)
|
||||
);
|
||||
|
||||
// Check for currency/percentage values
|
||||
const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText);
|
||||
|
||||
// A financial table should have periods AND (metrics OR currency values)
|
||||
const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency);
|
||||
|
||||
if (isFinancial) {
|
||||
logger.info('Identified financial table', {
|
||||
headers: table.headers,
|
||||
rowCount: table.rows.length,
|
||||
pageNumber: table.position.pageNumber
|
||||
});
|
||||
}
|
||||
|
||||
return isFinancial;
|
||||
}
|
||||
|
||||
/**
|
||||
* Format a structured table as markdown for better LLM comprehension
|
||||
* Preserves column alignment and makes tables human-readable
|
||||
*/
|
||||
private formatTableAsMarkdown(table: StructuredTable): string {
|
||||
const lines: string[] = [];
|
||||
|
||||
// Add header row
|
||||
if (table.headers.length > 0) {
|
||||
lines.push(`| ${table.headers.join(' | ')} |`);
|
||||
lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`);
|
||||
}
|
||||
|
||||
// Add data rows
|
||||
for (const row of table.rows) {
|
||||
lines.push(`| ${row.join(' | ')} |`);
|
||||
}
|
||||
|
||||
return lines.join('\n');
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.9: Update Chunk Creation to Include Financial Tables
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update createIntelligentChunks method (lines 115-158)
|
||||
|
||||
**Add after line 118:**
|
||||
```typescript
|
||||
// Extract structured tables from options
|
||||
const structuredTables = (options as any)?.structuredTables || [];
|
||||
```
|
||||
|
||||
**Add after line 119 (inside the method, before semantic chunking):**
|
||||
```typescript
|
||||
// PRIORITY: Create dedicated chunks for financial tables
|
||||
if (structuredTables.length > 0) {
|
||||
logger.info('Processing structured tables for chunking', {
|
||||
documentId,
|
||||
tableCount: structuredTables.length
|
||||
});
|
||||
|
||||
for (let i = 0; i < structuredTables.length; i++) {
|
||||
const table = structuredTables[i];
|
||||
const isFinancial = this.isFinancialTable(table);
|
||||
|
||||
// Format table as markdown for better readability
|
||||
const markdownTable = this.formatTableAsMarkdown(table);
|
||||
|
||||
chunks.push({
|
||||
id: `${documentId}-table-${i}`,
|
||||
content: markdownTable,
|
||||
chunkIndex: chunks.length,
|
||||
startPosition: -1, // Tables don't have text positions
|
||||
endPosition: -1,
|
||||
sectionType: isFinancial ? 'financial-table' : 'table',
|
||||
metadata: {
|
||||
isStructuredTable: true,
|
||||
isFinancialTable: isFinancial,
|
||||
tableIndex: i,
|
||||
pageNumber: table.position.pageNumber,
|
||||
headerCount: table.headers.length,
|
||||
rowCount: table.rows.length,
|
||||
structuredData: table // Preserve original structure
|
||||
}
|
||||
});
|
||||
|
||||
logger.info('Created chunk for structured table', {
|
||||
documentId,
|
||||
tableIndex: i,
|
||||
isFinancial,
|
||||
chunkId: chunks[chunks.length - 1].id,
|
||||
contentPreview: markdownTable.substring(0, 200)
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.10: Pin Financial Tables in Extraction
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260)
|
||||
|
||||
**Add before the return statement (around line 1259):**
|
||||
```typescript
|
||||
// Identify and pin financial table chunks to ensure they're always included
|
||||
const financialTableChunks = chunks.filter(
|
||||
chunk => chunk.metadata?.isFinancialTable === true
|
||||
);
|
||||
|
||||
logger.info('Financial table chunks identified for pinning', {
|
||||
documentId,
|
||||
financialTableCount: financialTableChunks.length,
|
||||
chunkIds: financialTableChunks.map(c => c.id)
|
||||
});
|
||||
|
||||
// Combine deterministic financial chunks with structured table chunks
|
||||
const allPinnedChunks = [
|
||||
...pinnedChunks,
|
||||
...financialTableChunks
|
||||
];
|
||||
```
|
||||
|
||||
**Update the return statement to use allPinnedChunks:**
|
||||
```typescript
|
||||
return await this.extractWithTargetedQuery(
|
||||
documentId,
|
||||
text,
|
||||
financialChunks,
|
||||
query,
|
||||
targetFields,
|
||||
7,
|
||||
allPinnedChunks // ✅ Now includes both deterministic and structured tables
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Phase 1
|
||||
|
||||
### Test 1.1: Verify Table Extraction
|
||||
```bash
|
||||
# Monitor logs for table extraction
|
||||
cd backend
|
||||
npm run dev
|
||||
|
||||
# Look for log entries:
|
||||
# - "Extracting structured tables from Document AI response"
|
||||
# - "Extracted structured table"
|
||||
# - "Identified financial table"
|
||||
```
|
||||
|
||||
### Test 1.2: Upload a CIM Document
|
||||
```bash
|
||||
# Upload a test document and check processing
|
||||
curl -X POST http://localhost:8080/api/documents/upload \
|
||||
-F "file=@test-cim.pdf" \
|
||||
-H "Authorization: Bearer YOUR_TOKEN"
|
||||
```
|
||||
|
||||
### Test 1.3: Verify Financial Data Populated
|
||||
Check the database or API response for:
|
||||
- `financialSummary.financials.fy3.revenue` - Should have values
|
||||
- `financialSummary.financials.fy2.ebitda` - Should have values
|
||||
- NOT "Not specified in CIM" for fields that exist in tables
|
||||
|
||||
### Test 1.4: Check Logs for Success Indicators
|
||||
```bash
|
||||
# Should see:
|
||||
✅ "Identified financial table" - confirms tables detected
|
||||
✅ "Created chunk for structured table" - confirms chunking worked
|
||||
✅ "Financial table chunks identified for pinning" - confirms pinning worked
|
||||
✅ "Deterministic financial data merged successfully" - confirms data merged
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Baseline & Post-Change Metrics
|
||||
|
||||
Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback:
|
||||
|
||||
1. Instrument the processing metadata (see `FINANCIAL_EXTRACTION_ANALYSIS.md`) with `tablesFound`, `financialTablesIdentified`, `structuredParsingUsed`, `textParsingFallback`, and `financialDataPopulated`.
|
||||
2. Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample `documentId`s with `tablesFound === 0`).
|
||||
3. Repeat after deploying Phase 1 and Phase 2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates.
|
||||
|
||||
---
|
||||
|
||||
## Expected Results After Phase 1
|
||||
|
||||
### Before Phase 1:
|
||||
```json
|
||||
{
|
||||
"financialSummary": {
|
||||
"financials": {
|
||||
"fy3": {
|
||||
"revenue": "Not specified in CIM",
|
||||
"ebitda": "Not specified in CIM"
|
||||
},
|
||||
"fy2": {
|
||||
"revenue": "Not specified in CIM",
|
||||
"ebitda": "Not specified in CIM"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### After Phase 1:
|
||||
```json
|
||||
{
|
||||
"financialSummary": {
|
||||
"financials": {
|
||||
"fy3": {
|
||||
"revenue": "$45.2M",
|
||||
"revenueGrowth": "N/A",
|
||||
"ebitda": "$8.5M",
|
||||
"ebitdaMargin": "18.8%"
|
||||
},
|
||||
"fy2": {
|
||||
"revenue": "$52.8M",
|
||||
"revenueGrowth": "16.8%",
|
||||
"ebitda": "$10.2M",
|
||||
"ebitdaMargin": "19.3%"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Enhanced Deterministic Parsing (Optional)
|
||||
|
||||
**Timeline**: 2-3 hours
|
||||
**Expected Additional Improvement**: +15-20% accuracy
|
||||
**Trigger**: If Phase 1 results are below 70% accuracy
|
||||
|
||||
### Step 2.1: Create Structured Table Parser
|
||||
|
||||
**File**: Create `backend/src/services/structuredFinancialParser.ts`
|
||||
|
||||
```typescript
|
||||
import { logger } from '../utils/logger';
|
||||
import type { StructuredTable } from './documentAiProcessor';
|
||||
import type { ParsedFinancials, FinancialPeriod } from './financialTableParser';
|
||||
|
||||
/**
|
||||
* Parse financials directly from Document AI structured tables
|
||||
* This is more reliable than parsing from flattened text
|
||||
*/
|
||||
export function parseFinancialsFromStructuredTable(
|
||||
table: StructuredTable
|
||||
): ParsedFinancials {
|
||||
const result: ParsedFinancials = {
|
||||
fy3: {},
|
||||
fy2: {},
|
||||
fy1: {},
|
||||
ltm: {}
|
||||
};
|
||||
|
||||
try {
|
||||
// 1. Identify period columns from headers
|
||||
const periodMapping = mapHeadersToPeriods(table.headers);
|
||||
|
||||
logger.info('Structured table period mapping', {
|
||||
headers: table.headers,
|
||||
periodMapping
|
||||
});
|
||||
|
||||
// 2. Process each row to extract metrics
|
||||
for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) {
|
||||
const row = table.rows[rowIndex];
|
||||
if (row.length === 0) continue;
|
||||
|
||||
const metricName = row[0].toLowerCase();
|
||||
|
||||
// Match against known financial metrics
|
||||
const fieldName = identifyMetricField(metricName);
|
||||
if (!fieldName) continue;
|
||||
|
||||
// 3. Assign values to correct periods
|
||||
periodMapping.forEach((period, columnIndex) => {
|
||||
if (!period) return; // Skip unmapped columns
|
||||
|
||||
const value = row[columnIndex + 1]; // +1 because first column is metric name
|
||||
if (!value || value.trim() === '') return;
|
||||
|
||||
// 4. Validate value type matches field
|
||||
if (isValidValueForField(value, fieldName)) {
|
||||
result[period][fieldName] = value.trim();
|
||||
|
||||
logger.debug('Mapped structured table value', {
|
||||
period,
|
||||
field: fieldName,
|
||||
value: value.trim(),
|
||||
row: rowIndex,
|
||||
column: columnIndex
|
||||
});
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
logger.info('Structured table parsing completed', {
|
||||
fy3: result.fy3,
|
||||
fy2: result.fy2,
|
||||
fy1: result.fy1,
|
||||
ltm: result.ltm
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Failed to parse structured financial table', {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Map header columns to financial periods (fy3, fy2, fy1, ltm)
|
||||
*/
|
||||
function mapHeadersToPeriods(headers: string[]): Array<keyof ParsedFinancials | null> {
|
||||
const periodMapping: Array<keyof ParsedFinancials | null> = [];
|
||||
|
||||
for (const header of headers) {
|
||||
const normalized = header.trim().toUpperCase().replace(/\s+/g, '');
|
||||
let period: keyof ParsedFinancials | null = null;
|
||||
|
||||
// Check for LTM/TTM
|
||||
if (normalized.includes('LTM') || normalized.includes('TTM')) {
|
||||
period = 'ltm';
|
||||
}
|
||||
// Check for year patterns
|
||||
else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) {
|
||||
period = 'fy1'; // Most recent full year
|
||||
}
|
||||
else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) {
|
||||
period = 'fy2'; // Second most recent year
|
||||
}
|
||||
else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) {
|
||||
period = 'fy3'; // Third most recent year
|
||||
}
|
||||
// Generic FY pattern - assign based on position
|
||||
else if (/FY\d{2}/.test(normalized)) {
|
||||
// Will be assigned based on relative position
|
||||
period = null; // Handle in second pass
|
||||
}
|
||||
|
||||
periodMapping.push(period);
|
||||
}
|
||||
|
||||
// Second pass: fill in generic FY columns based on position
|
||||
// Most recent on right, oldest on left (common CIM format)
|
||||
let fyIndex = 1;
|
||||
for (let i = periodMapping.length - 1; i >= 0; i--) {
|
||||
if (periodMapping[i] === null && /FY/i.test(headers[i])) {
|
||||
if (fyIndex === 1) periodMapping[i] = 'fy1';
|
||||
else if (fyIndex === 2) periodMapping[i] = 'fy2';
|
||||
else if (fyIndex === 3) periodMapping[i] = 'fy3';
|
||||
fyIndex++;
|
||||
}
|
||||
}
|
||||
|
||||
return periodMapping;
|
||||
}
|
||||
|
||||
/**
|
||||
* Identify which financial field a metric name corresponds to
|
||||
*/
|
||||
function identifyMetricField(metricName: string): keyof FinancialPeriod | null {
|
||||
const name = metricName.toLowerCase();
|
||||
|
||||
if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) {
|
||||
return 'revenue';
|
||||
}
|
||||
if (/gross\s*profit/.test(name)) {
|
||||
return 'grossProfit';
|
||||
}
|
||||
if (/gross\s*margin/.test(name)) {
|
||||
return 'grossMargin';
|
||||
}
|
||||
if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) {
|
||||
return 'ebitdaMargin';
|
||||
}
|
||||
if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) {
|
||||
return 'ebitda';
|
||||
}
|
||||
if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) {
|
||||
return 'revenueGrowth';
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Validate that a value is appropriate for a given field
|
||||
*/
|
||||
function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean {
|
||||
const trimmed = value.trim();
|
||||
|
||||
// Margin and growth fields should have %
|
||||
if (field.includes('Margin') || field.includes('Growth')) {
|
||||
return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a');
|
||||
}
|
||||
|
||||
// Revenue, profit, EBITDA should have $ or numbers
|
||||
if (['revenue', 'grossProfit', 'ebitda'].includes(field)) {
|
||||
return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed));
|
||||
}
|
||||
|
||||
return /\d/.test(trimmed);
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2.2: Integrate Structured Parser
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update multi-pass extraction (around line 1063-1088)
|
||||
|
||||
**Add import:**
|
||||
```typescript
|
||||
import { parseFinancialsFromStructuredTable } from './structuredFinancialParser';
|
||||
```
|
||||
|
||||
**Update financial extraction logic (around line 1066-1088):**
|
||||
```typescript
|
||||
// Try structured table parsing first (most reliable)
|
||||
try {
|
||||
const structuredTables = (options as any)?.structuredTables || [];
|
||||
const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t));
|
||||
|
||||
if (financialTables.length > 0) {
|
||||
logger.info('Attempting structured table parsing', {
|
||||
documentId,
|
||||
financialTableCount: financialTables.length
|
||||
});
|
||||
|
||||
// Try each financial table until we get good data
|
||||
for (const table of financialTables) {
|
||||
const parsedFromTable = parseFinancialsFromStructuredTable(table);
|
||||
|
||||
if (this.hasStructuredFinancialData(parsedFromTable)) {
|
||||
deterministicFinancials = parsedFromTable;
|
||||
deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable);
|
||||
|
||||
logger.info('Structured table parsing successful', {
|
||||
documentId,
|
||||
tableIndex: financialTables.indexOf(table),
|
||||
fy3: parsedFromTable.fy3,
|
||||
fy2: parsedFromTable.fy2,
|
||||
fy1: parsedFromTable.fy1,
|
||||
ltm: parsedFromTable.ltm
|
||||
});
|
||||
break; // Found good data, stop trying tables
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (structuredParserError) {
|
||||
logger.warn('Structured table parsing failed, falling back to text parser', {
|
||||
documentId,
|
||||
error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError)
|
||||
});
|
||||
}
|
||||
|
||||
// Fallback to text-based parsing if structured parsing failed
|
||||
if (!deterministicFinancials) {
|
||||
try {
|
||||
const { parseFinancialsFromText } = await import('./financialTableParser');
|
||||
const parsedFinancials = parseFinancialsFromText(text);
|
||||
// ... existing code
|
||||
} catch (parserError) {
|
||||
// ... existing error handling
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If Phase 1 causes issues:
|
||||
|
||||
### Quick Rollback (5 minutes)
|
||||
```bash
|
||||
git checkout HEAD -- backend/src/services/documentAiProcessor.ts
|
||||
git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts
|
||||
npm run build
|
||||
npm start
|
||||
```
|
||||
|
||||
### Feature Flag Approach (Recommended)
|
||||
Add environment variable to control new behavior:
|
||||
|
||||
```typescript
|
||||
// backend/src/config/env.ts
|
||||
export const config = {
|
||||
features: {
|
||||
useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
Then wrap new code:
|
||||
```typescript
|
||||
if (config.features.useStructuredTables) {
|
||||
// Use structured tables
|
||||
} else {
|
||||
// Use old flat text approach
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Phase 1 Success:
|
||||
- ✅ 60%+ of CIM documents have populated financial data (validated via new telemetry)
|
||||
- ✅ No regression in processing time (< 10% increase acceptable)
|
||||
- ✅ No errors in table extraction pipeline
|
||||
- ✅ Structured tables logged in console
|
||||
|
||||
### Phase 2 Success:
|
||||
- ✅ 85%+ of CIM documents have populated financial data or fall back to the hybrid path when `tablesFound === 0`
|
||||
- ✅ Column alignment accuracy > 95%
|
||||
- ✅ Reduction in "Not specified in CIM" responses
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Debugging
|
||||
|
||||
### Key Metrics to Track
|
||||
```typescript
|
||||
// Add to processing result
|
||||
metadata: {
|
||||
tablesFound: number;
|
||||
financialTablesIdentified: number;
|
||||
structuredParsingUsed: boolean;
|
||||
textParsingFallback: boolean;
|
||||
financialDataPopulated: boolean;
|
||||
}
|
||||
```
|
||||
|
||||
### Log Analysis Queries
|
||||
```bash
|
||||
# Find documents with no tables
|
||||
grep "totalTables: 0" backend.log
|
||||
|
||||
# Find failed table extractions
|
||||
grep "Failed to extract table" backend.log
|
||||
|
||||
# Find successful financial extractions
|
||||
grep "Structured table parsing successful" backend.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Implementation
|
||||
|
||||
1. **Run on historical documents**: Reprocess 10-20 existing CIMs to compare before/after
|
||||
2. **A/B test**: Process new documents with both old and new system, compare results
|
||||
3. **Tune thresholds**: Adjust financial table identification heuristics based on results
|
||||
4. **Document findings**: Update this plan with actual results and lessons learned
|
||||
|
||||
---
|
||||
|
||||
## Resources
|
||||
|
||||
- [Document AI Table Extraction Docs](https://cloud.google.com/document-ai/docs/handle-response)
|
||||
- [Financial Parser (current)](backend/src/services/financialTableParser.ts)
|
||||
- [Financial Extractor (unused)](backend/src/utils/financialExtractor.ts)
|
||||
- [Analysis Document](FINANCIAL_EXTRACTION_ANALYSIS.md)
|
||||
19
README.md
19
README.md
@@ -38,10 +38,12 @@
|
||||
|
||||
### Documentation
|
||||
- `APP_DESIGN_DOCUMENTATION.md` - Complete system architecture
|
||||
- `AGENTIC_RAG_IMPLEMENTATION_PLAN.md` - AI processing strategy
|
||||
- `PDF_GENERATION_ANALYSIS.md` - PDF generation optimization
|
||||
- `DEPLOYMENT_GUIDE.md` - Deployment instructions
|
||||
- `ARCHITECTURE_DIAGRAMS.md` - Visual architecture documentation
|
||||
- `QUICK_START.md` - Quick start guide
|
||||
- `TESTING_STRATEGY_DOCUMENTATION.md` - Testing guidelines
|
||||
- `TROUBLESHOOTING_GUIDE.md` - Troubleshooting guide
|
||||
|
||||
### Configuration
|
||||
- `backend/src/config/` - Environment and service configuration
|
||||
@@ -94,9 +96,9 @@ cd frontend && npm run dev
|
||||
- **uploadMonitoringService.ts** - Real-time upload tracking
|
||||
|
||||
### 3. Data Management
|
||||
- **agenticRAGDatabaseService.ts** - Analytics and session management
|
||||
- **vectorDatabaseService.ts** - Vector embeddings and search
|
||||
- **sessionService.ts** - User session management
|
||||
- **jobQueueService.ts** - Background job processing
|
||||
- **jobProcessorService.ts** - Job execution logic
|
||||
|
||||
## 📊 Processing Strategies
|
||||
|
||||
@@ -188,7 +190,7 @@ Structured CIM Review data including:
|
||||
## 🧪 Testing
|
||||
|
||||
### Test Structure
|
||||
- **Unit Tests**: Jest for backend, Vitest for frontend
|
||||
- **Unit Tests**: Vitest for backend and frontend
|
||||
- **Integration Tests**: End-to-end testing
|
||||
- **API Tests**: Supertest for backend endpoints
|
||||
|
||||
@@ -203,15 +205,12 @@ Structured CIM Review data including:
|
||||
|
||||
### Technical Documentation
|
||||
- [Application Design Documentation](APP_DESIGN_DOCUMENTATION.md) - Complete system architecture
|
||||
- [Agentic RAG Implementation Plan](AGENTIC_RAG_IMPLEMENTATION_PLAN.md) - AI processing strategy
|
||||
- [PDF Generation Analysis](PDF_GENERATION_ANALYSIS.md) - PDF optimization details
|
||||
- [Architecture Diagrams](ARCHITECTURE_DIAGRAMS.md) - Visual system design
|
||||
- [Deployment Guide](DEPLOYMENT_GUIDE.md) - Deployment instructions
|
||||
|
||||
### Analysis Reports
|
||||
- [Codebase Audit Report](codebase-audit-report.md) - Code quality analysis
|
||||
- [Dependency Analysis Report](DEPENDENCY_ANALYSIS_REPORT.md) - Dependency management
|
||||
- [Document AI Integration Summary](DOCUMENT_AI_INTEGRATION_SUMMARY.md) - Google Document AI setup
|
||||
- [Quick Start Guide](QUICK_START.md) - Getting started
|
||||
- [Testing Strategy](TESTING_STRATEGY_DOCUMENTATION.md) - Testing guidelines
|
||||
- [Troubleshooting Guide](TROUBLESHOOTING_GUIDE.md) - Common issues and solutions
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
|
||||
@@ -121,10 +121,20 @@ EMAIL_WEEKLY_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
#SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
#SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
#OPENROUTER_API_KEY=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
|
||||
|
||||
#ANTHROPIC_API_KEY=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
|
||||
|
||||
#OPENAI_API_KEY=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
|
||||
|
||||
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
OPENROUTER_API_KEY=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
|
||||
|
||||
ANTHROPIC_API_KEY=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
|
||||
|
||||
OPENAI_API_KEY=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
|
||||
OPENAI_API_KEY=sk-proj-dFNxetn-sm08kbZ8IpFROe0LgVQev3lEsyfrGNqdYruyW_mLATHXVGee3ay55zkDHDBYR_XX4T3BlbkFJ2mJVmqt5u58hqrPSLhDsoN6HPQD_vyQFCqtlePYagbcnAnRDcleK06pYUf-Z3NhzfD-ONkEoMA
|
||||
|
||||
142
backend/.env.bak3
Normal file
142
backend/.env.bak3
Normal file
@@ -0,0 +1,142 @@
|
||||
# Node Environment
|
||||
NODE_ENV=testing
|
||||
|
||||
# Firebase Configuration (Testing Project) - ✅ COMPLETED
|
||||
FB_PROJECT_ID=cim-summarizer-testing
|
||||
FB_STORAGE_BUCKET=cim-summarizer-testing.firebasestorage.app
|
||||
FB_API_KEY=AIzaSyBNf58cnNMbXb6VE3sVEJYJT5CGNQr0Kmg
|
||||
FB_AUTH_DOMAIN=cim-summarizer-testing.firebaseapp.com
|
||||
|
||||
# Supabase Configuration (Testing Instance) - ✅ COMPLETED
|
||||
SUPABASE_URL=https://gzoclmbqmgmpuhufbnhy.supabase.co
|
||||
|
||||
# Google Cloud Configuration (Testing Project) - ✅ COMPLETED
|
||||
GCLOUD_PROJECT_ID=cim-summarizer-testing
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=575027767a9291f6
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey-testing.json
|
||||
|
||||
# LLM Configuration (Same as production but with cost limits) - ✅ COMPLETED
|
||||
LLM_PROVIDER=anthropic
|
||||
LLM_MAX_COST_PER_DOCUMENT=1.00
|
||||
LLM_ENABLE_COST_OPTIMIZATION=true
|
||||
LLM_USE_FAST_MODEL_FOR_SIMPLE_TASKS=true
|
||||
|
||||
# Email Configuration (Testing) - ✅ COMPLETED
|
||||
EMAIL_HOST=smtp.gmail.com
|
||||
EMAIL_PORT=587
|
||||
EMAIL_USER=press7174@gmail.com
|
||||
EMAIL_FROM=press7174@gmail.com
|
||||
WEEKLY_EMAIL_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
# Vector Database (Testing)
|
||||
VECTOR_PROVIDER=supabase
|
||||
|
||||
# Testing-specific settings
|
||||
RATE_LIMIT_MAX_REQUESTS=1000
|
||||
RATE_LIMIT_WINDOW_MS=900000
|
||||
AGENTIC_RAG_DETAILED_LOGGING=true
|
||||
AGENTIC_RAG_PERFORMANCE_TRACKING=true
|
||||
AGENTIC_RAG_ERROR_REPORTING=true
|
||||
|
||||
# Week 8 Features Configuration
|
||||
# Cost Monitoring
|
||||
COST_MONITORING_ENABLED=true
|
||||
USER_DAILY_COST_LIMIT=50.00
|
||||
USER_MONTHLY_COST_LIMIT=500.00
|
||||
DOCUMENT_COST_LIMIT=10.00
|
||||
SYSTEM_DAILY_COST_LIMIT=1000.00
|
||||
|
||||
# Caching Configuration
|
||||
CACHE_ENABLED=true
|
||||
CACHE_TTL_HOURS=168
|
||||
CACHE_SIMILARITY_THRESHOLD=0.85
|
||||
CACHE_MAX_SIZE=10000
|
||||
|
||||
# Microservice Configuration
|
||||
MICROSERVICE_ENABLED=true
|
||||
MICROSERVICE_MAX_CONCURRENT_JOBS=5
|
||||
MICROSERVICE_HEALTH_CHECK_INTERVAL=30000
|
||||
MICROSERVICE_QUEUE_PROCESSING_INTERVAL=5000
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
ENABLE_RAG_PROCESSING=true
|
||||
ENABLE_PROCESSING_COMPARISON=false
|
||||
|
||||
# Agentic RAG Configuration
|
||||
AGENTIC_RAG_ENABLED=true
|
||||
AGENTIC_RAG_MAX_AGENTS=6
|
||||
AGENTIC_RAG_PARALLEL_PROCESSING=true
|
||||
AGENTIC_RAG_VALIDATION_STRICT=true
|
||||
AGENTIC_RAG_RETRY_ATTEMPTS=3
|
||||
AGENTIC_RAG_TIMEOUT_PER_AGENT=60000
|
||||
|
||||
# Agent-Specific Configuration
|
||||
AGENT_DOCUMENT_UNDERSTANDING_ENABLED=true
|
||||
AGENT_FINANCIAL_ANALYSIS_ENABLED=true
|
||||
AGENT_MARKET_ANALYSIS_ENABLED=true
|
||||
AGENT_INVESTMENT_THESIS_ENABLED=true
|
||||
AGENT_SYNTHESIS_ENABLED=true
|
||||
AGENT_VALIDATION_ENABLED=true
|
||||
|
||||
# Quality Control
|
||||
AGENTIC_RAG_QUALITY_THRESHOLD=0.8
|
||||
AGENTIC_RAG_COMPLETENESS_THRESHOLD=0.9
|
||||
AGENTIC_RAG_CONSISTENCY_CHECK=true
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=debug
|
||||
LOG_FILE=logs/testing.log
|
||||
|
||||
# Security Configuration
|
||||
BCRYPT_ROUNDS=10
|
||||
|
||||
# Database Configuration (Testing)
|
||||
DATABASE_HOST=db.supabase.co
|
||||
DATABASE_PORT=5432
|
||||
DATABASE_NAME=postgres
|
||||
DATABASE_USER=postgres
|
||||
DATABASE_PASSWORD=your-testing-supabase-password
|
||||
|
||||
# Redis Configuration (Testing - using in-memory for testing)
|
||||
REDIS_URL=redis://localhost:6379
|
||||
REDIS_HOST=localhost
|
||||
REDIS_PORT=6379
|
||||
ALLOWED_FILE_TYPES=application/pdf
|
||||
MAX_FILE_SIZE=52428800
|
||||
|
||||
GCLOUD_PROJECT_ID=324837881067
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=abb95bdd56632e4d
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
OPENROUTER_USE_BYOK=true
|
||||
|
||||
# Email Configuration
|
||||
EMAIL_SECURE=false
|
||||
EMAIL_WEEKLY_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
#SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
#SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
#OPENROUTER_API_KEY=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
|
||||
|
||||
#ANTHROPIC_API_KEY=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
|
||||
|
||||
#OPENAI_API_KEY=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
|
||||
|
||||
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
OPENROUTER_API_KEY=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
|
||||
|
||||
ANTHROPIC_API_KEY=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
|
||||
|
||||
OPENAI_API_KEY=sk-proj-dFNxetn-sm08kbZ8IpFROe0LgVQev3lEsyfrGNqdYruyW_mLATHXVGee3ay55zkDHDBYR_XX4T3BlbkFJ2mJVmqt5u58hqrPSLhDsoN6HPQD_vyQFCqtlePYagbcnAnRDcleK06pYUf-Z3NhzfD-ONkEoMA
|
||||
LLM_MODEL=claude-3-7-sonnet-latest
|
||||
LLM_MAX_TOKENS=16000
|
||||
141
backend/.env.bak4
Normal file
141
backend/.env.bak4
Normal file
@@ -0,0 +1,141 @@
|
||||
# Node Environment
|
||||
NODE_ENV=testing
|
||||
|
||||
# Firebase Configuration (Testing Project) - ✅ COMPLETED
|
||||
FB_PROJECT_ID=cim-summarizer-testing
|
||||
FB_STORAGE_BUCKET=cim-summarizer-testing.firebasestorage.app
|
||||
FB_API_KEY=AIzaSyBNf58cnNMbXb6VE3sVEJYJT5CGNQr0Kmg
|
||||
FB_AUTH_DOMAIN=cim-summarizer-testing.firebaseapp.com
|
||||
|
||||
# Supabase Configuration (Testing Instance) - ✅ COMPLETED
|
||||
SUPABASE_URL=https://gzoclmbqmgmpuhufbnhy.supabase.co
|
||||
|
||||
# Google Cloud Configuration (Testing Project) - ✅ COMPLETED
|
||||
GCLOUD_PROJECT_ID=cim-summarizer-testing
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=575027767a9291f6
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey-testing.json
|
||||
|
||||
# LLM Configuration (Same as production but with cost limits) - ✅ COMPLETED
|
||||
LLM_PROVIDER=anthropic
|
||||
LLM_MAX_COST_PER_DOCUMENT=1.00
|
||||
LLM_ENABLE_COST_OPTIMIZATION=true
|
||||
LLM_USE_FAST_MODEL_FOR_SIMPLE_TASKS=true
|
||||
|
||||
# Email Configuration (Testing) - ✅ COMPLETED
|
||||
EMAIL_HOST=smtp.gmail.com
|
||||
EMAIL_PORT=587
|
||||
EMAIL_USER=press7174@gmail.com
|
||||
EMAIL_FROM=press7174@gmail.com
|
||||
WEEKLY_EMAIL_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
# Vector Database (Testing)
|
||||
VECTOR_PROVIDER=supabase
|
||||
|
||||
# Testing-specific settings
|
||||
RATE_LIMIT_MAX_REQUESTS=1000
|
||||
RATE_LIMIT_WINDOW_MS=900000
|
||||
AGENTIC_RAG_DETAILED_LOGGING=true
|
||||
AGENTIC_RAG_PERFORMANCE_TRACKING=true
|
||||
AGENTIC_RAG_ERROR_REPORTING=true
|
||||
|
||||
# Week 8 Features Configuration
|
||||
# Cost Monitoring
|
||||
COST_MONITORING_ENABLED=true
|
||||
USER_DAILY_COST_LIMIT=50.00
|
||||
USER_MONTHLY_COST_LIMIT=500.00
|
||||
DOCUMENT_COST_LIMIT=10.00
|
||||
SYSTEM_DAILY_COST_LIMIT=1000.00
|
||||
|
||||
# Caching Configuration
|
||||
CACHE_ENABLED=true
|
||||
CACHE_TTL_HOURS=168
|
||||
CACHE_SIMILARITY_THRESHOLD=0.85
|
||||
CACHE_MAX_SIZE=10000
|
||||
|
||||
# Microservice Configuration
|
||||
MICROSERVICE_ENABLED=true
|
||||
MICROSERVICE_MAX_CONCURRENT_JOBS=5
|
||||
MICROSERVICE_HEALTH_CHECK_INTERVAL=30000
|
||||
MICROSERVICE_QUEUE_PROCESSING_INTERVAL=5000
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
ENABLE_RAG_PROCESSING=true
|
||||
ENABLE_PROCESSING_COMPARISON=false
|
||||
|
||||
# Agentic RAG Configuration
|
||||
AGENTIC_RAG_ENABLED=true
|
||||
AGENTIC_RAG_MAX_AGENTS=6
|
||||
AGENTIC_RAG_PARALLEL_PROCESSING=true
|
||||
AGENTIC_RAG_VALIDATION_STRICT=true
|
||||
AGENTIC_RAG_RETRY_ATTEMPTS=3
|
||||
AGENTIC_RAG_TIMEOUT_PER_AGENT=60000
|
||||
|
||||
# Agent-Specific Configuration
|
||||
AGENT_DOCUMENT_UNDERSTANDING_ENABLED=true
|
||||
AGENT_FINANCIAL_ANALYSIS_ENABLED=true
|
||||
AGENT_MARKET_ANALYSIS_ENABLED=true
|
||||
AGENT_INVESTMENT_THESIS_ENABLED=true
|
||||
AGENT_SYNTHESIS_ENABLED=true
|
||||
AGENT_VALIDATION_ENABLED=true
|
||||
|
||||
# Quality Control
|
||||
AGENTIC_RAG_QUALITY_THRESHOLD=0.8
|
||||
AGENTIC_RAG_COMPLETENESS_THRESHOLD=0.9
|
||||
AGENTIC_RAG_CONSISTENCY_CHECK=true
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=debug
|
||||
LOG_FILE=logs/testing.log
|
||||
|
||||
# Security Configuration
|
||||
BCRYPT_ROUNDS=10
|
||||
|
||||
# Database Configuration (Testing)
|
||||
DATABASE_HOST=db.supabase.co
|
||||
DATABASE_PORT=5432
|
||||
DATABASE_NAME=postgres
|
||||
DATABASE_USER=postgres
|
||||
DATABASE_PASSWORD=your-testing-supabase-password
|
||||
|
||||
# Redis Configuration (Testing - using in-memory for testing)
|
||||
REDIS_URL=redis://localhost:6379
|
||||
REDIS_HOST=localhost
|
||||
REDIS_PORT=6379
|
||||
ALLOWED_FILE_TYPES=application/pdf
|
||||
MAX_FILE_SIZE=52428800
|
||||
|
||||
GCLOUD_PROJECT_ID=324837881067
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=abb95bdd56632e4d
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
OPENROUTER_USE_BYOK=true
|
||||
|
||||
# Email Configuration
|
||||
EMAIL_SECURE=false
|
||||
EMAIL_WEEKLY_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
#SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
#SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
#OPENROUTER_API_KEY=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
|
||||
|
||||
#ANTHROPIC_API_KEY=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
|
||||
|
||||
#OPENAI_API_KEY=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
|
||||
|
||||
|
||||
SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
OPENROUTER_API_KEY=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
|
||||
|
||||
ANTHROPIC_API_KEY=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
|
||||
|
||||
OPENAI_API_KEY=sk-proj-dFNxetn-sm08kbZ8IpFROe0LgVQev3lEsyfrGNqdYruyW_mLATHXVGee3ay55zkDHDBYR_XX4T3BlbkFJ2mJVmqt5u58hqrPSLhDsoN6HPQD_vyQFCqtlePYagbcnAnRDcleK06pYUf-Z3NhzfD-ONkEoMA
|
||||
LLM_MODEL=claude-3-7-sonnet-latest
|
||||
LLM_MAX_TOKENS=16000
|
||||
140
backend/.env.pre-clean-20251110-023705.bak
Normal file
140
backend/.env.pre-clean-20251110-023705.bak
Normal file
@@ -0,0 +1,140 @@
|
||||
# Node Environment
|
||||
NODE_ENV=testing
|
||||
|
||||
# Firebase Configuration (Testing Project) - ✅ COMPLETED
|
||||
FB_PROJECT_ID=cim-summarizer-testing
|
||||
FB_STORAGE_BUCKET=cim-summarizer-testing.firebasestorage.app
|
||||
FB_API_KEY=AIzaSyBNf58cnNMbXb6VE3sVEJYJT5CGNQr0Kmg
|
||||
FB_AUTH_DOMAIN=cim-summarizer-testing.firebaseapp.com
|
||||
|
||||
# Supabase Configuration (Testing Instance) - ✅ COMPLETED
|
||||
SUPABASE_URL=https://gzoclmbqmgmpuhufbnhy.supabase.co
|
||||
|
||||
# Google Cloud Configuration (Testing Project) - ✅ COMPLETED
|
||||
GCLOUD_PROJECT_ID=cim-summarizer-testing
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=575027767a9291f6
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey-testing.json
|
||||
|
||||
# LLM Configuration (Same as production but with cost limits) - ✅ COMPLETED
|
||||
LLM_PROVIDER=anthropic
|
||||
LLM_MAX_COST_PER_DOCUMENT=1.00
|
||||
LLM_ENABLE_COST_OPTIMIZATION=true
|
||||
LLM_USE_FAST_MODEL_FOR_SIMPLE_TASKS=true
|
||||
|
||||
# Email Configuration (Testing) - ✅ COMPLETED
|
||||
EMAIL_HOST=smtp.gmail.com
|
||||
EMAIL_PORT=587
|
||||
EMAIL_USER=press7174@gmail.com
|
||||
EMAIL_FROM=press7174@gmail.com
|
||||
WEEKLY_EMAIL_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
# Vector Database (Testing)
|
||||
VECTOR_PROVIDER=supabase
|
||||
|
||||
# Testing-specific settings
|
||||
RATE_LIMIT_MAX_REQUESTS=1000
|
||||
RATE_LIMIT_WINDOW_MS=900000
|
||||
AGENTIC_RAG_DETAILED_LOGGING=true
|
||||
AGENTIC_RAG_PERFORMANCE_TRACKING=true
|
||||
AGENTIC_RAG_ERROR_REPORTING=true
|
||||
|
||||
# Week 8 Features Configuration
|
||||
# Cost Monitoring
|
||||
COST_MONITORING_ENABLED=true
|
||||
USER_DAILY_COST_LIMIT=50.00
|
||||
USER_MONTHLY_COST_LIMIT=500.00
|
||||
DOCUMENT_COST_LIMIT=10.00
|
||||
SYSTEM_DAILY_COST_LIMIT=1000.00
|
||||
|
||||
# Caching Configuration
|
||||
CACHE_ENABLED=true
|
||||
CACHE_TTL_HOURS=168
|
||||
CACHE_SIMILARITY_THRESHOLD=0.85
|
||||
CACHE_MAX_SIZE=10000
|
||||
|
||||
# Microservice Configuration
|
||||
MICROSERVICE_ENABLED=true
|
||||
MICROSERVICE_MAX_CONCURRENT_JOBS=5
|
||||
MICROSERVICE_HEALTH_CHECK_INTERVAL=30000
|
||||
MICROSERVICE_QUEUE_PROCESSING_INTERVAL=5000
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
ENABLE_RAG_PROCESSING=true
|
||||
ENABLE_PROCESSING_COMPARISON=false
|
||||
|
||||
# Agentic RAG Configuration
|
||||
AGENTIC_RAG_ENABLED=true
|
||||
AGENTIC_RAG_MAX_AGENTS=6
|
||||
AGENTIC_RAG_PARALLEL_PROCESSING=true
|
||||
AGENTIC_RAG_VALIDATION_STRICT=true
|
||||
AGENTIC_RAG_RETRY_ATTEMPTS=3
|
||||
AGENTIC_RAG_TIMEOUT_PER_AGENT=60000
|
||||
|
||||
# Agent-Specific Configuration
|
||||
AGENT_DOCUMENT_UNDERSTANDING_ENABLED=true
|
||||
AGENT_FINANCIAL_ANALYSIS_ENABLED=true
|
||||
AGENT_MARKET_ANALYSIS_ENABLED=true
|
||||
AGENT_INVESTMENT_THESIS_ENABLED=true
|
||||
AGENT_SYNTHESIS_ENABLED=true
|
||||
AGENT_VALIDATION_ENABLED=true
|
||||
|
||||
# Quality Control
|
||||
AGENTIC_RAG_QUALITY_THRESHOLD=0.8
|
||||
AGENTIC_RAG_COMPLETENESS_THRESHOLD=0.9
|
||||
AGENTIC_RAG_CONSISTENCY_CHECK=true
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=debug
|
||||
LOG_FILE=logs/testing.log
|
||||
|
||||
# Security Configuration
|
||||
BCRYPT_ROUNDS=10
|
||||
|
||||
# Database Configuration (Testing)
|
||||
DATABASE_HOST=db.supabase.co
|
||||
DATABASE_PORT=5432
|
||||
DATABASE_NAME=postgres
|
||||
DATABASE_USER=postgres
|
||||
DATABASE_PASSWORD=your-testing-supabase-password
|
||||
|
||||
# Redis Configuration (Testing - using in-memory for testing)
|
||||
REDIS_URL=redis://localhost:6379
|
||||
REDIS_HOST=localhost
|
||||
REDIS_PORT=6379
|
||||
ALLOWED_FILE_TYPES=application/pdf
|
||||
MAX_FILE_SIZE=52428800
|
||||
|
||||
GCLOUD_PROJECT_ID=324837881067
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=abb95bdd56632e4d
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
OPENROUTER_USE_BYOK=true
|
||||
|
||||
# Email Configuration
|
||||
EMAIL_SECURE=false
|
||||
EMAIL_WEEKLY_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
#SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
#SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
#OPENROUTER_API_KEY=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
|
||||
|
||||
#ANTHROPIC_API_KEY=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
|
||||
|
||||
#OPENAI_API_KEY=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
|
||||
|
||||
|
||||
|
||||
OPENROUTER_API_KEY=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
|
||||
|
||||
ANTHROPIC_API_KEY=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
|
||||
|
||||
OPENAI_API_KEY=sk-proj-dFNxetn-sm08kbZ8IpFROe0LgVQev3lEsyfrGNqdYruyW_mLATHXVGee3ay55zkDHDBYR_XX4T3BlbkFJ2mJVmqt5u58hqrPSLhDsoN6HPQD_vyQFCqtlePYagbcnAnRDcleK06pYUf-Z3NhzfD-ONkEoMA
|
||||
LLM_MODEL=claude-3-7-sonnet-latest
|
||||
LLM_MAX_TOKENS=16000
|
||||
144
backend/.env.pre-clean-20251110-144822.bak
Normal file
144
backend/.env.pre-clean-20251110-144822.bak
Normal file
@@ -0,0 +1,144 @@
|
||||
# Node Environment
|
||||
NODE_ENV=testing
|
||||
|
||||
# Firebase Configuration (Testing Project) - ✅ COMPLETED
|
||||
FB_PROJECT_ID=cim-summarizer-testing
|
||||
FB_STORAGE_BUCKET=cim-summarizer-testing.firebasestorage.app
|
||||
FB_API_KEY=AIzaSyBNf58cnNMbXb6VE3sVEJYJT5CGNQr0Kmg
|
||||
FB_AUTH_DOMAIN=cim-summarizer-testing.firebaseapp.com
|
||||
|
||||
# Supabase Configuration (Testing Instance) - ✅ COMPLETED
|
||||
SUPABASE_URL=https://gzoclmbqmgmpuhufbnhy.supabase.co
|
||||
|
||||
# Google Cloud Configuration (Testing Project) - ✅ COMPLETED
|
||||
GCLOUD_PROJECT_ID=cim-summarizer-testing
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=575027767a9291f6
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey-testing.json
|
||||
|
||||
# LLM Configuration (Same as production but with cost limits) - ✅ COMPLETED
|
||||
LLM_PROVIDER=anthropic
|
||||
LLM_MAX_COST_PER_DOCUMENT=1.00
|
||||
LLM_ENABLE_COST_OPTIMIZATION=true
|
||||
LLM_USE_FAST_MODEL_FOR_SIMPLE_TASKS=true
|
||||
|
||||
# Email Configuration (Testing) - ✅ COMPLETED
|
||||
EMAIL_HOST=smtp.gmail.com
|
||||
EMAIL_PORT=587
|
||||
EMAIL_USER=press7174@gmail.com
|
||||
EMAIL_FROM=press7174@gmail.com
|
||||
WEEKLY_EMAIL_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
# Vector Database (Testing)
|
||||
VECTOR_PROVIDER=supabase
|
||||
|
||||
# Testing-specific settings
|
||||
RATE_LIMIT_MAX_REQUESTS=1000
|
||||
RATE_LIMIT_WINDOW_MS=900000
|
||||
AGENTIC_RAG_DETAILED_LOGGING=true
|
||||
AGENTIC_RAG_PERFORMANCE_TRACKING=true
|
||||
AGENTIC_RAG_ERROR_REPORTING=true
|
||||
|
||||
# Week 8 Features Configuration
|
||||
# Cost Monitoring
|
||||
COST_MONITORING_ENABLED=true
|
||||
USER_DAILY_COST_LIMIT=50.00
|
||||
USER_MONTHLY_COST_LIMIT=500.00
|
||||
DOCUMENT_COST_LIMIT=10.00
|
||||
SYSTEM_DAILY_COST_LIMIT=1000.00
|
||||
|
||||
# Caching Configuration
|
||||
CACHE_ENABLED=true
|
||||
CACHE_TTL_HOURS=168
|
||||
CACHE_SIMILARITY_THRESHOLD=0.85
|
||||
CACHE_MAX_SIZE=10000
|
||||
|
||||
# Microservice Configuration
|
||||
MICROSERVICE_ENABLED=true
|
||||
MICROSERVICE_MAX_CONCURRENT_JOBS=5
|
||||
MICROSERVICE_HEALTH_CHECK_INTERVAL=30000
|
||||
MICROSERVICE_QUEUE_PROCESSING_INTERVAL=5000
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
ENABLE_RAG_PROCESSING=true
|
||||
ENABLE_PROCESSING_COMPARISON=false
|
||||
|
||||
# Agentic RAG Configuration
|
||||
AGENTIC_RAG_ENABLED=true
|
||||
AGENTIC_RAG_MAX_AGENTS=6
|
||||
AGENTIC_RAG_PARALLEL_PROCESSING=true
|
||||
AGENTIC_RAG_VALIDATION_STRICT=true
|
||||
AGENTIC_RAG_RETRY_ATTEMPTS=3
|
||||
AGENTIC_RAG_TIMEOUT_PER_AGENT=60000
|
||||
|
||||
# Agent-Specific Configuration
|
||||
AGENT_DOCUMENT_UNDERSTANDING_ENABLED=true
|
||||
AGENT_FINANCIAL_ANALYSIS_ENABLED=true
|
||||
AGENT_MARKET_ANALYSIS_ENABLED=true
|
||||
AGENT_INVESTMENT_THESIS_ENABLED=true
|
||||
AGENT_SYNTHESIS_ENABLED=true
|
||||
AGENT_VALIDATION_ENABLED=true
|
||||
|
||||
# Quality Control
|
||||
AGENTIC_RAG_QUALITY_THRESHOLD=0.8
|
||||
AGENTIC_RAG_COMPLETENESS_THRESHOLD=0.9
|
||||
AGENTIC_RAG_CONSISTENCY_CHECK=true
|
||||
|
||||
# Logging Configuration
|
||||
LOG_LEVEL=debug
|
||||
LOG_FILE=logs/testing.log
|
||||
|
||||
# Security Configuration
|
||||
BCRYPT_ROUNDS=10
|
||||
|
||||
# Database Configuration (Testing)
|
||||
DATABASE_HOST=db.supabase.co
|
||||
DATABASE_PORT=5432
|
||||
DATABASE_NAME=postgres
|
||||
DATABASE_USER=postgres
|
||||
DATABASE_PASSWORD=your-testing-supabase-password
|
||||
|
||||
# Redis Configuration (Testing - using in-memory for testing)
|
||||
REDIS_URL=redis://localhost:6379
|
||||
REDIS_HOST=localhost
|
||||
REDIS_PORT=6379
|
||||
ALLOWED_FILE_TYPES=application/pdf
|
||||
MAX_FILE_SIZE=52428800
|
||||
|
||||
GCLOUD_PROJECT_ID=324837881067
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=abb95bdd56632e4d
|
||||
GCS_BUCKET_NAME=cim-processor-testing-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-processor-testing-processed
|
||||
OPENROUTER_USE_BYOK=true
|
||||
|
||||
# Email Configuration
|
||||
EMAIL_SECURE=false
|
||||
EMAIL_WEEKLY_RECIPIENT=jpressnell@bluepointcapital.com
|
||||
|
||||
#SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
#SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
LLM_MODEL=claude-3-7-sonnet-latest
|
||||
LLM_MAX_TOKENS=16000
|
||||
|
||||
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6InNlcnZpY2Vfcm9sZSIsImlhdCI6MTc1MzgxNjY3OCwiZXhwIjoyMDY5MzkyNjc4fQ.f9PUzL1F8JqIkqD_DwrGBIyHPcehMo-97jXD8hee5ss
|
||||
|
||||
SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6Imd6b2NsbWJxbWdtcHVodWZibmh5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTM4MTY2NzgsImV4cCI6MjA2OTM5MjY3OH0.Jg8cAKbujDv7YgeLCeHsOkgkP-LwM-7fAXVIHno0pLI
|
||||
|
||||
OPENROUTER_API_KEY=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
|
||||
|
||||
ANTHROPIC_API_KEY=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
|
||||
|
||||
OPENAI_API_KEY=sk-proj-dFNxetn-sm08kbZ8IpFROe0LgVQev3lEsyfrGNqdYruyW_mLATHXVGee3ay55zkDHDBYR_XX4T3BlbkFJ2mJVmqt5u58hqrPSLhDsoN6HPQD_vyQFCqtlePYagbcnAnRDcleK06pYUf-Z3NhzfD-ONkEoMA
|
||||
169
backend/src/config/constants.ts
Normal file
169
backend/src/config/constants.ts
Normal file
@@ -0,0 +1,169 @@
|
||||
/**
|
||||
* Application-wide constants
|
||||
* Centralized location for model configurations, cost rates, timeouts, and other constants
|
||||
*/
|
||||
|
||||
/**
|
||||
* LLM Model Cost Rates (USD per 1M tokens)
|
||||
* Used for cost estimation in LLM service
|
||||
*/
|
||||
export const LLM_COST_RATES: Record<string, { input: number; output: number }> = {
|
||||
'claude-3-opus-20240229': { input: 15, output: 75 },
|
||||
'claude-sonnet-4-5-20250929': { input: 3, output: 15 }, // Sonnet 4.5
|
||||
'claude-3-5-sonnet-20241022': { input: 3, output: 15 },
|
||||
'claude-haiku-4-5-20251015': { input: 0.25, output: 1.25 }, // Haiku 4.5 (released Oct 15, 2025)
|
||||
'claude-3-5-haiku-20241022': { input: 0.25, output: 1.25 },
|
||||
'claude-3-5-haiku-latest': { input: 0.25, output: 1.25 },
|
||||
'gpt-4o': { input: 5, output: 15 },
|
||||
'gpt-4o-mini': { input: 0.15, output: 0.60 },
|
||||
};
|
||||
|
||||
/**
|
||||
* Default cost rate fallback (used when model not found in cost rates)
|
||||
*/
|
||||
export const DEFAULT_COST_RATE = LLM_COST_RATES['claude-3-5-sonnet-20241022'];
|
||||
|
||||
/**
|
||||
* OpenRouter Model Name Mappings
|
||||
* Maps Anthropic model names to OpenRouter API format
|
||||
*/
|
||||
export const OPENROUTER_MODEL_MAPPINGS: Record<string, string> = {
|
||||
// Claude 4.x models
|
||||
'claude-sonnet-4-5-20250929': 'anthropic/claude-sonnet-4.5',
|
||||
'claude-sonnet-4': 'anthropic/claude-sonnet-4.5',
|
||||
'claude-haiku-4-5-20251015': 'anthropic/claude-haiku-4.5',
|
||||
'claude-haiku-4': 'anthropic/claude-haiku-4.5',
|
||||
'claude-opus-4': 'anthropic/claude-opus-4',
|
||||
|
||||
// Claude 3.7 models
|
||||
'claude-3-7-sonnet-latest': 'anthropic/claude-3.7-sonnet',
|
||||
'claude-3-7-sonnet': 'anthropic/claude-3.7-sonnet',
|
||||
|
||||
// Claude 3.5 models
|
||||
'claude-3-5-sonnet-20241022': 'anthropic/claude-3.5-sonnet',
|
||||
'claude-3-5-sonnet': 'anthropic/claude-3.5-sonnet',
|
||||
'claude-3-5-haiku-20241022': 'anthropic/claude-3.5-haiku',
|
||||
'claude-3-5-haiku-latest': 'anthropic/claude-3.5-haiku',
|
||||
'claude-3-5-haiku': 'anthropic/claude-3.5-haiku',
|
||||
|
||||
// Claude 3.0 models
|
||||
'claude-3-haiku': 'anthropic/claude-3-haiku',
|
||||
'claude-3-opus': 'anthropic/claude-3-opus',
|
||||
};
|
||||
|
||||
/**
|
||||
* Map Anthropic model name to OpenRouter format
|
||||
* Handles versioned and generic model names
|
||||
*/
|
||||
export function mapModelToOpenRouter(model: string): string {
|
||||
// Check direct mapping first
|
||||
if (OPENROUTER_MODEL_MAPPINGS[model]) {
|
||||
return OPENROUTER_MODEL_MAPPINGS[model];
|
||||
}
|
||||
|
||||
// Handle pattern-based matching for versioned models
|
||||
if (model.includes('claude')) {
|
||||
if (model.includes('sonnet') && model.includes('4')) {
|
||||
return 'anthropic/claude-sonnet-4.5';
|
||||
} else if (model.includes('haiku') && (model.includes('4-5') || model.includes('4.5'))) {
|
||||
return 'anthropic/claude-haiku-4.5';
|
||||
} else if (model.includes('haiku') && model.includes('4')) {
|
||||
return 'anthropic/claude-haiku-4.5';
|
||||
} else if (model.includes('opus') && model.includes('4')) {
|
||||
return 'anthropic/claude-opus-4';
|
||||
} else if (model.includes('sonnet') && (model.includes('4.5') || model.includes('4-5'))) {
|
||||
return 'anthropic/claude-sonnet-4.5';
|
||||
} else if (model.includes('sonnet') && model.includes('3.7')) {
|
||||
return 'anthropic/claude-3.7-sonnet';
|
||||
} else if (model.includes('sonnet') && model.includes('3.5')) {
|
||||
return 'anthropic/claude-3.5-sonnet';
|
||||
} else if (model.includes('haiku') && model.includes('3.5')) {
|
||||
return 'anthropic/claude-3.5-haiku';
|
||||
} else if (model.includes('haiku') && model.includes('3')) {
|
||||
return 'anthropic/claude-3-haiku';
|
||||
} else if (model.includes('opus') && model.includes('3')) {
|
||||
return 'anthropic/claude-3-opus';
|
||||
}
|
||||
|
||||
// Fallback: try to construct from model name
|
||||
return `anthropic/${model}`;
|
||||
}
|
||||
|
||||
// Return model as-is if no mapping found
|
||||
return model;
|
||||
}
|
||||
|
||||
/**
|
||||
* LLM Timeout Constants (in milliseconds)
|
||||
*/
|
||||
export const LLM_TIMEOUTS = {
|
||||
DEFAULT: 180000, // 3 minutes
|
||||
COMPLEX_ANALYSIS: 360000, // 6 minutes for complex CIM analysis
|
||||
OPENROUTER_DEFAULT: 360000, // 6 minutes for OpenRouter
|
||||
ABORT_BUFFER: 10000, // 10 seconds buffer before wrapper timeout
|
||||
SDK_BUFFER: 10000, // 10 seconds buffer for SDK timeout
|
||||
} as const;
|
||||
|
||||
/**
|
||||
* Token Estimation Constants
|
||||
*/
|
||||
export const TOKEN_ESTIMATION = {
|
||||
CHARS_PER_TOKEN: 4, // Rough estimation: 1 token ≈ 4 characters for English text
|
||||
INPUT_OUTPUT_RATIO: 0.8, // Assume 80% input, 20% output for cost estimation
|
||||
} as const;
|
||||
|
||||
/**
|
||||
* Default LLM Configuration Values
|
||||
*/
|
||||
export const LLM_DEFAULTS = {
|
||||
MAX_TOKENS: 16000,
|
||||
TEMPERATURE: 0.1,
|
||||
PROMPT_BUFFER: 500,
|
||||
MAX_INPUT_TOKENS: 200000,
|
||||
DEFAULT_MAX_TOKENS_SIMPLE: 3000,
|
||||
DEFAULT_TEMPERATURE_SIMPLE: 0.3,
|
||||
} as const;
|
||||
|
||||
/**
|
||||
* OpenRouter API Configuration
|
||||
*/
|
||||
export const OPENROUTER_CONFIG = {
|
||||
BASE_URL: 'https://openrouter.ai/api/v1/chat/completions',
|
||||
HTTP_REFERER: 'https://cim-summarizer-testing.firebaseapp.com',
|
||||
X_TITLE: 'CIM Summarizer',
|
||||
} as const;
|
||||
|
||||
/**
|
||||
* Retry Configuration
|
||||
*/
|
||||
export const RETRY_CONFIG = {
|
||||
MAX_ATTEMPTS: 3,
|
||||
INITIAL_DELAY_MS: 1000, // 1 second
|
||||
MAX_DELAY_MS: 10000, // 10 seconds
|
||||
BACKOFF_MULTIPLIER: 2,
|
||||
} as const;
|
||||
|
||||
/**
|
||||
* Cost Estimation Helper
|
||||
* Estimates cost for a given number of tokens and model
|
||||
*/
|
||||
export function estimateLLMCost(tokens: number, model: string): number {
|
||||
const rates = LLM_COST_RATES[model] || DEFAULT_COST_RATE;
|
||||
if (!rates) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
const inputCost = (tokens * TOKEN_ESTIMATION.INPUT_OUTPUT_RATIO * rates.input) / 1000000;
|
||||
const outputCost = (tokens * (1 - TOKEN_ESTIMATION.INPUT_OUTPUT_RATIO) * rates.output) / 1000000;
|
||||
|
||||
return inputCost + outputCost;
|
||||
}
|
||||
|
||||
/**
|
||||
* Token Count Estimation Helper
|
||||
* Rough estimation based on character count
|
||||
*/
|
||||
export function estimateTokenCount(text: string): number {
|
||||
return Math.ceil(text.length / TOKEN_ESTIMATION.CHARS_PER_TOKEN);
|
||||
}
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
// This file is deprecated - use Supabase client instead
|
||||
// Kept for compatibility with legacy code that might import it
|
||||
|
||||
import { getSupabaseServiceClient } from './supabase';
|
||||
import { logger } from '../utils/logger';
|
||||
|
||||
// Legacy pool interface for backward compatibility
|
||||
const createLegacyPoolInterface = () => {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
|
||||
return {
|
||||
query: async (text: string, params?: any[]) => {
|
||||
logger.warn('Using legacy pool.query - consider migrating to Supabase client directly');
|
||||
|
||||
// This is a basic compatibility layer - for complex queries, use Supabase directly
|
||||
throw new Error('Legacy pool.query not implemented - use Supabase client directly');
|
||||
},
|
||||
|
||||
end: async () => {
|
||||
logger.info('Legacy pool.end() called - no action needed for Supabase');
|
||||
}
|
||||
};
|
||||
};
|
||||
|
||||
// Create legacy pool interface
|
||||
const pool = createLegacyPoolInterface();
|
||||
|
||||
// Log that we're using Supabase instead of PostgreSQL
|
||||
logger.info('Database connection configured for Supabase (cloud-native)');
|
||||
|
||||
export default pool;
|
||||
@@ -1,129 +0,0 @@
|
||||
import { Request, Response } from 'express';
|
||||
import { AuthenticatedRequest } from '../middleware/auth';
|
||||
import logger from '../utils/logger';
|
||||
|
||||
export interface RegisterRequest extends Request {
|
||||
body: {
|
||||
email: string;
|
||||
name: string;
|
||||
password: string;
|
||||
};
|
||||
}
|
||||
|
||||
export interface LoginRequest extends Request {
|
||||
body: {
|
||||
email: string;
|
||||
password: string;
|
||||
};
|
||||
}
|
||||
|
||||
export interface RefreshTokenRequest extends Request {
|
||||
body: {
|
||||
refreshToken: string;
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Legacy auth controller
|
||||
* All auth functions are now handled by Firebase Auth
|
||||
*/
|
||||
export const authController = {
|
||||
async register(_req: RegisterRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy register endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy registration is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async login(_req: LoginRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy login endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy login is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async refreshToken(_req: RefreshTokenRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy refresh token endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy token refresh is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async logout(_req: AuthenticatedRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy logout endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy logout is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async getProfile(_req: AuthenticatedRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy profile endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy profile access is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async updateProfile(_req: AuthenticatedRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy profile update endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy profile updates are disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async changePassword(_req: AuthenticatedRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy password change endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy password changes are disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async deleteAccount(_req: AuthenticatedRequest, res: Response): Promise<void> {
|
||||
logger.warn('Legacy account deletion endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy account deletion is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async verifyEmail(_req: Request, res: Response): Promise<void> {
|
||||
logger.warn('Legacy email verification endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy email verification is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async requestPasswordReset(_req: Request, res: Response): Promise<void> {
|
||||
logger.warn('Legacy password reset endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy password reset is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
},
|
||||
|
||||
async resetPassword(_req: Request, res: Response): Promise<void> {
|
||||
logger.warn('Legacy password reset endpoint is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy password reset is disabled. Use Firebase Auth instead.',
|
||||
error: 'DEPRECATED_ENDPOINT'
|
||||
});
|
||||
}
|
||||
};
|
||||
@@ -1,107 +0,0 @@
|
||||
import { Request, Response, NextFunction } from 'express';
|
||||
import logger from '../utils/logger';
|
||||
|
||||
export interface AuthenticatedRequest extends Request {
|
||||
user?: import('firebase-admin').auth.DecodedIdToken;
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Legacy authentication middleware
|
||||
* Use Firebase Auth instead via ../middleware/firebaseAuth
|
||||
*/
|
||||
export async function authenticateToken(
|
||||
_req: AuthenticatedRequest,
|
||||
res: Response,
|
||||
_next: NextFunction
|
||||
): Promise<void> {
|
||||
logger.warn('Legacy auth middleware is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy authentication is disabled. Use Firebase Auth instead.'
|
||||
});
|
||||
}
|
||||
|
||||
// Alias for backward compatibility
|
||||
export const auth = authenticateToken;
|
||||
|
||||
/**
|
||||
* DEPRECATED: Role-based authorization middleware
|
||||
*/
|
||||
export function requireRole(_allowedRoles: string[]) {
|
||||
return (_req: AuthenticatedRequest, res: Response, _next: NextFunction): void => {
|
||||
logger.warn('Legacy role-based auth is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy role-based authentication is disabled. Use Firebase Auth instead.'
|
||||
});
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Admin-only middleware
|
||||
*/
|
||||
export function requireAdmin(
|
||||
_req: AuthenticatedRequest,
|
||||
res: Response,
|
||||
_next: NextFunction
|
||||
): void {
|
||||
logger.warn('Legacy admin auth is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy admin authentication is disabled. Use Firebase Auth instead.'
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: User or admin middleware
|
||||
*/
|
||||
export function requireUserOrAdmin(
|
||||
_req: AuthenticatedRequest,
|
||||
res: Response,
|
||||
_next: NextFunction
|
||||
): void {
|
||||
logger.warn('Legacy user/admin auth is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy user/admin authentication is disabled. Use Firebase Auth instead.'
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Optional authentication middleware
|
||||
*/
|
||||
export async function optionalAuth(
|
||||
_req: AuthenticatedRequest,
|
||||
_res: Response,
|
||||
next: NextFunction
|
||||
): Promise<void> {
|
||||
logger.debug('Legacy optional auth is deprecated. Use Firebase Auth instead.');
|
||||
// For optional auth, we just continue without authentication
|
||||
next();
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Rate limiting middleware
|
||||
*/
|
||||
export function authRateLimit(
|
||||
_req: Request,
|
||||
_res: Response,
|
||||
next: NextFunction
|
||||
): void {
|
||||
next();
|
||||
}
|
||||
|
||||
/**
|
||||
* DEPRECATED: Logout middleware
|
||||
*/
|
||||
export async function logout(
|
||||
_req: AuthenticatedRequest,
|
||||
res: Response,
|
||||
_next: NextFunction
|
||||
): Promise<void> {
|
||||
logger.warn('Legacy logout is deprecated. Use Firebase Auth instead.');
|
||||
res.status(501).json({
|
||||
success: false,
|
||||
message: 'Legacy logout is disabled. Use Firebase Auth instead.'
|
||||
});
|
||||
}
|
||||
@@ -0,0 +1,232 @@
|
||||
-- Migration: Add financial extraction monitoring tables
|
||||
-- Created: 2025-01-XX
|
||||
-- Description: Track financial extraction accuracy, errors, and API call patterns
|
||||
|
||||
-- Table to track financial extraction events
|
||||
CREATE TABLE IF NOT EXISTS financial_extraction_events (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
|
||||
job_id UUID REFERENCES processing_jobs(id) ON DELETE SET NULL,
|
||||
user_id UUID REFERENCES users(id) ON DELETE SET NULL,
|
||||
|
||||
-- Extraction details
|
||||
extraction_method TEXT NOT NULL, -- 'deterministic_parser', 'llm_haiku', 'llm_sonnet', 'fallback'
|
||||
model_used TEXT, -- e.g., 'claude-3-5-haiku-latest', 'claude-sonnet-4-5-20250514'
|
||||
attempt_number INTEGER DEFAULT 1,
|
||||
|
||||
-- Results
|
||||
success BOOLEAN NOT NULL,
|
||||
has_financials BOOLEAN DEFAULT FALSE,
|
||||
periods_extracted TEXT[], -- Array of periods found: ['fy3', 'fy2', 'fy1', 'ltm']
|
||||
metrics_extracted TEXT[], -- Array of metrics: ['revenue', 'ebitda', 'ebitdaMargin', etc.]
|
||||
|
||||
-- Validation results
|
||||
validation_passed BOOLEAN,
|
||||
validation_issues TEXT[], -- Array of validation warnings/errors
|
||||
auto_corrections_applied INTEGER DEFAULT 0, -- Number of auto-corrections (e.g., margin fixes)
|
||||
|
||||
-- API call tracking
|
||||
api_call_duration_ms INTEGER,
|
||||
tokens_used INTEGER,
|
||||
cost_estimate_usd DECIMAL(10, 6),
|
||||
rate_limit_hit BOOLEAN DEFAULT FALSE,
|
||||
|
||||
-- Error tracking
|
||||
error_type TEXT, -- 'rate_limit', 'validation_failure', 'api_error', 'timeout', etc.
|
||||
error_message TEXT,
|
||||
error_code TEXT,
|
||||
|
||||
-- Timing
|
||||
processing_time_ms INTEGER,
|
||||
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
||||
|
||||
-- Indexes for common queries
|
||||
INDEX idx_financial_extraction_events_document_id ON financial_extraction_events(document_id),
|
||||
INDEX idx_financial_extraction_events_created_at ON financial_extraction_events(created_at DESC),
|
||||
INDEX idx_financial_extraction_events_success ON financial_extraction_events(success),
|
||||
INDEX idx_financial_extraction_events_method ON financial_extraction_events(extraction_method)
|
||||
);
|
||||
|
||||
-- Table to track API call patterns (for rate limit prevention)
|
||||
CREATE TABLE IF NOT EXISTS api_call_tracking (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
provider TEXT NOT NULL, -- 'anthropic', 'openai', 'openrouter'
|
||||
model TEXT NOT NULL,
|
||||
endpoint TEXT NOT NULL, -- 'financial_extraction', 'full_extraction', etc.
|
||||
|
||||
-- Call details
|
||||
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
||||
duration_ms INTEGER,
|
||||
success BOOLEAN NOT NULL,
|
||||
rate_limit_hit BOOLEAN DEFAULT FALSE,
|
||||
retry_attempt INTEGER DEFAULT 0,
|
||||
|
||||
-- Token usage
|
||||
input_tokens INTEGER,
|
||||
output_tokens INTEGER,
|
||||
total_tokens INTEGER,
|
||||
|
||||
-- Cost tracking
|
||||
cost_usd DECIMAL(10, 6),
|
||||
|
||||
-- Error details (if failed)
|
||||
error_type TEXT,
|
||||
error_message TEXT,
|
||||
|
||||
-- Indexes for rate limit tracking
|
||||
INDEX idx_api_call_tracking_provider_model ON api_call_tracking(provider, model),
|
||||
INDEX idx_api_call_tracking_timestamp ON api_call_tracking(timestamp DESC),
|
||||
INDEX idx_api_call_tracking_rate_limit ON api_call_tracking(rate_limit_hit, timestamp DESC)
|
||||
);
|
||||
|
||||
-- Table for aggregated metrics (updated periodically)
|
||||
CREATE TABLE IF NOT EXISTS financial_extraction_metrics (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
metric_date DATE NOT NULL UNIQUE,
|
||||
|
||||
-- Success metrics
|
||||
total_extractions INTEGER DEFAULT 0,
|
||||
successful_extractions INTEGER DEFAULT 0,
|
||||
failed_extractions INTEGER DEFAULT 0,
|
||||
success_rate DECIMAL(5, 4), -- 0.0000 to 1.0000
|
||||
|
||||
-- Method breakdown
|
||||
deterministic_parser_count INTEGER DEFAULT 0,
|
||||
llm_haiku_count INTEGER DEFAULT 0,
|
||||
llm_sonnet_count INTEGER DEFAULT 0,
|
||||
fallback_count INTEGER DEFAULT 0,
|
||||
|
||||
-- Accuracy metrics
|
||||
avg_periods_extracted DECIMAL(3, 2), -- Average number of periods extracted
|
||||
avg_metrics_extracted DECIMAL(5, 2), -- Average number of metrics extracted
|
||||
validation_pass_rate DECIMAL(5, 4),
|
||||
avg_auto_corrections DECIMAL(5, 2),
|
||||
|
||||
-- Performance metrics
|
||||
avg_processing_time_ms INTEGER,
|
||||
avg_api_call_duration_ms INTEGER,
|
||||
p95_processing_time_ms INTEGER,
|
||||
p99_processing_time_ms INTEGER,
|
||||
|
||||
-- Cost metrics
|
||||
total_cost_usd DECIMAL(10, 2),
|
||||
avg_cost_per_extraction_usd DECIMAL(10, 6),
|
||||
|
||||
-- Error metrics
|
||||
rate_limit_errors INTEGER DEFAULT 0,
|
||||
validation_errors INTEGER DEFAULT 0,
|
||||
api_errors INTEGER DEFAULT 0,
|
||||
timeout_errors INTEGER DEFAULT 0,
|
||||
|
||||
-- Updated timestamp
|
||||
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
|
||||
|
||||
INDEX idx_financial_extraction_metrics_date ON financial_extraction_metrics(metric_date DESC)
|
||||
);
|
||||
|
||||
-- Function to update daily metrics (can be called by a scheduled job)
|
||||
CREATE OR REPLACE FUNCTION update_financial_extraction_metrics(target_date DATE DEFAULT CURRENT_DATE)
|
||||
RETURNS VOID AS $$
|
||||
DECLARE
|
||||
v_total INTEGER;
|
||||
v_successful INTEGER;
|
||||
v_failed INTEGER;
|
||||
v_success_rate DECIMAL(5, 4);
|
||||
v_deterministic INTEGER;
|
||||
v_haiku INTEGER;
|
||||
v_sonnet INTEGER;
|
||||
v_fallback INTEGER;
|
||||
v_avg_periods DECIMAL(3, 2);
|
||||
v_avg_metrics DECIMAL(5, 2);
|
||||
v_validation_pass_rate DECIMAL(5, 4);
|
||||
v_avg_auto_corrections DECIMAL(5, 2);
|
||||
v_avg_processing_time INTEGER;
|
||||
v_avg_api_duration INTEGER;
|
||||
v_p95_processing INTEGER;
|
||||
v_p99_processing INTEGER;
|
||||
v_total_cost DECIMAL(10, 2);
|
||||
v_avg_cost DECIMAL(10, 6);
|
||||
v_rate_limit_errors INTEGER;
|
||||
v_validation_errors INTEGER;
|
||||
v_api_errors INTEGER;
|
||||
v_timeout_errors INTEGER;
|
||||
BEGIN
|
||||
-- Calculate metrics for the target date
|
||||
SELECT
|
||||
COUNT(*),
|
||||
COUNT(*) FILTER (WHERE success = true),
|
||||
COUNT(*) FILTER (WHERE success = false),
|
||||
CASE WHEN COUNT(*) > 0 THEN COUNT(*) FILTER (WHERE success = true)::DECIMAL / COUNT(*) ELSE 0 END,
|
||||
COUNT(*) FILTER (WHERE extraction_method = 'deterministic_parser'),
|
||||
COUNT(*) FILTER (WHERE extraction_method = 'llm_haiku'),
|
||||
COUNT(*) FILTER (WHERE extraction_method = 'llm_sonnet'),
|
||||
COUNT(*) FILTER (WHERE extraction_method = 'fallback'),
|
||||
COALESCE(AVG(array_length(periods_extracted, 1)), 0),
|
||||
COALESCE(AVG(array_length(metrics_extracted, 1)), 0),
|
||||
CASE WHEN COUNT(*) > 0 THEN COUNT(*) FILTER (WHERE validation_passed = true)::DECIMAL / COUNT(*) ELSE 0 END,
|
||||
COALESCE(AVG(auto_corrections_applied), 0),
|
||||
COALESCE(AVG(processing_time_ms), 0)::INTEGER,
|
||||
COALESCE(AVG(api_call_duration_ms), 0)::INTEGER,
|
||||
COALESCE(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms), 0)::INTEGER,
|
||||
COALESCE(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY processing_time_ms), 0)::INTEGER,
|
||||
COALESCE(SUM(cost_estimate_usd), 0),
|
||||
CASE WHEN COUNT(*) > 0 THEN COALESCE(SUM(cost_estimate_usd), 0) / COUNT(*) ELSE 0 END,
|
||||
COUNT(*) FILTER (WHERE error_type = 'rate_limit'),
|
||||
COUNT(*) FILTER (WHERE error_type = 'validation_failure'),
|
||||
COUNT(*) FILTER (WHERE error_type = 'api_error'),
|
||||
COUNT(*) FILTER (WHERE error_type = 'timeout')
|
||||
INTO
|
||||
v_total, v_successful, v_failed, v_success_rate,
|
||||
v_deterministic, v_haiku, v_sonnet, v_fallback,
|
||||
v_avg_periods, v_avg_metrics, v_validation_pass_rate, v_avg_auto_corrections,
|
||||
v_avg_processing_time, v_avg_api_duration, v_p95_processing, v_p99_processing,
|
||||
v_total_cost, v_avg_cost,
|
||||
v_rate_limit_errors, v_validation_errors, v_api_errors, v_timeout_errors
|
||||
FROM financial_extraction_events
|
||||
WHERE DATE(created_at) = target_date;
|
||||
|
||||
-- Insert or update metrics
|
||||
INSERT INTO financial_extraction_metrics (
|
||||
metric_date, total_extractions, successful_extractions, failed_extractions,
|
||||
success_rate, deterministic_parser_count, llm_haiku_count, llm_sonnet_count,
|
||||
fallback_count, avg_periods_extracted, avg_metrics_extracted,
|
||||
validation_pass_rate, avg_auto_corrections, avg_processing_time_ms,
|
||||
avg_api_call_duration_ms, p95_processing_time_ms, p99_processing_time_ms,
|
||||
total_cost_usd, avg_cost_per_extraction_usd, rate_limit_errors,
|
||||
validation_errors, api_errors, timeout_errors, updated_at
|
||||
) VALUES (
|
||||
target_date, v_total, v_successful, v_failed, v_success_rate,
|
||||
v_deterministic, v_haiku, v_sonnet, v_fallback,
|
||||
v_avg_periods, v_avg_metrics, v_validation_pass_rate, v_avg_auto_corrections,
|
||||
v_avg_processing_time, v_avg_api_duration, v_p95_processing, v_p99_processing,
|
||||
v_total_cost, v_avg_cost,
|
||||
v_rate_limit_errors, v_validation_errors, v_api_errors, v_timeout_errors,
|
||||
NOW()
|
||||
)
|
||||
ON CONFLICT (metric_date) DO UPDATE SET
|
||||
total_extractions = EXCLUDED.total_extractions,
|
||||
successful_extractions = EXCLUDED.successful_extractions,
|
||||
failed_extractions = EXCLUDED.failed_extractions,
|
||||
success_rate = EXCLUDED.success_rate,
|
||||
deterministic_parser_count = EXCLUDED.deterministic_parser_count,
|
||||
llm_haiku_count = EXCLUDED.llm_haiku_count,
|
||||
llm_sonnet_count = EXCLUDED.llm_sonnet_count,
|
||||
fallback_count = EXCLUDED.fallback_count,
|
||||
avg_periods_extracted = EXCLUDED.avg_periods_extracted,
|
||||
avg_metrics_extracted = EXCLUDED.avg_metrics_extracted,
|
||||
validation_pass_rate = EXCLUDED.validation_pass_rate,
|
||||
avg_auto_corrections = EXCLUDED.avg_auto_corrections,
|
||||
avg_processing_time_ms = EXCLUDED.avg_processing_time_ms,
|
||||
avg_api_call_duration_ms = EXCLUDED.avg_api_call_duration_ms,
|
||||
p95_processing_time_ms = EXCLUDED.p95_processing_time_ms,
|
||||
p99_processing_time_ms = EXCLUDED.p99_processing_time_ms,
|
||||
total_cost_usd = EXCLUDED.total_cost_usd,
|
||||
avg_cost_per_extraction_usd = EXCLUDED.avg_cost_per_extraction_usd,
|
||||
rate_limit_errors = EXCLUDED.rate_limit_errors,
|
||||
validation_errors = EXCLUDED.validation_errors,
|
||||
api_errors = EXCLUDED.api_errors,
|
||||
timeout_errors = EXCLUDED.timeout_errors,
|
||||
updated_at = NOW();
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
364
backend/src/scripts/testing/compare-processing-methods.ts
Executable file
364
backend/src/scripts/testing/compare-processing-methods.ts
Executable file
@@ -0,0 +1,364 @@
|
||||
#!/usr/bin/env ts-node
|
||||
|
||||
/**
|
||||
* Comparison Test: Parallel Processing vs Sequential Processing
|
||||
*
|
||||
* This script tests the new parallel processing methodology against
|
||||
* the current production (sequential) methodology to measure:
|
||||
* - Processing time differences
|
||||
* - API call counts
|
||||
* - Accuracy/completeness
|
||||
* - Rate limit safety
|
||||
*/
|
||||
|
||||
import * as dotenv from 'dotenv';
|
||||
import * as path from 'path';
|
||||
import * as fs from 'fs';
|
||||
import { simpleDocumentProcessor } from '../services/simpleDocumentProcessor';
|
||||
import { parallelDocumentProcessor } from '../services/parallelDocumentProcessor';
|
||||
import { documentAiProcessor } from '../services/documentAiProcessor';
|
||||
import { logger } from '../utils/logger';
|
||||
|
||||
// Load environment variables
|
||||
dotenv.config({ path: path.join(__dirname, '../../.env') });
|
||||
|
||||
interface ComparisonResult {
|
||||
method: 'sequential' | 'parallel';
|
||||
success: boolean;
|
||||
processingTime: number;
|
||||
apiCalls: number;
|
||||
completeness: number;
|
||||
sectionsExtracted: string[];
|
||||
error?: string;
|
||||
financialData?: any;
|
||||
}
|
||||
|
||||
interface TestResults {
|
||||
documentId: string;
|
||||
fileName: string;
|
||||
sequential: ComparisonResult;
|
||||
parallel: ComparisonResult;
|
||||
improvement: {
|
||||
timeReduction: number; // percentage
|
||||
timeSaved: number; // milliseconds
|
||||
apiCallsDifference: number;
|
||||
completenessDifference: number;
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Calculate completeness score for a CIMReview
|
||||
*/
|
||||
function calculateCompleteness(data: any): number {
|
||||
if (!data) return 0;
|
||||
|
||||
let totalFields = 0;
|
||||
let filledFields = 0;
|
||||
|
||||
const countFields = (obj: any, prefix = '') => {
|
||||
if (obj === null || obj === undefined) return;
|
||||
|
||||
if (typeof obj === 'object' && !Array.isArray(obj)) {
|
||||
Object.keys(obj).forEach(key => {
|
||||
const value = obj[key];
|
||||
const fieldPath = prefix ? `${prefix}.${key}` : key;
|
||||
|
||||
if (typeof value === 'object' && !Array.isArray(obj)) {
|
||||
countFields(value, fieldPath);
|
||||
} else {
|
||||
totalFields++;
|
||||
if (value && value !== 'Not specified in CIM' && value !== 'N/A' && value !== '') {
|
||||
filledFields++;
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
countFields(data);
|
||||
return totalFields > 0 ? (filledFields / totalFields) * 100 : 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get list of sections extracted
|
||||
*/
|
||||
function getSectionsExtracted(data: any): string[] {
|
||||
const sections: string[] = [];
|
||||
|
||||
if (data?.dealOverview) sections.push('dealOverview');
|
||||
if (data?.businessDescription) sections.push('businessDescription');
|
||||
if (data?.marketIndustryAnalysis) sections.push('marketIndustryAnalysis');
|
||||
if (data?.financialSummary) sections.push('financialSummary');
|
||||
if (data?.managementTeamOverview) sections.push('managementTeamOverview');
|
||||
if (data?.preliminaryInvestmentThesis) sections.push('preliminaryInvestmentThesis');
|
||||
|
||||
return sections;
|
||||
}
|
||||
|
||||
/**
|
||||
* Test a single document with both methods
|
||||
*/
|
||||
async function testDocument(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
filePath: string
|
||||
): Promise<TestResults> {
|
||||
console.log('\n' + '='.repeat(80));
|
||||
console.log(`Testing Document: ${path.basename(filePath)}`);
|
||||
console.log('='.repeat(80));
|
||||
|
||||
// Read file
|
||||
const fileBuffer = fs.readFileSync(filePath);
|
||||
const fileName = path.basename(filePath);
|
||||
const mimeType = 'application/pdf';
|
||||
|
||||
// Extract text once (shared between both methods)
|
||||
console.log('\n📄 Extracting text with Document AI...');
|
||||
const extractionResult = await documentAiProcessor.extractTextOnly(
|
||||
documentId,
|
||||
userId,
|
||||
fileBuffer,
|
||||
fileName,
|
||||
mimeType
|
||||
);
|
||||
|
||||
if (!extractionResult || !extractionResult.text) {
|
||||
throw new Error('Failed to extract text from document');
|
||||
}
|
||||
|
||||
const extractedText = extractionResult.text;
|
||||
console.log(`✅ Text extracted: ${extractedText.length} characters`);
|
||||
|
||||
const results: TestResults = {
|
||||
documentId,
|
||||
fileName,
|
||||
sequential: {} as ComparisonResult,
|
||||
parallel: {} as ComparisonResult,
|
||||
improvement: {
|
||||
timeReduction: 0,
|
||||
timeSaved: 0,
|
||||
apiCallsDifference: 0,
|
||||
completenessDifference: 0,
|
||||
},
|
||||
};
|
||||
|
||||
// Test Sequential Method (Current Production)
|
||||
console.log('\n' + '-'.repeat(80));
|
||||
console.log('🔄 Testing SEQUENTIAL Method (Current Production)');
|
||||
console.log('-'.repeat(80));
|
||||
|
||||
try {
|
||||
const sequentialStart = Date.now();
|
||||
const sequentialResult = await simpleDocumentProcessor.processDocument(
|
||||
documentId + '_sequential',
|
||||
userId,
|
||||
extractedText,
|
||||
{ fileBuffer, fileName, mimeType }
|
||||
);
|
||||
const sequentialTime = Date.now() - sequentialStart;
|
||||
|
||||
results.sequential = {
|
||||
method: 'sequential',
|
||||
success: sequentialResult.success,
|
||||
processingTime: sequentialTime,
|
||||
apiCalls: sequentialResult.apiCalls,
|
||||
completeness: calculateCompleteness(sequentialResult.analysisData),
|
||||
sectionsExtracted: getSectionsExtracted(sequentialResult.analysisData),
|
||||
error: sequentialResult.error,
|
||||
financialData: sequentialResult.analysisData?.financialSummary,
|
||||
};
|
||||
|
||||
console.log(`✅ Sequential completed in ${(sequentialTime / 1000).toFixed(2)}s`);
|
||||
console.log(` API Calls: ${sequentialResult.apiCalls}`);
|
||||
console.log(` Completeness: ${results.sequential.completeness.toFixed(1)}%`);
|
||||
console.log(` Sections: ${results.sequential.sectionsExtracted.join(', ')}`);
|
||||
} catch (error) {
|
||||
results.sequential = {
|
||||
method: 'sequential',
|
||||
success: false,
|
||||
processingTime: 0,
|
||||
apiCalls: 0,
|
||||
completeness: 0,
|
||||
sectionsExtracted: [],
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
};
|
||||
console.log(`❌ Sequential failed: ${results.sequential.error}`);
|
||||
}
|
||||
|
||||
// Wait a bit between tests to avoid rate limits
|
||||
console.log('\n⏳ Waiting 5 seconds before parallel test...');
|
||||
await new Promise(resolve => setTimeout(resolve, 5000));
|
||||
|
||||
// Test Parallel Method (New)
|
||||
console.log('\n' + '-'.repeat(80));
|
||||
console.log('⚡ Testing PARALLEL Method (New)');
|
||||
console.log('-'.repeat(80));
|
||||
|
||||
try {
|
||||
const parallelStart = Date.now();
|
||||
const parallelResult = await parallelDocumentProcessor.processDocument(
|
||||
documentId + '_parallel',
|
||||
userId,
|
||||
extractedText,
|
||||
{ fileBuffer, fileName, mimeType }
|
||||
);
|
||||
const parallelTime = Date.now() - parallelStart;
|
||||
|
||||
results.parallel = {
|
||||
method: 'parallel',
|
||||
success: parallelResult.success,
|
||||
processingTime: parallelTime,
|
||||
apiCalls: parallelResult.apiCalls,
|
||||
completeness: calculateCompleteness(parallelResult.analysisData),
|
||||
sectionsExtracted: getSectionsExtracted(parallelResult.analysisData),
|
||||
error: parallelResult.error,
|
||||
financialData: parallelResult.analysisData?.financialSummary,
|
||||
};
|
||||
|
||||
console.log(`✅ Parallel completed in ${(parallelTime / 1000).toFixed(2)}s`);
|
||||
console.log(` API Calls: ${parallelResult.apiCalls}`);
|
||||
console.log(` Completeness: ${results.parallel.completeness.toFixed(1)}%`);
|
||||
console.log(` Sections: ${results.parallel.sectionsExtracted.join(', ')}`);
|
||||
} catch (error) {
|
||||
results.parallel = {
|
||||
method: 'parallel',
|
||||
success: false,
|
||||
processingTime: 0,
|
||||
apiCalls: 0,
|
||||
completeness: 0,
|
||||
sectionsExtracted: [],
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
};
|
||||
console.log(`❌ Parallel failed: ${results.parallel.error}`);
|
||||
}
|
||||
|
||||
// Calculate improvements
|
||||
if (results.sequential.success && results.parallel.success) {
|
||||
results.improvement.timeSaved = results.sequential.processingTime - results.parallel.processingTime;
|
||||
results.improvement.timeReduction = results.sequential.processingTime > 0
|
||||
? (results.improvement.timeSaved / results.sequential.processingTime) * 100
|
||||
: 0;
|
||||
results.improvement.apiCallsDifference = results.parallel.apiCalls - results.sequential.apiCalls;
|
||||
results.improvement.completenessDifference = results.parallel.completeness - results.sequential.completeness;
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
/**
|
||||
* Print comparison results
|
||||
*/
|
||||
function printComparisonResults(results: TestResults): void {
|
||||
console.log('\n' + '='.repeat(80));
|
||||
console.log('📊 COMPARISON RESULTS');
|
||||
console.log('='.repeat(80));
|
||||
|
||||
console.log('\n📈 Performance Metrics:');
|
||||
console.log(` Sequential Time: ${(results.sequential.processingTime / 1000).toFixed(2)}s`);
|
||||
console.log(` Parallel Time: ${(results.parallel.processingTime / 1000).toFixed(2)}s`);
|
||||
|
||||
if (results.improvement.timeSaved > 0) {
|
||||
console.log(` ⚡ Time Saved: ${(results.improvement.timeSaved / 1000).toFixed(2)}s (${results.improvement.timeReduction.toFixed(1)}% faster)`);
|
||||
} else {
|
||||
console.log(` ⚠️ Time Difference: ${(Math.abs(results.improvement.timeSaved) / 1000).toFixed(2)}s (${Math.abs(results.improvement.timeReduction).toFixed(1)}% ${results.improvement.timeReduction < 0 ? 'slower' : 'faster'})`);
|
||||
}
|
||||
|
||||
console.log('\n🔢 API Calls:');
|
||||
console.log(` Sequential: ${results.sequential.apiCalls}`);
|
||||
console.log(` Parallel: ${results.parallel.apiCalls}`);
|
||||
if (results.improvement.apiCallsDifference !== 0) {
|
||||
const sign = results.improvement.apiCallsDifference > 0 ? '+' : '';
|
||||
console.log(` Difference: ${sign}${results.improvement.apiCallsDifference}`);
|
||||
}
|
||||
|
||||
console.log('\n✅ Completeness:');
|
||||
console.log(` Sequential: ${results.sequential.completeness.toFixed(1)}%`);
|
||||
console.log(` Parallel: ${results.parallel.completeness.toFixed(1)}%`);
|
||||
if (results.improvement.completenessDifference !== 0) {
|
||||
const sign = results.improvement.completenessDifference > 0 ? '+' : '';
|
||||
console.log(` Difference: ${sign}${results.improvement.completenessDifference.toFixed(1)}%`);
|
||||
}
|
||||
|
||||
console.log('\n📋 Sections Extracted:');
|
||||
console.log(` Sequential: ${results.sequential.sectionsExtracted.join(', ') || 'None'}`);
|
||||
console.log(` Parallel: ${results.parallel.sectionsExtracted.join(', ') || 'None'}`);
|
||||
|
||||
// Compare financial data if available
|
||||
if (results.sequential.financialData && results.parallel.financialData) {
|
||||
console.log('\n💰 Financial Data Comparison:');
|
||||
const seqFinancials = results.sequential.financialData.financials;
|
||||
const parFinancials = results.parallel.financialData.financials;
|
||||
|
||||
['fy3', 'fy2', 'fy1', 'ltm'].forEach(period => {
|
||||
const seqRev = seqFinancials?.[period]?.revenue;
|
||||
const parRev = parFinancials?.[period]?.revenue;
|
||||
const match = seqRev === parRev ? '✅' : '❌';
|
||||
console.log(` ${period.toUpperCase()} Revenue: ${match} Sequential: ${seqRev || 'N/A'} | Parallel: ${parRev || 'N/A'}`);
|
||||
});
|
||||
}
|
||||
|
||||
console.log('\n' + '='.repeat(80));
|
||||
|
||||
// Summary
|
||||
if (results.improvement.timeReduction > 0) {
|
||||
console.log(`\n🎉 Parallel processing is ${results.improvement.timeReduction.toFixed(1)}% faster!`);
|
||||
} else if (results.improvement.timeReduction < 0) {
|
||||
console.log(`\n⚠️ Parallel processing is ${Math.abs(results.improvement.timeReduction).toFixed(1)}% slower (may be due to rate limiting or overhead)`);
|
||||
} else {
|
||||
console.log(`\n➡️ Processing times are similar`);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Main test function
|
||||
*/
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
|
||||
if (args.length === 0) {
|
||||
console.error('Usage: ts-node compare-processing-methods.ts <pdf-file-path> [userId] [documentId]');
|
||||
console.error('\nExample:');
|
||||
console.error(' ts-node compare-processing-methods.ts ~/Downloads/stax-cim.pdf');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const filePath = args[0];
|
||||
const userId = args[1] || 'test-user-' + Date.now();
|
||||
const documentId = args[2] || 'test-doc-' + Date.now();
|
||||
|
||||
if (!fs.existsSync(filePath)) {
|
||||
console.error(`❌ File not found: ${filePath}`);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log('\n🚀 Starting Processing Method Comparison Test');
|
||||
console.log(` File: ${filePath}`);
|
||||
console.log(` User ID: ${userId}`);
|
||||
console.log(` Document ID: ${documentId}`);
|
||||
|
||||
try {
|
||||
const results = await testDocument(documentId, userId, filePath);
|
||||
printComparisonResults(results);
|
||||
|
||||
// Save results to file
|
||||
const resultsFile = path.join(__dirname, `../../comparison-results-${Date.now()}.json`);
|
||||
fs.writeFileSync(resultsFile, JSON.stringify(results, null, 2));
|
||||
console.log(`\n💾 Results saved to: ${resultsFile}`);
|
||||
|
||||
process.exit(0);
|
||||
} catch (error) {
|
||||
console.error('\n❌ Test failed:', error);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
// Run if executed directly
|
||||
if (require.main === module) {
|
||||
main().catch(error => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
}
|
||||
|
||||
export { testDocument, printComparisonResults, ComparisonResult, TestResults };
|
||||
|
||||
511
backend/src/services/financialExtractionMonitoringService.ts
Normal file
511
backend/src/services/financialExtractionMonitoringService.ts
Normal file
@@ -0,0 +1,511 @@
|
||||
import { logger } from '../utils/logger';
|
||||
import getSupabaseClient from '../config/supabase';
|
||||
|
||||
export interface FinancialExtractionEvent {
|
||||
documentId: string;
|
||||
jobId?: string;
|
||||
userId?: string;
|
||||
extractionMethod: 'deterministic_parser' | 'llm_haiku' | 'llm_sonnet' | 'fallback';
|
||||
modelUsed?: string;
|
||||
attemptNumber?: number;
|
||||
success: boolean;
|
||||
hasFinancials?: boolean;
|
||||
periodsExtracted?: string[];
|
||||
metricsExtracted?: string[];
|
||||
validationPassed?: boolean;
|
||||
validationIssues?: string[];
|
||||
autoCorrectionsApplied?: number;
|
||||
apiCallDurationMs?: number;
|
||||
tokensUsed?: number;
|
||||
costEstimateUsd?: number;
|
||||
rateLimitHit?: boolean;
|
||||
errorType?: 'rate_limit' | 'validation_failure' | 'api_error' | 'timeout' | 'other';
|
||||
errorMessage?: string;
|
||||
errorCode?: string;
|
||||
processingTimeMs?: number;
|
||||
}
|
||||
|
||||
export interface FinancialExtractionMetrics {
|
||||
totalExtractions: number;
|
||||
successfulExtractions: number;
|
||||
failedExtractions: number;
|
||||
successRate: number;
|
||||
deterministicParserCount: number;
|
||||
llmHaikuCount: number;
|
||||
llmSonnetCount: number;
|
||||
fallbackCount: number;
|
||||
avgPeriodsExtracted: number;
|
||||
avgMetricsExtracted: number;
|
||||
validationPassRate: number;
|
||||
avgAutoCorrections: number;
|
||||
avgProcessingTimeMs: number;
|
||||
avgApiCallDurationMs: number;
|
||||
p95ProcessingTimeMs: number;
|
||||
p99ProcessingTimeMs: number;
|
||||
totalCostUsd: number;
|
||||
avgCostPerExtractionUsd: number;
|
||||
rateLimitErrors: number;
|
||||
validationErrors: number;
|
||||
apiErrors: number;
|
||||
timeoutErrors: number;
|
||||
}
|
||||
|
||||
export interface ApiCallTracking {
|
||||
provider: 'anthropic' | 'openai' | 'openrouter';
|
||||
model: string;
|
||||
endpoint: 'financial_extraction' | 'full_extraction' | 'other';
|
||||
durationMs?: number;
|
||||
success: boolean;
|
||||
rateLimitHit?: boolean;
|
||||
retryAttempt?: number;
|
||||
inputTokens?: number;
|
||||
outputTokens?: number;
|
||||
totalTokens?: number;
|
||||
costUsd?: number;
|
||||
errorType?: string;
|
||||
errorMessage?: string;
|
||||
}
|
||||
|
||||
export interface FinancialExtractionHealthStatus {
|
||||
status: 'healthy' | 'degraded' | 'unhealthy';
|
||||
successRate: number;
|
||||
avgProcessingTime: number;
|
||||
rateLimitRisk: 'low' | 'medium' | 'high';
|
||||
recentErrors: number;
|
||||
recommendations: string[];
|
||||
timestamp: Date;
|
||||
}
|
||||
|
||||
/**
|
||||
* Service for monitoring financial extraction accuracy, errors, and API call patterns.
|
||||
*
|
||||
* This service is designed to be safe for parallel processing:
|
||||
* - Uses database-backed storage (not in-memory)
|
||||
* - All operations are atomic
|
||||
* - No shared mutable state
|
||||
* - Thread-safe for concurrent access
|
||||
*/
|
||||
class FinancialExtractionMonitoringService {
|
||||
private readonly RATE_LIMIT_WINDOW_MS = 60000; // 1 minute window
|
||||
private readonly RATE_LIMIT_THRESHOLD = 50; // Max calls per minute per provider/model
|
||||
private readonly HEALTH_THRESHOLDS = {
|
||||
successRate: {
|
||||
healthy: 0.95,
|
||||
degraded: 0.85,
|
||||
},
|
||||
avgProcessingTime: {
|
||||
healthy: 30000, // 30 seconds
|
||||
degraded: 120000, // 2 minutes
|
||||
},
|
||||
maxRecentErrors: 10,
|
||||
};
|
||||
|
||||
/**
|
||||
* Track a financial extraction event
|
||||
* Thread-safe: Uses database insert, safe for parallel processing
|
||||
*/
|
||||
async trackExtractionEvent(event: FinancialExtractionEvent): Promise<void> {
|
||||
try {
|
||||
const supabase = getSupabaseClient();
|
||||
const { error } = await supabase
|
||||
.from('financial_extraction_events')
|
||||
.insert({
|
||||
document_id: event.documentId,
|
||||
job_id: event.jobId || null,
|
||||
user_id: event.userId || null,
|
||||
extraction_method: event.extractionMethod,
|
||||
model_used: event.modelUsed || null,
|
||||
attempt_number: event.attemptNumber || 1,
|
||||
success: event.success,
|
||||
has_financials: event.hasFinancials || false,
|
||||
periods_extracted: event.periodsExtracted || [],
|
||||
metrics_extracted: event.metricsExtracted || [],
|
||||
validation_passed: event.validationPassed || null,
|
||||
validation_issues: event.validationIssues || [],
|
||||
auto_corrections_applied: event.autoCorrectionsApplied || 0,
|
||||
api_call_duration_ms: event.apiCallDurationMs || null,
|
||||
tokens_used: event.tokensUsed || null,
|
||||
cost_estimate_usd: event.costEstimateUsd || null,
|
||||
rate_limit_hit: event.rateLimitHit || false,
|
||||
error_type: event.errorType || null,
|
||||
error_message: event.errorMessage || null,
|
||||
error_code: event.errorCode || null,
|
||||
processing_time_ms: event.processingTimeMs || null,
|
||||
});
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to track financial extraction event', {
|
||||
error: error.message,
|
||||
documentId: event.documentId,
|
||||
});
|
||||
} else {
|
||||
logger.debug('Tracked financial extraction event', {
|
||||
documentId: event.documentId,
|
||||
method: event.extractionMethod,
|
||||
success: event.success,
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
// Don't throw - monitoring failures shouldn't break processing
|
||||
logger.error('Error tracking financial extraction event', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
documentId: event.documentId,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Track an API call for rate limit monitoring
|
||||
* Thread-safe: Uses database insert, safe for parallel processing
|
||||
*/
|
||||
async trackApiCall(call: ApiCallTracking): Promise<void> {
|
||||
try {
|
||||
const supabase = getSupabaseClient();
|
||||
const { error } = await supabase
|
||||
.from('api_call_tracking')
|
||||
.insert({
|
||||
provider: call.provider,
|
||||
model: call.model,
|
||||
endpoint: call.endpoint,
|
||||
duration_ms: call.durationMs || null,
|
||||
success: call.success,
|
||||
rate_limit_hit: call.rateLimitHit || false,
|
||||
retry_attempt: call.retryAttempt || 0,
|
||||
input_tokens: call.inputTokens || null,
|
||||
output_tokens: call.outputTokens || null,
|
||||
total_tokens: call.totalTokens || null,
|
||||
cost_usd: call.costUsd || null,
|
||||
error_type: call.errorType || null,
|
||||
error_message: call.errorMessage || null,
|
||||
});
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to track API call', {
|
||||
error: error.message,
|
||||
provider: call.provider,
|
||||
model: call.model,
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
// Don't throw - monitoring failures shouldn't break processing
|
||||
logger.error('Error tracking API call', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
provider: call.provider,
|
||||
model: call.model,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if we're at risk of hitting rate limits
|
||||
* Thread-safe: Uses database query, safe for parallel processing
|
||||
*/
|
||||
async checkRateLimitRisk(
|
||||
provider: 'anthropic' | 'openai' | 'openrouter',
|
||||
model: string
|
||||
): Promise<'low' | 'medium' | 'high'> {
|
||||
try {
|
||||
const supabase = getSupabaseClient();
|
||||
const windowStart = new Date(Date.now() - this.RATE_LIMIT_WINDOW_MS);
|
||||
|
||||
const { data, error } = await supabase
|
||||
.from('api_call_tracking')
|
||||
.select('id')
|
||||
.eq('provider', provider)
|
||||
.eq('model', model)
|
||||
.gte('timestamp', windowStart.toISOString())
|
||||
.limit(this.RATE_LIMIT_THRESHOLD + 1);
|
||||
|
||||
if (error) {
|
||||
logger.warn('Failed to check rate limit risk', {
|
||||
error: error.message,
|
||||
provider,
|
||||
model,
|
||||
});
|
||||
return 'low'; // Default to low risk if we can't check
|
||||
}
|
||||
|
||||
const callCount = data?.length || 0;
|
||||
|
||||
if (callCount >= this.RATE_LIMIT_THRESHOLD) {
|
||||
return 'high';
|
||||
} else if (callCount >= this.RATE_LIMIT_THRESHOLD * 0.7) {
|
||||
return 'medium';
|
||||
} else {
|
||||
return 'low';
|
||||
}
|
||||
} catch (error) {
|
||||
logger.error('Error checking rate limit risk', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
provider,
|
||||
model,
|
||||
});
|
||||
return 'low'; // Default to low risk on error
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Get metrics for a time period
|
||||
* Thread-safe: Uses database query, safe for parallel processing
|
||||
*/
|
||||
async getMetrics(hours: number = 24): Promise<FinancialExtractionMetrics | null> {
|
||||
try {
|
||||
const cutoffTime = new Date(Date.now() - hours * 60 * 60 * 1000);
|
||||
|
||||
// Get aggregated metrics from the metrics table if available
|
||||
const supabase = getSupabaseClient();
|
||||
const { data: metricsData, error: metricsError } = await supabase
|
||||
.from('financial_extraction_metrics')
|
||||
.select('*')
|
||||
.gte('metric_date', cutoffTime.toISOString().split('T')[0])
|
||||
.order('metric_date', { ascending: false })
|
||||
.limit(1);
|
||||
|
||||
if (!metricsError && metricsData && metricsData.length > 0) {
|
||||
const m = metricsData[0];
|
||||
return {
|
||||
totalExtractions: m.total_extractions || 0,
|
||||
successfulExtractions: m.successful_extractions || 0,
|
||||
failedExtractions: m.failed_extractions || 0,
|
||||
successRate: parseFloat(m.success_rate || 0),
|
||||
deterministicParserCount: m.deterministic_parser_count || 0,
|
||||
llmHaikuCount: m.llm_haiku_count || 0,
|
||||
llmSonnetCount: m.llm_sonnet_count || 0,
|
||||
fallbackCount: m.fallback_count || 0,
|
||||
avgPeriodsExtracted: parseFloat(m.avg_periods_extracted || 0),
|
||||
avgMetricsExtracted: parseFloat(m.avg_metrics_extracted || 0),
|
||||
validationPassRate: parseFloat(m.validation_pass_rate || 0),
|
||||
avgAutoCorrections: parseFloat(m.avg_auto_corrections || 0),
|
||||
avgProcessingTimeMs: m.avg_processing_time_ms || 0,
|
||||
avgApiCallDurationMs: m.avg_api_call_duration_ms || 0,
|
||||
p95ProcessingTimeMs: m.p95_processing_time_ms || 0,
|
||||
p99ProcessingTimeMs: m.p99_processing_time_ms || 0,
|
||||
totalCostUsd: parseFloat(m.total_cost_usd || 0),
|
||||
avgCostPerExtractionUsd: parseFloat(m.avg_cost_per_extraction_usd || 0),
|
||||
rateLimitErrors: m.rate_limit_errors || 0,
|
||||
validationErrors: m.validation_errors || 0,
|
||||
apiErrors: m.api_errors || 0,
|
||||
timeoutErrors: m.timeout_errors || 0,
|
||||
};
|
||||
}
|
||||
|
||||
// Fallback: Calculate from events if metrics table is empty
|
||||
const { data: eventsData, error: eventsError } = await supabase
|
||||
.from('financial_extraction_events')
|
||||
.select('*')
|
||||
.gte('created_at', cutoffTime.toISOString());
|
||||
|
||||
if (eventsError) {
|
||||
logger.error('Failed to get financial extraction metrics', {
|
||||
error: eventsError.message,
|
||||
});
|
||||
return null;
|
||||
}
|
||||
|
||||
if (!eventsData || eventsData.length === 0) {
|
||||
return this.getEmptyMetrics();
|
||||
}
|
||||
|
||||
// Calculate metrics from events
|
||||
const total = eventsData.length;
|
||||
const successful = eventsData.filter(e => e.success).length;
|
||||
const failed = total - successful;
|
||||
const successRate = total > 0 ? successful / total : 0;
|
||||
|
||||
const processingTimes = eventsData
|
||||
.map(e => e.processing_time_ms)
|
||||
.filter(t => t !== null && t !== undefined) as number[];
|
||||
const avgProcessingTime = processingTimes.length > 0
|
||||
? Math.round(processingTimes.reduce((a, b) => a + b, 0) / processingTimes.length)
|
||||
: 0;
|
||||
|
||||
const p95ProcessingTime = processingTimes.length > 0
|
||||
? Math.round(this.percentile(processingTimes, 0.95))
|
||||
: 0;
|
||||
const p99ProcessingTime = processingTimes.length > 0
|
||||
? Math.round(this.percentile(processingTimes, 0.99))
|
||||
: 0;
|
||||
|
||||
return {
|
||||
totalExtractions: total,
|
||||
successfulExtractions: successful,
|
||||
failedExtractions: failed,
|
||||
successRate,
|
||||
deterministicParserCount: eventsData.filter(e => e.extraction_method === 'deterministic_parser').length,
|
||||
llmHaikuCount: eventsData.filter(e => e.extraction_method === 'llm_haiku').length,
|
||||
llmSonnetCount: eventsData.filter(e => e.extraction_method === 'llm_sonnet').length,
|
||||
fallbackCount: eventsData.filter(e => e.extraction_method === 'fallback').length,
|
||||
avgPeriodsExtracted: this.avgArrayLength(eventsData.map(e => e.periods_extracted)),
|
||||
avgMetricsExtracted: this.avgArrayLength(eventsData.map(e => e.metrics_extracted)),
|
||||
validationPassRate: this.calculatePassRate(eventsData.map(e => e.validation_passed)),
|
||||
avgAutoCorrections: this.avg(eventsData.map(e => e.auto_corrections_applied || 0)),
|
||||
avgProcessingTimeMs: avgProcessingTime,
|
||||
avgApiCallDurationMs: this.avg(eventsData.map(e => e.api_call_duration_ms).filter(t => t !== null && t !== undefined) as number[]),
|
||||
p95ProcessingTimeMs: p95ProcessingTime,
|
||||
p99ProcessingTimeMs: p99ProcessingTime,
|
||||
totalCostUsd: eventsData.reduce((sum, e) => sum + (parseFloat(e.cost_estimate_usd || 0)), 0),
|
||||
avgCostPerExtractionUsd: total > 0
|
||||
? eventsData.reduce((sum, e) => sum + (parseFloat(e.cost_estimate_usd || 0)), 0) / total
|
||||
: 0,
|
||||
rateLimitErrors: eventsData.filter(e => e.error_type === 'rate_limit').length,
|
||||
validationErrors: eventsData.filter(e => e.error_type === 'validation_failure').length,
|
||||
apiErrors: eventsData.filter(e => e.error_type === 'api_error').length,
|
||||
timeoutErrors: eventsData.filter(e => e.error_type === 'timeout').length,
|
||||
};
|
||||
} catch (error) {
|
||||
logger.error('Error getting financial extraction metrics', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Get health status for financial extraction
|
||||
*/
|
||||
async getHealthStatus(): Promise<FinancialExtractionHealthStatus> {
|
||||
const metrics = await this.getMetrics(24);
|
||||
const recommendations: string[] = [];
|
||||
|
||||
if (!metrics) {
|
||||
return {
|
||||
status: 'unhealthy',
|
||||
successRate: 0,
|
||||
avgProcessingTime: 0,
|
||||
rateLimitRisk: 'low',
|
||||
recentErrors: 0,
|
||||
recommendations: ['Unable to retrieve metrics'],
|
||||
timestamp: new Date(),
|
||||
};
|
||||
}
|
||||
|
||||
// Determine status based on thresholds
|
||||
let status: 'healthy' | 'degraded' | 'unhealthy' = 'healthy';
|
||||
|
||||
if (metrics.successRate < this.HEALTH_THRESHOLDS.successRate.degraded) {
|
||||
status = 'unhealthy';
|
||||
recommendations.push(`Success rate is low (${(metrics.successRate * 100).toFixed(1)}%). Investigate recent failures.`);
|
||||
} else if (metrics.successRate < this.HEALTH_THRESHOLDS.successRate.healthy) {
|
||||
status = 'degraded';
|
||||
recommendations.push(`Success rate is below target (${(metrics.successRate * 100).toFixed(1)}%). Monitor closely.`);
|
||||
}
|
||||
|
||||
if (metrics.avgProcessingTimeMs > this.HEALTH_THRESHOLDS.avgProcessingTime.degraded) {
|
||||
if (status === 'healthy') status = 'degraded';
|
||||
recommendations.push(`Average processing time is high (${(metrics.avgProcessingTimeMs / 1000).toFixed(1)}s). Consider optimization.`);
|
||||
}
|
||||
|
||||
if (metrics.rateLimitErrors > 0) {
|
||||
if (status === 'healthy') status = 'degraded';
|
||||
recommendations.push(`${metrics.rateLimitErrors} rate limit errors detected. Consider reducing concurrency or adding delays.`);
|
||||
}
|
||||
|
||||
// Check rate limit risk for common providers/models
|
||||
const anthropicRisk = await this.checkRateLimitRisk('anthropic', 'claude-3-5-haiku-latest');
|
||||
const sonnetRisk = await this.checkRateLimitRisk('anthropic', 'claude-sonnet-4-5-20250514');
|
||||
const rateLimitRisk: 'low' | 'medium' | 'high' =
|
||||
anthropicRisk === 'high' || sonnetRisk === 'high' ? 'high' :
|
||||
anthropicRisk === 'medium' || sonnetRisk === 'medium' ? 'medium' : 'low';
|
||||
|
||||
if (rateLimitRisk === 'high') {
|
||||
recommendations.push('High rate limit risk detected. Consider reducing parallel processing or adding delays between API calls.');
|
||||
} else if (rateLimitRisk === 'medium') {
|
||||
recommendations.push('Medium rate limit risk. Monitor API call patterns closely.');
|
||||
}
|
||||
|
||||
return {
|
||||
status,
|
||||
successRate: metrics.successRate,
|
||||
avgProcessingTime: metrics.avgProcessingTimeMs,
|
||||
rateLimitRisk,
|
||||
recentErrors: metrics.failedExtractions,
|
||||
recommendations,
|
||||
timestamp: new Date(),
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Update daily metrics (should be called by a scheduled job)
|
||||
*/
|
||||
async updateDailyMetrics(date: Date = new Date()): Promise<void> {
|
||||
try {
|
||||
const supabase = getSupabaseClient();
|
||||
const { error } = await supabase.rpc('update_financial_extraction_metrics', {
|
||||
target_date: date.toISOString().split('T')[0],
|
||||
});
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to update daily metrics', {
|
||||
error: error.message,
|
||||
date: date.toISOString(),
|
||||
});
|
||||
} else {
|
||||
logger.info('Updated daily financial extraction metrics', {
|
||||
date: date.toISOString(),
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
logger.error('Error updating daily metrics', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
date: date.toISOString(),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Helper methods
|
||||
private getEmptyMetrics(): FinancialExtractionMetrics {
|
||||
return {
|
||||
totalExtractions: 0,
|
||||
successfulExtractions: 0,
|
||||
failedExtractions: 0,
|
||||
successRate: 0,
|
||||
deterministicParserCount: 0,
|
||||
llmHaikuCount: 0,
|
||||
llmSonnetCount: 0,
|
||||
fallbackCount: 0,
|
||||
avgPeriodsExtracted: 0,
|
||||
avgMetricsExtracted: 0,
|
||||
validationPassRate: 0,
|
||||
avgAutoCorrections: 0,
|
||||
avgProcessingTimeMs: 0,
|
||||
avgApiCallDurationMs: 0,
|
||||
p95ProcessingTimeMs: 0,
|
||||
p99ProcessingTimeMs: 0,
|
||||
totalCostUsd: 0,
|
||||
avgCostPerExtractionUsd: 0,
|
||||
rateLimitErrors: 0,
|
||||
validationErrors: 0,
|
||||
apiErrors: 0,
|
||||
timeoutErrors: 0,
|
||||
};
|
||||
}
|
||||
|
||||
private avg(values: number[]): number {
|
||||
if (values.length === 0) return 0;
|
||||
return values.reduce((a, b) => a + b, 0) / values.length;
|
||||
}
|
||||
|
||||
private avgArrayLength(arrays: (string[] | null)[]): number {
|
||||
const lengths = arrays
|
||||
.filter(a => a !== null && a !== undefined)
|
||||
.map(a => a!.length);
|
||||
return this.avg(lengths);
|
||||
}
|
||||
|
||||
private calculatePassRate(passed: (boolean | null)[]): number {
|
||||
const valid = passed.filter(p => p !== null);
|
||||
if (valid.length === 0) return 0;
|
||||
const passedCount = valid.filter(p => p === true).length;
|
||||
return passedCount / valid.length;
|
||||
}
|
||||
|
||||
private percentile(sorted: number[], p: number): number {
|
||||
if (sorted.length === 0) return 0;
|
||||
const sortedCopy = [...sorted].sort((a, b) => a - b);
|
||||
const index = Math.ceil(sortedCopy.length * p) - 1;
|
||||
return sortedCopy[Math.max(0, Math.min(index, sortedCopy.length - 1))];
|
||||
}
|
||||
}
|
||||
|
||||
export const financialExtractionMonitoringService = new FinancialExtractionMonitoringService();
|
||||
|
||||
112
backend/src/services/llmPrompts/cimSystemPrompt.ts
Normal file
112
backend/src/services/llmPrompts/cimSystemPrompt.ts
Normal file
@@ -0,0 +1,112 @@
|
||||
/**
|
||||
* CIM System Prompt Builder
|
||||
* Generates the system prompt for CIM document analysis
|
||||
*/
|
||||
|
||||
export function getCIMSystemPrompt(focusedFields?: string[]): string {
|
||||
const focusInstruction = focusedFields && focusedFields.length > 0
|
||||
? `\n\nPRIORITY AREAS FOR THIS PASS (extract these thoroughly, but still extract ALL other fields):\n${focusedFields.map(f => `- ${f}`).join('\n')}\n\nFor this pass, prioritize extracting the fields listed above with extra thoroughness. However, you MUST still extract ALL fields in the template. Do NOT use "Not specified in CIM" for any field unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be especially thorough in extracting all nested fields within the priority areas.`
|
||||
: '';
|
||||
|
||||
return `You are a world-class private equity investment analyst at BPCP (Blue Point Capital Partners), operating at the analytical depth and rigor of top-tier PE firms (KKR, Blackstone, Apollo, Carlyle). Your task is to analyze Confidential Information Memorandums (CIMs) with the precision, depth, and strategic insight expected by BPCP's investment committee. Return a comprehensive, structured JSON object that follows the BPCP CIM Review Template format EXACTLY.${focusInstruction}
|
||||
|
||||
CRITICAL REQUIREMENTS:
|
||||
1. **JSON OUTPUT ONLY**: Your entire response MUST be a single, valid JSON object. Do not include any text or explanation before or after the JSON object.
|
||||
2. **BPCP TEMPLATE FORMAT**: The JSON object MUST follow the BPCP CIM Review Template structure exactly as specified.
|
||||
3. **COMPLETE ALL FIELDS**: You MUST provide a value for every field. Use "Not specified in CIM" for any information that is not available in the document.
|
||||
4. **NO PLACEHOLDERS**: Do not use placeholders like "..." or "TBD". Use "Not specified in CIM" instead.
|
||||
5. **PROFESSIONAL ANALYSIS**: The content should be high-quality and suitable for BPCP's investment committee.
|
||||
6. **BPCP FOCUS**: Focus on companies in 5+MM EBITDA range in consumer and industrial end markets, with emphasis on M&A, technology & data usage, supply chain and human capital optimization.
|
||||
7. **BPCP PREFERENCES**: BPCP prefers companies which are founder/family-owned and within driving distance of Cleveland and Charlotte.
|
||||
8. **EXACT FIELD NAMES**: Use the exact field names and descriptions from the BPCP CIM Review Template.
|
||||
9. **FINANCIAL DATA**: For financial metrics, use actual numbers if available, otherwise use "Not specified in CIM".
|
||||
10. **VALID JSON**: Ensure your response is valid JSON that can be parsed without errors.
|
||||
|
||||
FINANCIAL VALIDATION FRAMEWORK:
|
||||
Before finalizing any financial extraction, you MUST perform these validation checks:
|
||||
|
||||
**Magnitude Validation**:
|
||||
- Revenue should typically be $10M+ for target companies (if less, verify you're using the PRIMARY table, not a subsidiary)
|
||||
- EBITDA should typically be $1M+ and positive for viable targets
|
||||
- If FY-3 revenue is $64M, FY-2 should be similar magnitude (e.g., $50M-$90M), not $2.9M or $10 - this indicates column misalignment
|
||||
|
||||
**Trend Validation**:
|
||||
- Revenue should generally increase or be stable year-over-year (FY-3 → FY-2 → FY-1)
|
||||
- Large sudden drops (>50%) or increases (>200%) may indicate misaligned columns or wrong table
|
||||
- EBITDA should follow similar trends to revenue (unless margin expansion/contraction is explicitly explained)
|
||||
|
||||
**Cross-Period Consistency**:
|
||||
- If FY-3 revenue = $64M and FY-2 revenue = $71M, growth should be ~11% (not 1000% or -50%)
|
||||
- Margins should be relatively stable across periods (within 10-15 percentage points unless explained)
|
||||
- EBITDA margins should be 5-50% (typical range), gross margins 20-80%
|
||||
|
||||
**Multi-Table Cross-Reference**:
|
||||
- Cross-reference primary table with executive summary financial highlights
|
||||
- Verify consistency between detailed financials and summary tables
|
||||
- Check appendices for additional financial detail or adjustments
|
||||
- If discrepancies exist, note them and use the most authoritative source (typically the detailed historical table)
|
||||
|
||||
**Calculation Validation**:
|
||||
- Verify revenue growth percentages match: ((Current - Prior) / Prior) * 100
|
||||
- Verify margins match: (Metric / Revenue) * 100
|
||||
- If calculations don't match, use the explicitly stated values from the table
|
||||
|
||||
PE INVESTOR PERSONA & METHODOLOGY:
|
||||
You operate with the analytical rigor and strategic depth of top-tier private equity firms. Your analysis should demonstrate:
|
||||
|
||||
**Value Creation Focus**:
|
||||
- Identify specific, quantifiable value creation opportunities (e.g., "Margin expansion of 200-300 bps through pricing optimization and cost reduction, potentially adding $2-3M EBITDA")
|
||||
- Assess operational improvement potential (supply chain, technology, human capital)
|
||||
- Evaluate M&A and add-on acquisition potential with specific rationale
|
||||
- Quantify potential impact where possible (EBITDA improvement, revenue growth, multiple expansion)
|
||||
|
||||
**Risk Assessment Depth**:
|
||||
- Categorize risks by type: operational, financial, market, execution, regulatory, technology
|
||||
- Assess both probability and impact (high/medium/low)
|
||||
- Identify mitigating factors and management's risk management approach
|
||||
- Distinguish between deal-breakers and manageable risks
|
||||
|
||||
**Strategic Analysis Frameworks**:
|
||||
- **Porter's Five Forces**: Assess competitive intensity, supplier power, buyer power, threat of substitutes, threat of new entrants
|
||||
- **SWOT Analysis**: Synthesize strengths, weaknesses, opportunities, threats from the CIM
|
||||
- **Value Creation Playbook**: Revenue growth (organic/inorganic), margin expansion, operational improvements, multiple expansion
|
||||
- **Comparable Analysis**: Reference industry benchmarks, comparable company multiples, recent transaction multiples where mentioned
|
||||
|
||||
**Industry Context Integration**:
|
||||
- Reference industry-specific metrics and benchmarks (e.g., SaaS: ARR growth, churn, CAC payback; Manufacturing: inventory turns, days sales outstanding)
|
||||
- Consider sector-specific risks and opportunities (regulatory changes, technology disruption, consolidation trends)
|
||||
- Evaluate market position relative to industry standards (market share, growth vs market, margin vs peers)
|
||||
|
||||
COMMON MISTAKES TO AVOID:
|
||||
1. **Subsidiary vs Parent Table Confusion**: Primary table shows values in millions ($64M), subsidiary tables show thousands ($20,546). Always use the PRIMARY table.
|
||||
2. **Column Misalignment**: Count columns carefully - ensure values align with their period columns. Verify trends make sense.
|
||||
3. **Projections vs Historical**: Ignore tables marked with "E", "P", "PF", "Projected", "Forecast" - only extract historical data.
|
||||
4. **Unit Confusion**: "$20,546 (in thousands)" = $20.5M, not $20,546M. Always check table footnotes for units.
|
||||
5. **Missing Cross-Validation**: Don't extract financials in isolation - cross-reference with executive summary, narrative text, appendices.
|
||||
6. **Generic Analysis**: Avoid generic statements like "strong management team" - provide specific details (years of experience, track record, specific achievements).
|
||||
7. **Incomplete Risk Assessment**: Don't just list risks - assess impact, probability, and mitigations. Categorize by type.
|
||||
8. **Vague Value Creation**: Instead of "operational improvements", specify "reduce SG&A by 150 bps through shared services consolidation, adding $1.5M EBITDA".
|
||||
|
||||
ANALYSIS QUALITY REQUIREMENTS:
|
||||
- **Financial Precision**: Extract exact financial figures, percentages, and growth rates. Calculate CAGR where possible. Validate all calculations.
|
||||
- **Competitive Intelligence**: Identify specific competitors with market share context, competitive positioning (leader/follower/niche), and differentiation drivers.
|
||||
- **Risk Assessment**: Evaluate both stated and implied risks, categorize by type, assess impact and probability, identify mitigations.
|
||||
- **Growth Drivers**: Identify specific revenue growth drivers with quantification (e.g., "New product line launched in 2023, contributing $5M revenue in FY-1").
|
||||
- **Management Quality**: Assess management experience with specific details (years in role, prior companies, track record), evaluate retention risk and succession planning.
|
||||
- **Value Creation**: Identify specific value creation levers with quantification guidance (e.g., "Pricing optimization: 2-3% price increase on 60% of revenue base = $1.8-2.7M revenue increase").
|
||||
- **Due Diligence Focus**: Highlight areas requiring deeper investigation, prioritize by investment decision impact (deal-breakers vs nice-to-know).
|
||||
- **Key Questions Detail**: Provide detailed, contextual questions (2-3 sentences each) explaining why each question matters for the investment decision.
|
||||
- **Investment Thesis Detail**: Provide comprehensive analysis with specific examples, quantification where possible, and strategic rationale. Each item should include: what, why it matters, quantification if possible, investment impact.
|
||||
|
||||
DOCUMENT ANALYSIS APPROACH:
|
||||
- Read the entire document systematically, paying special attention to financial tables, charts, appendices, and footnotes
|
||||
- Cross-reference information across different sections for consistency (executive summary vs detailed sections vs appendices)
|
||||
- Extract both explicit statements and implicit insights (read between the lines for risks, opportunities, competitive position)
|
||||
- Focus on quantitative data while providing qualitative context and strategic interpretation
|
||||
- Identify any inconsistencies or areas requiring clarification (note discrepancies and their potential significance)
|
||||
- Consider industry context and market dynamics when evaluating opportunities and risks (benchmark against industry standards)
|
||||
- Use document structure (headers, sections, page numbers) to locate and validate information
|
||||
- Check footnotes for adjustments, definitions, exclusions, and important context
|
||||
`;
|
||||
}
|
||||
|
||||
14
backend/src/services/llmPrompts/index.ts
Normal file
14
backend/src/services/llmPrompts/index.ts
Normal file
@@ -0,0 +1,14 @@
|
||||
/**
|
||||
* LLM Prompt Builders
|
||||
* Centralized exports for all prompt builders
|
||||
*
|
||||
* Note: Due to the large size of prompt templates, individual prompt builders
|
||||
* are kept in llmService.ts for now. This file serves as a placeholder for
|
||||
* future modularization when prompts are fully extracted.
|
||||
*/
|
||||
|
||||
// Re-export prompt builders when they are extracted
|
||||
// For now, prompts remain in llmService.ts to maintain functionality
|
||||
|
||||
export { getCIMSystemPrompt } from './cimSystemPrompt';
|
||||
|
||||
38
backend/src/services/llmProviders/baseProvider.ts
Normal file
38
backend/src/services/llmProviders/baseProvider.ts
Normal file
@@ -0,0 +1,38 @@
|
||||
/**
|
||||
* Base LLM Provider Interface
|
||||
* Defines the contract for all LLM provider implementations
|
||||
*/
|
||||
|
||||
import { LLMRequest, LLMResponse } from '../../types/llm';
|
||||
|
||||
/**
|
||||
* Base interface for LLM providers
|
||||
*/
|
||||
export interface ILLMProvider {
|
||||
call(request: LLMRequest): Promise<LLMResponse>;
|
||||
}
|
||||
|
||||
/**
|
||||
* Base provider class with common functionality
|
||||
*/
|
||||
export abstract class BaseLLMProvider implements ILLMProvider {
|
||||
protected apiKey: string;
|
||||
protected defaultModel: string;
|
||||
protected maxTokens: number;
|
||||
protected temperature: number;
|
||||
|
||||
constructor(
|
||||
apiKey: string,
|
||||
defaultModel: string,
|
||||
maxTokens: number,
|
||||
temperature: number
|
||||
) {
|
||||
this.apiKey = apiKey;
|
||||
this.defaultModel = defaultModel;
|
||||
this.maxTokens = maxTokens;
|
||||
this.temperature = temperature;
|
||||
}
|
||||
|
||||
abstract call(request: LLMRequest): Promise<LLMResponse>;
|
||||
}
|
||||
|
||||
11
backend/src/services/llmProviders/index.ts
Normal file
11
backend/src/services/llmProviders/index.ts
Normal file
@@ -0,0 +1,11 @@
|
||||
/**
|
||||
* LLM Provider Exports
|
||||
* Centralized exports for all LLM provider implementations
|
||||
*/
|
||||
|
||||
// Providers will be exported here when extracted from llmService.ts
|
||||
// For now, providers remain in llmService.ts to maintain functionality
|
||||
|
||||
export type { ILLMProvider } from './baseProvider';
|
||||
export { BaseLLMProvider } from './baseProvider';
|
||||
|
||||
@@ -3,6 +3,7 @@ import { logger } from '../utils/logger';
|
||||
import { z } from 'zod';
|
||||
import { CIMReview, cimReviewSchema } from './llmSchemas';
|
||||
import { defaultCIMReview } from './unifiedDocumentProcessor';
|
||||
import { financialExtractionMonitoringService } from './financialExtractionMonitoringService';
|
||||
|
||||
export interface LLMRequest {
|
||||
prompt: string;
|
||||
@@ -112,7 +113,7 @@ class LLMService {
|
||||
maxTokens: options?.maxTokens || 3000,
|
||||
temperature: options?.temperature !== undefined ? options.temperature : 0.3,
|
||||
model: options?.model || this.defaultModel
|
||||
});
|
||||
}, 'other');
|
||||
|
||||
if (!response.success || !response.content) {
|
||||
throw new Error(response.error || 'LLM generation failed');
|
||||
@@ -251,7 +252,7 @@ class LLMService {
|
||||
model: selectedModel,
|
||||
maxTokens: config.llm.maxTokens,
|
||||
temperature: config.llm.temperature,
|
||||
});
|
||||
}, 'full_extraction');
|
||||
|
||||
if (!response.success) {
|
||||
logger.error('LLM API call failed', {
|
||||
@@ -357,7 +358,11 @@ class LLMService {
|
||||
/**
|
||||
* Call the appropriate LLM API
|
||||
*/
|
||||
private async callLLM(request: LLMRequest): Promise<LLMResponse> {
|
||||
private async callLLM(request: LLMRequest, endpoint: 'financial_extraction' | 'full_extraction' | 'other' = 'other'): Promise<LLMResponse> {
|
||||
const startTime = Date.now();
|
||||
const model = request.model || this.defaultModel;
|
||||
let rateLimitHit = false;
|
||||
|
||||
try {
|
||||
// Use configured timeout from config.llm.timeoutMs (default 6 minutes for complex analysis)
|
||||
// Increased from 3 minutes to handle complex CIM analysis even with RAG reduction
|
||||
@@ -373,7 +378,7 @@ class LLMService {
|
||||
// CRITICAL DEBUG: Log which provider method we're calling
|
||||
logger.info('Calling LLM provider method', {
|
||||
provider: this.provider,
|
||||
model: request.model || this.defaultModel,
|
||||
model: model,
|
||||
willCallOpenRouter: this.provider === 'openrouter',
|
||||
willCallAnthropic: this.provider === 'anthropic',
|
||||
willCallOpenAI: this.provider === 'openai'
|
||||
@@ -393,13 +398,57 @@ class LLMService {
|
||||
}
|
||||
})();
|
||||
|
||||
return await Promise.race([llmPromise, timeoutPromise]);
|
||||
const response = await Promise.race([llmPromise, timeoutPromise]);
|
||||
const durationMs = Date.now() - startTime;
|
||||
|
||||
// Track API call asynchronously (non-blocking)
|
||||
financialExtractionMonitoringService.trackApiCall({
|
||||
provider: this.provider as 'anthropic' | 'openai' | 'openrouter',
|
||||
model: model,
|
||||
endpoint: endpoint,
|
||||
durationMs: durationMs,
|
||||
success: response.success,
|
||||
rateLimitHit: rateLimitHit,
|
||||
inputTokens: response.usage?.promptTokens,
|
||||
outputTokens: response.usage?.completionTokens,
|
||||
totalTokens: response.usage?.totalTokens,
|
||||
costUsd: this.estimateCost(
|
||||
(response.usage?.promptTokens || 0) + (response.usage?.completionTokens || 0),
|
||||
model
|
||||
),
|
||||
errorType: response.success ? undefined : 'api_error',
|
||||
errorMessage: response.error,
|
||||
}).catch(err => {
|
||||
// Don't let monitoring failures break processing
|
||||
logger.debug('Failed to track API call (non-critical)', { error: err.message });
|
||||
});
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
const durationMs = Date.now() - startTime;
|
||||
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
|
||||
rateLimitHit = errorMessage.toLowerCase().includes('rate limit');
|
||||
|
||||
// Track failed API call asynchronously (non-blocking)
|
||||
financialExtractionMonitoringService.trackApiCall({
|
||||
provider: this.provider as 'anthropic' | 'openai' | 'openrouter',
|
||||
model: model,
|
||||
endpoint: endpoint,
|
||||
durationMs: durationMs,
|
||||
success: false,
|
||||
rateLimitHit: rateLimitHit,
|
||||
errorType: rateLimitHit ? 'rate_limit' : (errorMessage.includes('timeout') ? 'timeout' : 'api_error'),
|
||||
errorMessage: errorMessage,
|
||||
}).catch(err => {
|
||||
// Don't let monitoring failures break processing
|
||||
logger.debug('Failed to track API call (non-critical)', { error: err.message });
|
||||
});
|
||||
|
||||
logger.error('LLM API call failed', error);
|
||||
return {
|
||||
success: false,
|
||||
content: '',
|
||||
error: error instanceof Error ? error.message : 'Unknown error',
|
||||
error: errorMessage,
|
||||
};
|
||||
}
|
||||
}
|
||||
@@ -901,7 +950,7 @@ class LLMService {
|
||||
? `\n\nPRIORITY AREAS FOR THIS PASS (extract these thoroughly, but still extract ALL other fields):\n${focusedFields.map(f => `- ${f}`).join('\n')}\n\nFor this pass, prioritize extracting the fields listed above with extra thoroughness. However, you MUST still extract ALL fields in the template. Do NOT use "Not specified in CIM" for any field unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be especially thorough in extracting all nested fields within the priority areas.`
|
||||
: '';
|
||||
|
||||
return `You are an expert investment analyst at BPCP (Blue Point Capital Partners) reviewing a Confidential Information Memorandum (CIM). Your task is to analyze CIM documents and return a comprehensive, structured JSON object that follows the BPCP CIM Review Template format EXACTLY.${focusInstruction}
|
||||
return `You are a world-class private equity investment analyst at BPCP (Blue Point Capital Partners), operating at the analytical depth and rigor of top-tier PE firms (KKR, Blackstone, Apollo, Carlyle). Your task is to analyze Confidential Information Memorandums (CIMs) with the precision, depth, and strategic insight expected by BPCP's investment committee. Return a comprehensive, structured JSON object that follows the BPCP CIM Review Template format EXACTLY.${focusInstruction}
|
||||
|
||||
CRITICAL REQUIREMENTS:
|
||||
1. **JSON OUTPUT ONLY**: Your entire response MUST be a single, valid JSON object. Do not include any text or explanation before or after the JSON object.
|
||||
@@ -915,24 +964,91 @@ CRITICAL REQUIREMENTS:
|
||||
9. **FINANCIAL DATA**: For financial metrics, use actual numbers if available, otherwise use "Not specified in CIM".
|
||||
10. **VALID JSON**: Ensure your response is valid JSON that can be parsed without errors.
|
||||
|
||||
FINANCIAL VALIDATION FRAMEWORK:
|
||||
Before finalizing any financial extraction, you MUST perform these validation checks:
|
||||
|
||||
**Magnitude Validation**:
|
||||
- Revenue should typically be $10M+ for target companies (if less, verify you're using the PRIMARY table, not a subsidiary)
|
||||
- EBITDA should typically be $1M+ and positive for viable targets
|
||||
- If FY-3 revenue is $64M, FY-2 should be similar magnitude (e.g., $50M-$90M), not $2.9M or $10 - this indicates column misalignment
|
||||
|
||||
**Trend Validation**:
|
||||
- Revenue should generally increase or be stable year-over-year (FY-3 → FY-2 → FY-1)
|
||||
- Large sudden drops (>50%) or increases (>200%) may indicate misaligned columns or wrong table
|
||||
- EBITDA should follow similar trends to revenue (unless margin expansion/contraction is explicitly explained)
|
||||
|
||||
**Cross-Period Consistency**:
|
||||
- If FY-3 revenue = $64M and FY-2 revenue = $71M, growth should be ~11% (not 1000% or -50%)
|
||||
- Margins should be relatively stable across periods (within 10-15 percentage points unless explained)
|
||||
- EBITDA margins should be 5-50% (typical range), gross margins 20-80%
|
||||
|
||||
**Multi-Table Cross-Reference**:
|
||||
- Cross-reference primary table with executive summary financial highlights
|
||||
- Verify consistency between detailed financials and summary tables
|
||||
- Check appendices for additional financial detail or adjustments
|
||||
- If discrepancies exist, note them and use the most authoritative source (typically the detailed historical table)
|
||||
|
||||
**Calculation Validation**:
|
||||
- Verify revenue growth percentages match: ((Current - Prior) / Prior) * 100
|
||||
- Verify margins match: (Metric / Revenue) * 100
|
||||
- If calculations don't match, use the explicitly stated values from the table
|
||||
|
||||
PE INVESTOR PERSONA & METHODOLOGY:
|
||||
You operate with the analytical rigor and strategic depth of top-tier private equity firms. Your analysis should demonstrate:
|
||||
|
||||
**Value Creation Focus**:
|
||||
- Identify specific, quantifiable value creation opportunities (e.g., "Margin expansion of 200-300 bps through pricing optimization and cost reduction, potentially adding $2-3M EBITDA")
|
||||
- Assess operational improvement potential (supply chain, technology, human capital)
|
||||
- Evaluate M&A and add-on acquisition potential with specific rationale
|
||||
- Quantify potential impact where possible (EBITDA improvement, revenue growth, multiple expansion)
|
||||
|
||||
**Risk Assessment Depth**:
|
||||
- Categorize risks by type: operational, financial, market, execution, regulatory, technology
|
||||
- Assess both probability and impact (high/medium/low)
|
||||
- Identify mitigating factors and management's risk management approach
|
||||
- Distinguish between deal-breakers and manageable risks
|
||||
|
||||
**Strategic Analysis Frameworks**:
|
||||
- **Porter's Five Forces**: Assess competitive intensity, supplier power, buyer power, threat of substitutes, threat of new entrants
|
||||
- **SWOT Analysis**: Synthesize strengths, weaknesses, opportunities, threats from the CIM
|
||||
- **Value Creation Playbook**: Revenue growth (organic/inorganic), margin expansion, operational improvements, multiple expansion
|
||||
- **Comparable Analysis**: Reference industry benchmarks, comparable company multiples, recent transaction multiples where mentioned
|
||||
|
||||
**Industry Context Integration**:
|
||||
- Reference industry-specific metrics and benchmarks (e.g., SaaS: ARR growth, churn, CAC payback; Manufacturing: inventory turns, days sales outstanding)
|
||||
- Consider sector-specific risks and opportunities (regulatory changes, technology disruption, consolidation trends)
|
||||
- Evaluate market position relative to industry standards (market share, growth vs market, margin vs peers)
|
||||
|
||||
COMMON MISTAKES TO AVOID:
|
||||
1. **Subsidiary vs Parent Table Confusion**: Primary table shows values in millions ($64M), subsidiary tables show thousands ($20,546). Always use the PRIMARY table.
|
||||
2. **Column Misalignment**: Count columns carefully - ensure values align with their period columns. Verify trends make sense.
|
||||
3. **Projections vs Historical**: Ignore tables marked with "E", "P", "PF", "Projected", "Forecast" - only extract historical data.
|
||||
4. **Unit Confusion**: "$20,546 (in thousands)" = $20.5M, not $20,546M. Always check table footnotes for units.
|
||||
5. **Missing Cross-Validation**: Don't extract financials in isolation - cross-reference with executive summary, narrative text, appendices.
|
||||
6. **Generic Analysis**: Avoid generic statements like "strong management team" - provide specific details (years of experience, track record, specific achievements).
|
||||
7. **Incomplete Risk Assessment**: Don't just list risks - assess impact, probability, and mitigations. Categorize by type.
|
||||
8. **Vague Value Creation**: Instead of "operational improvements", specify "reduce SG&A by 150 bps through shared services consolidation, adding $1.5M EBITDA".
|
||||
|
||||
ANALYSIS QUALITY REQUIREMENTS:
|
||||
- **Financial Precision**: Extract exact financial figures, percentages, and growth rates. Calculate CAGR where possible.
|
||||
- **Competitive Intelligence**: Identify specific competitors, market positions, and competitive advantages.
|
||||
- **Risk Assessment**: Evaluate both stated and implied risks, including operational, financial, and market risks.
|
||||
- **Growth Drivers**: Identify specific revenue growth drivers, market expansion opportunities, and operational improvements.
|
||||
- **Management Quality**: Assess management experience, track record, and post-transaction intentions.
|
||||
- **Value Creation**: Identify specific value creation levers that align with BPCP's expertise.
|
||||
- **Due Diligence Focus**: Highlight areas requiring deeper investigation and specific questions for management.
|
||||
- **Key Questions Detail**: Provide detailed, contextual questions and next steps. Avoid brief bullet points - write in full sentences with proper explanation of context and investment significance.
|
||||
- **Investment Thesis Detail**: Provide comprehensive analysis of attractions, risks, value creation opportunities, and strategic alignment. Avoid brief bullet points - write in full sentences with proper context and investment significance.
|
||||
- **Financial Precision**: Extract exact financial figures, percentages, and growth rates. Calculate CAGR where possible. Validate all calculations.
|
||||
- **Competitive Intelligence**: Identify specific competitors with market share context, competitive positioning (leader/follower/niche), and differentiation drivers.
|
||||
- **Risk Assessment**: Evaluate both stated and implied risks, categorize by type, assess impact and probability, identify mitigations.
|
||||
- **Growth Drivers**: Identify specific revenue growth drivers with quantification (e.g., "New product line launched in 2023, contributing $5M revenue in FY-1").
|
||||
- **Management Quality**: Assess management experience with specific details (years in role, prior companies, track record), evaluate retention risk and succession planning.
|
||||
- **Value Creation**: Identify specific value creation levers with quantification guidance (e.g., "Pricing optimization: 2-3% price increase on 60% of revenue base = $1.8-2.7M revenue increase").
|
||||
- **Due Diligence Focus**: Highlight areas requiring deeper investigation, prioritize by investment decision impact (deal-breakers vs nice-to-know).
|
||||
- **Key Questions Detail**: Provide detailed, contextual questions (2-3 sentences each) explaining why each question matters for the investment decision.
|
||||
- **Investment Thesis Detail**: Provide comprehensive analysis with specific examples, quantification where possible, and strategic rationale. Each item should include: what, why it matters, quantification if possible, investment impact.
|
||||
|
||||
DOCUMENT ANALYSIS APPROACH:
|
||||
- Read the entire document carefully, paying special attention to financial tables, charts, and appendices
|
||||
- Cross-reference information across different sections for consistency
|
||||
- Extract both explicit statements and implicit insights
|
||||
- Focus on quantitative data while providing qualitative context
|
||||
- Identify any inconsistencies or areas requiring clarification
|
||||
- Consider industry context and market dynamics when evaluating opportunities and risks
|
||||
- Read the entire document systematically, paying special attention to financial tables, charts, appendices, and footnotes
|
||||
- Cross-reference information across different sections for consistency (executive summary vs detailed sections vs appendices)
|
||||
- Extract both explicit statements and implicit insights (read between the lines for risks, opportunities, competitive position)
|
||||
- Focus on quantitative data while providing qualitative context and strategic interpretation
|
||||
- Identify any inconsistencies or areas requiring clarification (note discrepancies and their potential significance)
|
||||
- Consider industry context and market dynamics when evaluating opportunities and risks (benchmark against industry standards)
|
||||
- Use document structure (headers, sections, page numbers) to locate and validate information
|
||||
- Check footnotes for adjustments, definitions, exclusions, and important context
|
||||
`;
|
||||
}
|
||||
|
||||
@@ -952,17 +1068,17 @@ Please correct these errors and generate a new, valid JSON object. Pay close att
|
||||
|
||||
const jsonTemplate = `{
|
||||
"dealOverview": {
|
||||
"targetCompanyName": "Target Company Name",
|
||||
"industrySector": "Industry/Sector",
|
||||
"geography": "Geography (HQ & Key Operations)",
|
||||
"dealSource": "Deal Source",
|
||||
"transactionType": "Transaction Type",
|
||||
"dateCIMReceived": "Date CIM Received",
|
||||
"dateReviewed": "Date Reviewed",
|
||||
"reviewers": "Reviewer(s)",
|
||||
"cimPageCount": "CIM Page Count",
|
||||
"statedReasonForSale": "Stated Reason for Sale (if provided)",
|
||||
"employeeCount": "Number of employees (if stated in document)"
|
||||
"targetCompanyName": "Target Company Name", // Format: Use exact legal entity name (e.g., "ABC Company, Inc." not just "ABC Company")
|
||||
"industrySector": "Industry/Sector", // Format: Specific classification (e.g., "Specialty Chemicals" not just "Chemicals", "B2B Software/SaaS" not just "Software")
|
||||
"geography": "Geography (HQ & Key Operations)", // Format: "City, State" (e.g., "Cleveland, OH" not just "Cleveland"). Include multiple locations if mentioned.
|
||||
"dealSource": "Deal Source", // Format: Investment bank or firm name (e.g., "Harris Williams", "Capstone Partners"). Look in cover page, headers, footers, contact pages.
|
||||
"transactionType": "Transaction Type", // Format: Examples: "Control Buyout", "Minority Investment", "Growth Equity", "Recapitalization"
|
||||
"dateCIMReceived": "Date CIM Received", // Format: "YYYY-MM-DD" or "Month DD, YYYY" (e.g., "2024-03-15" or "March 15, 2024")
|
||||
"dateReviewed": "Date Reviewed", // Format: "YYYY-MM-DD" or "Month DD, YYYY"
|
||||
"reviewers": "Reviewer(s)", // Format: Comma-separated names (e.g., "John Smith, Jane Doe")
|
||||
"cimPageCount": "CIM Page Count", // Format: Number only (e.g., "45" not "45 pages")
|
||||
"statedReasonForSale": "Stated Reason for Sale (if provided)", // Format: Full sentence or paragraph explaining reason
|
||||
"employeeCount": "Number of employees (if stated in document)" // Format: Number only (e.g., "250" not "approximately 250 employees")
|
||||
},
|
||||
"businessDescription": {
|
||||
"coreOperationsSummary": "Core Operations Summary (3-5 sentences)",
|
||||
@@ -991,36 +1107,36 @@ Please correct these errors and generate a new, valid JSON object. Pay close att
|
||||
"financialSummary": {
|
||||
"financials": {
|
||||
"fy3": {
|
||||
"revenue": "Revenue amount for FY-3 (oldest historical year, typically 3 years ago)",
|
||||
"revenueGrowth": "N/A (baseline year)",
|
||||
"grossProfit": "Gross profit amount for FY-3",
|
||||
"grossMargin": "Gross margin % for FY-3",
|
||||
"ebitda": "EBITDA amount for FY-3",
|
||||
"ebitdaMargin": "EBITDA margin % for FY-3"
|
||||
"revenue": "Revenue amount for FY-3 (oldest historical year, typically 3 years ago)", // Format: "$XX.XM" (e.g., "$64.2M"). Must be $10M+ for target companies. If <$10M, likely wrong table (subsidiary table with values in thousands). Examples: "$64.2M" ✓, "$1.2B" ✓, "$2.9M" ✗ (too low, wrong table), "$64,200,000" ✗ (wrong format)
|
||||
"revenueGrowth": "N/A (baseline year)", // Format: "N/A" for baseline year. Validation: FY-3 should always be "N/A" for revenue growth (it's the baseline). Do NOT calculate growth for FY-3. Examples: "N/A" ✓, "0%" ✗ (wrong - use N/A), "16.8%" ✗ (wrong - FY-3 has no prior year)
|
||||
"grossProfit": "Gross profit amount for FY-3", // Format: "$XX.XM" or "$XX.XB". Validation: Should be positive and less than revenue.
|
||||
"grossMargin": "Gross margin % for FY-3", // Format: "XX.X%" (e.g., "40.0%"). Validation: Should be 20-80% typical range. Calculate: (Gross Profit / Revenue) * 100 if not stated.
|
||||
"ebitda": "EBITDA amount for FY-3", // Format: "$XX.XM" or "$XX.XB". Validation: Should be $1M+ and positive for viable targets. Should be less than revenue.
|
||||
"ebitdaMargin": "EBITDA margin % for FY-3" // Format: "XX.X%" (e.g., "29.7%"). Validation: Should be 5-50% typical range. Calculate: (EBITDA / Revenue) * 100 if not stated. Cross-validate with stated margin.
|
||||
},
|
||||
"fy2": {
|
||||
"revenue": "Revenue amount for FY-2 (2 years ago)",
|
||||
"revenueGrowth": "Revenue growth % for FY-2 (year-over-year from FY-3)",
|
||||
"grossProfit": "Gross profit amount for FY-2",
|
||||
"grossMargin": "Gross margin % for FY-2",
|
||||
"ebitda": "EBITDA amount for FY-2",
|
||||
"ebitdaMargin": "EBITDA margin % for FY-2"
|
||||
"revenue": "Revenue amount for FY-2 (2 years ago)", // Format: "$XX.XM" (e.g., "$71.0M"). Validation: Should be similar magnitude to FY-3 (e.g., if FY-3=$64M, FY-2 should be $50M-$90M, not $2.9M). Trend check: Should generally increase or be stable. Examples: "$71.0M" ✓ (if FY-3=$64M), "$2.9M" ✗ (wrong magnitude, column misalignment)
|
||||
"revenueGrowth": "Revenue growth % for FY-2 (year-over-year from FY-3)", // Format: "XX.X%" or "(XX.X)%" for negative. Calculate if not provided: ((FY-2 - FY-3) / FY-3) * 100. Examples: "16.8%" ✓ (if FY-3=$64M, FY-2=$71M), "(4.4)%" ✓ (negative growth), "1000%" ✗ (unrealistic, likely misalignment), "N/A" ✗ (wrong - only FY-3 should be N/A)
|
||||
"grossProfit": "Gross profit amount for FY-2", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends.
|
||||
"grossMargin": "Gross margin % for FY-2", // Format: "XX.X%". Validation: Should be relatively stable (within 10-15pp of FY-3 unless explained).
|
||||
"ebitda": "EBITDA amount for FY-2", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends.
|
||||
"ebitdaMargin": "EBITDA margin % for FY-2" // Format: "XX.X%". Validation: Should be relatively stable (within 10-15pp of FY-3 unless explained). Cross-validate calculation.
|
||||
},
|
||||
"fy1": {
|
||||
"revenue": "Revenue amount for FY-1 (1 year ago, most recent full fiscal year)",
|
||||
"revenueGrowth": "Revenue growth % for FY-1 (year-over-year from FY-2)",
|
||||
"grossProfit": "Gross profit amount for FY-1",
|
||||
"grossMargin": "Gross margin % for FY-1",
|
||||
"ebitda": "EBITDA amount for FY-1",
|
||||
"ebitdaMargin": "EBITDA margin % for FY-1"
|
||||
"revenue": "Revenue amount for FY-1 (1 year ago, most recent full fiscal year)", // Format: "$XX.XM" (e.g., "$71.0M"). Validation: Should be similar magnitude to FY-2. Trend check: Should generally increase or be stable from FY-2. Examples: "$71.0M" ✓ (if FY-2=$71M), "$10" ✗ (wrong format, missing M), "$71M revenue" ✗ (extra text)
|
||||
"revenueGrowth": "Revenue growth % for FY-1 (year-over-year from FY-2)", // Format: "XX.X%" or "(XX.X)%". Calculate if not provided: ((FY-1 - FY-2) / FY-2) * 100. Cross-validate with stated growth. Examples: "0.0%" ✓ (no growth), "15.9%" ✓, "(4.4)%" ✓ (negative), "16.8 percent" ✗ (wrong format - use %)
|
||||
"grossProfit": "Gross profit amount for FY-1", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends.
|
||||
"grossMargin": "Gross margin % for FY-1", // Format: "XX.X%". Validation: Should be relatively stable across periods.
|
||||
"ebitda": "EBITDA amount for FY-1", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends, typically positive.
|
||||
"ebitdaMargin": "EBITDA margin % for FY-1" // Format: "XX.X%". Validation: Should be relatively stable. Cross-validate calculation with revenue and EBITDA.
|
||||
},
|
||||
"ltm": {
|
||||
"revenue": "Revenue amount for LTM (Last Twelve Months, most recent trailing period)",
|
||||
"revenueGrowth": "Revenue growth % for LTM (year-over-year from FY-1)",
|
||||
"grossProfit": "Gross profit amount for LTM",
|
||||
"grossMargin": "Gross margin % for LTM",
|
||||
"ebitda": "EBITDA amount for LTM",
|
||||
"ebitdaMargin": "EBITDA margin % for LTM"
|
||||
"revenue": "Revenue amount for LTM (Last Twelve Months, most recent trailing period)", // Format: "$XX.XM" (e.g., "$76.0M"). Validation: Should be similar magnitude to FY-1. May be higher or lower depending on recent performance. Examples: "$76.0M" ✓ (if FY-1=$71M), "$76M" ✓, "76 million" ✗ (wrong format)
|
||||
"revenueGrowth": "Revenue growth % for LTM (year-over-year from FY-1)", // Format: "XX.X%" or "(XX.X)%". Calculate if not provided: ((LTM - FY-1) / FY-1) * 100. Note: LTM may span different time period than FY-1. Examples: "7.0%" ✓, "(2.5)%" ✓ (negative), "N/A" ✗ (calculate if possible)
|
||||
"grossProfit": "Gross profit amount for LTM", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends.
|
||||
"grossMargin": "Gross margin % for LTM", // Format: "XX.X%". Validation: Should be relatively stable.
|
||||
"ebitda": "EBITDA amount for LTM", // Format: "$XX.XM" or "$XX.XB". Validation: Should follow revenue trends.
|
||||
"ebitdaMargin": "EBITDA margin % for LTM" // Format: "XX.X%". Validation: Should be relatively stable. Cross-validate calculation.
|
||||
}
|
||||
},
|
||||
"qualityOfEarnings": "Quality of earnings/adjustments impression",
|
||||
@@ -1037,17 +1153,17 @@ Please correct these errors and generate a new, valid JSON object. Pay close att
|
||||
"organizationalStructure": "Organizational Structure Overview (Impression)"
|
||||
},
|
||||
"preliminaryInvestmentThesis": {
|
||||
"keyAttractions": "Key Attractions / Strengths (Why Invest?) - Provide 5-8 detailed strengths and attractions. For each, explain the specific advantage, provide context from the CIM, and explain why it makes this an attractive investment opportunity. Focus on competitive advantages, market position, and growth potential.",
|
||||
"potentialRisks": "Potential Risks / Concerns (Why Not Invest?) - Identify 5-8 specific risks and concerns. For each risk, explain the nature of the risk, its potential impact on the investment, and any mitigating factors mentioned in the CIM. Consider operational, financial, market, and execution risks.",
|
||||
"valueCreationLevers": "Initial Value Creation Levers (How PE Adds Value) - List 5-8 specific value creation opportunities. For each lever, explain how BPCP's expertise and resources could create value, provide specific examples of potential improvements, and estimate the potential impact on EBITDA or growth.",
|
||||
"alignmentWithFundStrategy": "Alignment with Fund Strategy - Provide a comprehensive analysis of alignment with BPCP's strategy. Address: EBITDA range fit (5+MM), industry focus (consumer/industrial), geographic preferences (Cleveland/Charlotte driving distance), value creation expertise (M&A, technology, supply chain, human capital), and founder/family ownership. Explain specific areas of strategic fit and any potential misalignments."
|
||||
"keyAttractions": "Key Attractions / Strengths (Why Invest?) - Provide 5-8 detailed strengths and attractions. Format: Numbered list "1. ... 2. ..." with each item 2-3 sentences. For each, explain the specific advantage, provide context from the CIM, include quantification (numbers, percentages, metrics), and explain why it makes this an attractive investment opportunity. Focus on competitive advantages, market position, and growth potential. Example: "1. Market-leading position with 25% market share in the $2.5B specialty chemicals market, providing pricing power and competitive moat. This supports 2-3x revenue growth potential through market expansion and pricing optimization."",
|
||||
"potentialRisks": "Potential Risks / Concerns (Why Not Invest?) - Identify 5-8 specific risks and concerns. Format: Numbered list "1. ... 2. ..." with each item 2-3 sentences. Categorize by type (operational, financial, market, execution, regulatory, technology). For each risk, explain the nature, assess probability (High/Medium/Low) and impact (High/Medium/Low), identify mitigations, and indicate if deal-breaker. Include specific examples from CIM. Example: "1. Customer concentration risk (Operational): Top 3 customers represent 45% of revenue, creating significant revenue risk if any customer is lost. Probability: Medium, Impact: High. Mitigation: Management has long-term contracts with these customers. Deal-breaker: No, but requires careful due diligence."",
|
||||
"valueCreationLevers": "Initial Value Creation Levers (How PE Adds Value) - List 5-8 specific value creation opportunities. Format: Numbered list "1. ... 2. ..." with each item 2-3 sentences. For each lever, specify the opportunity, quantify potential impact (dollars or percentages), explain implementation approach, provide timeline, and indicate confidence level. Example: "1. Margin expansion through pricing optimization: 2-3% price increase on 60% of revenue base could add $1.8-2.7M revenue. Implementation: Leverage BPCP's pricing expertise and market analysis. Timeline: 12-18 months. Confidence: High based on strong market position."",
|
||||
"alignmentWithFundStrategy": "Alignment with Fund Strategy - Provide a comprehensive analysis of alignment with BPCP's strategy. Address: EBITDA range fit (5+MM) with score 1-10, industry focus (consumer/industrial) with score 1-10, geographic preferences (Cleveland/Charlotte driving distance) with score 1-10, value creation expertise (M&A, technology, supply chain, human capital) with score 1-10, and founder/family ownership with score 1-10. Provide overall alignment score and explain specific areas of strategic fit and any potential misalignments."
|
||||
},
|
||||
"keyQuestionsNextSteps": {
|
||||
"criticalQuestions": "Critical Questions Arising from CIM Review - Provide 5-8 specific, detailed questions that require deeper investigation. Each question should be 2-3 sentences explaining the context and why it's important for the investment decision.",
|
||||
"missingInformation": "Key Missing Information / Areas for Diligence Focus - List 5-8 specific areas where additional information is needed. For each area, explain what information is missing, why it's critical, and how it would impact the investment decision.",
|
||||
"preliminaryRecommendation": "Preliminary Recommendation - Provide a clear recommendation (Proceed, Pass, or Proceed with Caution) with brief justification.",
|
||||
"rationaleForRecommendation": "Rationale for Recommendation (Brief) - Provide 3-4 key reasons supporting your recommendation, focusing on the most compelling factors.",
|
||||
"proposedNextSteps": "Proposed Next Steps - List 5-8 specific, actionable next steps in order of priority. Each step should include who should be involved and the expected timeline."
|
||||
"criticalQuestions": "Critical Questions Arising from CIM Review - Provide 5-8 specific, detailed questions. Format: Numbered list "1. ... 2. ..." with each item 2-3 sentences. Each question should explain the context, why it's important for the investment decision, and indicate priority (Deal-breaker, High, Medium, Nice-to-know). Example: "1. What is the customer retention rate for contracts expiring in the next 12 months? This is critical because 30% of revenue comes from contracts expiring in the next year, and retention rate will significantly impact revenue projections and valuation. Priority: High Impact."",
|
||||
"missingInformation": "Key Missing Information / Areas for Diligence Focus - List 5-8 specific areas. Format: Numbered list "1. ... 2. ..." with each item 2-3 sentences. For each area, explain what information is missing (be specific), why it's critical, how it would impact the investment decision, and indicate priority. Example: "1. Detailed breakdown of revenue by customer segment with historical trends. This is critical because understanding segment growth rates and profitability is essential for revenue projections and valuation. Missing this information makes it difficult to assess growth sustainability. Priority: High Impact."",
|
||||
"preliminaryRecommendation": "Preliminary Recommendation - Format: One of "Proceed", "Pass", or "Proceed with Caution". Provide clear recommendation with brief justification focusing on most compelling factors.",
|
||||
"rationaleForRecommendation": "Rationale for Recommendation (Brief) - Format: 3-4 sentences or bullet points. Provide 3-4 key reasons supporting your recommendation, focusing on the most compelling factors (investment attractions, risks, strategic fit, value creation potential).",
|
||||
"proposedNextSteps": "Proposed Next Steps - Format: Numbered list "1. ... 2. ..." with 5-8 items, each 2-3 sentences. List specific, actionable next steps in order of priority. Each step should include who should be involved and the expected timeline. Example: "1. Schedule management call to discuss customer retention and contract renewal pipeline. Involve: Investment team lead, deal sponsor. Timeline: Within 1 week.""
|
||||
}
|
||||
}`;
|
||||
|
||||
@@ -1154,6 +1270,76 @@ Correct Extraction:
|
||||
- LTM = LTM Mar-25 = $76M revenue, $27M EBITDA
|
||||
- IGNORE 2025E (projection, marked with "E")
|
||||
|
||||
**Example 5: Only 2 Periods (Edge Case)**
|
||||
Table Header: "2023 2024"
|
||||
Revenue Row: "$64M $71M"
|
||||
EBITDA Row: "$19M $24M"
|
||||
|
||||
Correct Extraction:
|
||||
- FY-3 = Not specified in CIM (only 2 years provided)
|
||||
- FY-2 = 2023 = $64M revenue, $19M EBITDA (older year)
|
||||
- FY-1 = 2024 = $71M revenue, $24M EBITDA (most recent year)
|
||||
- LTM = Not specified in CIM (no LTM column)
|
||||
|
||||
**Example 6: Only 3 Periods (Edge Case)**
|
||||
Table Header: "2022 2023 2024"
|
||||
Revenue Row: "$58M $64M $71M"
|
||||
EBITDA Row: "$17M $19M $24M"
|
||||
|
||||
Correct Extraction:
|
||||
- FY-3 = 2022 = $58M revenue, $17M EBITDA (oldest year)
|
||||
- FY-2 = 2023 = $64M revenue, $19M EBITDA (middle year)
|
||||
- FY-1 = 2024 = $71M revenue, $24M EBITDA (most recent year)
|
||||
- LTM = Not specified in CIM (no LTM column)
|
||||
|
||||
**Example 7: Thousands Format with Conversion**
|
||||
Table Header: "2021 2022 2023 2024"
|
||||
Note: "(All amounts in thousands)"
|
||||
Revenue Row: "$45,200 $52,800 $61,200 $58,500"
|
||||
EBITDA Row: "$8,500 $10,200 $12,100 $11,500"
|
||||
|
||||
Correct Extraction (convert to millions):
|
||||
- FY-3 = 2021 = $45.2M revenue, $8.5M EBITDA
|
||||
- FY-2 = 2022 = $52.8M revenue, $10.2M EBITDA
|
||||
- FY-1 = 2023 = $61.2M revenue, $12.1M EBITDA
|
||||
- LTM = 2024 = $58.5M revenue, $11.5M EBITDA
|
||||
|
||||
**Example 8: Negative Values in Parentheses**
|
||||
Table Header: "FY-3 FY-2 FY-1 LTM"
|
||||
Revenue Row: "$64M $71M $71M $76M"
|
||||
Revenue Growth: "N/A 10.9% 0.0% 7.0%"
|
||||
EBITDA Row: "$19M $24M $24M $27M"
|
||||
EBITDA Margin: "29.7% 33.8% 33.8% 35.5%"
|
||||
|
||||
Correct Extraction:
|
||||
- FY-3 = $64M revenue, $19M EBITDA, 29.7% EBITDA margin, N/A revenue growth
|
||||
- FY-2 = $71M revenue, $24M EBITDA, 33.8% EBITDA margin, 10.9% revenue growth
|
||||
- FY-1 = $71M revenue, $24M EBITDA, 33.8% EBITDA margin, 0.0% revenue growth
|
||||
- LTM = $76M revenue, $27M EBITDA, 35.5% EBITDA margin, 7.0% revenue growth
|
||||
|
||||
**Example 9: Fiscal Year End Different from Calendar Year**
|
||||
Table Header: "FYE Mar 2022 FYE Mar 2023 FYE Mar 2024 LTM Jun 2024"
|
||||
Revenue Row: "$58M $64M $71M $76M"
|
||||
EBITDA Row: "$17M $19M $24M $27M"
|
||||
|
||||
Correct Extraction (use fiscal years, not calendar):
|
||||
- FY-3 = FYE Mar 2022 = $58M revenue, $17M EBITDA
|
||||
- FY-2 = FYE Mar 2023 = $64M revenue, $19M EBITDA
|
||||
- FY-1 = FYE Mar 2024 = $71M revenue, $24M EBITDA
|
||||
- LTM = LTM Jun 2024 = $76M revenue, $27M EBITDA
|
||||
|
||||
**Example 10: Pro Forma vs Historical (Use Historical Only)**
|
||||
HISTORICAL TABLE (Use This):
|
||||
Table Header: "2021 2022 2023 2024"
|
||||
Revenue Row: "$45.2M $52.8M $61.2M $58.5M"
|
||||
|
||||
PRO FORMA TABLE (Ignore - Shows Adjusted/Projected):
|
||||
Table Header: "2021 2022 2023 2024"
|
||||
Revenue Row: "$48.5M $55.2M $64.1M $61.8M"
|
||||
Note: "Pro forma includes acquisition of XYZ Corp"
|
||||
|
||||
Correct Extraction: Use HISTORICAL table only. Ignore pro forma adjustments.
|
||||
|
||||
DETAILED ANALYSIS INSTRUCTIONS:
|
||||
1. **Financial Analysis**: Extract exact revenue, EBITDA, and margin figures from the PRIMARY historical financial table. Calculate growth rates and trends. Note any adjustments or add-backs.
|
||||
2. **Competitive Position**: Identify specific competitors, market share, and competitive advantages. Assess barriers to entry.
|
||||
@@ -1176,45 +1362,74 @@ ${jsonTemplate}
|
||||
|
||||
IMPORTANT: Replace all placeholder text with actual information from the CIM document. If information is not available, use "Not specified in CIM". Ensure all financial metrics are properly formatted as strings. Provide detailed, actionable insights suitable for investment decision-making.
|
||||
|
||||
CRITICAL FINANCIAL EXTRACTION RULES:
|
||||
STRUCTURED EXTRACTION WORKFLOW:
|
||||
|
||||
**Step 1: Find the Right Table**
|
||||
- Look for tables showing the TARGET COMPANY's historical financial performance
|
||||
- Tables may be labeled: "Financial Summary", "Historical Financials", "Income Statement", "P&L", "Financial Performance"
|
||||
- IGNORE: Market projections, industry benchmarks, competitor data, forward-looking estimates
|
||||
**Phase 1: Document Structure Analysis**
|
||||
1. Identify document sections using headers, page numbers, and table of contents
|
||||
2. Locate key sections: Executive Summary, Financial Summary, Market Analysis, Business Description, Management Team, Appendices
|
||||
3. Note page numbers for financial tables, charts, and key data points
|
||||
4. Identify document metadata (company name, dates, deal source) from cover page, headers, footers
|
||||
|
||||
**Step 2: Identify Periods (Flexible Approach)**
|
||||
Financial tables can have different formats. Here's how to map them:
|
||||
**Phase 2: Financial Data Extraction (with Cross-Validation)**
|
||||
1. Locate PRIMARY historical financial table (see financial extraction rules below)
|
||||
2. Extract financial metrics from primary table
|
||||
3. Cross-reference with executive summary financial highlights
|
||||
4. Verify consistency between detailed financials and summary statements
|
||||
5. Check appendices for additional financial detail or adjustments
|
||||
6. Validate calculations (growth rates, margins) for mathematical consistency
|
||||
7. If discrepancies exist, use the most authoritative source (typically detailed historical table)
|
||||
|
||||
*Format A: Years shown (2021, 2022, 2023, 2024)*
|
||||
- FY-3 = Oldest year (e.g., 2021 or 2022)
|
||||
- FY-2 = Second oldest year (e.g., 2022 or 2023)
|
||||
- FY-1 = Most recent full fiscal year (e.g., 2023 or 2024)
|
||||
- LTM = Look for "LTM", "TTM", "Last Twelve Months", or trailing period
|
||||
**Phase 3: Business & Market Analysis**
|
||||
1. Extract business description from multiple sections (overview, operations, products/services)
|
||||
2. Cross-reference customer information across sections (customer base, concentration, contracts)
|
||||
3. Extract market data (TAM/SAM, growth rates, trends) from market analysis section
|
||||
4. Identify competitive landscape from multiple mentions (competitor list, market position, differentiation)
|
||||
5. Validate market size claims against industry benchmarks where possible
|
||||
|
||||
*Format B: Periods shown (FY-3, FY-2, FY-1, LTM)*
|
||||
- Use them directly as labeled
|
||||
**Phase 4: Investment Analysis Synthesis**
|
||||
1. Synthesize investment attractions from financial performance, market position, competitive advantages
|
||||
2. Identify risks from multiple sources (risk section, financial analysis, market dynamics, operational factors)
|
||||
3. Develop value creation opportunities based on identified operational improvements, M&A potential, technology opportunities
|
||||
4. Assess BPCP alignment using quantitative scoring where possible (EBITDA fit: 1-10, Industry fit: 1-10, etc.)
|
||||
|
||||
*Format C: Mixed (2023, 2024, LTM Mar-25, 2025E)*
|
||||
- Use actual years for FY-3, FY-2, FY-1
|
||||
- Use LTM/TTM for LTM
|
||||
- IGNORE anything with "E", "P", "PF" (estimates/projections)
|
||||
FORMAT STANDARDIZATION REQUIREMENTS:
|
||||
|
||||
**Step 3: Extract Values Carefully**
|
||||
- Read from the CORRECT column for each period
|
||||
- Extract EXACT values as shown ($64M, $71M, 29.3%, etc.)
|
||||
- Preserve the format (don't convert $64M to $64,000,000)
|
||||
**Currency Values**:
|
||||
✓ CORRECT: "$64.2M", "$1.2B", "$20.5M" (from thousands: "$20,546 (in thousands)" → "$20.5M")
|
||||
✗ INCORRECT: "$64,200,000", "$64M revenue", "64.2 million", "$64.2 M" (space before M), "64.2M" (missing $)
|
||||
|
||||
**Step 4: Validate Your Extraction**
|
||||
- Check that values make sense: If FY-3 revenue is $64M, FY-2 should be similar magnitude (e.g., $50M-$90M), not $2.9M or $10
|
||||
- Revenue should typically be $10M+ for target companies
|
||||
- EBITDA should typically be $1M+ and positive
|
||||
- Margins should be 5-50% for EBITDA margin
|
||||
- If values seem wrong, you may have misaligned columns - double-check
|
||||
**Percentages**:
|
||||
✓ CORRECT: "29.3%", "15.8%", "(4.4)%" (negative in parentheses)
|
||||
✗ INCORRECT: "29.3 percent", "29.3", "-4.4%", "29.3 %" (space before %), "29.3%" (if negative, must use parentheses)
|
||||
|
||||
**Step 5: If Uncertain**
|
||||
- If you can't find the table, can't identify periods clearly, or values don't make sense → use "Not specified in CIM"
|
||||
- Better to leave blank than extract wrong data
|
||||
**Dates**:
|
||||
✓ CORRECT: "2024-03-15", "March 15, 2024", "2024-03-15"
|
||||
✗ INCORRECT: "03/15/2024", "15-Mar-2024", "March 2024" (missing day), "2024" (missing month/day)
|
||||
|
||||
**Growth Rates**:
|
||||
✓ CORRECT: "16.8%", "(4.4)%" (negative in parentheses), "0.0%" (zero growth)
|
||||
✗ INCORRECT: "16.8 percent", "-4.4%", "16.8" (missing %), "N/A" (unless truly not calculable)
|
||||
|
||||
**Lists**:
|
||||
✓ CORRECT: "1. First item with 2-3 sentences providing specific details and context. 2. Second item with quantification and investment significance. 3. Third item..."
|
||||
✗ INCORRECT: "1. Brief point\n2. Another brief point" (too short), "First item. Second item." (not numbered), "1. Item one 2. Item two" (missing line breaks)
|
||||
|
||||
**Company Names**:
|
||||
✓ CORRECT: "ABC Company, Inc.", "XYZ Corporation", "DEF LLC" (preserve exact legal entity name)
|
||||
✗ INCORRECT: "ABC Company" (if document says "ABC Company, Inc."), "ABC" (abbreviated), "abc company" (wrong capitalization)
|
||||
|
||||
**Geographic Locations**:
|
||||
✓ CORRECT: "Cleveland, OH", "Charlotte, NC", "New York, NY"
|
||||
✗ INCORRECT: "Cleveland" (missing state), "OH" (missing city), "Cleveland, Ohio" (use state abbreviation), "Cleveland, OH, USA" (unnecessary country)
|
||||
|
||||
CONTEXT-AWARE EXTRACTION GUIDANCE:
|
||||
|
||||
- **Use Document Structure**: Reference section headers, page numbers, and table locations when extracting data
|
||||
- **Cross-Section Validation**: If company name appears in multiple places, ensure consistency
|
||||
- **Table Context**: When extracting from tables, note the table title, section, and page number for validation
|
||||
- **Narrative Context**: When extracting from narrative text, include surrounding context (e.g., "Management stated that..." vs "The CIM indicates...")
|
||||
- **Appendix References**: Check appendices for detailed financials, management bios, market research, competitive analysis
|
||||
- **Footnotes**: Always check footnotes for adjustments, definitions, exclusions, and important context
|
||||
|
||||
SPECIAL REQUIREMENTS FOR KEY QUESTIONS & NEXT STEPS:
|
||||
- **Critical Questions**: Provide 5-8 detailed questions, each 2-3 sentences long, explaining the context and investment significance
|
||||
@@ -2053,7 +2268,7 @@ IMPORTANT: Replace all placeholder text with actual information from the CIM doc
|
||||
model: selectedModel,
|
||||
maxTokens,
|
||||
temperature: config.llm.temperature,
|
||||
});
|
||||
}, 'financial_extraction');
|
||||
|
||||
if (!response.success) {
|
||||
logger.error('Financial extraction LLM API call failed', {
|
||||
@@ -2166,8 +2381,11 @@ IMPORTANT: Replace all placeholder text with actual information from the CIM doc
|
||||
const calculatedMargin = (ebitdaValue / revValue) * 100;
|
||||
const marginDiff = Math.abs(calculatedMargin - marginValue);
|
||||
|
||||
// If margin difference is > 5 percentage points, there may be an issue
|
||||
if (marginDiff > 5 && revValue > 0) {
|
||||
// If margin difference is > 15 percentage points, this is a critical error
|
||||
// Examples: 95% when should be 22%, or 15% when should be 75%
|
||||
if (marginDiff > 15 && revValue > 0) {
|
||||
validationIssues.push(`CRITICAL: EBITDA margin mismatch for ${period}: stated ${marginValue}% vs calculated ${calculatedMargin.toFixed(1)}% (diff: ${marginDiff.toFixed(1)}pp) - likely column misalignment`);
|
||||
} else if (marginDiff > 5 && revValue > 0) {
|
||||
validationIssues.push(`EBITDA margin mismatch for ${period}: stated ${marginValue}% vs calculated ${calculatedMargin.toFixed(1)}%`);
|
||||
}
|
||||
|
||||
@@ -2175,6 +2393,11 @@ IMPORTANT: Replace all placeholder text with actual information from the CIM doc
|
||||
if (marginValue < 0 || marginValue > 60) {
|
||||
validationIssues.push(`EBITDA margin for ${period} is outside typical range (${marginValue}%)`);
|
||||
}
|
||||
|
||||
// Additional check: If calculated margin is reasonable but stated margin is way off, flag it
|
||||
if (calculatedMargin >= 0 && calculatedMargin <= 60 && marginDiff > 15) {
|
||||
validationIssues.push(`Consider using calculated margin (${calculatedMargin.toFixed(1)}%) instead of stated margin (${marginValue}%) for ${period}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
@@ -2350,9 +2573,210 @@ If ANY validation check fails, you likely have:
|
||||
- Misaligned columns (values in wrong period columns)
|
||||
- Extraction error (read the table again carefully)
|
||||
|
||||
**Step 5: If Uncertain**
|
||||
**Step 5: Cross-Table Validation (CRITICAL)**
|
||||
After extracting from the PRIMARY table, you MUST perform systematic cross-validation with other financial sources. Follow this structured workflow:
|
||||
|
||||
**Cross-Validation Workflow**:
|
||||
|
||||
1. **Extract from PRIMARY table first**:
|
||||
- Complete your extraction from the PRIMARY historical financial table
|
||||
- Note the key metrics: revenue, EBITDA, gross profit, margins for each period
|
||||
|
||||
2. **Check Executive Summary for Key Metrics**:
|
||||
- Search executive summary section for financial highlights
|
||||
- Look for mentions of revenue, EBITDA, or key financial figures
|
||||
- Extract the values mentioned in executive summary
|
||||
|
||||
3. **Calculate Discrepancy Percentage**:
|
||||
- Compare PRIMARY table values with executive summary values
|
||||
- Calculate discrepancy: |(Primary Table Value - Executive Summary Value) / Primary Table Value| * 100
|
||||
- Example: If PRIMARY shows $64M and executive summary shows $68M, discrepancy = |(64-68)/64| * 100 = 6.25%
|
||||
|
||||
4. **If Discrepancy >10%, Investigate**:
|
||||
When discrepancy exceeds 10%, systematically investigate:
|
||||
|
||||
a. **Check if Executive Summary uses Adjusted/Pro Forma Numbers**:
|
||||
- Look for terms: "Adjusted EBITDA", "Pro Forma", "Normalized", "Run-Rate"
|
||||
- Executive summary may show adjusted figures (with add-backs, pro forma adjustments)
|
||||
- PRIMARY table typically shows historical/actual results
|
||||
- If this is the case, discrepancy is expected - use PRIMARY table values
|
||||
|
||||
b. **Check if Period Definitions Differ**:
|
||||
- Verify fiscal year end matches (e.g., PRIMARY table may use FYE Mar 2024, executive summary may reference calendar 2024)
|
||||
- Check if LTM calculation dates differ
|
||||
- If periods differ, discrepancy is expected - use PRIMARY table periods
|
||||
|
||||
c. **Determine Which Source is More Authoritative**:
|
||||
- PRIMARY detailed table is typically most authoritative (shows actual historical results)
|
||||
- Executive summary may be rounded, adjusted, or use different definitions
|
||||
- Use PRIMARY table as authoritative source unless executive summary explicitly states it's using different/adjusted numbers
|
||||
|
||||
d. **Document Discrepancies in qualityOfEarnings Field**:
|
||||
- If discrepancy >10% and cannot be explained by adjustments/period differences, note it
|
||||
- Example: "Executive summary shows $68M revenue FY-1 vs $64M in detailed table (6.25% discrepancy). Using detailed table value as authoritative."
|
||||
- If discrepancy is due to adjustments, note: "Executive summary shows Adjusted EBITDA of $27M (includes $3M add-backs) vs $24M in historical table."
|
||||
|
||||
5. **Cross-Reference with Summary Financial Tables**:
|
||||
- If CIM has both detailed and summary financial tables, cross-check key metrics
|
||||
- Summary tables may be rounded or use different formatting
|
||||
- Use detailed PRIMARY table for complete data, summary table for validation only
|
||||
- If summary table differs significantly (>10%), investigate and document
|
||||
|
||||
6. **Check Appendix Financials**:
|
||||
- Review appendices for additional financial detail or adjustments
|
||||
- Look for: "Adjusted EBITDA" tables, "Normalized Financials", "Quality of Earnings" adjustments
|
||||
- Note any significant adjustments, add-backs, or one-time items mentioned
|
||||
- Document these in qualityOfEarnings field
|
||||
|
||||
7. **Validate with Narrative Text References**:
|
||||
- Scan narrative sections for financial mentions (e.g., "revenue grew from $64M to $71M")
|
||||
- Use these as validation checks, not primary sources
|
||||
- If narrative contradicts PRIMARY table by >10%, investigate which is correct
|
||||
- Typically, PRIMARY table is more reliable than narrative text
|
||||
|
||||
**Final Decision Rule**:
|
||||
- **Use PRIMARY table as authoritative source** unless:
|
||||
- Executive summary explicitly states it's using adjusted/pro forma numbers AND you need adjusted values
|
||||
- Period definitions clearly differ (in which case, use PRIMARY table periods)
|
||||
- PRIMARY table is clearly a subsidiary/segment table (values in thousands, not millions)
|
||||
- **Always document** significant discrepancies (>10%) in qualityOfEarnings field
|
||||
- **Better to use PRIMARY table** than executive summary if uncertain
|
||||
|
||||
**Step 6: Enhanced Unit Conversion Handling**
|
||||
Handle various unit formats explicitly:
|
||||
|
||||
1. **Thousands Format**:
|
||||
- Look for footnotes: "(in thousands)", "(000s)", "($000)"
|
||||
- Example: "$20,546 (in thousands)" = $20.5M (divide by 1,000, round to 1 decimal)
|
||||
- Example: "$20,546K" = $20.5M
|
||||
- Always check table footnotes for unit indicators
|
||||
|
||||
2. **Millions Format**:
|
||||
- "$64M", "$64.2M", "$64,200,000" all = $64.2M
|
||||
- Preserve format: Use "$64.2M" not "$64,200,000"
|
||||
|
||||
3. **Billions Format**:
|
||||
- "$1.2B", "$1,200M" = $1.2B
|
||||
- Convert billions to millions if needed: $1.2B = $1,200M
|
||||
|
||||
4. **Negative Numbers**:
|
||||
- Parentheses: "(4.4)" = negative 4.4
|
||||
- Minus sign: "-4.4" = negative 4.4
|
||||
- For percentages: "(4.4)%" = negative 4.4%
|
||||
- For currency: "($2.5M)" = negative $2.5M
|
||||
|
||||
5. **Currency Symbols**:
|
||||
- "$" = US dollars (most common)
|
||||
- "€" = Euros (convert if needed, note in extraction)
|
||||
- "£" = British pounds (convert if needed, note in extraction)
|
||||
|
||||
**Step 7: Missing Data Inference Rules**
|
||||
When to infer vs when to require explicit data:
|
||||
|
||||
1. **Calculate Growth Rates**:
|
||||
- If revenue for FY-3 and FY-2 are available, calculate FY-2 growth: ((FY-2 - FY-3) / FY-3) * 100
|
||||
- If growth rate is explicitly stated, use that; otherwise calculate
|
||||
- FY-3 growth should be "N/A" (baseline year)
|
||||
|
||||
2. **Calculate Margins**:
|
||||
- If revenue and EBITDA available, calculate margin: (EBITDA / Revenue) * 100
|
||||
- If margin explicitly stated, use that; otherwise calculate
|
||||
- If calculated margin differs significantly (>5pp) from stated, note the discrepancy
|
||||
|
||||
3. **Infer Missing Periods**:
|
||||
- If only 2 periods available, assign to FY-2 and FY-1 (most recent periods)
|
||||
- If only 3 periods available, assign to FY-3, FY-2, FY-1
|
||||
- Do NOT infer values - only infer period assignments
|
||||
|
||||
4. **Do NOT Infer**:
|
||||
- Do NOT make up financial values
|
||||
- Do NOT estimate missing periods
|
||||
- Do NOT assume trends continue
|
||||
- If data is missing, use "Not specified in CIM"
|
||||
|
||||
**Step 8: Table Type Classification**
|
||||
Identify and handle different table types:
|
||||
|
||||
1. **Historical Financial Table (USE THIS)**:
|
||||
- Shows actual past performance
|
||||
- Labeled: "Historical Financials", "Actual Results", "Reported Financials"
|
||||
- Contains years or periods (2021, 2022, 2023, FY-1, FY-2, etc.)
|
||||
- No "E", "P", "PF", "Projected", "Forecast" markers
|
||||
|
||||
2. **Projected/Forward-Looking Table (IGNORE)**:
|
||||
- Shows future estimates
|
||||
- Labeled: "Projections", "Forecast", "Budget", "Plan"
|
||||
- Contains "E", "P", "PF" markers or future years
|
||||
- IGNORE these - only extract historical data
|
||||
|
||||
3. **Pro Forma/Adjusted Table (USE WITH CAUTION)**:
|
||||
- Shows adjusted or normalized results
|
||||
- Labeled: "Pro Forma", "Adjusted", "Normalized", "Run-Rate"
|
||||
- May include add-backs, adjustments, or acquisition impacts
|
||||
- Note adjustments but prefer historical table if both available
|
||||
|
||||
4. **Segment/Subsidiary Table (IGNORE FOR PRIMARY)**:
|
||||
- Shows individual business units or subsidiaries
|
||||
- Values typically in thousands (smaller magnitude)
|
||||
- Use only if no consolidated table available
|
||||
|
||||
5. **Consolidated Table (USE THIS)**:
|
||||
- Shows combined company results
|
||||
- Labeled: "Consolidated", "Combined", "Total"
|
||||
- Values typically in millions (larger magnitude)
|
||||
|
||||
**Step 9: Footnote Integration**
|
||||
Always check footnotes for critical information:
|
||||
|
||||
1. **Adjustments and Add-backs**:
|
||||
- Footnotes may explain EBITDA adjustments, add-backs, or one-time items
|
||||
- Note these in qualityOfEarnings field
|
||||
- Example: "EBITDA includes $2M in management fees add-back"
|
||||
|
||||
2. **Definitions**:
|
||||
- Footnotes may define "EBITDA", "Adjusted EBITDA", "Revenue" (gross vs net)
|
||||
- Use these definitions to ensure correct extraction
|
||||
|
||||
3. **Exclusions**:
|
||||
- Footnotes may exclude certain items (discontinued operations, divestitures)
|
||||
- Note these exclusions
|
||||
|
||||
4. **Units and Basis**:
|
||||
- Footnotes may specify units (thousands, millions) or currency
|
||||
- Critical for correct extraction
|
||||
|
||||
5. **Period Definitions**:
|
||||
- Footnotes may clarify fiscal year end, LTM calculation date, stub periods
|
||||
- Use this to correctly map periods
|
||||
|
||||
**Step 10: Temporal Context Handling**
|
||||
Handle various time period formats:
|
||||
|
||||
1. **Fiscal Year Ends**:
|
||||
- "FYE Mar 2024" = fiscal year ending March 2024
|
||||
- "FY 2024" may mean different things (calendar vs fiscal)
|
||||
- Check document for fiscal year end definition
|
||||
- Use fiscal year, not calendar year, if specified
|
||||
|
||||
2. **LTM Calculation Dates**:
|
||||
- "LTM Mar 2024" = last twelve months through March 2024
|
||||
- "TTM Jun 2024" = trailing twelve months through June 2024
|
||||
- Note the calculation date for context
|
||||
|
||||
3. **Stub Periods**:
|
||||
- Some tables show partial periods (e.g., "6M 2024" = 6 months)
|
||||
- Typically not used for FY-3, FY-2, FY-1 (use full years)
|
||||
- May be used for LTM if recent acquisition
|
||||
|
||||
4. **Calendar vs Fiscal**:
|
||||
- Calendar year: Jan 1 - Dec 31
|
||||
- Fiscal year: Varies (e.g., Apr 1 - Mar 31, Oct 1 - Sep 30)
|
||||
- Use fiscal year if specified, otherwise assume calendar
|
||||
|
||||
**Step 11: If Uncertain**
|
||||
- If you can't find the PRIMARY table, can't identify periods clearly, or values don't make sense → use "Not specified in CIM"
|
||||
- Better to leave blank than extract wrong data
|
||||
- If multiple tables exist and you're unsure which is primary, use the one with largest revenue values (typically $20M-$1B+)
|
||||
|
||||
FEW-SHOT EXAMPLES - Correct Financial Table Extraction:
|
||||
|
||||
@@ -2427,6 +2851,103 @@ Correct Extraction:
|
||||
- FY-1 = 2024 = $71M revenue, $24M EBITDA (most recent year)
|
||||
- LTM = Not specified in CIM (no LTM column)
|
||||
|
||||
**Example 7: Multiple Tables with Conflicting Values - Identifying PRIMARY**
|
||||
Scenario: Document contains multiple financial tables with different values.
|
||||
|
||||
TABLE A (in Executive Summary):
|
||||
Revenue: $68M (FY-1), $75M (LTM)
|
||||
Note: "Adjusted for pro forma acquisition of XYZ Corp"
|
||||
|
||||
TABLE B (in Financial Summary section):
|
||||
Historical Financials:
|
||||
FY-3: $64M revenue, $19M EBITDA
|
||||
FY-2: $71M revenue, $24M EBITDA
|
||||
FY-1: $71M revenue, $24M EBITDA
|
||||
LTM: $76M revenue, $27M EBITDA
|
||||
Note: "Actual historical results"
|
||||
|
||||
Correct Extraction:
|
||||
- Use TABLE B (Historical Financials) as PRIMARY table - it shows actual historical results
|
||||
- TABLE A shows adjusted/pro forma numbers (not historical)
|
||||
- Extract: FY-1 = $71M revenue (from TABLE B), not $68M (from TABLE A)
|
||||
- Note discrepancy in qualityOfEarnings: "Executive summary shows adjusted revenue of $68M vs $71M actual historical (4.2% difference due to pro forma adjustments)"
|
||||
|
||||
**Example 8: Table with Merged Cells or Irregular Formatting**
|
||||
Table appears with merged cells or irregular spacing:
|
||||
```
|
||||
2021 2022 2023 2024
|
||||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||||
Revenue Growth N/A 16.8% 15.9% (4.4)%
|
||||
Gross Profit $18.1M $21.1M $24.5M $23.4M
|
||||
Gross Margin 40.0% 40.0% 40.0% 40.0%
|
||||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||||
EBITDA Margin 18.8% 19.3% 19.8% 19.7%
|
||||
```
|
||||
|
||||
Note: Some rows may have merged cells or irregular spacing. Count columns carefully.
|
||||
|
||||
Correct Extraction:
|
||||
- Identify column positions: Column 1 = 2021, Column 2 = 2022, Column 3 = 2023, Column 4 = 2024
|
||||
- Extract values by column position, not by visual alignment
|
||||
- FY-3 = 2021 = $45.2M revenue, $8.5M EBITDA, 18.8% EBITDA margin
|
||||
- FY-2 = 2022 = $52.8M revenue, $10.2M EBITDA, 19.3% EBITDA margin
|
||||
- FY-1 = 2023 = $61.2M revenue, $12.1M EBITDA, 19.8% EBITDA margin
|
||||
- LTM = 2024 = $58.5M revenue, $11.5M EBITDA, 19.7% EBITDA margin
|
||||
|
||||
**Example 9: Table with Footnotes Containing Critical Adjustments**
|
||||
Table Header: "FY-3 FY-2 FY-1 LTM"
|
||||
Revenue Row: "$64M $71M $71M $76M"
|
||||
EBITDA Row: "$19M $24M $24M $27M"
|
||||
Footnote 1: "EBITDA includes $2M management fees add-back in each period"
|
||||
Footnote 2: "LTM period is through March 2024"
|
||||
Footnote 3: "All amounts in millions of US dollars"
|
||||
|
||||
Correct Extraction:
|
||||
- Extract values as shown: FY-1 = $71M revenue, $24M EBITDA
|
||||
- Document adjustments in qualityOfEarnings: "EBITDA includes $2M management fees add-back per period. Historical EBITDA without add-back would be $22M, $22M, $22M, $25M for FY-3, FY-2, FY-1, LTM respectively."
|
||||
- Note LTM calculation date: "LTM through March 2024"
|
||||
- Use footnotes to understand adjustments and period definitions
|
||||
|
||||
**Example 10: Pro Forma vs Historical Side-by-Side Comparison**
|
||||
Table shows both Historical and Pro Forma columns:
|
||||
|
||||
Table Header: "Historical Results Pro Forma (Adjusted)"
|
||||
"FY-3 FY-2 FY-1 LTM FY-3 FY-2 FY-1 LTM"
|
||||
Revenue Row: "$64M $71M $71M $76M $68M $75M $75M $80M"
|
||||
EBITDA Row: "$19M $24M $24M $27M $22M $27M $27M $30M"
|
||||
Note: "Pro Forma includes acquisition of ABC Corp and add-backs"
|
||||
|
||||
Correct Extraction:
|
||||
- Use HISTORICAL columns (first 4 columns) for extraction
|
||||
- IGNORE Pro Forma columns (last 4 columns) - these are adjusted, not historical
|
||||
- Extract: FY-1 = $71M revenue, $24M EBITDA (from Historical, not $75M/$27M from Pro Forma)
|
||||
- Document in qualityOfEarnings: "Pro forma adjustments add $4M revenue and $3M EBITDA per period. Historical results shown above."
|
||||
|
||||
**Example 11: Partial Table with Only 3 Periods (Edge Case)**
|
||||
Table Header: "2022 2023 2024"
|
||||
Revenue Row: "$58M $64M $71M"
|
||||
EBITDA Row: "$17M $19M $24M"
|
||||
Note: "Historical financials for last 3 years"
|
||||
|
||||
Correct Extraction:
|
||||
- FY-3 = 2022 = $58M revenue, $17M EBITDA (oldest year)
|
||||
- FY-2 = 2023 = $64M revenue, $19M EBITDA (middle year)
|
||||
- FY-1 = 2024 = $71M revenue, $24M EBITDA (most recent year)
|
||||
- LTM = Not specified in CIM (no LTM column provided)
|
||||
|
||||
**Example 12: Table with Thousands Format Requiring Conversion**
|
||||
Table Header: "2021 2022 2023 2024"
|
||||
Note: "(All amounts in thousands)"
|
||||
Revenue Row: "$45,200 $52,800 $61,200 $58,500"
|
||||
EBITDA Row: "$8,500 $10,200 $12,100 $11,500"
|
||||
|
||||
Correct Extraction (convert to millions):
|
||||
- FY-3 = 2021 = $45.2M revenue, $8.5M EBITDA ($45,200K ÷ 1,000 = $45.2M)
|
||||
- FY-2 = 2022 = $52.8M revenue, $10.2M EBITDA ($52,800K ÷ 1,000 = $52.8M)
|
||||
- FY-1 = 2023 = $61.2M revenue, $12.1M EBITDA ($61,200K ÷ 1,000 = $61.2M)
|
||||
- LTM = 2024 = $58.5M revenue, $11.5M EBITDA ($58,500K ÷ 1,000 = $58.5M)
|
||||
- CRITICAL: Always check footnotes for unit indicators before extracting
|
||||
|
||||
CIM Document Text:
|
||||
${text}
|
||||
|
||||
@@ -2485,7 +3006,7 @@ IMPORTANT: Extract ONLY financial data. Return ONLY the financialSummary section
|
||||
* Get system prompt for financial extraction
|
||||
*/
|
||||
private getFinancialSystemPrompt(): string {
|
||||
return `You are an expert financial analyst at BPCP (Blue Point Capital Partners) specializing in extracting historical financial data from CIM documents. Your task is to extract ONLY the financial summary section from the CIM document.
|
||||
return `You are an expert financial analyst at BPCP (Blue Point Capital Partners) specializing in extracting historical financial data from CIM documents with 100% accuracy. Your task is to extract ONLY the financial summary section from the CIM document.
|
||||
|
||||
CRITICAL REQUIREMENTS:
|
||||
1. **JSON OUTPUT ONLY**: Your entire response MUST be a single, valid JSON object containing ONLY the financialSummary section.
|
||||
@@ -2495,7 +3016,103 @@ CRITICAL REQUIREMENTS:
|
||||
5. **PERIOD MAPPING**: Correctly map periods (FY-3, FY-2, FY-1, LTM) from various table formats (years, FY-X, mixed).
|
||||
6. **IF UNCERTAIN**: Use "Not specified in CIM" rather than extracting incorrect data.
|
||||
|
||||
Focus exclusively on financial data extraction. Do not extract any other sections.`;
|
||||
EXPANDED VALIDATION FRAMEWORK:
|
||||
Before finalizing extraction, perform these validation checks:
|
||||
|
||||
**Magnitude Validation**:
|
||||
- Revenue should typically be $10M+ for target companies (if less, verify you're using PRIMARY table, not subsidiary)
|
||||
- EBITDA should typically be $1M+ and positive for viable targets
|
||||
- If FY-3 revenue is $64M, FY-2 should be similar magnitude (e.g., $50M-$90M), not $2.9M or $10 - this indicates column misalignment
|
||||
|
||||
**Trend Validation**:
|
||||
- Revenue should generally increase or be stable year-over-year (FY-3 → FY-2 → FY-1)
|
||||
- Large sudden drops (>50%) or increases (>200%) may indicate misaligned columns or wrong table
|
||||
- EBITDA should follow similar trends to revenue (unless margin expansion/contraction is explicitly explained)
|
||||
|
||||
**Margin Reasonableness**:
|
||||
- EBITDA margins should be 5-50% (typical range for most businesses)
|
||||
- Gross margins should be 20-80% (typical range)
|
||||
- Margins should be relatively stable across periods (within 10-15 percentage points unless explained)
|
||||
- If margins are outside these ranges, verify you're using the correct table and calculations
|
||||
|
||||
**Cross-Period Consistency**:
|
||||
- If FY-3 revenue = $64M and FY-2 revenue = $71M, growth should be ~11% (not 1000% or -50%)
|
||||
- Verify growth rates match: ((Current - Prior) / Prior) * 100
|
||||
- Verify margins match: (Metric / Revenue) * 100
|
||||
- If calculations don't match, use the explicitly stated values from the table
|
||||
|
||||
**Calculation Validation**:
|
||||
- Revenue growth: ((Current Year - Prior Year) / Prior Year) * 100
|
||||
- EBITDA margin: (EBITDA / Revenue) * 100
|
||||
- Gross margin: (Gross Profit / Revenue) * 100
|
||||
- If calculated values differ significantly (>5pp) from stated values, note the discrepancy
|
||||
|
||||
COMMON MISTAKES TO AVOID (Error Prevention):
|
||||
1. **Subsidiary vs Parent Table Confusion**:
|
||||
- PRIMARY table shows values in millions ($64M, $71M)
|
||||
- Subsidiary tables show thousands ($20,546, $26,352)
|
||||
- Always use the PRIMARY table with larger values
|
||||
|
||||
2. **Projections vs Historical**:
|
||||
- Ignore tables marked with "E", "P", "PF", "Projected", "Forecast"
|
||||
- Only extract from historical/actual results tables
|
||||
|
||||
3. **Thousands vs Millions**:
|
||||
- "$20,546 (in thousands)" = $20.5M, not $20,546M
|
||||
- Always check table footnotes for unit indicators
|
||||
- If revenue < $10M, you're likely using wrong table
|
||||
|
||||
4. **Column Misalignment**:
|
||||
- Count columns carefully - ensure values align with their period columns
|
||||
- Verify trends make sense (revenue generally increases or is stable)
|
||||
- If values seem misaligned, double-check column positions
|
||||
|
||||
5. **Missing Cross-Validation**:
|
||||
- Don't extract financials in isolation
|
||||
- Cross-reference with executive summary financial highlights
|
||||
- Verify consistency between detailed financials and summary statements
|
||||
|
||||
6. **Unit Conversion Errors**:
|
||||
- Parentheses for negative: "(4.4)" = negative 4.4
|
||||
- Currency symbols: "$" = US dollars, "€" = Euros, "£" = British pounds
|
||||
- Always check footnotes for unit definitions
|
||||
|
||||
CONFIDENCE SCORING:
|
||||
Flag uncertain extractions by considering:
|
||||
- **High Confidence**: Values from clear PRIMARY table, match executive summary, calculations consistent
|
||||
- **Medium Confidence**: Values from PRIMARY table but some ambiguity (e.g., unclear period mapping)
|
||||
- **Low Confidence**: Values don't match executive summary, calculations inconsistent, or unclear table type
|
||||
- If confidence is low, use "Not specified in CIM" rather than guessing
|
||||
|
||||
ALTERNATIVE EXTRACTION METHODS:
|
||||
If PRIMARY table is not found or unclear:
|
||||
|
||||
1. **Narrative Text Extraction**:
|
||||
- Search for financial mentions in narrative text (e.g., "revenue grew from $64M to $71M")
|
||||
- Use these as validation checks, not primary sources
|
||||
- Extract only if explicitly stated with specific values
|
||||
|
||||
2. **Executive Summary Financial Highlights**:
|
||||
- Check executive summary for key financial metrics
|
||||
- May contain revenue, EBITDA highlights
|
||||
- Use as secondary source if primary table unavailable
|
||||
|
||||
3. **Appendix Financials**:
|
||||
- Check appendices for detailed financial statements
|
||||
- May contain full income statements or P&L
|
||||
- Use if main section table is unclear
|
||||
|
||||
4. **Chart/Graph Data**:
|
||||
- Some CIMs show financial data in charts
|
||||
- Extract if values are clearly readable
|
||||
- Note that charts may be approximate
|
||||
|
||||
5. **Multiple Table Comparison**:
|
||||
- If multiple tables exist, compare values
|
||||
- Use the table with largest revenue values (typically PRIMARY)
|
||||
- Cross-validate key metrics across tables
|
||||
|
||||
Focus exclusively on financial data extraction. Do not extract any other sections. Prioritize accuracy over completeness - better to leave a field blank than extract incorrect data.`;
|
||||
}
|
||||
|
||||
/**
|
||||
|
||||
15
backend/src/services/llmUtils/costCalculator.ts
Normal file
15
backend/src/services/llmUtils/costCalculator.ts
Normal file
@@ -0,0 +1,15 @@
|
||||
/**
|
||||
* Cost Calculation Utilities
|
||||
* Estimates LLM API costs based on token usage and model
|
||||
*/
|
||||
|
||||
import { estimateLLMCost } from '../../config/constants';
|
||||
|
||||
/**
|
||||
* Estimate cost for a given number of tokens and model
|
||||
* Uses the centralized cost estimation from constants
|
||||
*/
|
||||
export function estimateCost(tokens: number, model: string): number {
|
||||
return estimateLLMCost(tokens, model);
|
||||
}
|
||||
|
||||
9
backend/src/services/llmUtils/index.ts
Normal file
9
backend/src/services/llmUtils/index.ts
Normal file
@@ -0,0 +1,9 @@
|
||||
/**
|
||||
* LLM Utility Functions
|
||||
* Centralized exports for all LLM utility functions
|
||||
*/
|
||||
|
||||
export { extractJsonFromResponse } from './jsonExtractor';
|
||||
export { estimateTokenCount, truncateText } from './tokenEstimator';
|
||||
export { estimateCost } from './costCalculator';
|
||||
|
||||
184
backend/src/services/llmUtils/jsonExtractor.ts
Normal file
184
backend/src/services/llmUtils/jsonExtractor.ts
Normal file
@@ -0,0 +1,184 @@
|
||||
/**
|
||||
* JSON Extraction Utilities
|
||||
* Extracts JSON from LLM responses, handling various formats and edge cases
|
||||
*/
|
||||
|
||||
import { logger } from '../../utils/logger';
|
||||
import { LLM_COST_RATES, DEFAULT_COST_RATE, estimateLLMCost, estimateTokenCount } from '../../config/constants';
|
||||
|
||||
/**
|
||||
* Extract JSON from LLM response content
|
||||
* Handles various formats: ```json blocks, plain JSON, truncated responses
|
||||
*/
|
||||
export function extractJsonFromResponse(content: string): any {
|
||||
try {
|
||||
// First, try to find JSON within ```json ... ```
|
||||
const jsonBlockStart = content.indexOf('```json');
|
||||
logger.info('JSON extraction - checking for ```json block', {
|
||||
jsonBlockStart,
|
||||
hasJsonBlock: jsonBlockStart !== -1,
|
||||
contentLength: content.length,
|
||||
contentEnds: content.substring(content.length - 50),
|
||||
});
|
||||
|
||||
if (jsonBlockStart !== -1) {
|
||||
const jsonContentStart = content.indexOf('\n', jsonBlockStart) + 1;
|
||||
let closingBackticks = -1;
|
||||
|
||||
// Try to find \n``` first (most common)
|
||||
const newlineBackticks = content.indexOf('\n```', jsonContentStart);
|
||||
if (newlineBackticks !== -1) {
|
||||
closingBackticks = newlineBackticks + 1;
|
||||
} else {
|
||||
// Fallback: look for ``` at the very end
|
||||
if (content.endsWith('```')) {
|
||||
closingBackticks = content.length - 3;
|
||||
} else {
|
||||
closingBackticks = content.length;
|
||||
logger.warn('LLM response has no closing backticks, using entire content');
|
||||
}
|
||||
}
|
||||
|
||||
logger.info('JSON extraction - found block boundaries', {
|
||||
jsonContentStart,
|
||||
closingBackticks,
|
||||
newlineBackticks,
|
||||
contentEndsWithBackticks: content.endsWith('```'),
|
||||
isValid: closingBackticks > jsonContentStart,
|
||||
});
|
||||
|
||||
if (jsonContentStart > 0 && closingBackticks > jsonContentStart) {
|
||||
const jsonStr = content.substring(jsonContentStart, closingBackticks).trim();
|
||||
|
||||
logger.info('JSON extraction - extracted string', {
|
||||
jsonStrLength: jsonStr.length,
|
||||
startsWithBrace: jsonStr.startsWith('{'),
|
||||
jsonStrPreview: jsonStr.substring(0, 300),
|
||||
});
|
||||
|
||||
if (jsonStr && jsonStr.startsWith('{')) {
|
||||
try {
|
||||
// Use brace matching to get the complete root object
|
||||
let braceCount = 0;
|
||||
let rootEndIndex = -1;
|
||||
for (let i = 0; i < jsonStr.length; i++) {
|
||||
if (jsonStr[i] === '{') braceCount++;
|
||||
else if (jsonStr[i] === '}') {
|
||||
braceCount--;
|
||||
if (braceCount === 0) {
|
||||
rootEndIndex = i;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
if (rootEndIndex !== -1) {
|
||||
const completeJsonStr = jsonStr.substring(0, rootEndIndex + 1);
|
||||
logger.info('Brace matching succeeded', {
|
||||
originalLength: jsonStr.length,
|
||||
extractedLength: completeJsonStr.length,
|
||||
extractedPreview: completeJsonStr.substring(0, 200),
|
||||
});
|
||||
return JSON.parse(completeJsonStr);
|
||||
} else {
|
||||
logger.warn('Brace matching failed to find closing brace', {
|
||||
jsonStrLength: jsonStr.length,
|
||||
jsonStrPreview: jsonStr.substring(0, 500),
|
||||
});
|
||||
}
|
||||
} catch (e) {
|
||||
logger.error('Brace matching threw error, falling back to regex', {
|
||||
error: e instanceof Error ? e.message : String(e),
|
||||
stack: e instanceof Error ? e.stack : undefined,
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback to regex match
|
||||
logger.warn('Using fallback regex extraction');
|
||||
const jsonMatch = content.match(/```json\n([\s\S]+)\n```/);
|
||||
if (jsonMatch && jsonMatch[1]) {
|
||||
logger.info('Regex extraction found JSON', {
|
||||
matchLength: jsonMatch[1].length,
|
||||
matchPreview: jsonMatch[1].substring(0, 200),
|
||||
});
|
||||
return JSON.parse(jsonMatch[1]);
|
||||
}
|
||||
|
||||
// Try to find JSON within ``` ... ```
|
||||
const codeBlockMatch = content.match(/```\n([\s\S]*?)\n```/);
|
||||
if (codeBlockMatch && codeBlockMatch[1]) {
|
||||
return JSON.parse(codeBlockMatch[1]);
|
||||
}
|
||||
|
||||
// If that fails, try to find the largest valid JSON object
|
||||
const startIndex = content.indexOf('{');
|
||||
if (startIndex === -1) {
|
||||
throw new Error('No JSON object found in response');
|
||||
}
|
||||
|
||||
// Try to find the complete JSON object by matching braces
|
||||
let braceCount = 0;
|
||||
let endIndex = -1;
|
||||
|
||||
for (let i = startIndex; i < content.length; i++) {
|
||||
if (content[i] === '{') {
|
||||
braceCount++;
|
||||
} else if (content[i] === '}') {
|
||||
braceCount--;
|
||||
if (braceCount === 0) {
|
||||
endIndex = i;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (endIndex === -1) {
|
||||
// If we can't find a complete JSON object, the response was likely truncated
|
||||
const partialJson = content.substring(startIndex);
|
||||
const openBraces = (partialJson.match(/{/g) || []).length;
|
||||
const closeBraces = (partialJson.match(/}/g) || []).length;
|
||||
const isTruncated = openBraces > closeBraces;
|
||||
|
||||
logger.warn('Attempting to recover from truncated JSON response', {
|
||||
contentLength: content.length,
|
||||
partialJsonLength: partialJson.length,
|
||||
openBraces,
|
||||
closeBraces,
|
||||
isTruncated,
|
||||
endsAbruptly: !content.trim().endsWith('}') && !content.trim().endsWith('```')
|
||||
});
|
||||
|
||||
// If clearly truncated (more open than close braces), throw a specific error
|
||||
if (isTruncated && openBraces - closeBraces > 2) {
|
||||
throw new Error(`Response was truncated due to token limit. Expected ${openBraces - closeBraces} more closing braces. Increase maxTokens limit.`);
|
||||
}
|
||||
|
||||
// Try to find the last complete object or array
|
||||
const lastCompleteMatch = partialJson.match(/(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})/);
|
||||
if (lastCompleteMatch && lastCompleteMatch[1]) {
|
||||
return JSON.parse(lastCompleteMatch[1]);
|
||||
}
|
||||
|
||||
// If that fails, try to find the last complete key-value pair
|
||||
const lastPairMatch = partialJson.match(/(\{[^{}]*"[^"]*"\s*:\s*"[^"]*"[^{}]*\})/);
|
||||
if (lastPairMatch && lastPairMatch[1]) {
|
||||
return JSON.parse(lastPairMatch[1]);
|
||||
}
|
||||
|
||||
throw new Error(`Unable to extract valid JSON from truncated response. Response appears incomplete (${openBraces} open braces, ${closeBraces} close braces). Increase maxTokens limit.`);
|
||||
}
|
||||
|
||||
const jsonString = content.substring(startIndex, endIndex + 1);
|
||||
return JSON.parse(jsonString);
|
||||
} catch (error) {
|
||||
logger.error('Failed to extract JSON from LLM response', {
|
||||
error,
|
||||
contentLength: content.length,
|
||||
contentPreview: content.substring(0, 1000)
|
||||
});
|
||||
throw new Error(`JSON extraction failed: ${error instanceof Error ? error.message : 'Unknown error'}`);
|
||||
}
|
||||
}
|
||||
|
||||
56
backend/src/services/llmUtils/tokenEstimator.ts
Normal file
56
backend/src/services/llmUtils/tokenEstimator.ts
Normal file
@@ -0,0 +1,56 @@
|
||||
/**
|
||||
* Token Estimation Utilities
|
||||
* Estimates token counts and handles text truncation
|
||||
*/
|
||||
|
||||
import { estimateTokenCount as estimateTokens, TOKEN_ESTIMATION } from '../../config/constants';
|
||||
|
||||
/**
|
||||
* Estimate token count for text
|
||||
* Uses the constant from config for consistency
|
||||
*/
|
||||
export function estimateTokenCount(text: string): number {
|
||||
return estimateTokens(text);
|
||||
}
|
||||
|
||||
/**
|
||||
* Truncate text to fit within token limit while preserving sentence boundaries
|
||||
*/
|
||||
export function truncateText(text: string, maxTokens: number): string {
|
||||
// Convert token limit to character limit (approximate)
|
||||
const maxChars = maxTokens * TOKEN_ESTIMATION.CHARS_PER_TOKEN;
|
||||
|
||||
if (text.length <= maxChars) {
|
||||
return text;
|
||||
}
|
||||
|
||||
// Try to truncate at sentence boundaries for better context preservation
|
||||
const truncated = text.substring(0, maxChars);
|
||||
|
||||
// Find the last sentence boundary (period, exclamation, question mark followed by space)
|
||||
const sentenceEndRegex = /[.!?]\s+/g;
|
||||
let lastMatch: RegExpExecArray | null = null;
|
||||
let match: RegExpExecArray | null;
|
||||
|
||||
while ((match = sentenceEndRegex.exec(truncated)) !== null) {
|
||||
if (match.index < maxChars * 0.95) { // Only use if within 95% of limit
|
||||
lastMatch = match;
|
||||
}
|
||||
}
|
||||
|
||||
if (lastMatch) {
|
||||
// Truncate at sentence boundary
|
||||
return text.substring(0, lastMatch.index + lastMatch[0].length).trim();
|
||||
}
|
||||
|
||||
// Fallback: truncate at word boundary
|
||||
const wordBoundaryRegex = /\s+/;
|
||||
const lastSpaceIndex = truncated.lastIndexOf(' ');
|
||||
if (lastSpaceIndex > maxChars * 0.9) {
|
||||
return text.substring(0, lastSpaceIndex).trim();
|
||||
}
|
||||
|
||||
// Final fallback: hard truncate
|
||||
return truncated.trim();
|
||||
}
|
||||
|
||||
@@ -632,18 +632,50 @@ export class OptimizedAgenticRAGProcessor {
|
||||
* This query represents what we're looking for in the document
|
||||
*/
|
||||
private createCIMAnalysisQuery(): string {
|
||||
return `Confidential Information Memorandum CIM document analysis including:
|
||||
- Executive summary and deal overview
|
||||
- Company name, industry sector, transaction type, geography
|
||||
- Business description and core operations
|
||||
- Key products and services, unique value proposition
|
||||
- Customer base overview and customer concentration
|
||||
- Market size, growth rate, industry trends
|
||||
- Competitive landscape and market position
|
||||
- Financial summary with revenue, EBITDA, margins, growth rates
|
||||
- Management team overview
|
||||
- Investment thesis and key questions
|
||||
- Transaction details and deal structure`;
|
||||
return `Confidential Information Memorandum (CIM) document comprehensive analysis with priority weighting:
|
||||
|
||||
**HIGH PRIORITY (Weight: 10/10)** - Critical for investment decision:
|
||||
- Historical financial performance table with revenue, EBITDA, gross profit, margins, and growth rates for FY-3, FY-2, FY-1, and LTM periods
|
||||
- Executive summary financial highlights and key metrics
|
||||
- Investment thesis, key attractions, risks, and value creation opportunities
|
||||
- Deal overview including target company name, industry sector, transaction type, geography, deal source
|
||||
|
||||
**HIGH PRIORITY (Weight: 9/10)** - Essential investment analysis:
|
||||
- Market analysis including total addressable market (TAM), serviceable addressable market (SAM), market growth rates, CAGR
|
||||
- Competitive landscape analysis with key competitors, market position, market share, competitive differentiation
|
||||
- Business description including core operations, key products and services, unique value proposition, revenue mix
|
||||
- Management team overview including key leaders, management quality assessment, post-transaction intentions
|
||||
|
||||
**MEDIUM PRIORITY (Weight: 7/10)** - Important context:
|
||||
- Customer base overview including customer segments, customer concentration risk, top customers percentage, contract length, recurring revenue
|
||||
- Industry trends, drivers, tailwinds, headwinds, regulatory environment
|
||||
- Barriers to entry, competitive moats, basis of competition
|
||||
- Quality of earnings analysis, EBITDA adjustments, addbacks, capital expenditures, working capital intensity, free cash flow quality
|
||||
|
||||
**MEDIUM PRIORITY (Weight: 6/10)** - Supporting information:
|
||||
- Key supplier dependencies, supply chain risks, supplier concentration
|
||||
- Organizational structure, reporting relationships, depth of team
|
||||
- Revenue growth drivers, margin stability analysis, profitability trends
|
||||
- Critical questions for management, missing information, preliminary recommendation, proposed next steps
|
||||
|
||||
**LOWER PRIORITY (Weight: 4/10)** - Additional context:
|
||||
- Transaction details and deal structure
|
||||
- CIM document dates, reviewers, page count, stated reason for sale, employee count
|
||||
- Geographic locations and operating locations
|
||||
- Market dynamics and macroeconomic factors
|
||||
|
||||
**SEMANTIC SPECIFICITY ENHANCEMENTS**:
|
||||
Use specific financial terminology: "historical financial performance table", "income statement", "P&L statement", "financial summary table", "consolidated financials", "revenue growth year-over-year", "EBITDA margin percentage", "gross profit margin", "trailing twelve months LTM", "fiscal year FY-1 FY-2 FY-3"
|
||||
|
||||
Use specific market terminology: "total addressable market TAM", "serviceable addressable market SAM", "compound annual growth rate CAGR", "market share percentage", "competitive positioning", "barriers to entry", "competitive moat", "market leader", "niche player"
|
||||
|
||||
Use specific investment terminology: "investment thesis", "value creation levers", "margin expansion opportunities", "add-on acquisition potential", "operational improvements", "M&A strategy", "preliminary recommendation", "due diligence questions"
|
||||
|
||||
**CONTEXT ENRICHMENT**:
|
||||
- Document structure hints: Look for section headers like "Financial Summary", "Market Analysis", "Competitive Landscape", "Management Team", "Investment Highlights"
|
||||
- Table locations: Financial tables typically in "Financial Summary" or "Historical Financials" sections, may also be in appendices
|
||||
- Appendix references: Check appendices for detailed financials, management bios, market research, competitive analysis
|
||||
- Page number context: Note page numbers for key sections and tables for validation`;
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -1390,18 +1422,73 @@ export class OptimizedAgenticRAGProcessor {
|
||||
pinnedChunks: ProcessingChunk[] = []
|
||||
): Promise<{ data: Partial<CIMReview>; apiCalls: number }> {
|
||||
const query = `Extract deal information, company metadata, and comprehensive financial data including:
|
||||
- Target company name, industry sector, geography, deal source (the investment bank or firm marketing the deal - look for names like "Capstone Partners", "Harris Williams", "Raymond James", etc. in the document header, footer, or contact information), transaction type
|
||||
- CIM document dates, reviewers, page count, stated reason for sale, employee count
|
||||
- CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
- CRITICAL: Find and extract financial tables with historical data. Look for tables showing:
|
||||
* Revenue (also called "Net Sales", "Total Revenue") for FY-3, FY-2, FY-1, and LTM (Last Twelve Months) or TTM (Trailing Twelve Months)
|
||||
|
||||
**DEAL SOURCE EXTRACTION (Enhanced Patterns)**:
|
||||
The deal source is the investment bank or firm marketing the deal. Look for it in these locations:
|
||||
- **Cover Page**: Check the cover page for investment bank logos, names, or "Prepared by" statements
|
||||
- **Document Header/Footer**: Look for investment bank names in headers or footers on each page
|
||||
- **Contact Information Page**: Check for "For inquiries contact" or "Contact Information" sections
|
||||
- **Common Investment Bank Names**: Look for firms like "Capstone Partners", "Harris Williams", "Raymond James", "Jefferies", "Piper Sandler", "William Blair", "Stifel", "Baird", "Lincoln International", "Duff & Phelps", "Houlihan Lokey", "Moelis", "Lazard", "Goldman Sachs", "Morgan Stanley", "JPMorgan", "Bank of America", "Wells Fargo", "Citigroup", "Credit Suisse", "UBS", "Deutsche Bank", "Barclays", "RBC Capital Markets", "TD Securities", "BMO Capital Markets", "CIBC World Markets", "Scotiabank", "National Bank Financial", "Canaccord Genuity", "Desjardins Securities", "Laurentian Bank Securities", "Cormark Securities", "Eight Capital", "GMP Securities", "GMP Capital Markets"
|
||||
- **Email Domains**: Investment banks often have distinctive email domains (e.g., "@harriswilliams.com", "@capstonepartners.com")
|
||||
- **Phone Numbers**: May be listed with area codes (e.g., "(212)" for NYC-based banks, "(312)" for Chicago-based banks)
|
||||
- If no investment bank is found, look for "M&A Advisor", "Financial Advisor", "Transaction Advisor", or similar terms
|
||||
|
||||
**METADATA CROSS-VALIDATION**:
|
||||
- **Company Name**: Extract from cover page, executive summary, and business description. Verify consistency across all mentions. Use the exact legal entity name if provided (e.g., "ABC Company, Inc." not just "ABC Company")
|
||||
- **Industry Sector**: Look in executive summary, business description, and market analysis. Cross-reference to ensure consistency. Use specific industry classifications (e.g., "Specialty Chemicals" not just "Chemicals", "B2B Software/SaaS" not just "Software")
|
||||
- **Geography**: Extract headquarters location and key operating locations. Format as "City, State" (e.g., "Cleveland, OH" not just "Cleveland"). Check for multiple locations mentioned in operations section
|
||||
|
||||
**DATE EXTRACTION (Enhanced Handling)**:
|
||||
- **CIM Document Date**: Look for "Date:", "As of:", "Prepared:", "Dated:" on cover page or first page. Handle formats: "March 15, 2024", "3/15/2024", "2024-03-15", "Mar 2024", "Q1 2024"
|
||||
- **Review Date**: Extract when the CIM was reviewed (may be different from document date)
|
||||
- **Fiscal Year End**: Look for "Fiscal Year End:", "FYE:", "Fiscal Year:", "Year End:" - common formats: "March 31", "Dec 31", "Sep 30", "Jun 30"
|
||||
- **LTM Calculation Date**: If LTM period is shown, note the calculation date (e.g., "LTM Mar 2024" = through March 2024)
|
||||
|
||||
**EMPLOYEE COUNT CONTEXT**:
|
||||
Look for employee count in these locations:
|
||||
- **Executive Summary**: Often mentioned in company overview
|
||||
- **Company Overview Section**: May have "About Us" or "Company Profile" with headcount
|
||||
- **Organizational Chart**: If org chart is provided, may indicate approximate headcount
|
||||
- **Business Description**: May mention "team of X employees" or "X-person organization"
|
||||
- **Management Section**: May reference "X employees across Y locations"
|
||||
- Format as number only (e.g., "250" not "approximately 250 employees")
|
||||
|
||||
**FINANCIAL TABLE DETECTION (Enhanced Instructions)**:
|
||||
CRITICAL: Find and extract financial tables with historical data. Use these context clues to identify PRIMARY vs subsidiary tables:
|
||||
|
||||
1. **Table Location Indicators**:
|
||||
- PRIMARY tables are usually in main "Financial Summary" or "Historical Financials" sections
|
||||
- Subsidiary tables may be in appendices or segment breakdown sections
|
||||
- PRIMARY table typically appears before subsidiary tables
|
||||
|
||||
2. **Value Magnitude Indicators**:
|
||||
- PRIMARY table: Values in millions ($64M, $71M, $76M) - typical for target companies
|
||||
- Subsidiary table: Values in thousands ($20,546, $26,352) - for segments or subsidiaries
|
||||
- If revenue < $10M, you're likely looking at wrong table
|
||||
|
||||
3. **Table Title Indicators**:
|
||||
- PRIMARY: "Financial Summary", "Historical Financials", "Income Statement", "P&L", "Financial Performance", "Key Metrics", "Consolidated Financials"
|
||||
- Subsidiary: "Segment Results", "Division Performance", "[Subsidiary Name] Financials", "Business Unit Results"
|
||||
|
||||
4. **Table Structure Indicators**:
|
||||
- PRIMARY: Shows consolidated company results with 3-4 periods (FY-3, FY-2, FY-1, LTM)
|
||||
- Subsidiary: Shows individual business units, may have different period structures
|
||||
|
||||
5. **Cross-Reference Validation**:
|
||||
- Check executive summary for financial highlights - should match PRIMARY table magnitude
|
||||
- If executive summary says "$64M revenue" but table shows "$20,546", you're using wrong table
|
||||
|
||||
Look for tables showing:
|
||||
* Revenue (also called "Net Sales", "Total Revenue", "Top Line") for FY-3, FY-2, FY-1, and LTM (Last Twelve Months) or TTM (Trailing Twelve Months)
|
||||
* Revenue growth percentages (YoY, year-over-year, % change)
|
||||
* EBITDA (also called "Adjusted EBITDA", "Adj. EBITDA") for all periods
|
||||
* EBITDA (also called "Adjusted EBITDA", "Adj. EBITDA", "EBITDA (Adjusted)") for all periods
|
||||
* EBITDA margin percentages for all periods
|
||||
* Gross profit and gross margin percentages for all periods
|
||||
- Financial tables may be labeled as: "Financial Summary", "Historical Financials", "Income Statement", "P&L", "Financial Performance", "Key Metrics", or similar
|
||||
- Tables typically have column headers with years (2021, 2022, 2023, 2024, FY2021, FY2022, FY2023, FY2024) or periods (FY-3, FY-2, FY-1, LTM, TTM)
|
||||
|
||||
CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
|
||||
EXAMPLE FINANCIAL TABLE FORMAT:
|
||||
Financial tables in CIMs typically look like this:
|
||||
FY-3 FY-2 FY-1 LTM
|
||||
@@ -1563,27 +1650,67 @@ IMPORTANT EXTRACTION RULES:
|
||||
chunks: ProcessingChunk[]
|
||||
): Promise<{ data: Partial<CIMReview>; apiCalls: number }> {
|
||||
const query = `Extract market analysis, business operations, and management information including:
|
||||
- CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
- Total addressable market (TAM) size estimates and calculations
|
||||
- Serviceable addressable market (SAM) and target market sizing
|
||||
- Market growth rates, CAGR historical and projected
|
||||
- Industry trends, drivers, tailwinds and headwinds
|
||||
- Competitive landscape and key competitor identification
|
||||
- Company's market position, ranking, and market share
|
||||
- Basis of competition and competitive differentiation
|
||||
- Barriers to entry and competitive moats
|
||||
- Core business operations and operational model description
|
||||
- Key products, services, and service lines with revenue mix
|
||||
- Unique value proposition and competitive differentiation
|
||||
- Customer base overview, segments, and customer types
|
||||
- Customer concentration risk, top customers percentage
|
||||
- Contract length, recurring revenue, and retention rates
|
||||
- Key supplier dependencies and supply chain risks
|
||||
- Management team structure and key leaders
|
||||
- CEO, CFO, COO, and executive leadership bios and backgrounds
|
||||
- Management quality, experience, and track record
|
||||
- Post-transaction management intentions and rollover
|
||||
- Organizational structure, reporting relationships, depth of team`;
|
||||
|
||||
CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
|
||||
**MARKET ANALYSIS FRAMEWORK (TAM/SAM/SOM Methodology)**:
|
||||
- **Total Addressable Market (TAM)**: The total market demand for a product or service. Extract size estimates, calculation methodology, and data sources. Format as "$XX.XB" or "$XX.XM" with time period (e.g., "$5.2B in 2024")
|
||||
- **Serviceable Addressable Market (SAM)**: The portion of TAM that can be reached with current products/services. Extract size and how it's defined (geographic, product, customer segment limitations)
|
||||
- **Serviceable Obtainable Market (SOM)**: The portion of SAM that can realistically be captured. Extract market share targets, growth plans, and capture strategy
|
||||
- **Market Growth Rates**: Extract historical CAGR (Compound Annual Growth Rate) and projected growth rates. Include time periods (e.g., "8.5% CAGR from 2020-2024, projected 7.2% CAGR 2024-2028")
|
||||
- **Market Sizing Approaches**: Note if TAM/SAM calculated using top-down (industry reports, government data) or bottom-up (customer count × average spend) methodology
|
||||
|
||||
**COMPETITIVE INTELLIGENCE DEPTH**:
|
||||
- **Key Competitors**: Identify specific competitor names, not just generic descriptions. Include both direct competitors (same products/services) and indirect competitors (alternative solutions)
|
||||
- **Market Share Context**: Extract market share percentages if stated (e.g., "Company X holds 15% market share, second largest player"). If not stated, infer from market position descriptions
|
||||
- **Competitive Positioning**: Identify where company ranks (e.g., "#1", "#2", "Top 3", "Top 5", "Market leader", "Niche player", "Follower"). Extract specific rankings if provided
|
||||
- **Differentiation Drivers**: Extract specific competitive advantages (technology, service, pricing, distribution, brand, customer relationships). Quantify where possible (e.g., "30% faster delivery than competitors")
|
||||
- **Competitive Dynamics**: Extract information about competitive intensity (fragmented vs consolidated market, price competition vs differentiation, new entrants, consolidation trends)
|
||||
|
||||
**CUSTOMER ANALYSIS DEPTH**:
|
||||
- **Customer Lifetime Value (LTV)**: Extract if mentioned, or calculate if data available (average contract value × contract length × retention rate)
|
||||
- **Churn Rates**: Extract customer churn percentages, retention rates, or renewal rates. Include time periods (e.g., "95% annual retention rate")
|
||||
- **Expansion Rates**: Extract upsell/cross-sell rates, expansion revenue, or net revenue retention (NRR). Format as percentages (e.g., "120% NRR in FY-1")
|
||||
- **Contract Terms**: Extract typical contract length, renewal terms, pricing models (fixed, variable, usage-based, subscription). Include specific details (e.g., "3-year contracts with 2-year renewal options")
|
||||
- **Pricing Models**: Extract pricing structure (per seat, per transaction, percentage of revenue, fixed fee, etc.). Include pricing levels if mentioned
|
||||
- **Customer Segments**: Extract detailed customer segmentation (by size, industry, geography, product usage). Include revenue mix by segment if available
|
||||
- **Customer Concentration Risk**: Extract top 5 and top 10 customer percentages. Include specific customer names if mentioned and their revenue contribution
|
||||
- **Recurring Revenue**: Extract recurring revenue percentage, subscription revenue, or contract-based revenue. Distinguish between MRR/ARR if applicable
|
||||
|
||||
**SUPPLIER RISK ASSESSMENT**:
|
||||
- **Supplier Concentration**: Extract top supplier percentages, single-source dependencies, or supplier concentration metrics
|
||||
- **Switching Costs**: Extract information about supplier switching difficulty, contract terms, or lock-in factors
|
||||
- **Dependency Analysis**: Identify critical suppliers, sole-source relationships, or suppliers that would be difficult to replace
|
||||
- **Supply Chain Resilience**: Extract information about supply chain risks, geographic concentration, backup suppliers, or supply chain diversification
|
||||
- **Supplier Relationships**: Extract information about supplier relationships (long-term contracts, strategic partnerships, preferred vendor status)
|
||||
|
||||
**INDUSTRY TREND ANALYSIS**:
|
||||
- **Tailwinds (Positive Trends)**: Extract growth drivers, favorable market conditions, technology adoption, regulatory changes, demographic trends, economic factors supporting growth
|
||||
- **Headwinds (Negative Trends)**: Extract challenges, unfavorable market conditions, technology disruption, regulatory risks, competitive threats, economic headwinds
|
||||
- **Regulatory Changes**: Extract information about regulatory environment, compliance requirements, pending regulations, or regulatory risks/opportunities
|
||||
- **Technology Disruptions**: Extract information about technology trends, digital transformation, automation, AI/ML adoption, or technology threats/opportunities
|
||||
- **Consolidation Trends**: Extract information about industry consolidation, M&A activity, roll-up strategies, or market structure changes
|
||||
|
||||
**BARRIERS TO ENTRY FRAMEWORK (Porter's Framework)**:
|
||||
- **Capital Requirements**: Extract information about capital intensity, investment requirements, or barriers to entry from capital perspective
|
||||
- **Regulatory Barriers**: Extract licenses, certifications, regulatory approvals, or compliance requirements that create barriers
|
||||
- **Technology Barriers**: Extract proprietary technology, patents, R&D requirements, or technical expertise needed
|
||||
- **Brand/Distribution Barriers**: Extract brand strength, customer relationships, distribution channels, or market presence that create barriers
|
||||
- **Economies of Scale**: Extract information about scale advantages, cost structure, or operational efficiencies that create barriers
|
||||
- **Switching Costs**: Extract customer switching costs, integration requirements, or lock-in factors that protect market position
|
||||
|
||||
**BUSINESS OPERATIONS**:
|
||||
- Core business operations and operational model description (how the business operates day-to-day)
|
||||
- Key products, services, and service lines with revenue mix (percentage breakdown if available)
|
||||
- Unique value proposition and competitive differentiation (why customers choose this company)
|
||||
- Operational capabilities and core competencies
|
||||
|
||||
**MANAGEMENT TEAM**:
|
||||
- Management team structure and key leaders (CEO, CFO, COO, Head of Sales, etc.)
|
||||
- CEO, CFO, COO, and executive leadership bios and backgrounds (years of experience, prior companies, track record)
|
||||
- Management quality, experience, and track record (specific achievements, industry recognition)
|
||||
- Post-transaction management intentions and rollover (will management stay, equity rollover, retention plans)
|
||||
- Organizational structure, reporting relationships, depth of team (org chart details if available)`;
|
||||
|
||||
const targetFields = [
|
||||
'marketIndustryAnalysis.*',
|
||||
@@ -1767,24 +1894,177 @@ IMPORTANT EXTRACTION RULES:
|
||||
text: string,
|
||||
chunks: ProcessingChunk[]
|
||||
): Promise<{ data: Partial<CIMReview>; apiCalls: number }> {
|
||||
const query = `Synthesize investment analysis and strategic assessment including:
|
||||
- CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
- Key investment attractions, strengths, and reasons to invest
|
||||
- Investment highlights and compelling attributes
|
||||
- Potential risks, concerns, and reasons not to invest
|
||||
- Red flags and areas of concern
|
||||
- Value creation opportunities and levers for PE value-add
|
||||
- Operational improvements and margin expansion opportunities
|
||||
- M&A and add-on acquisition potential
|
||||
- Technology enablement and digital transformation opportunities
|
||||
- Alignment with BPCP fund strategy (5MM+ EBITDA, consumer/industrial sectors)
|
||||
- Geographic fit with Cleveland/Charlotte proximity
|
||||
- Founder/family ownership alignment
|
||||
- Critical questions for management and due diligence
|
||||
- Missing information and gaps requiring further investigation
|
||||
- Preliminary recommendation (Pass/Pursue/More Info)
|
||||
- Rationale for recommendation
|
||||
- Proposed next steps and action items`;
|
||||
const query = `Synthesize investment analysis and strategic assessment using world-class PE investment thesis framework:
|
||||
|
||||
CRITICAL: Extract ALL fields completely. Do NOT use "Not specified in CIM" unless you have thoroughly searched the entire document and confirmed the information is truly not present. Be thorough and extract all available information.
|
||||
|
||||
**INVESTMENT THESIS FRAMEWORK (Standard PE Framework)**:
|
||||
Structure your analysis using these four pillars:
|
||||
|
||||
1. **Investment Highlights**: Key attractions, strengths, and reasons to invest
|
||||
- Market position and competitive advantages
|
||||
- Financial performance and growth trajectory
|
||||
- Management team quality and track record
|
||||
- Market opportunity and growth potential
|
||||
- Operational excellence and scalability
|
||||
|
||||
2. **Value Creation Plan**: Specific levers for value creation
|
||||
- Revenue growth opportunities (organic and inorganic)
|
||||
- Margin expansion potential
|
||||
- Operational improvements
|
||||
- M&A and add-on acquisition strategy
|
||||
- Technology and digital transformation
|
||||
- Multiple expansion potential
|
||||
|
||||
3. **Risk Assessment**: Comprehensive risk evaluation
|
||||
- Operational risks (execution, competitive, market)
|
||||
- Financial risks (leverage, liquidity, customer concentration)
|
||||
- Market risks (industry trends, regulatory, economic)
|
||||
- Execution risks (integration, management, technology)
|
||||
- Regulatory risks (compliance, changes, approvals)
|
||||
- Technology risks (disruption, obsolescence, cybersecurity)
|
||||
|
||||
4. **Strategic Rationale**: Why this investment makes sense
|
||||
- Alignment with fund strategy
|
||||
- Fit with portfolio companies
|
||||
- Market timing and opportunity
|
||||
- Competitive positioning
|
||||
- Exit potential and multiple expansion
|
||||
|
||||
**VALUE CREATION PLAYBOOK (Quantification Guidance)**:
|
||||
For each value creation lever, provide:
|
||||
- **Specific Opportunity**: What exactly can be improved (e.g., "Reduce SG&A by 150 bps through shared services consolidation")
|
||||
- **Quantification**: Potential impact in dollars or percentages (e.g., "adding $1.5M EBITDA" or "200-300 bps margin expansion")
|
||||
- **Implementation Approach**: How BPCP would execute (e.g., "Leverage BPCP's shared services platform and procurement expertise")
|
||||
- **Timeline**: Expected time to realize value (e.g., "12-18 months")
|
||||
- **Confidence Level**: High/Medium/Low based on CIM evidence
|
||||
|
||||
Value Creation Levers to Evaluate:
|
||||
- **Revenue Growth**: Pricing optimization, new products/services, market expansion, sales force effectiveness, customer acquisition
|
||||
- **Margin Expansion**: Cost reduction, pricing power, operational efficiency, supply chain optimization, procurement improvements
|
||||
- **M&A Strategy**: Add-on acquisition targets, roll-up opportunities, platform expansion, geographic expansion
|
||||
- **Operational Improvements**: Technology enablement, process optimization, automation, shared services, best practices
|
||||
- **Multiple Expansion**: Market position improvement, growth acceleration, margin expansion, strategic value
|
||||
|
||||
**RISK CATEGORIZATION (Structured Risk Types)**:
|
||||
Categorize each risk by type and assess:
|
||||
|
||||
1. **Operational Risks**:
|
||||
- Execution risk (can management deliver on plans?)
|
||||
- Competitive risk (market position, competitive response)
|
||||
- Market risk (demand, pricing, customer behavior)
|
||||
- Operational risk (supply chain, quality, capacity)
|
||||
|
||||
2. **Financial Risks**:
|
||||
- Leverage risk (debt levels, interest rate exposure)
|
||||
- Liquidity risk (cash flow, working capital)
|
||||
- Customer concentration risk (top customer dependency)
|
||||
- Revenue risk (contract renewal, churn, pricing)
|
||||
|
||||
3. **Market Risks**:
|
||||
- Industry trends (growth, consolidation, disruption)
|
||||
- Regulatory changes (compliance, approvals, restrictions)
|
||||
- Economic factors (recession, inflation, interest rates)
|
||||
- Technology disruption (new technologies, obsolescence)
|
||||
|
||||
4. **Execution Risks**:
|
||||
- Integration risk (if M&A involved)
|
||||
- Management risk (retention, capability, succession)
|
||||
- Technology risk (implementation, cybersecurity, obsolescence)
|
||||
- Cultural risk (organizational change, employee retention)
|
||||
|
||||
5. **Regulatory Risks**:
|
||||
- Compliance requirements
|
||||
- Pending regulations
|
||||
- Regulatory approvals needed
|
||||
- Industry-specific regulations
|
||||
|
||||
6. **Technology Risks**:
|
||||
- Technology disruption
|
||||
- Cybersecurity threats
|
||||
- Technology obsolescence
|
||||
- Digital transformation challenges
|
||||
|
||||
For each risk, assess:
|
||||
- **Probability**: High/Medium/Low likelihood of occurring
|
||||
- **Impact**: High/Medium/Low impact on investment if it occurs
|
||||
- **Mitigation**: How the risk can be managed or mitigated
|
||||
- **Deal-Breaker Status**: Is this a deal-breaker or manageable?
|
||||
|
||||
**BPCP ALIGNMENT SCORING (Quantitative Assessment)**:
|
||||
Provide quantitative scores (1-10) for each alignment criterion:
|
||||
|
||||
1. **EBITDA Fit**: Score 1-10 based on EBITDA range (5+MM target, higher scores for 5-20MM range)
|
||||
2. **Industry Fit**: Score 1-10 based on consumer/industrial sector focus
|
||||
3. **Geographic Fit**: Score 1-10 based on proximity to Cleveland/Charlotte (driving distance)
|
||||
4. **Value Creation Fit**: Score 1-10 based on alignment with BPCP expertise (M&A, technology, supply chain, human capital)
|
||||
5. **Ownership Fit**: Score 1-10 based on founder/family ownership (preferred)
|
||||
6. **Growth Potential**: Score 1-10 based on growth trajectory and market opportunity
|
||||
7. **Management Quality**: Score 1-10 based on management team assessment
|
||||
|
||||
Provide overall alignment score and specific areas of fit/misalignment.
|
||||
|
||||
**COMPARABLE ANALYSIS**:
|
||||
- **Comparable Companies**: Identify and reference comparable companies mentioned in CIM (competitors, peers)
|
||||
- **Transaction Multiples**: Extract transaction multiples if mentioned (revenue multiples, EBITDA multiples)
|
||||
- **Industry Benchmarks**: Reference industry benchmarks for margins, growth rates, multiples
|
||||
- **Valuation Context**: Use comparables to assess valuation attractiveness
|
||||
|
||||
**MANAGEMENT ASSESSMENT DEPTH**:
|
||||
- **Experience Scoring**: Assess years of experience, prior company success, industry expertise (score 1-10)
|
||||
- **Track Record Analysis**: Evaluate specific achievements, growth under management, operational improvements
|
||||
- **Retention Risk**: Assess likelihood of management staying post-transaction (High/Medium/Low)
|
||||
- **Succession Planning**: Evaluate depth of team, key person risk, succession plans
|
||||
- **Management Equity**: Assess management rollover, alignment of interests, incentive structure
|
||||
|
||||
**DUE DILIGENCE PRIORITIZATION**:
|
||||
Rank questions and missing information by investment decision impact:
|
||||
|
||||
1. **Deal-Breakers (Priority 1)**: Questions that could kill the deal if not answered favorably
|
||||
- Financial accuracy and quality of earnings
|
||||
- Major customer concentration or retention risk
|
||||
- Regulatory approvals or compliance issues
|
||||
- Management retention and succession
|
||||
|
||||
2. **High Impact (Priority 2)**: Questions that significantly affect valuation or investment thesis
|
||||
- Growth assumptions and market opportunity
|
||||
- Competitive position and differentiation
|
||||
- Operational improvements and value creation
|
||||
- M&A strategy and add-on potential
|
||||
|
||||
3. **Medium Impact (Priority 3)**: Questions that affect investment structure or terms
|
||||
- Working capital requirements
|
||||
- Capital expenditure needs
|
||||
- Technology requirements
|
||||
- Integration considerations
|
||||
|
||||
4. **Nice-to-Know (Priority 4)**: Questions that provide additional context but don't affect core decision
|
||||
- Industry trends and benchmarks
|
||||
- Competitive dynamics
|
||||
- Market research and analysis
|
||||
|
||||
For each question, explain:
|
||||
- **Context**: Why this question matters
|
||||
- **Investment Impact**: How the answer affects the investment decision
|
||||
- **Priority**: Deal-breaker, High, Medium, or Nice-to-know
|
||||
|
||||
**INVESTMENT ATTRACTIONS**:
|
||||
Extract 5-8 detailed strengths. For each, include:
|
||||
- **What**: The specific advantage or strength
|
||||
- **Why It Matters**: Why this makes the investment attractive
|
||||
- **Quantification**: Numbers, percentages, or metrics if available
|
||||
- **Investment Impact**: How this affects the investment thesis
|
||||
|
||||
**PRELIMINARY RECOMMENDATION**:
|
||||
Provide clear recommendation: "Proceed", "Pass", or "Proceed with Caution"
|
||||
Include brief justification focusing on most compelling factors
|
||||
|
||||
**RATIONALE FOR RECOMMENDATION**:
|
||||
Provide 3-4 key reasons supporting the recommendation, focusing on:
|
||||
- Most compelling investment attractions
|
||||
- Most significant risks or concerns
|
||||
- Strategic fit and alignment
|
||||
- Value creation potential`;
|
||||
|
||||
const targetFields = [
|
||||
'preliminaryInvestmentThesis.keyAttractions',
|
||||
@@ -1874,11 +2154,19 @@ IMPORTANT EXTRACTION RULES:
|
||||
pinnedChunks: pinnedChunks.length
|
||||
});
|
||||
|
||||
// Call LLM with the reduced text, focused fields, and detailed extraction instructions from RAG query
|
||||
// Enhance extraction instructions with field-specific templates and dynamic generation
|
||||
const enhancedExtractionInstructions = this.buildEnhancedExtractionInstructions(
|
||||
ragQuery,
|
||||
targetFields,
|
||||
selectedChunks,
|
||||
options
|
||||
);
|
||||
|
||||
// Call LLM with the reduced text, focused fields, and enhanced extraction instructions
|
||||
// NOTE: To use Haiku for faster processing, set LLM_MODEL=claude-haiku-4-5-20251001
|
||||
// or use OpenRouter with model: anthropic/claude-haiku-4.5
|
||||
// Pass targetFields as focusedFields and ragQuery as extractionInstructions to fully utilize agentic RAG
|
||||
const result = await llmService.processCIMDocument(reducedText, 'BPCP CIM Review Template', undefined, targetFields, ragQuery);
|
||||
// Pass targetFields as focusedFields and enhancedExtractionInstructions as extractionInstructions to fully utilize agentic RAG
|
||||
const result = await llmService.processCIMDocument(reducedText, 'BPCP CIM Review Template', undefined, targetFields, enhancedExtractionInstructions);
|
||||
|
||||
if (!result.success || !result.jsonOutput) {
|
||||
logger.warn('Targeted extraction pass returned no data', { documentId, ragQuery: ragQuery.substring(0, 50) });
|
||||
@@ -1900,6 +2188,140 @@ IMPORTANT EXTRACTION RULES:
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Build enhanced extraction instructions with field-specific templates and dynamic generation
|
||||
*/
|
||||
private buildEnhancedExtractionInstructions(
|
||||
baseQuery: string,
|
||||
targetFields: string[],
|
||||
selectedChunks: ProcessingChunk[],
|
||||
options?: {
|
||||
isFinancialPass?: boolean;
|
||||
}
|
||||
): string {
|
||||
// Categorize target fields for field-specific instruction templates
|
||||
const financialFields = targetFields.filter(f => f.includes('financial') || f.includes('revenue') || f.includes('ebitda') || f.includes('margin') || f.includes('profit'));
|
||||
const marketFields = targetFields.filter(f => f.includes('market') || f.includes('industry') || f.includes('competitive'));
|
||||
const businessFields = targetFields.filter(f => f.includes('business') || f.includes('customer') || f.includes('supplier') || f.includes('product'));
|
||||
const managementFields = targetFields.filter(f => f.includes('management') || f.includes('team') || f.includes('leader'));
|
||||
const investmentFields = targetFields.filter(f => f.includes('investment') || f.includes('thesis') || f.includes('risk') || f.includes('attraction') || f.includes('valueCreation'));
|
||||
|
||||
// Detect document characteristics from chunks
|
||||
const hasFinancialTables = selectedChunks.some(chunk => chunk.metadata?.isFinancialTable === true);
|
||||
const hasStructuredTables = selectedChunks.some(chunk => chunk.metadata?.isStructuredTable === true);
|
||||
const hasProjections = selectedChunks.some(chunk =>
|
||||
chunk.content.match(/\b(20\d{2}[EP]|Projected|Forecast|Budget|Plan)\b/i)
|
||||
);
|
||||
const hasAppendices = selectedChunks.some(chunk =>
|
||||
chunk.content.match(/\b(Appendix|Exhibit|Attachment)\b/i)
|
||||
);
|
||||
|
||||
let enhancedInstructions = baseQuery + '\n\n';
|
||||
|
||||
// Add field-specific instruction templates
|
||||
if (financialFields.length > 0) {
|
||||
enhancedInstructions += `**FINANCIAL FIELD EXTRACTION TEMPLATE**:
|
||||
- **Table Detection**: ${hasFinancialTables ? 'Financial tables detected in document. Use PRIMARY table with values in millions ($20M-$1B+), not subsidiary tables with values in thousands.' : 'Search for financial tables in "Financial Summary", "Historical Financials", "Income Statement" sections.'}
|
||||
- **Period Mapping**: Identify FY-3 (oldest), FY-2, FY-1 (most recent full year), LTM (trailing period). Handle various formats (years, FY-X, mixed).
|
||||
- **Value Extraction**: Extract exact values, preserve format ($64M, 29.3%, etc.). Cross-reference with executive summary financial highlights.
|
||||
- **Validation**: Verify magnitude ($10M+ revenue), trends (generally increasing/stable), margins (5-50% EBITDA, 20-80% gross), calculations (growth rates, margins).
|
||||
- **Cross-Reference**: ${hasAppendices ? 'Check appendices for additional financial detail or adjustments.' : 'If appendices exist, check for detailed financials.'}
|
||||
- **Table Type**: ${hasProjections ? 'IGNORE projection tables (marked with E, P, PF, Projected, Forecast). Only extract historical data.' : 'Extract from historical financial tables only.'}
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (marketFields.length > 0) {
|
||||
enhancedInstructions += `**MARKET FIELD EXTRACTION TEMPLATE**:
|
||||
- **Market Sizing**: Extract TAM/SAM/SOM with methodology (top-down vs bottom-up), data sources, time periods.
|
||||
- **Growth Rates**: Extract historical and projected CAGR with time periods (e.g., "8.5% CAGR 2020-2024, projected 7.2% CAGR 2024-2028").
|
||||
- **Competitive Analysis**: Extract specific competitor names, market share percentages, competitive positioning (#1, #2, Top 3, etc.), differentiation drivers.
|
||||
- **Industry Trends**: Categorize as tailwinds (positive) or headwinds (negative). Include regulatory changes, technology disruptions, consolidation trends.
|
||||
- **Barriers to Entry**: Use Porter's framework - capital, regulatory, technology, brand/distribution, economies of scale, switching costs.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (businessFields.length > 0) {
|
||||
enhancedInstructions += `**BUSINESS FIELD EXTRACTION TEMPLATE**:
|
||||
- **Operations**: Extract core operations description, operational model, day-to-day business processes.
|
||||
- **Products/Services**: Extract specific products/services with revenue mix percentages if available. Include service lines, product categories.
|
||||
- **Customer Analysis**: Extract customer segments, LTV, churn rates, expansion rates (NRR), contract terms, pricing models, customer concentration (top 5, top 10 %).
|
||||
- **Supplier Analysis**: Extract supplier concentration, switching costs, dependency analysis, supply chain resilience, critical suppliers.
|
||||
- **Value Proposition**: Extract specific reasons customers choose this company (technology, service, pricing, distribution, brand, relationships).
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (managementFields.length > 0) {
|
||||
enhancedInstructions += `**MANAGEMENT FIELD EXTRACTION TEMPLATE**:
|
||||
- **Key Leaders**: Extract CEO, CFO, COO, Head of Sales, and other key executives with specific titles.
|
||||
- **Experience**: Extract years of experience, prior companies, track record, specific achievements, industry recognition.
|
||||
- **Quality Assessment**: Score 1-10 based on experience, track record, industry expertise. Provide specific examples.
|
||||
- **Retention Risk**: Assess likelihood of staying post-transaction (High/Medium/Low). Extract rollover equity, retention plans.
|
||||
- **Succession Planning**: Evaluate depth of team, key person risk, succession plans, organizational structure.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (investmentFields.length > 0) {
|
||||
enhancedInstructions += `**INVESTMENT FIELD EXTRACTION TEMPLATE**:
|
||||
- **Attractions**: Extract 5-8 strengths with specificity, quantification, context, investment significance. Format: Numbered list, 2-3 sentences each.
|
||||
- **Risks**: Extract 5-8 risks categorized by type (operational, financial, market, execution, regulatory, technology). Assess probability, impact, mitigations, deal-breaker status.
|
||||
- **Value Creation**: Extract 5-8 levers with specific opportunity, quantification, implementation approach, timeline, confidence level.
|
||||
- **Alignment**: Score 1-10 for each BPCP criterion (EBITDA fit, industry fit, geographic fit, value creation fit, ownership fit, growth potential, management quality).
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
// Add validation instructions
|
||||
enhancedInstructions += `**CROSS-REFERENCE VALIDATION INSTRUCTIONS**:
|
||||
- Validate extracted data against other document sections (executive summary, detailed sections, appendices).
|
||||
- If company name appears in multiple places, ensure consistency.
|
||||
- If financial data appears in multiple places, use most authoritative source (typically detailed historical table).
|
||||
- Cross-check market data with competitive landscape section.
|
||||
- Verify management information across management team section and organizational structure.
|
||||
|
||||
`;
|
||||
|
||||
// Add dynamic instructions based on document characteristics
|
||||
if (hasFinancialTables) {
|
||||
enhancedInstructions += `**DOCUMENT CHARACTERISTIC: Financial Tables Detected**
|
||||
- Primary financial table identified. Extract from this table, cross-reference with executive summary.
|
||||
- Verify table is PRIMARY (values in millions) not subsidiary (values in thousands).
|
||||
- Check for multiple financial tables - use the one with largest revenue values.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (hasStructuredTables) {
|
||||
enhancedInstructions += `**DOCUMENT CHARACTERISTIC: Structured Tables Detected**
|
||||
- Structured tables available. Use these for accurate financial extraction.
|
||||
- Cross-reference structured table data with narrative text for validation.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (hasAppendices) {
|
||||
enhancedInstructions += `**DOCUMENT CHARACTERISTIC: Appendices Detected**
|
||||
- Check appendices for additional financial detail, management bios, market research, competitive analysis.
|
||||
- Appendices may contain detailed information not in main sections.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
if (hasProjections) {
|
||||
enhancedInstructions += `**DOCUMENT CHARACTERISTIC: Projections Detected**
|
||||
- IGNORE projection tables (marked with E, P, PF, Projected, Forecast, Budget, Plan).
|
||||
- Only extract from historical/actual results tables.
|
||||
- If both historical and projected tables exist, use historical only.
|
||||
|
||||
`;
|
||||
}
|
||||
|
||||
return enhancedInstructions;
|
||||
}
|
||||
|
||||
private hasStructuredFinancialData(financials?: ParsedFinancials | null): boolean {
|
||||
if (!financials) return false;
|
||||
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||
@@ -2207,9 +2629,125 @@ IMPORTANT EXTRACTION RULES:
|
||||
return f.split('.').join(' ');
|
||||
}).join(', ');
|
||||
|
||||
return `Find specific information about: ${fieldDescriptions}.
|
||||
Look for data tables, appendices, exhibits, footnotes, and detailed sections that contain: ${fieldDescriptions}.
|
||||
Extract exact values, numbers, percentages, names, and detailed information.`;
|
||||
// Categorize fields for field-specific search strategies
|
||||
const financialFields = fields.filter(f => f.includes('financial') || f.includes('revenue') || f.includes('ebitda') || f.includes('margin') || f.includes('profit'));
|
||||
const marketFields = fields.filter(f => f.includes('market') || f.includes('industry') || f.includes('competitive') || f.includes('TAM') || f.includes('SAM'));
|
||||
const businessFields = fields.filter(f => f.includes('business') || f.includes('customer') || f.includes('supplier') || f.includes('product') || f.includes('service'));
|
||||
const managementFields = fields.filter(f => f.includes('management') || f.includes('team') || f.includes('leader') || f.includes('organizational'));
|
||||
const dealFields = fields.filter(f => f.includes('deal') || f.includes('transaction') || f.includes('source') || f.includes('geography'));
|
||||
|
||||
// Generate alternative phrasings for better search
|
||||
const generateAlternativePhrasings = (fieldPath: string): string[] => {
|
||||
const alternatives: string[] = [];
|
||||
const parts = fieldPath.split('.');
|
||||
|
||||
// Add synonyms and related terms
|
||||
if (fieldPath.includes('revenue')) {
|
||||
alternatives.push('net sales', 'total sales', 'top line', 'revenue', 'sales revenue');
|
||||
}
|
||||
if (fieldPath.includes('ebitda')) {
|
||||
alternatives.push('EBITDA', 'adjusted EBITDA', 'adj EBITDA', 'earnings before interest taxes depreciation amortization');
|
||||
}
|
||||
if (fieldPath.includes('market')) {
|
||||
alternatives.push('market size', 'TAM', 'total addressable market', 'market opportunity', 'addressable market');
|
||||
}
|
||||
if (fieldPath.includes('customer')) {
|
||||
alternatives.push('customer', 'client', 'customer base', 'customer concentration', 'top customers');
|
||||
}
|
||||
if (fieldPath.includes('competitor')) {
|
||||
alternatives.push('competitor', 'competition', 'competitive landscape', 'rival', 'peer');
|
||||
}
|
||||
|
||||
return alternatives.length > 0 ? alternatives : [fieldPath];
|
||||
};
|
||||
|
||||
const allAlternatives = fields.flatMap(f => generateAlternativePhrasings(f));
|
||||
const uniqueAlternatives = [...new Set(allAlternatives)];
|
||||
|
||||
let query = `Find specific information about: ${fieldDescriptions}.\n\n`;
|
||||
|
||||
// Field-specific search strategies
|
||||
if (financialFields.length > 0) {
|
||||
query += `**FINANCIAL DATA SEARCH STRATEGY**:\n`;
|
||||
query += `- Search for financial tables, income statements, P&L statements, financial summaries\n`;
|
||||
query += `- Look in "Financial Summary", "Historical Financials", "Income Statement" sections\n`;
|
||||
query += `- Check appendices for detailed financial statements\n`;
|
||||
query += `- Cross-reference with executive summary financial highlights\n`;
|
||||
query += `- Extract exact numbers, preserve format ($64M, 29.3%, etc.)\n`;
|
||||
query += `- Verify calculations (growth rates, margins) for consistency\n\n`;
|
||||
}
|
||||
|
||||
if (marketFields.length > 0) {
|
||||
query += `**MARKET DATA SEARCH STRATEGY**:\n`;
|
||||
query += `- Search in "Market Analysis", "Industry Overview", "Competitive Landscape" sections\n`;
|
||||
query += `- Look for market size estimates, growth rates, CAGR calculations\n`;
|
||||
query += `- Check for industry reports, market research references\n`;
|
||||
query += `- Extract TAM/SAM/SOM estimates with methodology\n`;
|
||||
query += `- Identify competitive positioning and market share data\n\n`;
|
||||
}
|
||||
|
||||
if (businessFields.length > 0) {
|
||||
query += `**BUSINESS DATA SEARCH STRATEGY**:\n`;
|
||||
query += `- Search in "Business Description", "Company Overview", "Products & Services" sections\n`;
|
||||
query += `- Look for customer information in "Customer Base", "Sales & Marketing" sections\n`;
|
||||
query += `- Extract product/service details, revenue mix, customer segments\n`;
|
||||
query += `- Check for customer concentration data, contract terms, pricing models\n`;
|
||||
query += `- Look for supplier information in "Operations", "Supply Chain" sections\n\n`;
|
||||
}
|
||||
|
||||
if (managementFields.length > 0) {
|
||||
query += `**MANAGEMENT DATA SEARCH STRATEGY**:\n`;
|
||||
query += `- Search in "Management Team", "Leadership", "Organizational Structure" sections\n`;
|
||||
query += `- Look for management bios, experience, track record\n`;
|
||||
query += `- Extract organizational charts, reporting relationships\n`;
|
||||
query += `- Check for post-transaction intentions, retention plans\n\n`;
|
||||
}
|
||||
|
||||
if (dealFields.length > 0) {
|
||||
query += `**DEAL DATA SEARCH STRATEGY**:\n`;
|
||||
query += `- Search cover page, headers, footers for deal source\n`;
|
||||
query += `- Look in "Deal Overview", "Transaction Summary" sections\n`;
|
||||
query += `- Extract transaction type, deal structure, dates\n`;
|
||||
query += `- Check contact information pages for investment bank names\n\n`;
|
||||
}
|
||||
|
||||
// Alternative phrasing
|
||||
query += `**ALTERNATIVE SEARCH TERMS**:\n`;
|
||||
query += `Also search using these related terms: ${uniqueAlternatives.slice(0, 10).join(', ')}\n\n`;
|
||||
|
||||
// Context-aware queries
|
||||
query += `**CONTEXT-AWARE SEARCH**:\n`;
|
||||
query += `- If company name is known, search for "[Company Name] [field]" (e.g., "ABC Company revenue")\n`;
|
||||
query += `- Use section headers to locate relevant information (e.g., "Financial Summary" for financial data)\n`;
|
||||
query += `- Check footnotes, appendices, and exhibits for additional detail\n`;
|
||||
query += `- Look for tables, charts, and graphs that may contain the information\n\n`;
|
||||
|
||||
// Inference rules
|
||||
query += `**INFERENCE RULES**:\n`;
|
||||
if (financialFields.some(f => f.includes('revenueGrowth'))) {
|
||||
query += `- If revenue for two periods is available, calculate growth: ((Current - Prior) / Prior) * 100\n`;
|
||||
}
|
||||
if (financialFields.some(f => f.includes('Margin'))) {
|
||||
query += `- If revenue and profit metric available, calculate margin: (Metric / Revenue) * 100\n`;
|
||||
}
|
||||
query += `- Do NOT infer values - only calculate if base data is available\n`;
|
||||
query += `- If calculation is possible, use calculated value; otherwise use "Not specified in CIM"\n\n`;
|
||||
|
||||
// Cross-section search
|
||||
query += `**CROSS-SECTION SEARCH**:\n`;
|
||||
query += `- If financial data missing, check executive summary for financial highlights\n`;
|
||||
query += `- If market data missing, check competitive landscape section for market context\n`;
|
||||
query += `- If customer data missing, check business description for customer mentions\n`;
|
||||
query += `- If management data missing, check organizational structure or leadership sections\n`;
|
||||
query += `- Search related sections that may contain the information indirectly\n\n`;
|
||||
|
||||
query += `**EXTRACTION REQUIREMENTS**:\n`;
|
||||
query += `- Extract exact values, numbers, percentages, names, and detailed information\n`;
|
||||
query += `- Preserve original format (currency, percentages, dates)\n`;
|
||||
query += `- If information is truly not available after thorough search, use "Not specified in CIM"\n`;
|
||||
query += `- Be thorough - check all sections, appendices, footnotes, and exhibits`;
|
||||
|
||||
return query;
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -2304,18 +2842,85 @@ Extract exact values, numbers, percentages, names, and detailed information.`;
|
||||
const contextChunks = chunks.slice(0, 5); // Use first 5 chunks for context
|
||||
const context = contextChunks.map(c => c.content).join('\n\n');
|
||||
|
||||
// Determine field-specific quality criteria
|
||||
const getQualityCriteria = (fieldName: string): string => {
|
||||
if (fieldName.includes('Attractions') || fieldName.includes('Strengths')) {
|
||||
return `QUALITY CRITERIA FOR KEY ATTRACTIONS:
|
||||
- **Specificity**: Each item should identify a specific advantage (e.g., "Market-leading position with 25% market share" not "strong market position")
|
||||
- **Quantification**: Include numbers, percentages, or metrics where possible (e.g., "$64M revenue", "15% CAGR", "95% retention rate")
|
||||
- **Context**: Explain why this matters for the investment (e.g., "provides pricing power and competitive moat")
|
||||
- **Investment Significance**: Connect to investment thesis (e.g., "supports 2-3x revenue growth potential")`;
|
||||
} else if (fieldName.includes('Risks') || fieldName.includes('Concerns')) {
|
||||
return `QUALITY CRITERIA FOR RISKS:
|
||||
- **Risk Type**: Categorize by type (operational, financial, market, execution, regulatory, technology)
|
||||
- **Impact Assessment**: Assess probability (High/Medium/Low) and impact (High/Medium/Low)
|
||||
- **Mitigation**: Identify how risk can be managed or mitigated
|
||||
- **Deal-Breaker Status**: Indicate if this is a deal-breaker or manageable risk
|
||||
- **Specificity**: Provide specific examples from CIM (e.g., "Top 3 customers represent 45% of revenue" not "customer concentration")`;
|
||||
} else if (fieldName.includes('Value Creation') || fieldName.includes('Levers')) {
|
||||
return `QUALITY CRITERIA FOR VALUE CREATION LEVERS:
|
||||
- **Specific Opportunity**: What exactly can be improved (e.g., "Reduce SG&A by 150 bps through shared services")
|
||||
- **Quantification**: Potential impact in dollars or percentages (e.g., "adding $1.5M EBITDA" or "200-300 bps margin expansion")
|
||||
- **Implementation Approach**: How BPCP would execute (e.g., "Leverage BPCP's shared services platform")
|
||||
- **Timeline**: Expected time to realize value (e.g., "12-18 months")
|
||||
- **Confidence Level**: High/Medium/Low based on CIM evidence`;
|
||||
} else if (fieldName.includes('Questions') || fieldName.includes('Critical')) {
|
||||
return `QUALITY CRITERIA FOR CRITICAL QUESTIONS:
|
||||
- **Context**: 2-3 sentences explaining why this question matters
|
||||
- **Investment Impact**: How the answer affects the investment decision
|
||||
- **Priority**: Deal-breaker, High, Medium, or Nice-to-know
|
||||
- **Specificity**: Ask specific, actionable questions (e.g., "What is the customer retention rate for contracts expiring in the next 12 months?" not "What about customer retention?")`;
|
||||
} else if (fieldName.includes('Missing Information')) {
|
||||
return `QUALITY CRITERIA FOR MISSING INFORMATION:
|
||||
- **What's Missing**: Specific information needed (e.g., "Detailed breakdown of revenue by customer segment" not "more customer data")
|
||||
- **Why Critical**: Why this information is critical for investment decision
|
||||
- **Investment Impact**: How missing information affects valuation or investment thesis
|
||||
- **Priority**: Deal-breaker, High, Medium, or Nice-to-know`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
const qualityCriteria = getQualityCriteria(fieldName);
|
||||
|
||||
const prompt = currentCount < 5
|
||||
? `The following list has ${currentCount} items but needs exactly ${targetCount} items (between 5-8).
|
||||
|
||||
Current ${fieldName}:
|
||||
${currentValue}
|
||||
|
||||
${qualityCriteria}
|
||||
|
||||
**PRIORITIZATION LOGIC**:
|
||||
When expanding the list, prioritize items that:
|
||||
1. Are most important for investment decision-making
|
||||
2. Have specific details, numbers, or metrics from the CIM
|
||||
3. Cover different aspects (don't overlap with existing items)
|
||||
4. Provide actionable insights for PE investors
|
||||
|
||||
**INVESTMENT DEPTH REQUIREMENTS**:
|
||||
Each item must include:
|
||||
- **What**: The specific point, risk, opportunity, or question
|
||||
- **Why It Matters**: Why this is important for the investment decision
|
||||
- **Quantification**: Numbers, percentages, or metrics if available
|
||||
- **Investment Impact**: How this affects the investment thesis, valuation, or decision
|
||||
|
||||
**CONSISTENCY CHECKS**:
|
||||
- Ensure items don't overlap or duplicate each other
|
||||
- Each item should cover a distinct aspect
|
||||
- Items should be comprehensive and cover different dimensions
|
||||
- Maintain consistent format and depth across all items
|
||||
|
||||
**FORMAT STANDARDIZATION**:
|
||||
- Use numbered format: "1. [item text] 2. [item text]" etc.
|
||||
- Each item: 2-3 sentences with specific details
|
||||
- Include specific examples, numbers, or metrics from CIM
|
||||
- Connect to investment significance
|
||||
|
||||
Based on the CIM document context below, expand this list to exactly ${targetCount} items.
|
||||
Add ${targetCount - currentCount} new items that fit the theme and context.
|
||||
Each item should be 2-3 sentences with specific details.
|
||||
Add ${targetCount - currentCount} new items that fit the theme, meet quality criteria, and provide investment-grade insights.
|
||||
|
||||
Document Context:
|
||||
${context.substring(0, 3000)}
|
||||
${context.substring(0, 4000)}
|
||||
|
||||
Return ONLY the new numbered list (format: 1. ... 2. ... etc.), nothing else.
|
||||
Do not include any preamble or explanation.`
|
||||
@@ -2324,10 +2929,39 @@ Do not include any preamble or explanation.`
|
||||
Current ${fieldName}:
|
||||
${currentValue}
|
||||
|
||||
${qualityCriteria}
|
||||
|
||||
**PRIORITIZATION LOGIC**:
|
||||
When consolidating, prioritize items that:
|
||||
1. Are most important for investment decision-making
|
||||
2. Have specific details, numbers, or metrics from the CIM
|
||||
3. Cover different aspects (avoid merging items that cover distinct topics)
|
||||
4. Provide actionable insights for PE investors
|
||||
|
||||
**INVESTMENT DEPTH REQUIREMENTS**:
|
||||
Each item must include:
|
||||
- **What**: The specific point, risk, opportunity, or question
|
||||
- **Why It Matters**: Why this is important for the investment decision
|
||||
- **Quantification**: Numbers, percentages, or metrics if available
|
||||
- **Investment Impact**: How this affects the investment thesis, valuation, or decision
|
||||
|
||||
**CONSISTENCY CHECKS**:
|
||||
- Merge only items that overlap or cover the same topic
|
||||
- Keep items that cover distinct aspects separate
|
||||
- Ensure comprehensive coverage of different dimensions
|
||||
- Maintain consistent format and depth across all items
|
||||
|
||||
**FORMAT STANDARDIZATION**:
|
||||
- Use numbered format: "1. [item text] 2. [item text]" etc.
|
||||
- Each item: 2-3 sentences with specific details
|
||||
- Include specific examples, numbers, or metrics from CIM
|
||||
- Connect to investment significance
|
||||
|
||||
Consolidate this list to exactly ${targetCount} items by:
|
||||
- Merging similar or overlapping points
|
||||
- Merging similar or overlapping points (only if they cover the same topic)
|
||||
- Keeping the most important and specific items
|
||||
- Maintaining 2-3 sentences per item with specific details
|
||||
- Ensuring each item meets investment depth requirements
|
||||
|
||||
Return ONLY the new numbered list (format: 1. ... 2. ... etc.), nothing else.
|
||||
Do not include any preamble or explanation.`;
|
||||
|
||||
606
backend/src/services/parallelDocumentProcessor.ts
Normal file
606
backend/src/services/parallelDocumentProcessor.ts
Normal file
@@ -0,0 +1,606 @@
|
||||
import { logger } from '../utils/logger';
|
||||
import { llmService } from './llmService';
|
||||
import { CIMReview } from './llmSchemas';
|
||||
import { financialExtractionMonitoringService } from './financialExtractionMonitoringService';
|
||||
import { defaultCIMReview } from './unifiedDocumentProcessor';
|
||||
|
||||
// Use the same ProcessingResult interface as other processors
|
||||
interface ProcessingResult {
|
||||
success: boolean;
|
||||
summary: string;
|
||||
analysisData: CIMReview;
|
||||
processingStrategy: 'parallel_sections' | 'simple_full_document' | 'document_ai_agentic_rag';
|
||||
processingTime: number;
|
||||
apiCalls: number;
|
||||
error: string | undefined;
|
||||
}
|
||||
|
||||
interface SectionExtractionResult {
|
||||
section: string;
|
||||
success: boolean;
|
||||
data: Partial<CIMReview>;
|
||||
error?: string;
|
||||
apiCalls: number;
|
||||
processingTime: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Parallel Document Processor
|
||||
*
|
||||
* Strategy: Extract independent sections in parallel to reduce processing time
|
||||
* - Financial extraction (already optimized with Haiku)
|
||||
* - Business description
|
||||
* - Market analysis
|
||||
* - Deal overview
|
||||
* - Management team
|
||||
* - Investment thesis
|
||||
*
|
||||
* Safety features:
|
||||
* - Rate limit risk checking before parallel execution
|
||||
* - Automatic fallback to sequential if risk is high
|
||||
* - API call tracking to prevent exceeding limits
|
||||
*/
|
||||
class ParallelDocumentProcessor {
|
||||
private readonly MAX_CONCURRENT_EXTRACTIONS = 2; // Limit parallel API calls (Anthropic has concurrent connection limits)
|
||||
private readonly RATE_LIMIT_RISK_THRESHOLD: 'low' | 'medium' | 'high' = 'medium'; // Fallback to sequential if risk >= medium
|
||||
|
||||
/**
|
||||
* Process document with parallel section extraction
|
||||
*/
|
||||
async processDocument(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any = {}
|
||||
): Promise<ProcessingResult> {
|
||||
const startTime = Date.now();
|
||||
let totalApiCalls = 0;
|
||||
|
||||
try {
|
||||
logger.info('Parallel processor: Starting', {
|
||||
documentId,
|
||||
textLength: text.length,
|
||||
});
|
||||
|
||||
// Check rate limit risk before starting parallel processing
|
||||
const rateLimitRisk = await this.checkRateLimitRisk();
|
||||
|
||||
if (rateLimitRisk === 'high') {
|
||||
logger.warn('High rate limit risk detected, falling back to sequential processing', {
|
||||
documentId,
|
||||
risk: rateLimitRisk,
|
||||
});
|
||||
// Fallback to simple processor
|
||||
const { simpleDocumentProcessor } = await import('./simpleDocumentProcessor');
|
||||
return await simpleDocumentProcessor.processDocument(documentId, userId, text, options);
|
||||
}
|
||||
|
||||
// Extract sections in parallel
|
||||
const sections = await this.extractSectionsInParallel(documentId, userId, text, options);
|
||||
totalApiCalls = sections.reduce((sum, s) => sum + s.apiCalls, 0);
|
||||
|
||||
// Merge all section results
|
||||
const analysisData = this.mergeSectionResults(sections);
|
||||
|
||||
// Generate summary
|
||||
const summary = this.generateSummary(analysisData);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
logger.info('Parallel processor: Completed', {
|
||||
documentId,
|
||||
processingTime,
|
||||
apiCalls: totalApiCalls,
|
||||
sectionsExtracted: sections.filter(s => s.success).length,
|
||||
totalSections: sections.length,
|
||||
});
|
||||
|
||||
return {
|
||||
success: true,
|
||||
summary,
|
||||
analysisData: analysisData as CIMReview,
|
||||
processingStrategy: 'parallel_sections',
|
||||
processingTime,
|
||||
apiCalls: totalApiCalls,
|
||||
error: undefined,
|
||||
};
|
||||
} catch (error) {
|
||||
const processingTime = Date.now() - startTime;
|
||||
logger.error('Parallel processor: Failed', {
|
||||
documentId,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
processingTime,
|
||||
});
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: defaultCIMReview,
|
||||
processingStrategy: 'parallel_sections',
|
||||
processingTime,
|
||||
apiCalls: totalApiCalls,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Check rate limit risk across all providers/models
|
||||
*/
|
||||
private async checkRateLimitRisk(): Promise<'low' | 'medium' | 'high'> {
|
||||
try {
|
||||
// Check risk for common models
|
||||
const anthropicHaikuRisk = await financialExtractionMonitoringService.checkRateLimitRisk(
|
||||
'anthropic',
|
||||
'claude-3-5-haiku-latest'
|
||||
);
|
||||
const anthropicSonnetRisk = await financialExtractionMonitoringService.checkRateLimitRisk(
|
||||
'anthropic',
|
||||
'claude-sonnet-4-5-20250514'
|
||||
);
|
||||
|
||||
// Return highest risk
|
||||
if (anthropicHaikuRisk === 'high' || anthropicSonnetRisk === 'high') {
|
||||
return 'high';
|
||||
} else if (anthropicHaikuRisk === 'medium' || anthropicSonnetRisk === 'medium') {
|
||||
return 'medium';
|
||||
} else {
|
||||
return 'low';
|
||||
}
|
||||
} catch (error) {
|
||||
logger.warn('Failed to check rate limit risk, defaulting to low', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
return 'low'; // Default to low risk on error
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract sections in parallel with concurrency control
|
||||
*/
|
||||
private async extractSectionsInParallel(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any
|
||||
): Promise<SectionExtractionResult[]> {
|
||||
const sections = [
|
||||
{ name: 'financial', extractor: () => this.extractFinancialSection(documentId, userId, text, options) },
|
||||
{ name: 'dealOverview', extractor: () => this.extractDealOverviewSection(documentId, text) },
|
||||
{ name: 'businessDescription', extractor: () => this.extractBusinessDescriptionSection(documentId, text) },
|
||||
{ name: 'marketAnalysis', extractor: () => this.extractMarketAnalysisSection(documentId, text) },
|
||||
{ name: 'managementTeam', extractor: () => this.extractManagementTeamSection(documentId, text) },
|
||||
{ name: 'investmentThesis', extractor: () => this.extractInvestmentThesisSection(documentId, text) },
|
||||
];
|
||||
|
||||
// Process sections in batches to respect concurrency limits
|
||||
const results: SectionExtractionResult[] = [];
|
||||
|
||||
for (let i = 0; i < sections.length; i += this.MAX_CONCURRENT_EXTRACTIONS) {
|
||||
const batch = sections.slice(i, i + this.MAX_CONCURRENT_EXTRACTIONS);
|
||||
|
||||
logger.info(`Processing batch ${Math.floor(i / this.MAX_CONCURRENT_EXTRACTIONS) + 1} of sections`, {
|
||||
documentId,
|
||||
batchSize: batch.length,
|
||||
sections: batch.map(s => s.name),
|
||||
});
|
||||
|
||||
// Retry logic for concurrent connection limit errors
|
||||
let batchResults = await Promise.allSettled(
|
||||
batch.map(section => section.extractor())
|
||||
);
|
||||
|
||||
// Check for concurrent connection limit errors and retry with sequential processing
|
||||
const hasConcurrentLimitError = batchResults.some(result =>
|
||||
result.status === 'rejected' &&
|
||||
result.reason instanceof Error &&
|
||||
(result.reason.message.includes('concurrent connections') ||
|
||||
result.reason.message.includes('429'))
|
||||
);
|
||||
|
||||
if (hasConcurrentLimitError) {
|
||||
logger.warn('Concurrent connection limit hit, retrying batch sequentially', {
|
||||
documentId,
|
||||
batchSize: batch.length,
|
||||
});
|
||||
|
||||
// Retry each section sequentially with delay
|
||||
batchResults = [];
|
||||
for (const section of batch) {
|
||||
try {
|
||||
const result = await section.extractor();
|
||||
batchResults.push({ status: 'fulfilled' as const, value: result });
|
||||
// Small delay between sequential calls
|
||||
await new Promise(resolve => setTimeout(resolve, 1000));
|
||||
} catch (error) {
|
||||
batchResults.push({
|
||||
status: 'rejected' as const,
|
||||
reason: error instanceof Error ? error : new Error(String(error))
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
batchResults.forEach((result, index) => {
|
||||
if (result.status === 'fulfilled') {
|
||||
results.push(result.value);
|
||||
} else {
|
||||
logger.error(`Section extraction failed: ${batch[index].name}`, {
|
||||
documentId,
|
||||
error: result.reason,
|
||||
});
|
||||
results.push({
|
||||
section: batch[index].name,
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.reason instanceof Error ? result.reason.message : String(result.reason),
|
||||
apiCalls: 0,
|
||||
processingTime: 0,
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
// Small delay between batches to respect rate limits
|
||||
if (i + this.MAX_CONCURRENT_EXTRACTIONS < sections.length) {
|
||||
await new Promise(resolve => setTimeout(resolve, 1000)); // Increased to 1s delay between batches
|
||||
}
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract financial section (already optimized with Haiku)
|
||||
*/
|
||||
private async extractFinancialSection(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
// Run deterministic parser first
|
||||
let deterministicFinancials: any = null;
|
||||
try {
|
||||
const { parseFinancialsFromText } = await import('./financialTableParser');
|
||||
const parsedFinancials = parseFinancialsFromText(text);
|
||||
const hasData = parsedFinancials.fy3?.revenue || parsedFinancials.fy2?.revenue ||
|
||||
parsedFinancials.fy1?.revenue || parsedFinancials.ltm?.revenue;
|
||||
if (hasData) {
|
||||
deterministicFinancials = parsedFinancials;
|
||||
}
|
||||
} catch (parserError) {
|
||||
logger.debug('Deterministic parser failed in parallel extraction', {
|
||||
error: parserError instanceof Error ? parserError.message : String(parserError),
|
||||
});
|
||||
}
|
||||
|
||||
const financialResult = await llmService.processFinancialsOnly(
|
||||
text,
|
||||
deterministicFinancials || undefined
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (financialResult.success && financialResult.jsonOutput?.financialSummary) {
|
||||
return {
|
||||
section: 'financial',
|
||||
success: true,
|
||||
data: { financialSummary: financialResult.jsonOutput.financialSummary },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'financial',
|
||||
success: false,
|
||||
data: {},
|
||||
error: financialResult.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'financial',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract deal overview section
|
||||
*/
|
||||
private async extractDealOverviewSection(
|
||||
documentId: string,
|
||||
text: string
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(
|
||||
text,
|
||||
'BPCP CIM Review Template',
|
||||
undefined, // No existing analysis
|
||||
['dealOverview'], // Focus only on deal overview fields
|
||||
'Extract only the deal overview information: company name, industry, geography, deal source, transaction type, dates, reviewers, page count, and reason for sale.'
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success && result.jsonOutput?.dealOverview) {
|
||||
return {
|
||||
section: 'dealOverview',
|
||||
success: true,
|
||||
data: { dealOverview: result.jsonOutput.dealOverview },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'dealOverview',
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'dealOverview',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract business description section
|
||||
*/
|
||||
private async extractBusinessDescriptionSection(
|
||||
documentId: string,
|
||||
text: string
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(
|
||||
text,
|
||||
'BPCP CIM Review Template',
|
||||
undefined,
|
||||
['businessDescription'],
|
||||
'Extract only the business description: core operations, products/services, value proposition, customer base, and supplier information.'
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success && result.jsonOutput?.businessDescription) {
|
||||
return {
|
||||
section: 'businessDescription',
|
||||
success: true,
|
||||
data: { businessDescription: result.jsonOutput.businessDescription },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'businessDescription',
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'businessDescription',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract market analysis section
|
||||
*/
|
||||
private async extractMarketAnalysisSection(
|
||||
documentId: string,
|
||||
text: string
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(
|
||||
text,
|
||||
'BPCP CIM Review Template',
|
||||
undefined,
|
||||
['marketIndustryAnalysis'],
|
||||
'Extract only the market and industry analysis: market size, growth rate, industry trends, competitive landscape, and barriers to entry.'
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success && result.jsonOutput?.marketIndustryAnalysis) {
|
||||
return {
|
||||
section: 'marketAnalysis',
|
||||
success: true,
|
||||
data: { marketIndustryAnalysis: result.jsonOutput.marketIndustryAnalysis },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'marketAnalysis',
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'marketAnalysis',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract management team section
|
||||
*/
|
||||
private async extractManagementTeamSection(
|
||||
documentId: string,
|
||||
text: string
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(
|
||||
text,
|
||||
'BPCP CIM Review Template',
|
||||
undefined,
|
||||
['managementTeamOverview'],
|
||||
'Extract only the management team information: key leaders, quality assessment, post-transaction intentions, and organizational structure.'
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success && result.jsonOutput?.managementTeamOverview) {
|
||||
return {
|
||||
section: 'managementTeam',
|
||||
success: true,
|
||||
data: { managementTeamOverview: result.jsonOutput.managementTeamOverview },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'managementTeam',
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'managementTeam',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract investment thesis section
|
||||
*/
|
||||
private async extractInvestmentThesisSection(
|
||||
documentId: string,
|
||||
text: string
|
||||
): Promise<SectionExtractionResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(
|
||||
text,
|
||||
'BPCP CIM Review Template',
|
||||
undefined,
|
||||
['preliminaryInvestmentThesis'],
|
||||
'Extract only the investment thesis: key attractions, potential risks, value creation levers, and alignment with BPCP fund strategy.'
|
||||
);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success && result.jsonOutput?.preliminaryInvestmentThesis) {
|
||||
return {
|
||||
section: 'investmentThesis',
|
||||
success: true,
|
||||
data: { preliminaryInvestmentThesis: result.jsonOutput.preliminaryInvestmentThesis },
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
section: 'investmentThesis',
|
||||
success: false,
|
||||
data: {},
|
||||
error: result.error,
|
||||
apiCalls: 1,
|
||||
processingTime,
|
||||
};
|
||||
}
|
||||
} catch (error) {
|
||||
return {
|
||||
section: 'investmentThesis',
|
||||
success: false,
|
||||
data: {},
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
apiCalls: 0,
|
||||
processingTime: Date.now() - startTime,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Merge results from all sections
|
||||
*/
|
||||
private mergeSectionResults(results: SectionExtractionResult[]): Partial<CIMReview> {
|
||||
const merged: Partial<CIMReview> = { ...defaultCIMReview };
|
||||
|
||||
results.forEach(result => {
|
||||
if (result.success) {
|
||||
Object.assign(merged, result.data);
|
||||
}
|
||||
});
|
||||
|
||||
return merged;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generate summary from analysis data
|
||||
*/
|
||||
private generateSummary(data: Partial<CIMReview>): string {
|
||||
const parts: string[] = [];
|
||||
|
||||
if (data.dealOverview?.targetCompanyName) {
|
||||
parts.push(`Target: ${data.dealOverview.targetCompanyName}`);
|
||||
}
|
||||
if (data.dealOverview?.industrySector) {
|
||||
parts.push(`Industry: ${data.dealOverview.industrySector}`);
|
||||
}
|
||||
if (data.financialSummary?.financials?.ltm?.revenue) {
|
||||
parts.push(`LTM Revenue: ${data.financialSummary.financials.ltm.revenue}`);
|
||||
}
|
||||
if (data.financialSummary?.financials?.ltm?.ebitda) {
|
||||
parts.push(`LTM EBITDA: ${data.financialSummary.financials.ltm.ebitda}`);
|
||||
}
|
||||
|
||||
return parts.join(' | ') || 'CIM analysis completed';
|
||||
}
|
||||
}
|
||||
|
||||
export const parallelDocumentProcessor = new ParallelDocumentProcessor();
|
||||
|
||||
@@ -5,6 +5,7 @@ import { llmService } from './llmService';
|
||||
import { CIMReview } from './llmSchemas';
|
||||
import { cimReviewSchema } from './llmSchemas';
|
||||
import { defaultCIMReview } from './unifiedDocumentProcessor';
|
||||
import { financialExtractionMonitoringService } from './financialExtractionMonitoringService';
|
||||
|
||||
interface ProcessingResult {
|
||||
success: boolean;
|
||||
@@ -111,12 +112,14 @@ class SimpleDocumentProcessor {
|
||||
});
|
||||
|
||||
let financialData: CIMReview['financialSummary'] | null = null;
|
||||
const financialExtractionStartTime = Date.now();
|
||||
try {
|
||||
const financialResult = await llmService.processFinancialsOnly(
|
||||
extractedText,
|
||||
deterministicFinancials || undefined
|
||||
);
|
||||
apiCalls += 1;
|
||||
const financialExtractionDuration = Date.now() - financialExtractionStartTime;
|
||||
|
||||
if (financialResult.success && financialResult.jsonOutput?.financialSummary) {
|
||||
financialData = financialResult.jsonOutput.financialSummary;
|
||||
@@ -124,13 +127,92 @@ class SimpleDocumentProcessor {
|
||||
documentId,
|
||||
hasFinancials: !!financialData.financials
|
||||
});
|
||||
|
||||
// Track successful financial extraction event
|
||||
const financials = financialData.financials;
|
||||
const periodsExtracted: string[] = [];
|
||||
const metricsExtractedSet = new Set<string>();
|
||||
|
||||
if (financials) {
|
||||
['fy3', 'fy2', 'fy1', 'ltm'].forEach(period => {
|
||||
const periodData = financials[period as keyof typeof financials];
|
||||
if (periodData) {
|
||||
// Check if period has any data
|
||||
const hasData = periodData.revenue || periodData.ebitda || periodData.grossProfit;
|
||||
if (hasData) {
|
||||
periodsExtracted.push(period);
|
||||
|
||||
// Track which metrics are present
|
||||
if (periodData.revenue) metricsExtractedSet.add('revenue');
|
||||
if (periodData.revenueGrowth) metricsExtractedSet.add('revenueGrowth');
|
||||
if (periodData.grossProfit) metricsExtractedSet.add('grossProfit');
|
||||
if (periodData.grossMargin) metricsExtractedSet.add('grossMargin');
|
||||
if (periodData.ebitda) metricsExtractedSet.add('ebitda');
|
||||
if (periodData.ebitdaMargin) metricsExtractedSet.add('ebitdaMargin');
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Determine extraction method
|
||||
const extractionMethod = deterministicFinancials
|
||||
? 'deterministic_parser'
|
||||
: (financialResult.model?.includes('haiku') ? 'llm_haiku' : 'llm_sonnet');
|
||||
|
||||
// Track extraction event (non-blocking)
|
||||
financialExtractionMonitoringService.trackExtractionEvent({
|
||||
documentId,
|
||||
userId,
|
||||
extractionMethod: extractionMethod as 'deterministic_parser' | 'llm_haiku' | 'llm_sonnet' | 'fallback',
|
||||
modelUsed: financialResult.model,
|
||||
success: true,
|
||||
hasFinancials: !!financials,
|
||||
periodsExtracted,
|
||||
metricsExtracted: Array.from(metricsExtractedSet),
|
||||
processingTimeMs: financialExtractionDuration,
|
||||
apiCallDurationMs: financialExtractionDuration, // Approximate
|
||||
tokensUsed: financialResult.inputTokens + financialResult.outputTokens,
|
||||
costEstimateUsd: financialResult.cost,
|
||||
}).catch(err => {
|
||||
logger.debug('Failed to track financial extraction event (non-critical)', { error: err.message });
|
||||
});
|
||||
} else {
|
||||
// Track failed financial extraction event
|
||||
const extractionMethod = deterministicFinancials
|
||||
? 'deterministic_parser'
|
||||
: 'llm_haiku'; // Default assumption
|
||||
|
||||
financialExtractionMonitoringService.trackExtractionEvent({
|
||||
documentId,
|
||||
userId,
|
||||
extractionMethod: extractionMethod as 'deterministic_parser' | 'llm_haiku' | 'llm_sonnet' | 'fallback',
|
||||
success: false,
|
||||
errorType: 'api_error',
|
||||
errorMessage: financialResult.error,
|
||||
processingTimeMs: Date.now() - financialExtractionStartTime,
|
||||
}).catch(err => {
|
||||
logger.debug('Failed to track financial extraction event (non-critical)', { error: err.message });
|
||||
});
|
||||
|
||||
logger.warn('Financial extraction failed, will try in main extraction', {
|
||||
documentId,
|
||||
error: financialResult.error
|
||||
});
|
||||
}
|
||||
} catch (financialError) {
|
||||
// Track error event
|
||||
financialExtractionMonitoringService.trackExtractionEvent({
|
||||
documentId,
|
||||
userId,
|
||||
extractionMethod: deterministicFinancials ? 'deterministic_parser' : 'llm_haiku',
|
||||
success: false,
|
||||
errorType: 'api_error',
|
||||
errorMessage: financialError instanceof Error ? financialError.message : String(financialError),
|
||||
processingTimeMs: Date.now() - financialExtractionStartTime,
|
||||
}).catch(err => {
|
||||
logger.debug('Failed to track financial extraction event (non-critical)', { error: err.message });
|
||||
});
|
||||
|
||||
logger.warn('Financial extraction threw error, will try in main extraction', {
|
||||
documentId,
|
||||
error: financialError instanceof Error ? financialError.message : String(financialError)
|
||||
@@ -559,32 +641,46 @@ Focus on finding these specific fields in the document. Extract exact values, nu
|
||||
periodData.revenue = 'Not specified in CIM';
|
||||
}
|
||||
|
||||
// Check 2: Detect unusual growth patterns (suggests misaligned columns)
|
||||
// Find adjacent periods to check growth
|
||||
const periodOrder = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||
const currentIndex = periodOrder.indexOf(period);
|
||||
if (currentIndex > 0) {
|
||||
const prevPeriod = periodOrder[currentIndex - 1];
|
||||
const prevValue = extractNumericValue(financials[prevPeriod]?.revenue || '');
|
||||
if (prevValue !== null && prevValue > 0) {
|
||||
const growth = ((currentValue - prevValue) / prevValue) * 100;
|
||||
// Flag if growth is >200% or < -50% (unusual for year-over-year)
|
||||
if (growth > 200 || growth < -50) {
|
||||
logger.warn('Detected unusual revenue growth pattern - may indicate misaligned columns', {
|
||||
period,
|
||||
prevPeriod,
|
||||
currentValue: currentValue,
|
||||
prevValue: prevValue,
|
||||
growth: `${growth.toFixed(1)}%`,
|
||||
reason: `Unusual growth (${growth > 0 ? '+' : ''}${growth.toFixed(1)}%) between ${prevPeriod} and ${period} - may indicate column misalignment`
|
||||
});
|
||||
// Don't reject - just log as warning, as this might be legitimate
|
||||
}
|
||||
// Check 2: Revenue should generally increase or be stable (FY-1/LTM shouldn't be much lower than FY-2/FY-3)
|
||||
// Exception: If this is FY-3 and others are higher, that's normal
|
||||
if (period !== 'fy3' && currentValue < minOtherValue * 0.5 && currentValue < avgOtherValue * 0.6) {
|
||||
logger.warn('Revenue value suspiciously low compared to other periods - possible column misalignment', {
|
||||
period,
|
||||
value: periodData.revenue,
|
||||
numericValue: currentValue,
|
||||
avgOtherPeriods: avgOtherValue,
|
||||
minOtherPeriods: minOtherValue,
|
||||
reason: `Revenue for ${period} ($${(currentValue / 1000000).toFixed(1)}M) is <50% of minimum other period ($${(minOtherValue / 1000000).toFixed(1)}M) - may indicate column misalignment`
|
||||
});
|
||||
// Don't reject automatically, but flag for review - this often indicates wrong column
|
||||
}
|
||||
|
||||
// Check 3: Detect unusual growth patterns (suggests misaligned columns)
|
||||
// Find adjacent periods to check growth
|
||||
const periodOrder = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||
const currentIndex = periodOrder.indexOf(period);
|
||||
if (currentIndex > 0) {
|
||||
const prevPeriod = periodOrder[currentIndex - 1];
|
||||
const prevValue = extractNumericValue(financials[prevPeriod]?.revenue || '');
|
||||
if (prevValue !== null && prevValue > 0) {
|
||||
const growth = ((currentValue - prevValue) / prevValue) * 100;
|
||||
// Flag if growth is >200% or < -50% (unusual for year-over-year)
|
||||
if (growth > 200 || growth < -50) {
|
||||
logger.warn('Detected unusual revenue growth pattern - may indicate misaligned columns', {
|
||||
period,
|
||||
prevPeriod,
|
||||
currentValue: currentValue,
|
||||
prevValue: prevValue,
|
||||
growth: `${growth.toFixed(1)}%`,
|
||||
reason: `Unusual growth (${growth > 0 ? '+' : ''}${growth.toFixed(1)}%) between ${prevPeriod} and ${period} - may indicate column misalignment`
|
||||
});
|
||||
// Don't reject - just log as warning, as this might be legitimate
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Validate EBITDA - should be reasonable
|
||||
if (periodData.ebitda && periodData.ebitda !== 'Not specified in CIM') {
|
||||
@@ -620,40 +716,83 @@ Focus on finding these specific fields in the document. Extract exact values, nu
|
||||
const marginMatch = marginStr.match(/(-?\d+(?:\.\d+)?)/);
|
||||
if (marginMatch) {
|
||||
const marginValue = parseFloat(marginMatch[1]);
|
||||
// Reject margins outside reasonable range (-10% to 60%)
|
||||
// Negative margins are possible but should be within reason
|
||||
if (marginValue < -10 || marginValue > 60) {
|
||||
logger.warn('Rejecting invalid EBITDA margin', {
|
||||
period,
|
||||
value: marginStr,
|
||||
numericValue: marginValue,
|
||||
reason: `Margin (${marginValue}%) outside reasonable range (-10% to 60%)`
|
||||
});
|
||||
periodData.ebitdaMargin = 'Not specified in CIM';
|
||||
} else {
|
||||
// Cross-validate: Check margin consistency with revenue and EBITDA
|
||||
const revValue = extractNumericValue(periodData.revenue || '');
|
||||
const ebitdaValue = extractNumericValue(periodData.ebitda || '');
|
||||
if (revValue !== null && ebitdaValue !== null && revValue > 0) {
|
||||
const calculatedMargin = (ebitdaValue / revValue) * 100;
|
||||
const marginDiff = Math.abs(calculatedMargin - marginValue);
|
||||
// If margin difference is > 10 percentage points, flag it
|
||||
if (marginDiff > 10) {
|
||||
logger.warn('EBITDA margin mismatch detected', {
|
||||
|
||||
// First, try to calculate margin from revenue and EBITDA to validate
|
||||
const revValue = extractNumericValue(periodData.revenue || '');
|
||||
const ebitdaValue = extractNumericValue(periodData.ebitda || '');
|
||||
|
||||
if (revValue !== null && ebitdaValue !== null && revValue > 0) {
|
||||
const calculatedMargin = (ebitdaValue / revValue) * 100;
|
||||
const marginDiff = Math.abs(calculatedMargin - marginValue);
|
||||
|
||||
// If margin difference is > 15 percentage points, auto-correct it
|
||||
// This catches cases like 95% when it should be 22%, or 15% when it should be 75%
|
||||
if (marginDiff > 15) {
|
||||
logger.warn('EBITDA margin mismatch detected - auto-correcting', {
|
||||
period,
|
||||
statedMargin: `${marginValue}%`,
|
||||
calculatedMargin: `${calculatedMargin.toFixed(1)}%`,
|
||||
difference: `${marginDiff.toFixed(1)}pp`,
|
||||
revenue: periodData.revenue,
|
||||
ebitda: periodData.ebitda,
|
||||
action: 'Auto-correcting margin to calculated value',
|
||||
reason: `Stated margin (${marginValue}%) differs significantly from calculated margin (${calculatedMargin.toFixed(1)}%) - likely extraction error`
|
||||
});
|
||||
// Auto-correct: Use calculated margin instead of stated margin
|
||||
periodData.ebitdaMargin = `${calculatedMargin.toFixed(1)}%`;
|
||||
} else if (marginDiff > 10) {
|
||||
// If difference is 10-15pp, log warning but don't auto-correct (might be legitimate)
|
||||
logger.warn('EBITDA margin mismatch detected', {
|
||||
period,
|
||||
statedMargin: `${marginValue}%`,
|
||||
calculatedMargin: `${calculatedMargin.toFixed(1)}%`,
|
||||
difference: `${marginDiff.toFixed(1)}pp`,
|
||||
revenue: periodData.revenue,
|
||||
ebitda: periodData.ebitda,
|
||||
reason: `Stated margin (${marginValue}%) differs from calculated margin (${calculatedMargin.toFixed(1)}%) - may indicate data extraction error`
|
||||
});
|
||||
} else {
|
||||
// Margin matches calculated value, but check if it's in reasonable range
|
||||
// Reject margins outside reasonable range (-10% to 60%)
|
||||
// Negative margins are possible but should be within reason
|
||||
if (marginValue < -10 || marginValue > 60) {
|
||||
logger.warn('EBITDA margin outside reasonable range - using calculated value', {
|
||||
period,
|
||||
statedMargin: `${marginValue}%`,
|
||||
value: marginStr,
|
||||
numericValue: marginValue,
|
||||
calculatedMargin: `${calculatedMargin.toFixed(1)}%`,
|
||||
difference: `${marginDiff.toFixed(1)}pp`,
|
||||
revenue: periodData.revenue,
|
||||
ebitda: periodData.ebitda,
|
||||
reason: `Stated margin (${marginValue}%) differs significantly from calculated margin (${calculatedMargin.toFixed(1)}%) - may indicate data extraction error`
|
||||
reason: `Stated margin (${marginValue}%) outside reasonable range (-10% to 60%), but calculated margin (${calculatedMargin.toFixed(1)}%) is valid - using calculated`
|
||||
});
|
||||
// Don't reject - just log as warning
|
||||
// Use calculated margin if it's in reasonable range
|
||||
if (calculatedMargin >= -10 && calculatedMargin <= 60) {
|
||||
periodData.ebitdaMargin = `${calculatedMargin.toFixed(1)}%`;
|
||||
} else {
|
||||
periodData.ebitdaMargin = 'Not specified in CIM';
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Can't calculate margin, so just check if stated margin is in reasonable range
|
||||
if (marginValue < -10 || marginValue > 60) {
|
||||
logger.warn('Rejecting invalid EBITDA margin', {
|
||||
period,
|
||||
value: marginStr,
|
||||
numericValue: marginValue,
|
||||
reason: `Margin (${marginValue}%) outside reasonable range (-10% to 60%)`
|
||||
});
|
||||
periodData.ebitdaMargin = 'Not specified in CIM';
|
||||
}
|
||||
}
|
||||
|
||||
// Check margin consistency across periods (margins should be relatively stable)
|
||||
if (periodData.ebitdaMargin && periodData.ebitdaMargin !== 'Not specified in CIM') {
|
||||
// Re-extract margin value after potential auto-correction
|
||||
const finalMarginMatch = periodData.ebitdaMargin.match(/(-?\d+(?:\.\d+)?)/);
|
||||
const finalMarginValue = finalMarginMatch ? parseFloat(finalMarginMatch[1]) : marginValue;
|
||||
|
||||
// Check margin consistency across periods (margins should be relatively stable)
|
||||
const otherMargins = otherPeriods
|
||||
// Get other periods for cross-period validation
|
||||
const otherPeriodsForMargin = periods.filter(p => p !== period && financials[p]?.ebitdaMargin);
|
||||
const otherMargins = otherPeriodsForMargin
|
||||
.map(p => {
|
||||
const margin = financials[p]?.ebitdaMargin;
|
||||
if (!margin || margin === 'Not specified in CIM') return null;
|
||||
@@ -664,15 +803,15 @@ Focus on finding these specific fields in the document. Extract exact values, nu
|
||||
|
||||
if (otherMargins.length > 0) {
|
||||
const avgOtherMargin = otherMargins.reduce((a, b) => a + b, 0) / otherMargins.length;
|
||||
const marginDiff = Math.abs(marginValue - avgOtherMargin);
|
||||
const marginDiff = Math.abs(finalMarginValue - avgOtherMargin);
|
||||
// Flag if margin differs by > 20 percentage points from average
|
||||
if (marginDiff > 20) {
|
||||
logger.warn('EBITDA margin inconsistency across periods', {
|
||||
period,
|
||||
margin: `${marginValue}%`,
|
||||
margin: `${finalMarginValue}%`,
|
||||
avgOtherPeriods: `${avgOtherMargin.toFixed(1)}%`,
|
||||
difference: `${marginDiff.toFixed(1)}pp`,
|
||||
reason: `Margin for ${period} (${marginValue}%) differs significantly from average of other periods (${avgOtherMargin.toFixed(1)}%) - may indicate extraction error`
|
||||
reason: `Margin for ${period} (${finalMarginValue}%) differs significantly from average of other periods (${avgOtherMargin.toFixed(1)}%) - may indicate extraction error`
|
||||
});
|
||||
// Don't reject - just log as warning
|
||||
}
|
||||
|
||||
54
backend/src/types/document.ts
Normal file
54
backend/src/types/document.ts
Normal file
@@ -0,0 +1,54 @@
|
||||
/**
|
||||
* Shared types for document-related operations
|
||||
*/
|
||||
|
||||
/**
|
||||
* Document status types
|
||||
*/
|
||||
export type DocumentStatus =
|
||||
| 'pending'
|
||||
| 'uploading'
|
||||
| 'processing'
|
||||
| 'completed'
|
||||
| 'failed'
|
||||
| 'cancelled';
|
||||
|
||||
/**
|
||||
* Document metadata
|
||||
*/
|
||||
export interface DocumentMetadata {
|
||||
id: string;
|
||||
userId: string;
|
||||
fileName: string;
|
||||
fileSize: number;
|
||||
mimeType: string;
|
||||
status: DocumentStatus;
|
||||
createdAt: Date;
|
||||
updatedAt: Date;
|
||||
processingStartedAt?: Date;
|
||||
processingCompletedAt?: Date;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Document upload options
|
||||
*/
|
||||
export interface DocumentUploadOptions {
|
||||
fileName: string;
|
||||
mimeType: string;
|
||||
fileSize: number;
|
||||
userId: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Document processing metadata
|
||||
*/
|
||||
export interface DocumentProcessingMetadata {
|
||||
documentId: string;
|
||||
userId: string;
|
||||
strategy: string;
|
||||
processingTime?: number;
|
||||
apiCalls?: number;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
60
backend/src/types/job.ts
Normal file
60
backend/src/types/job.ts
Normal file
@@ -0,0 +1,60 @@
|
||||
/**
|
||||
* Shared types for job processing
|
||||
*/
|
||||
|
||||
/**
|
||||
* Job status types
|
||||
*/
|
||||
export type JobStatus =
|
||||
| 'pending'
|
||||
| 'processing'
|
||||
| 'completed'
|
||||
| 'failed'
|
||||
| 'cancelled';
|
||||
|
||||
/**
|
||||
* Job priority levels
|
||||
*/
|
||||
export type JobPriority = 'low' | 'normal' | 'high' | 'urgent';
|
||||
|
||||
/**
|
||||
* Processing job interface
|
||||
*/
|
||||
export interface ProcessingJob {
|
||||
id: string;
|
||||
documentId: string;
|
||||
userId: string;
|
||||
status: JobStatus;
|
||||
priority: JobPriority;
|
||||
createdAt: Date;
|
||||
updatedAt: Date;
|
||||
startedAt?: Date;
|
||||
completedAt?: Date;
|
||||
error?: string;
|
||||
retryCount: number;
|
||||
maxRetries: number;
|
||||
metadata?: Record<string, any>;
|
||||
}
|
||||
|
||||
/**
|
||||
* Job queue configuration
|
||||
*/
|
||||
export interface JobQueueConfig {
|
||||
maxConcurrentJobs: number;
|
||||
retryDelay: number;
|
||||
maxRetries: number;
|
||||
timeout: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Job processing result
|
||||
*/
|
||||
export interface JobProcessingResult {
|
||||
success: boolean;
|
||||
jobsProcessed: number;
|
||||
jobsCompleted: number;
|
||||
jobsFailed: number;
|
||||
processingTime: number;
|
||||
errors?: string[];
|
||||
}
|
||||
|
||||
56
backend/src/types/llm.ts
Normal file
56
backend/src/types/llm.ts
Normal file
@@ -0,0 +1,56 @@
|
||||
/**
|
||||
* Shared types for LLM services
|
||||
*/
|
||||
|
||||
import { CIMReview, cimReviewSchema } from '../services/llmSchemas';
|
||||
import { z } from 'zod';
|
||||
|
||||
/**
|
||||
* LLM request interface
|
||||
*/
|
||||
export interface LLMRequest {
|
||||
prompt: string;
|
||||
systemPrompt?: string;
|
||||
maxTokens?: number;
|
||||
temperature?: number;
|
||||
model?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* LLM response interface
|
||||
*/
|
||||
export interface LLMResponse {
|
||||
success: boolean;
|
||||
content: string;
|
||||
usage?: {
|
||||
promptTokens: number;
|
||||
completionTokens: number;
|
||||
totalTokens: number;
|
||||
};
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* CIM analysis result from LLM processing
|
||||
*/
|
||||
export interface CIMAnalysisResult {
|
||||
success: boolean;
|
||||
jsonOutput?: CIMReview;
|
||||
error?: string;
|
||||
model: string;
|
||||
cost: number;
|
||||
inputTokens: number;
|
||||
outputTokens: number;
|
||||
validationIssues?: z.ZodIssue[];
|
||||
}
|
||||
|
||||
/**
|
||||
* LLM provider types
|
||||
*/
|
||||
export type LLMProvider = 'anthropic' | 'openai' | 'openrouter';
|
||||
|
||||
/**
|
||||
* LLM endpoint types for tracking
|
||||
*/
|
||||
export type LLMEndpoint = 'financial_extraction' | 'full_extraction' | 'other';
|
||||
|
||||
63
backend/src/types/processing.ts
Normal file
63
backend/src/types/processing.ts
Normal file
@@ -0,0 +1,63 @@
|
||||
/**
|
||||
* Shared types for document processing
|
||||
*/
|
||||
|
||||
import { CIMReview } from '../services/llmSchemas';
|
||||
|
||||
/**
|
||||
* Processing strategy types
|
||||
*/
|
||||
export type ProcessingStrategy =
|
||||
| 'document_ai_agentic_rag'
|
||||
| 'simple_full_document'
|
||||
| 'parallel_sections'
|
||||
| 'document_ai_multi_pass_rag';
|
||||
|
||||
/**
|
||||
* Standard processing result for document processors
|
||||
*/
|
||||
export interface ProcessingResult {
|
||||
success: boolean;
|
||||
summary: string;
|
||||
analysisData: CIMReview;
|
||||
processingStrategy: ProcessingStrategy;
|
||||
processingTime: number;
|
||||
apiCalls: number;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extended processing result for RAG processors with chunk information
|
||||
*/
|
||||
export interface RAGProcessingResult extends ProcessingResult {
|
||||
totalChunks?: number;
|
||||
processedChunks?: number;
|
||||
averageChunkSize?: number;
|
||||
memoryUsage?: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Processing options for document processors
|
||||
*/
|
||||
export interface ProcessingOptions {
|
||||
strategy?: ProcessingStrategy;
|
||||
fileBuffer?: Buffer;
|
||||
fileName?: string;
|
||||
mimeType?: string;
|
||||
enableSemanticChunking?: boolean;
|
||||
enableMetadataEnrichment?: boolean;
|
||||
similarityThreshold?: number;
|
||||
structuredTables?: any[];
|
||||
[key: string]: any; // Allow additional options
|
||||
}
|
||||
|
||||
/**
|
||||
* Document AI processing result
|
||||
*/
|
||||
export interface DocumentAIProcessingResult {
|
||||
success: boolean;
|
||||
content: string;
|
||||
metadata?: any;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
204
backend/src/utils/errorHandlers.ts
Normal file
204
backend/src/utils/errorHandlers.ts
Normal file
@@ -0,0 +1,204 @@
|
||||
/**
|
||||
* Common Error Handling Utilities
|
||||
* Shared error handling patterns used across services
|
||||
*/
|
||||
|
||||
import { logger } from './logger';
|
||||
|
||||
/**
|
||||
* Extract error message from any error type
|
||||
*/
|
||||
export function extractErrorMessage(error: unknown): string {
|
||||
if (error instanceof Error) {
|
||||
return error.message;
|
||||
}
|
||||
if (typeof error === 'string') {
|
||||
return error;
|
||||
}
|
||||
if (error && typeof error === 'object') {
|
||||
const errorObj = error as Record<string, any>;
|
||||
return errorObj.message || errorObj.error || String(error);
|
||||
}
|
||||
return String(error);
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract error stack trace
|
||||
*/
|
||||
export function extractErrorStack(error: unknown): string | undefined {
|
||||
if (error instanceof Error) {
|
||||
return error.stack;
|
||||
}
|
||||
return undefined;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract detailed error information for logging
|
||||
*/
|
||||
export function extractErrorDetails(error: unknown): {
|
||||
name?: string;
|
||||
message: string;
|
||||
stack?: string;
|
||||
type: string;
|
||||
value?: any;
|
||||
} {
|
||||
if (error instanceof Error) {
|
||||
return {
|
||||
name: error.name,
|
||||
message: error.message,
|
||||
stack: error.stack,
|
||||
type: 'Error',
|
||||
};
|
||||
}
|
||||
|
||||
return {
|
||||
message: extractErrorMessage(error),
|
||||
type: typeof error,
|
||||
value: error,
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if error is a timeout error
|
||||
*/
|
||||
export function isTimeoutError(error: unknown): boolean {
|
||||
const message = extractErrorMessage(error);
|
||||
return message.toLowerCase().includes('timeout') ||
|
||||
message.toLowerCase().includes('timed out') ||
|
||||
message.toLowerCase().includes('exceeded');
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if error is a rate limit error
|
||||
*/
|
||||
export function isRateLimitError(error: unknown): boolean {
|
||||
if (error && typeof error === 'object') {
|
||||
const errorObj = error as Record<string, any>;
|
||||
return errorObj.status === 429 ||
|
||||
errorObj.code === 429 ||
|
||||
errorObj.error?.type === 'rate_limit_error' ||
|
||||
extractErrorMessage(error).toLowerCase().includes('rate limit');
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if error is retryable
|
||||
*/
|
||||
export function isRetryableError(error: unknown): boolean {
|
||||
// Timeout errors are retryable
|
||||
if (isTimeoutError(error)) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Rate limit errors are retryable (with backoff)
|
||||
if (isRateLimitError(error)) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Network/connection errors are retryable
|
||||
const message = extractErrorMessage(error).toLowerCase();
|
||||
if (message.includes('network') ||
|
||||
message.includes('connection') ||
|
||||
message.includes('econnrefused') ||
|
||||
message.includes('etimedout')) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// 5xx server errors are retryable
|
||||
if (error && typeof error === 'object') {
|
||||
const errorObj = error as Record<string, any>;
|
||||
const status = errorObj.status || errorObj.statusCode;
|
||||
if (status && status >= 500 && status < 600) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract retry delay from rate limit error
|
||||
*/
|
||||
export function extractRetryAfter(error: unknown): number {
|
||||
if (error && typeof error === 'object') {
|
||||
const errorObj = error as Record<string, any>;
|
||||
const retryAfter = errorObj.headers?.['retry-after'] ||
|
||||
errorObj.error?.retry_after ||
|
||||
errorObj.retryAfter;
|
||||
if (retryAfter) {
|
||||
return typeof retryAfter === 'number' ? retryAfter : parseInt(retryAfter, 10);
|
||||
}
|
||||
}
|
||||
return 60; // Default 60 seconds
|
||||
}
|
||||
|
||||
/**
|
||||
* Log error with structured context
|
||||
*/
|
||||
export function logErrorWithContext(
|
||||
error: unknown,
|
||||
context: Record<string, any>,
|
||||
level: 'error' | 'warn' | 'info' = 'error'
|
||||
): void {
|
||||
const errorMessage = extractErrorMessage(error);
|
||||
const errorStack = extractErrorStack(error);
|
||||
const errorDetails = extractErrorDetails(error);
|
||||
|
||||
const logData = {
|
||||
...context,
|
||||
error: {
|
||||
message: errorMessage,
|
||||
stack: errorStack,
|
||||
details: errorDetails,
|
||||
isRetryable: isRetryableError(error),
|
||||
isTimeout: isTimeoutError(error),
|
||||
isRateLimit: isRateLimitError(error),
|
||||
},
|
||||
timestamp: new Date().toISOString(),
|
||||
};
|
||||
|
||||
if (level === 'error') {
|
||||
logger.error('Error occurred', logData);
|
||||
} else if (level === 'warn') {
|
||||
logger.warn('Warning occurred', logData);
|
||||
} else {
|
||||
logger.info('Info', logData);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Create a standardized error object
|
||||
*/
|
||||
export function createStandardError(
|
||||
message: string,
|
||||
code?: string,
|
||||
statusCode?: number,
|
||||
retryable?: boolean
|
||||
): Error & { code?: string; statusCode?: number; retryable?: boolean } {
|
||||
const error = new Error(message) as Error & { code?: string; statusCode?: number; retryable?: boolean };
|
||||
if (code) error.code = code;
|
||||
if (statusCode) error.statusCode = statusCode;
|
||||
if (retryable !== undefined) error.retryable = retryable;
|
||||
return error;
|
||||
}
|
||||
|
||||
/**
|
||||
* Wrap async function with error handling
|
||||
*/
|
||||
export async function withErrorHandling<T>(
|
||||
fn: () => Promise<T>,
|
||||
context: Record<string, any>,
|
||||
onError?: (error: unknown) => void
|
||||
): Promise<T> {
|
||||
try {
|
||||
return await fn();
|
||||
} catch (error) {
|
||||
logErrorWithContext(error, context);
|
||||
if (onError) {
|
||||
onError(error);
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
603
docs/PROMPT_ENGINEERING_ANALYSIS.md
Normal file
603
docs/PROMPT_ENGINEERING_ANALYSIS.md
Normal file
@@ -0,0 +1,603 @@
|
||||
# Prompt Engineering Deep Dive Analysis
|
||||
## CIM Document Processing System
|
||||
|
||||
**Analysis Date**: 2025-01-XX
|
||||
**Analyst**: AI Prompt Engineering Specialist
|
||||
**Scope**: 12 prompt constructs across 2 core service files
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This analysis identifies **15 specific, actionable recommendations** to optimize AI prompts across the CIM processing system. Recommendations are prioritized by success likelihood and implementation difficulty, targeting improvements in:
|
||||
|
||||
1. **Financial Accuracy**: 100% correctness of extracted financial data
|
||||
2. **Data Completeness**: All key data extracted (management, customers, KPIs)
|
||||
3. **Format & Readability**: Correct, easy-to-read, programmatically parsable JSON
|
||||
4. **Processing Efficiency**: Fastest possible processing and response
|
||||
5. **Insight Quality**: World-class PE investor-level analysis
|
||||
|
||||
**Key Findings**:
|
||||
- Strong foundation with comprehensive validation frameworks
|
||||
- Opportunities for enhanced few-shot examples and cross-validation
|
||||
- Potential for improved RAG query specificity and dynamic instruction generation
|
||||
- Need for more structured PE investment framework integration
|
||||
|
||||
---
|
||||
|
||||
## Analysis Methodology
|
||||
|
||||
Each prompt construct was evaluated against:
|
||||
- Current effectiveness for the 5 objectives
|
||||
- Specific weaknesses and gaps
|
||||
- Improvement opportunities with success likelihood vs implementation difficulty
|
||||
- Recommended enhancements prioritized by impact
|
||||
|
||||
---
|
||||
|
||||
## Detailed Recommendations
|
||||
|
||||
### QUICK WINS (High Success, Low Difficulty)
|
||||
|
||||
#### Recommendation 1: Add Financial Table Detection Examples with Edge Cases
|
||||
**Location**: `buildFinancialPrompt` (`llmService.ts:2460-2845`)
|
||||
|
||||
**Current State**:
|
||||
- Has 10 few-shot examples covering various formats
|
||||
- Missing examples for: multi-table scenarios, conflicting data, partial tables, merged cells
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add 3-5 additional few-shot examples covering:
|
||||
- Multiple tables with conflicting values (how to identify PRIMARY)
|
||||
- Tables with merged cells or irregular formatting
|
||||
- Partial tables (only 2-3 periods available)
|
||||
- Tables with footnotes containing critical adjustments
|
||||
- Pro forma vs historical side-by-side comparison
|
||||
|
||||
**Success Likelihood**: **High** (90%)
|
||||
**Implementation Difficulty**: **Low** (2-3 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +5-8% (better handling of edge cases)
|
||||
- Data Completeness: +2-3% (fewer "Not specified" for valid data)
|
||||
- Format & Readability: Neutral
|
||||
- Processing Efficiency: Neutral
|
||||
- Insight Quality: Neutral
|
||||
|
||||
**Code Reference**: Lines 2721-2792 in `llmService.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 2: Enhance JSON Template with Inline Validation Hints
|
||||
**Location**: `buildCIMPrompt` (`llmService.ts:1069-1361`)
|
||||
|
||||
**Current State**:
|
||||
- JSON template has format comments but lacks validation hints
|
||||
- No examples of correct vs incorrect values
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add inline validation examples to JSON template:
|
||||
```json
|
||||
"revenue": "Revenue amount for FY-3", // Format: "$XX.XM" (e.g., "$64.2M"). Must be $10M+ for target companies. If <$10M, likely wrong table.
|
||||
"revenueGrowth": "N/A (baseline year)", // Format: "XX.X%" or "N/A". Calculate if not provided: ((FY-2 - FY-3) / FY-3) * 100
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (85%)
|
||||
**Implementation Difficulty**: **Low** (1-2 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +3-5% (clearer format expectations)
|
||||
- Format & Readability: +5-7% (better format consistency)
|
||||
- Data Completeness: +1-2%
|
||||
- Processing Efficiency: Neutral
|
||||
- Insight Quality: Neutral
|
||||
|
||||
**Code Reference**: Lines 1069-1361 in `llmService.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 3: Add Explicit Format Standardization Examples
|
||||
**Location**: `buildCIMPrompt` (`llmService.ts:1395-1410`)
|
||||
|
||||
**Current State**:
|
||||
- Format requirements are stated but lack concrete examples
|
||||
- No examples of incorrect formats to avoid
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add "DO/DON'T" format examples:
|
||||
```
|
||||
**Currency Values**:
|
||||
✓ CORRECT: "$64.2M", "$1.2B", "$20.5M" (from thousands)
|
||||
✗ INCORRECT: "$64,200,000", "$64M revenue", "64.2 million"
|
||||
|
||||
**Percentages**:
|
||||
✓ CORRECT: "29.3%", "(4.4)%" (negative)
|
||||
✗ INCORRECT: "29.3 percent", "29.3", "-4.4%"
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (88%)
|
||||
**Implementation Difficulty**: **Low** (1 hour)
|
||||
**Expected Impact**:
|
||||
- Format & Readability: +8-10% (dramatically better format consistency)
|
||||
- Financial Accuracy: +2-3% (fewer parsing errors)
|
||||
- Processing Efficiency: +2-3% (less post-processing needed)
|
||||
- Data Completeness: Neutral
|
||||
- Insight Quality: Neutral
|
||||
|
||||
**Code Reference**: Lines 1395-1410 in `llmService.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 4: Enhance Cross-Table Validation Instructions
|
||||
**Location**: `buildFinancialPrompt` (`llmService.ts:2562-2584`)
|
||||
|
||||
**Current State**:
|
||||
- Cross-table validation is mentioned but lacks step-by-step process
|
||||
- No specific validation rules for discrepancies
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add structured cross-validation workflow:
|
||||
```
|
||||
**Step 5: Cross-Table Validation (CRITICAL)**
|
||||
1. Extract from PRIMARY table first
|
||||
2. Check executive summary for key metrics (revenue, EBITDA)
|
||||
3. If discrepancy >10%, investigate:
|
||||
- Is executive summary using adjusted/pro forma numbers?
|
||||
- Is PRIMARY table using different period definitions?
|
||||
- Which source is more authoritative? (Usually detailed table)
|
||||
4. Document any discrepancies in qualityOfEarnings field
|
||||
5. Use PRIMARY table as authoritative source unless executive summary explicitly states adjustments
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (87%)
|
||||
**Implementation Difficulty**: **Low** (2 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +6-9% (better handling of discrepancies)
|
||||
- Data Completeness: +2-3% (captures adjustments)
|
||||
- Format & Readability: Neutral
|
||||
- Processing Efficiency: Neutral
|
||||
- Insight Quality: +1-2% (better quality of earnings notes)
|
||||
|
||||
**Code Reference**: Lines 2562-2584 in `llmService.ts`
|
||||
|
||||
---
|
||||
|
||||
### HIGH-IMPACT IMPROVEMENTS (High Success, Medium Difficulty)
|
||||
|
||||
#### Recommendation 5: Implement Multi-Pass Financial Validation
|
||||
**Location**: `extractPass1CombinedMetadataFinancial` + new validation method
|
||||
|
||||
**Current State**:
|
||||
- Financial extraction happens in single pass
|
||||
- Validation occurs within the prompt but not systematically
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add post-extraction validation pass:
|
||||
1. After Pass 1 extraction, run validation check
|
||||
2. If validation fails (magnitude, trends, calculations), trigger targeted re-extraction
|
||||
3. Use focused prompt asking LLM to re-check specific periods/metrics
|
||||
4. Compare results and flag discrepancies
|
||||
|
||||
**Success Likelihood**: **High** (82%)
|
||||
**Implementation Difficulty**: **Medium** (6-8 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +10-15% (catches errors before final output)
|
||||
- Data Completeness: +3-5% (fills gaps found during validation)
|
||||
- Format & Readability: +2-3%
|
||||
- Processing Efficiency: -5-8% (additional pass adds time)
|
||||
- Insight Quality: +2-3%
|
||||
|
||||
**Code Reference**: New method needed in `optimizedAgenticRAGProcessor.ts` after line 1562
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 6: Enhance RAG Query with Field-Specific Semantic Boosts
|
||||
**Location**: `createCIMAnalysisQuery` (`optimizedAgenticRAGProcessor.ts:634-678`)
|
||||
|
||||
**Current State**:
|
||||
- Priority weighting exists but is generic
|
||||
- Semantic specificity is good but could be more targeted
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add field-specific semantic boost patterns:
|
||||
```
|
||||
**FINANCIAL DATA SEMANTIC BOOSTS** (Weight: 15/10 for financial chunks):
|
||||
- Boost: "historical financial performance table", "income statement", "P&L statement"
|
||||
- Boost: "revenue for FY-3 FY-2 FY-1 LTM", "EBITDA margin percentage"
|
||||
- Boost: "trailing twelve months", "fiscal year end", "last twelve months"
|
||||
- Penalize: "projected", "forecast", "budget", "plan" (unless explicitly historical)
|
||||
|
||||
**MARKET DATA SEMANTIC BOOSTS** (Weight: 12/10 for market chunks):
|
||||
- Boost: "total addressable market TAM", "serviceable addressable market SAM"
|
||||
- Boost: "market share percentage", "competitive positioning", "market leader"
|
||||
- Boost: "compound annual growth rate CAGR", "market growth rate"
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (80%)
|
||||
**Implementation Difficulty**: **Medium** (4-5 hours)
|
||||
**Expected Impact**:
|
||||
- Data Completeness: +5-8% (better chunk retrieval)
|
||||
- Financial Accuracy: +3-5% (more relevant context)
|
||||
- Processing Efficiency: +3-5% (fewer irrelevant chunks)
|
||||
- Format & Readability: Neutral
|
||||
- Insight Quality: +2-4% (better context for analysis)
|
||||
|
||||
**Code Reference**: Lines 634-678 in `optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 7: Add Dynamic Few-Shot Example Selection
|
||||
**Location**: `buildCIMPrompt` + new helper method
|
||||
|
||||
**Current State**:
|
||||
- Fixed set of 10 financial examples
|
||||
- Examples don't adapt to document characteristics
|
||||
|
||||
**Proposed Improvement**:
|
||||
Create dynamic example selection based on detected document characteristics:
|
||||
- If document has fiscal year end different from calendar: include fiscal year examples
|
||||
- If document has thousands format: include conversion examples
|
||||
- If document has only 2-3 periods: include partial period examples
|
||||
- If document has pro forma tables: include pro forma vs historical examples
|
||||
|
||||
**Success Likelihood**: **High** (78%)
|
||||
**Implementation Difficulty**: **Medium** (5-6 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +8-12% (examples match document format)
|
||||
- Data Completeness: +3-5%
|
||||
- Format & Readability: +2-3%
|
||||
- Processing Efficiency: Neutral (selection is fast)
|
||||
- Insight Quality: Neutral
|
||||
|
||||
**Code Reference**: New helper method in `llmService.ts`, modify `buildCIMPrompt` around line 1226
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 8: Enhance Gap-Filling Query with Field-Specific Inference Rules
|
||||
**Location**: `createGapFillingQuery` (`optimizedAgenticRAGProcessor.ts:2626-2750`)
|
||||
|
||||
**Current State**:
|
||||
- Has inference rules but they're generic
|
||||
- Missing field-specific calculation rules
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add comprehensive field-specific inference rules:
|
||||
```
|
||||
**FINANCIAL FIELD INFERENCE RULES**:
|
||||
- revenueGrowth: If revenue for 2 periods available, calculate: ((Current - Prior) / Prior) * 100
|
||||
- ebitdaMargin: If revenue and EBITDA available, calculate: (EBITDA / Revenue) * 100
|
||||
- grossMargin: If revenue and grossProfit available, calculate: (Gross Profit / Revenue) * 100
|
||||
- CAGR: If multiple periods available, calculate: ((End/Start)^(1/Periods) - 1) * 100
|
||||
|
||||
**MARKET FIELD INFERENCE RULES**:
|
||||
- Market share: If TAM and company revenue available, calculate: (Revenue / TAM) * 100
|
||||
- Market growth: If TAM for 2 periods available, calculate growth rate
|
||||
|
||||
**BUSINESS FIELD INFERENCE RULES**:
|
||||
- Customer concentration: If top customers mentioned, sum percentages
|
||||
- Recurring revenue %: If MRR/ARR and total revenue available, calculate percentage
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (85%)
|
||||
**Implementation Difficulty**: **Medium** (4-5 hours)
|
||||
**Expected Impact**:
|
||||
- Data Completeness: +8-12% (calculates missing derived fields)
|
||||
- Financial Accuracy: +3-5% (validates through calculation)
|
||||
- Format & Readability: +1-2%
|
||||
- Processing Efficiency: Neutral
|
||||
- Insight Quality: +2-3%
|
||||
|
||||
**Code Reference**: Lines 2725-2734 in `optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 9: Add PE Investment Framework Scoring Template
|
||||
**Location**: `extractPass5InvestmentThesis` (`optimizedAgenticRAGProcessor.ts:1897-2067`)
|
||||
|
||||
**Current State**:
|
||||
- BPCP alignment scoring exists but lacks detailed scoring rubric
|
||||
- No examples of high vs low scores
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add detailed scoring rubric with examples:
|
||||
```
|
||||
**BPCP ALIGNMENT SCORING RUBRIC** (1-10 scale):
|
||||
|
||||
1. **EBITDA Fit** (Target: 5-20MM):
|
||||
- 10: $5-20MM EBITDA, perfect fit
|
||||
- 8: $3-5MM or $20-30MM, good fit with growth potential
|
||||
- 5: $1-3MM or $30-50MM, acceptable but outside sweet spot
|
||||
- 3: <$1MM or >$50MM, poor fit
|
||||
|
||||
2. **Industry Fit** (Consumer/Industrial):
|
||||
- 10: Pure consumer or industrial, core focus
|
||||
- 8: Mixed consumer/industrial, good fit
|
||||
- 5: Adjacent sector (e.g., healthcare services), acceptable
|
||||
- 3: Outside focus (e.g., tech, healthcare), poor fit
|
||||
|
||||
[Continue for all 7 criteria with specific examples]
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (83%)
|
||||
**Implementation Difficulty**: **Medium** (3-4 hours)
|
||||
**Expected Impact**:
|
||||
- Insight Quality: +10-15% (more consistent, quantitative scoring)
|
||||
- Data Completeness: +2-3% (ensures all criteria scored)
|
||||
- Format & Readability: +3-5% (standardized scores)
|
||||
- Financial Accuracy: Neutral
|
||||
- Processing Efficiency: Neutral
|
||||
|
||||
**Code Reference**: Lines 1994-2005 in `optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
---
|
||||
|
||||
### STRATEGIC ENHANCEMENTS (High Success, High Difficulty)
|
||||
|
||||
#### Recommendation 10: Implement Multi-Pass Cross-Validation System
|
||||
**Location**: New validation service + integration points
|
||||
|
||||
**Current State**:
|
||||
- Each pass extracts independently
|
||||
- No systematic cross-validation between passes
|
||||
|
||||
**Proposed Improvement**:
|
||||
Create validation service that:
|
||||
1. After all passes complete, runs cross-validation checks
|
||||
2. Identifies inconsistencies (e.g., company name differs, financials don't match)
|
||||
3. Triggers targeted re-extraction for inconsistent fields
|
||||
4. Maintains validation log for debugging
|
||||
|
||||
**Success Likelihood**: **High** (75%)
|
||||
**Implementation Difficulty**: **High** (12-15 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +12-18% (catches cross-pass inconsistencies)
|
||||
- Data Completeness: +5-8% (fills gaps found during validation)
|
||||
- Format & Readability: +3-5%
|
||||
- Processing Efficiency: -8-12% (additional validation pass)
|
||||
- Insight Quality: +5-7%
|
||||
|
||||
**Code Reference**: New file: `backend/src/services/crossValidationService.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 11: Add Context-Aware Prompt Adaptation
|
||||
**Location**: `buildEnhancedExtractionInstructions` + document analysis
|
||||
|
||||
**Current State**:
|
||||
- Dynamic instructions exist but are rule-based
|
||||
- Doesn't adapt to document-specific patterns
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add document pattern detection and adaptive prompts:
|
||||
1. Analyze document structure (sections, table locations, format patterns)
|
||||
2. Detect document "type" (e.g., "bank-prepared CIM", "company-prepared", "auction process")
|
||||
3. Adapt prompts based on detected patterns:
|
||||
- Bank-prepared: Emphasize executive summary cross-reference
|
||||
- Company-prepared: Emphasize narrative text extraction
|
||||
- Auction: Emphasize competitive positioning
|
||||
|
||||
**Success Likelihood**: **Medium-High** (70%)
|
||||
**Implementation Difficulty**: **High** (10-12 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +8-12% (better extraction for document type)
|
||||
- Data Completeness: +6-10% (targets right sections)
|
||||
- Format & Readability: +2-3%
|
||||
- Processing Efficiency: +5-8% (more targeted extraction)
|
||||
- Insight Quality: +4-6%
|
||||
|
||||
**Code Reference**: Enhance `buildEnhancedExtractionInstructions` in `optimizedAgenticRAGProcessor.ts:2194-2322`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 12: Implement Confidence Scoring and Uncertainty Handling
|
||||
**Location**: New confidence scoring system + prompt enhancements
|
||||
|
||||
**Current State**:
|
||||
- Confidence scoring mentioned in `getFinancialSystemPrompt` but not used
|
||||
- No systematic uncertainty handling
|
||||
|
||||
**Proposed Improvement**:
|
||||
1. Add confidence scores to extraction output (High/Medium/Low)
|
||||
2. For Low confidence fields, trigger targeted re-extraction
|
||||
3. Add uncertainty indicators to JSON output
|
||||
4. Use confidence scores to prioritize gap-filling
|
||||
|
||||
**Success Likelihood**: **Medium-High** (72%)
|
||||
**Implementation Difficulty**: **High** (10-12 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +10-15% (flags uncertain extractions)
|
||||
- Data Completeness: +5-8% (targeted re-extraction)
|
||||
- Format & Readability: +2-3% (uncertainty indicators)
|
||||
- Processing Efficiency: -5-10% (additional passes for low confidence)
|
||||
- Insight Quality: +3-5%
|
||||
|
||||
**Code Reference**: New method in `llmService.ts`, modify schema in `llmSchemas.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 13: Add PE Investment Thesis Template with Examples
|
||||
**Location**: `extractPass5InvestmentThesis` (`optimizedAgenticRAGProcessor.ts:1897-2067`)
|
||||
|
||||
**Current State**:
|
||||
- Framework exists but lacks concrete examples
|
||||
- No "good vs bad" investment thesis examples
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add comprehensive investment thesis template with examples:
|
||||
```
|
||||
**EXAMPLE: HIGH-QUALITY INVESTMENT THESIS**
|
||||
|
||||
Key Attractions:
|
||||
1. Market-leading position with 25% market share in $2.5B TAM, providing pricing power and competitive moat. Revenue grew 15% CAGR over 3 years to $64M, demonstrating strong execution. This market position supports 2-3x revenue growth potential through geographic expansion and product line extensions.
|
||||
|
||||
[Continue with 4-7 more examples showing specificity, quantification, and investment impact]
|
||||
|
||||
**EXAMPLE: LOW-QUALITY INVESTMENT THESIS (AVOID)**
|
||||
|
||||
Key Attractions:
|
||||
1. Strong market position. [TOO VAGUE - lacks specificity, quantification, investment impact]
|
||||
2. Good management team. [TOO GENERIC - no details, no track record, no investment significance]
|
||||
```
|
||||
|
||||
**Success Likelihood**: **High** (88%)
|
||||
**Implementation Difficulty**: **Medium-High** (6-8 hours)
|
||||
**Expected Impact**:
|
||||
- Insight Quality: +15-20% (dramatically better investment thesis quality)
|
||||
- Data Completeness: +3-5% (ensures all required elements)
|
||||
- Format & Readability: +5-7% (consistent structure)
|
||||
- Financial Accuracy: Neutral
|
||||
- Processing Efficiency: Neutral
|
||||
|
||||
**Code Reference**: Lines 1901-2067 in `optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 14: Enhance List Field Repair with Document-Specific Context
|
||||
**Location**: `repairListField` (`optimizedAgenticRAGProcessor.ts:2832-3000`)
|
||||
|
||||
**Current State**:
|
||||
- Uses first 5 chunks for context (4000 chars)
|
||||
- Doesn't prioritize most relevant chunks
|
||||
|
||||
**Proposed Improvement**:
|
||||
1. Use RAG to find most relevant chunks for the specific field being repaired
|
||||
2. Increase context to 6000-8000 chars for better understanding
|
||||
3. Add field-specific chunk prioritization (e.g., for "risks", prioritize risk sections)
|
||||
4. Include examples of high-quality items from similar fields
|
||||
|
||||
**Success Likelihood**: **High** (80%)
|
||||
**Implementation Difficulty**: **Medium** (5-6 hours)
|
||||
**Expected Impact**:
|
||||
- Insight Quality: +8-12% (better list item quality)
|
||||
- Data Completeness: +3-5% (more comprehensive lists)
|
||||
- Format & Readability: +2-3%
|
||||
- Financial Accuracy: Neutral
|
||||
- Processing Efficiency: -2-3% (more context processing)
|
||||
|
||||
**Code Reference**: Lines 2841-2926 in `optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
---
|
||||
|
||||
#### Recommendation 15: Add Structured Extraction Workflow with Checkpoints
|
||||
**Location**: `buildCIMPrompt` (`llmService.ts:1365-1394`)
|
||||
|
||||
**Current State**:
|
||||
- Workflow exists but is linear
|
||||
- No validation checkpoints
|
||||
|
||||
**Proposed Improvement**:
|
||||
Add checkpoint-based workflow:
|
||||
```
|
||||
**Phase 1: Document Structure Analysis** [CHECKPOINT: Verify sections identified]
|
||||
1. Identify document sections...
|
||||
2. Locate key sections...
|
||||
[VALIDATION: If <5 sections found, expand search]
|
||||
|
||||
**Phase 2: Financial Data Extraction** [CHECKPOINT: Validate financial table found]
|
||||
1. Locate PRIMARY historical financial table
|
||||
[VALIDATION: If revenue <$10M, search for alternative table]
|
||||
2. Extract financial metrics...
|
||||
[VALIDATION: Verify magnitude, trends, calculations]
|
||||
```
|
||||
|
||||
**Success Likelihood**: **Medium-High** (75%)
|
||||
**Implementation Difficulty**: **High** (8-10 hours)
|
||||
**Expected Impact**:
|
||||
- Financial Accuracy: +10-15% (catches errors at checkpoints)
|
||||
- Data Completeness: +5-8% (expands search when needed)
|
||||
- Format & Readability: +2-3%
|
||||
- Processing Efficiency: -5-8% (additional validation steps)
|
||||
- Insight Quality: +3-5%
|
||||
|
||||
**Code Reference**: Lines 1365-1394 in `llmService.ts`
|
||||
|
||||
---
|
||||
|
||||
## Prioritized Implementation Roadmap
|
||||
|
||||
### Phase 1: Quick Wins (Week 1-2)
|
||||
**Total Effort**: 6-8 hours
|
||||
**Expected Impact**: +15-25% improvement across objectives
|
||||
|
||||
1. ✅ Recommendation 3: Add Explicit Format Standardization Examples
|
||||
2. ✅ Recommendation 2: Enhance JSON Template with Inline Validation Hints
|
||||
3. ✅ Recommendation 1: Add Financial Table Detection Examples with Edge Cases
|
||||
4. ✅ Recommendation 4: Enhance Cross-Table Validation Instructions
|
||||
|
||||
### Phase 2: High-Impact Improvements (Week 3-4)
|
||||
**Total Effort**: 22-28 hours
|
||||
**Expected Impact**: +25-40% improvement across objectives
|
||||
|
||||
5. ✅ Recommendation 6: Enhance RAG Query with Field-Specific Semantic Boosts
|
||||
6. ✅ Recommendation 8: Enhance Gap-Filling Query with Field-Specific Inference Rules
|
||||
7. ✅ Recommendation 9: Add PE Investment Framework Scoring Template
|
||||
8. ✅ Recommendation 7: Add Dynamic Few-Shot Example Selection
|
||||
9. ✅ Recommendation 13: Add PE Investment Thesis Template with Examples
|
||||
|
||||
### Phase 3: Strategic Enhancements (Week 5-8)
|
||||
**Total Effort**: 40-50 hours
|
||||
**Expected Impact**: +30-50% improvement across objectives
|
||||
|
||||
10. ✅ Recommendation 14: Enhance List Field Repair with Document-Specific Context
|
||||
11. ✅ Recommendation 5: Implement Multi-Pass Financial Validation
|
||||
12. ✅ Recommendation 10: Implement Multi-Pass Cross-Validation System
|
||||
13. ✅ Recommendation 11: Add Context-Aware Prompt Adaptation
|
||||
14. ✅ Recommendation 12: Implement Confidence Scoring and Uncertainty Handling
|
||||
15. ✅ Recommendation 15: Add Structured Extraction Workflow with Checkpoints
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Baseline (Current State)
|
||||
- Financial Accuracy: ~85-90% (estimated)
|
||||
- Data Completeness: ~80-85% (estimated)
|
||||
- Format Consistency: ~75-80% (estimated)
|
||||
- Processing Speed: Baseline
|
||||
- Investment Quality: ~7/10 (estimated)
|
||||
|
||||
### Target (After All Recommendations)
|
||||
- Financial Accuracy: **>99%** (validated against manual review)
|
||||
- Data Completeness: **>95%** (excluding truly unavailable data)
|
||||
- Format Consistency: **>98%** (adherence to format specifications)
|
||||
- Processing Speed: **<30% increase** (despite improvements)
|
||||
- Investment Quality: **>8.5/10** (investment committee feedback)
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Low Risk (Recommendations 1-4, 6-9, 13-14)
|
||||
- Well-defined scope
|
||||
- Clear implementation path
|
||||
- Low chance of breaking existing functionality
|
||||
- Easy to roll back if needed
|
||||
|
||||
### Medium Risk (Recommendations 5, 10, 11, 15)
|
||||
- More complex implementation
|
||||
- May require architectural changes
|
||||
- Testing required to ensure no regressions
|
||||
- May impact processing time
|
||||
|
||||
### High Risk (Recommendation 12)
|
||||
- Requires schema changes
|
||||
- May impact downstream systems
|
||||
- Requires comprehensive testing
|
||||
- Most complex to implement
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
This analysis identifies **15 specific, actionable recommendations** to optimize AI prompts across the CIM processing system. The recommendations are prioritized by success likelihood and implementation difficulty, with a clear roadmap for implementation over 8 weeks.
|
||||
|
||||
**Key Takeaways**:
|
||||
1. **Quick wins** can deliver 15-25% improvement with minimal effort
|
||||
2. **High-impact improvements** can deliver 25-40% improvement with moderate effort
|
||||
3. **Strategic enhancements** can deliver 30-50% improvement but require significant effort
|
||||
|
||||
**Recommended Approach**:
|
||||
- Start with Phase 1 (Quick Wins) to build momentum
|
||||
- Validate improvements with real CIM documents
|
||||
- Iterate based on results before moving to Phase 2
|
||||
- Consider Phase 3 enhancements based on business priorities and resource availability
|
||||
|
||||
All recommendations include specific code references, success likelihood assessments, and expected impact on the 5 core objectives.
|
||||
Reference in New Issue
Block a user