Remove 15 stale planning and analysis docs
These are completed implementation plans, one-time analysis artifacts, and generic guides that no longer reflect the current codebase. All useful content is either implemented in code or captured in TODO_AND_OPTIMIZATIONS.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -1,746 +0,0 @@
|
||||
<img src="https://r2cdn.perplexity.ai/pplx-full-logo-primary-dark%402x.png" style="height:64px;margin-right:32px"/>
|
||||
|
||||
## Best Practices for Debugging with Cursor: Becoming a Senior Developer-Level Debugger
|
||||
|
||||
Transform Cursor into an elite debugging partner with these comprehensive strategies, workflow optimizations, and hidden power features that professional developers use to maximize productivity.
|
||||
|
||||
### Core Debugging Philosophy: Test-Driven Development with AI
|
||||
|
||||
**Write Tests First, Always**
|
||||
|
||||
The single most effective debugging strategy is implementing Test-Driven Development (TDD) with Cursor. This gives you verifiable proof that code works before deployment[^1][^2][^3].
|
||||
|
||||
**Workflow:**
|
||||
|
||||
- Start with: "Write tests first, then the code, then run the tests and update the code until tests pass"[^1]
|
||||
- Enable YOLO mode (Settings → scroll down → enable YOLO mode) to allow Cursor to automatically run tests, build commands, and iterate until passing[^1][^4]
|
||||
- Let the AI cycle through test failures autonomously—it will fix lint errors and test failures without manual intervention[^1][^5]
|
||||
|
||||
**YOLO Mode Configuration:**
|
||||
Add this prompt to YOLO settings:
|
||||
|
||||
```
|
||||
any kind of tests are always allowed like vitest, npm test, nr test, etc. also basic build commands like build, tsc, etc. creating files and making directories (like touch, mkdir, etc) is always ok too
|
||||
```
|
||||
|
||||
This enables autonomous iteration on builds and tests[^1][^4].
|
||||
|
||||
### Advanced Debugging Techniques
|
||||
|
||||
**1. Log-Driven Debugging Workflow**
|
||||
|
||||
When facing persistent bugs, use this iterative logging approach[^1][^6]:
|
||||
|
||||
- Tell Cursor: "Please add logs to the code to get better visibility into what is going on so we can find the fix. I'll run the code and feed you the logs results"[^1]
|
||||
- Run your code and collect log output
|
||||
- Paste the raw logs back into Cursor: "Here's the log output. What do you now think is causing the issue? And how do we fix it?"[^1]
|
||||
- Cursor will propose targeted fixes based on actual runtime behavior
|
||||
|
||||
**For Firebase Projects:**
|
||||
Use the logger SDK with proper severity levels[^7]:
|
||||
|
||||
```javascript
|
||||
const { log, info, debug, warn, error } = require("firebase-functions/logger");
|
||||
|
||||
// Log with structured data
|
||||
logger.error("API call failed", {
|
||||
endpoint: endpoint,
|
||||
statusCode: response.status,
|
||||
userId: userId
|
||||
});
|
||||
```
|
||||
|
||||
**2. Autonomous Workflow with Plan-Approve-Execute Pattern**
|
||||
|
||||
Use Cursor in Project Manager mode for complex debugging tasks[^5][^8]:
|
||||
|
||||
**Setup `.cursorrules` file:**
|
||||
|
||||
```
|
||||
You are working with me as PM/Technical Approver while you act as developer.
|
||||
- Work from PRD file one item at a time
|
||||
- Generate detailed story file outlining approach
|
||||
- Wait for approval before executing
|
||||
- Use TDD for implementation
|
||||
- Update story with progress after completion
|
||||
```
|
||||
|
||||
**Workflow:**
|
||||
|
||||
- Agent creates story file breaking down the fix in detail
|
||||
- You review and approve the approach
|
||||
- Agent executes using TDD
|
||||
- Agent runs tests until all pass
|
||||
- Agent pushes changes with clear commit message[^5][^8]
|
||||
|
||||
This prevents the AI from going off-track and ensures deliberate, verifiable fixes.
|
||||
|
||||
### Context Management Mastery
|
||||
|
||||
**3. Strategic Use of @ Symbols**
|
||||
|
||||
Master these context references for precise debugging[^9][^10]:
|
||||
|
||||
- `@Files` - Reference specific files
|
||||
- `@Folders` - Include entire directories
|
||||
- `@Code` - Reference specific functions/classes
|
||||
- `@Docs` - Pull in library documentation (add libraries via Settings → Cursor Settings → Docs)[^4][^9]
|
||||
- `@Web` - Search current information online
|
||||
- `@Codebase` - Search entire codebase (Chat only)
|
||||
- `@Lint Errors` - Reference current lint errors (Chat only)[^9]
|
||||
- `@Git` - Access git history and recent changes
|
||||
- `@Recent Changes` - View recent modifications
|
||||
|
||||
**Pro tip:** Stack multiple @ symbols in one prompt for comprehensive context[^9].
|
||||
|
||||
**4. Reference Open Editors Strategy**
|
||||
|
||||
Keep your AI focused by managing context deliberately[^11]:
|
||||
|
||||
- Close all irrelevant tabs
|
||||
- Open only files related to current debugging task
|
||||
- Use `@` to reference open editors
|
||||
- This prevents the AI from getting confused by unrelated code[^11]
|
||||
|
||||
**5. Context7 MCP for Up-to-Date Documentation**
|
||||
|
||||
Integrate Context7 MCP to eliminate outdated API suggestions[^12][^13][^14]:
|
||||
|
||||
**Installation:**
|
||||
|
||||
```json
|
||||
// ~/.cursor/mcp.json
|
||||
{
|
||||
"mcpServers": {
|
||||
"context7": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@upstash/context7-mcp@latest"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
```
|
||||
use context7 for latest documentation on [library name]
|
||||
```
|
||||
|
||||
Add to your cursor rules:
|
||||
|
||||
```
|
||||
When referencing documentation for any library, use the context7 MCP server for lookups to ensure up-to-date information
|
||||
```
|
||||
|
||||
|
||||
### Power Tools and Integrations
|
||||
|
||||
**6. Browser Tools MCP for Live Debugging**
|
||||
|
||||
Debug live applications by connecting Cursor directly to your browser[^15][^16]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Clone browser-tools-mcp repository
|
||||
2. Install Chrome extension
|
||||
3. Configure MCP in Cursor settings:
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"browser-tools": {
|
||||
"command": "node",
|
||||
"args": ["/path/to/browser-tools-mcp/server.js"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. Run the server: `npm start`
|
||||
|
||||
**Features:**
|
||||
|
||||
- "Investigate what happens when users click the pay button and resolve any JavaScript errors"
|
||||
- "Summarize these console logs and identify recurring errors"
|
||||
- "Which API calls are failing?"
|
||||
- Automatically captures screenshots, console logs, network requests, and DOM state[^15][^16]
|
||||
|
||||
**7. Sequential Thinking MCP for Complex Problems**
|
||||
|
||||
For intricate debugging requiring multi-step reasoning[^17][^18][^19]:
|
||||
|
||||
**Installation:**
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"sequential-thinking": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@modelcontextprotocol/server-sequential-thinking"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
|
||||
- Breaking down complex bugs into manageable steps
|
||||
- Problems where the full scope isn't clear initially
|
||||
- Analysis that might need course correction
|
||||
- Maintaining context over multiple debugging steps[^17]
|
||||
|
||||
Add to cursor rules:
|
||||
|
||||
```
|
||||
Use Sequential thinking for complex reflections and multi-step debugging
|
||||
```
|
||||
|
||||
**8. Firebase Crashlytics MCP Integration**
|
||||
|
||||
Connect Crashlytics directly to Cursor for AI-powered crash analysis[^20][^21]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
1. Enable BigQuery export in Firebase Console → Project Settings → Integrations
|
||||
2. Generate Firebase service account JSON key
|
||||
3. Configure MCP:
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"crashlytics": {
|
||||
"command": "node",
|
||||
"args": ["/path/to/mcp-crashlytics-server/dist/index.js"],
|
||||
"env": {
|
||||
"GOOGLE_SERVICE_ACCOUNT_KEY": "/path/to/service-account.json",
|
||||
"BIGQUERY_PROJECT_ID": "your-project-id",
|
||||
"BIGQUERY_DATASET_ID": "firebase_crashlytics"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
|
||||
- "Fetch the latest Crashlytics issues for my project"
|
||||
- "Add a note to issue xyz summarizing investigation"
|
||||
- Use `crashlytics:connect` command for structured debugging flow[^20][^21]
|
||||
|
||||
|
||||
### Cursor Rules \& Configuration
|
||||
|
||||
**9. Master .cursorrules Files**
|
||||
|
||||
Create powerful project-specific rules[^22][^23][^24]:
|
||||
|
||||
**Structure:**
|
||||
|
||||
```markdown
|
||||
# Project Overview
|
||||
[High-level description of what you're building]
|
||||
|
||||
# Tech Stack
|
||||
- Framework: [e.g., Next.js 14]
|
||||
- Language: TypeScript (strict mode)
|
||||
- Database: [e.g., PostgreSQL with Prisma]
|
||||
|
||||
# Critical Rules
|
||||
- Always use strict TypeScript - never use `any`
|
||||
- Never modify files without explicit approval
|
||||
- Always read relevant files before making changes
|
||||
- Log all exceptions in catch blocks using Crashlytics
|
||||
|
||||
# Deprecated Patterns (DO NOT USE)
|
||||
- Old API: `oldMethod()` ❌
|
||||
- Use instead: `newMethod()` ✅
|
||||
|
||||
# Common Bugs to Document
|
||||
[Add bugs you encounter here so they don't recur]
|
||||
```
|
||||
|
||||
**Pro Tips:**
|
||||
|
||||
- Document bugs you encounter in .cursorrules so AI avoids them in future[^23]
|
||||
- Use cursor.directory for template examples[^11][^23]
|
||||
- Stack multiple rule files: global rules + project-specific + feature-specific[^24]
|
||||
- Use `.cursor/rules` directory for organized rule management[^24][^25]
|
||||
|
||||
**10. Global Rules Configuration**
|
||||
|
||||
Set personal coding standards in Settings → Rules for AI[^11][^4]:
|
||||
|
||||
```
|
||||
- Always prefer strict types over any in TypeScript
|
||||
- Ensure answers are brief and to the point
|
||||
- Propose alternative solutions when stuck
|
||||
- Skip unnecessary elaborations
|
||||
- Emphasize technical specifics over general advice
|
||||
- Always examine relevant files before taking action
|
||||
```
|
||||
|
||||
**11. Notepads for Reusable Context**
|
||||
|
||||
Use Notepads to store debugging patterns and common fixes[^11][^26][^27][^28]:
|
||||
|
||||
**Create notepads for:**
|
||||
|
||||
- Common error patterns and solutions
|
||||
- Debugging checklists for specific features
|
||||
- File references for complex features
|
||||
- Standard prompts like "code review" or "vulnerability search"
|
||||
|
||||
**Usage:**
|
||||
Reference notepads in prompts to quickly load debugging context without retyping[^27][^28].
|
||||
|
||||
### Keyboard Shortcuts for Speed
|
||||
|
||||
**Essential Debugging Shortcuts**[^29][^30][^31]:
|
||||
|
||||
**Core AI Commands:**
|
||||
|
||||
- `Cmd/Ctrl + K` - Inline editing (fastest for quick fixes)[^1][^32][^30]
|
||||
- `Cmd/Ctrl + L` - Open AI chat[^30][^31]
|
||||
- `Cmd/Ctrl + I` - Open Composer[^30]
|
||||
- `Cmd/Ctrl + Shift + I` - Full-screen Composer[^30]
|
||||
|
||||
**When to use what:**
|
||||
|
||||
- Use `Cmd+K` for fast, localized changes to selected code[^1][^32]
|
||||
- Use `Cmd+L` for questions and explanations[^31]
|
||||
- Use `Cmd+I` (Composer) for multi-file changes and complex refactors[^32][^4]
|
||||
|
||||
**Navigation:**
|
||||
|
||||
- `Cmd/Ctrl + P` - Quick file open[^29][^33]
|
||||
- `Cmd/Ctrl + Shift + O` - Go to symbol in file[^33]
|
||||
- `Ctrl + G` - Go to line (for stack traces)[^33]
|
||||
- `F12` - Go to definition[^29]
|
||||
|
||||
**Terminal:**
|
||||
|
||||
- `Cmd/Ctrl + `` - Toggle terminal[^29][^30]
|
||||
- `Cmd + K` in terminal - Clear terminal (note: may need custom keybinding)[^34][^35]
|
||||
|
||||
|
||||
### Advanced Workflow Strategies
|
||||
|
||||
**12. Agent Mode with Plan Mode**
|
||||
|
||||
Use Plan Mode for complex debugging[^36][^37]:
|
||||
|
||||
1. Hit `Cmd+N` for new chat
|
||||
2. Press `Shift+Tab` to toggle Plan Mode
|
||||
3. Describe the bug or feature
|
||||
4. Agent researches codebase and creates detailed plan
|
||||
5. Review and approve before implementation
|
||||
|
||||
**Agent mode benefits:**
|
||||
|
||||
- Autonomous exploration of codebase
|
||||
- Edits multiple files
|
||||
- Runs commands automatically
|
||||
- Fixes errors iteratively[^37][^38]
|
||||
|
||||
**13. Composer Agent Mode Best Practices**
|
||||
|
||||
For large-scale debugging and refactoring[^39][^5][^4]:
|
||||
|
||||
**Setup:**
|
||||
|
||||
- Always use Agent mode (toggle in Composer)
|
||||
- Enable YOLO mode for autonomous execution[^5][^4]
|
||||
- Start with clear, detailed problem descriptions
|
||||
|
||||
**Workflow:**
|
||||
|
||||
1. Describe the complete bug context in detail
|
||||
2. Let Agent plan the approach
|
||||
3. Agent will:
|
||||
- Pull relevant files automatically
|
||||
- Run terminal commands as needed
|
||||
- Iterate on test failures
|
||||
- Fix linting errors autonomously[^4]
|
||||
|
||||
**Recovery strategies:**
|
||||
|
||||
- If Agent goes off-track, hit stop immediately
|
||||
- Say: "Wait, you're way off track here. Reset, recalibrate"[^1]
|
||||
- Use Composer history to restore checkpoints[^40][^41]
|
||||
|
||||
**14. Index Management**
|
||||
|
||||
Keep your codebase index fresh[^11]:
|
||||
|
||||
**Manual resync:**
|
||||
Settings → Cursor Settings → Resync Index
|
||||
|
||||
**Why this matters:**
|
||||
|
||||
- Outdated index causes incorrect suggestions
|
||||
- AI may reference deleted files
|
||||
- Prevents hallucinations about code structure[^11]
|
||||
|
||||
**15. Error Pattern Recognition**
|
||||
|
||||
Watch for these warning signs and intervene[^1][^42]:
|
||||
|
||||
- AI repeatedly apologizing
|
||||
- Same error occurring 3+ times
|
||||
- Complexity escalating unexpectedly
|
||||
- AI asking same diagnostic questions repeatedly
|
||||
|
||||
**When you see these:**
|
||||
|
||||
- Stop the current chat
|
||||
- Start fresh conversation with better context
|
||||
- Add specific constraints to prevent loops
|
||||
- Use "explain your thinking" to understand AI's logic[^42]
|
||||
|
||||
|
||||
### Firebase-Specific Debugging
|
||||
|
||||
**16. Firebase Logging Best Practices**
|
||||
|
||||
Structure logs for effective debugging[^7][^43]:
|
||||
|
||||
**Severity levels:**
|
||||
|
||||
```javascript
|
||||
logger.debug("Detailed diagnostic info")
|
||||
logger.info("Normal operations")
|
||||
logger.warn("Warning conditions")
|
||||
logger.error("Error conditions", { context: details })
|
||||
logger.write({ severity: "EMERGENCY", message: "Critical failure" })
|
||||
```
|
||||
|
||||
**Add context:**
|
||||
|
||||
```javascript
|
||||
// Tag user IDs for filtering
|
||||
Crashlytics.setUserIdentifier(userId)
|
||||
|
||||
// Log exceptions with context
|
||||
Crashlytics.logException(error)
|
||||
Crashlytics.log(priority, tag, message)
|
||||
```
|
||||
|
||||
**View logs:**
|
||||
|
||||
- Firebase Console → Functions → Logs
|
||||
- Cloud Logging for advanced filtering
|
||||
- Filter by severity, user ID, version[^43]
|
||||
|
||||
**17. Version and User Tagging**
|
||||
|
||||
Enable precise debugging of production issues[^43]:
|
||||
|
||||
```javascript
|
||||
// Set version
|
||||
Crashlytics.setCustomKey("app_version", "1.2.3")
|
||||
|
||||
// Set user identifier
|
||||
Crashlytics.setUserIdentifier(userId)
|
||||
|
||||
// Add custom context
|
||||
Crashlytics.setCustomKey("feature_flag", "beta_enabled")
|
||||
```
|
||||
|
||||
Filter crashes in Firebase Console by version and user to isolate issues.
|
||||
|
||||
### Meta-Strategies
|
||||
|
||||
**18. Minimize Context Pollution**
|
||||
|
||||
**Project-level tactics:**
|
||||
|
||||
- Use `.cursorignore` similar to `.gitignore` to exclude unnecessary files[^44]
|
||||
- Keep only relevant documentation indexed[^4]
|
||||
- Close unrelated editor tabs before asking questions[^11]
|
||||
|
||||
**19. Commit Often**
|
||||
|
||||
Let Cursor handle commits[^40]:
|
||||
|
||||
```
|
||||
Push all changes, update story with progress, write clear commit message, and push to remote
|
||||
```
|
||||
|
||||
This creates restoration points if debugging goes sideways.
|
||||
|
||||
**20. Multi-Model Strategy**
|
||||
|
||||
Don't rely on one model[^4][^45]:
|
||||
|
||||
- Use Claude 3.5 Sonnet for complex reasoning and file generation[^5][^8]
|
||||
- Try different models if stuck
|
||||
- Some tasks work better with specific models
|
||||
|
||||
**21. Break Down Complex Debugging**
|
||||
|
||||
When debugging fails repeatedly[^39][^40]:
|
||||
|
||||
- Break the problem into smallest possible sub-tasks
|
||||
- Start new chats for discrete issues
|
||||
- Ask AI to explain its approach before implementing
|
||||
- Use sequential prompts rather than one massive request
|
||||
|
||||
|
||||
### Troubleshooting Cursor Itself
|
||||
|
||||
**When Cursor Misbehaves:**
|
||||
|
||||
**Context loss issues:**[^46][^47][^48]
|
||||
|
||||
- Check for .mdc glob attachment issues in settings
|
||||
- Disable workbench/editor auto-attachment if causing crashes[^46]
|
||||
- Start new chat if context becomes corrupted[^48]
|
||||
|
||||
**Agent loops:**[^47]
|
||||
|
||||
- Stop immediately when looping detected
|
||||
- Provide explicit, numbered steps
|
||||
- Use "complete step 1, then stop and report" approach
|
||||
- Restart with clearer constraints
|
||||
|
||||
**Rule conflicts:**[^49][^46]
|
||||
|
||||
- User rules may not apply automatically - use project .cursorrules instead[^49]
|
||||
- Test rules by asking AI to recite them
|
||||
- Check rules are being loaded (mention them in responses)[^46]
|
||||
|
||||
|
||||
### Ultimate Debugging Checklist
|
||||
|
||||
Before starting any debugging session:
|
||||
|
||||
**Setup:**
|
||||
|
||||
- [ ] Enable YOLO mode
|
||||
- [ ] Configure .cursorrules with project specifics
|
||||
- [ ] Resync codebase index
|
||||
- [ ] Close irrelevant files
|
||||
- [ ] Add relevant documentation to Cursor docs
|
||||
|
||||
**During Debugging:**
|
||||
|
||||
- [ ] Write tests first before fixing
|
||||
- [ ] Add logging at critical points
|
||||
- [ ] Use @ symbols to reference exact files
|
||||
- [ ] Let Agent run tests autonomously
|
||||
- [ ] Stop immediately if AI goes off-track
|
||||
- [ ] Commit frequently with clear messages
|
||||
|
||||
**Advanced Tools (when needed):**
|
||||
|
||||
- [ ] Context7 MCP for up-to-date docs
|
||||
- [ ] Browser Tools MCP for live debugging
|
||||
- [ ] Sequential Thinking MCP for complex issues
|
||||
- [ ] Crashlytics MCP for production errors
|
||||
|
||||
**Recovery Strategies:**
|
||||
|
||||
- [ ] Use Composer checkpoints to restore state
|
||||
- [ ] Start new chat with git diff context if lost
|
||||
- [ ] Ask AI to recite instructions to verify context
|
||||
- [ ] Use Plan Mode to reset approach
|
||||
|
||||
By implementing these strategies systematically, you transform Cursor from a coding assistant into an elite debugging partner that operates at senior developer level. The key is combining AI autonomy (YOLO mode, Agent mode) with human oversight (TDD, plan approval, checkpoints) to create a powerful, verifiable debugging workflow[^1][^5][^8][^4].
|
||||
<span style="display:none">[^50][^51][^52][^53][^54][^55][^56][^57][^58][^59][^60][^61][^62][^63][^64][^65][^66][^67][^68][^69][^70][^71][^72][^73][^74][^75][^76][^77][^78][^79][^80][^81][^82][^83][^84][^85][^86][^87][^88][^89][^90][^91][^92][^93][^94][^95][^96][^97][^98]</span>
|
||||
|
||||
<div align="center">⁂</div>
|
||||
|
||||
[^1]: https://www.builder.io/blog/cursor-tips
|
||||
|
||||
[^2]: https://cursorintro.com/insights/Test-Driven-Development-as-a-Framework-for-AI-Assisted-Development
|
||||
|
||||
[^3]: https://www.linkedin.com/posts/richardsondx_i-built-tdd-for-cursor-ai-agents-and-its-activity-7330360750995132416-Jt5A
|
||||
|
||||
[^4]: https://stack.convex.dev/6-tips-for-improving-your-cursor-composer-and-convex-workflow
|
||||
|
||||
[^5]: https://www.reddit.com/r/cursor/comments/1iga00x/refined_workflow_for_cursor_composer_agent_mode/
|
||||
|
||||
[^6]: https://www.sidetool.co/post/how-to-use-cursor-for-efficient-code-review-and-debugging/
|
||||
|
||||
[^7]: https://firebase.google.com/docs/functions/writing-and-viewing-logs
|
||||
|
||||
[^8]: https://forum.cursor.com/t/composer-agent-refined-workflow-detailed-instructions-and-example-repo-for-practice/47180
|
||||
|
||||
[^9]: https://learncursor.dev/features/at-symbols
|
||||
|
||||
[^10]: https://cursor.com/docs/context/symbols
|
||||
|
||||
[^11]: https://www.reddit.com/r/ChatGPTCoding/comments/1hu276s/how_to_use_cursor_more_efficiently/
|
||||
|
||||
[^12]: https://dev.to/mehmetakar/context7-mcp-tutorial-3he2
|
||||
|
||||
[^13]: https://github.com/upstash/context7
|
||||
|
||||
[^14]: https://apidog.com/blog/context7-mcp-server/
|
||||
|
||||
[^15]: https://www.reddit.com/r/cursor/comments/1jg0in6/i_cut_my_browser_debugging_time_in_half_using_ai/
|
||||
|
||||
[^16]: https://www.youtube.com/watch?v=K5hLY0mytV0
|
||||
|
||||
[^17]: https://mcpcursor.com/server/sequential-thinking
|
||||
|
||||
[^18]: https://apidog.com/blog/mcp-sequential-thinking/
|
||||
|
||||
[^19]: https://skywork.ai/skypage/en/An-AI-Engineer's-Deep-Dive:-Mastering-Complex-Reasoning-with-the-sequential-thinking-MCP-Server-and-Claude-Code/1971471570609172480
|
||||
|
||||
[^20]: https://firebase.google.com/docs/crashlytics/ai-assistance-mcp
|
||||
|
||||
[^21]: https://lobehub.com/mcp/your-username-mcp-crashlytics-server
|
||||
|
||||
[^22]: https://trigger.dev/blog/cursor-rules
|
||||
|
||||
[^23]: https://www.youtube.com/watch?v=Vy7dJKv1EpA
|
||||
|
||||
[^24]: https://www.reddit.com/r/cursor/comments/1ik06ol/a_guide_to_understand_new_cursorrules_in_045/
|
||||
|
||||
[^25]: https://cursor.com/docs/context/rules
|
||||
|
||||
[^26]: https://forum.cursor.com/t/enhanced-productivity-persistent-notepads-smart-organization-and-project-integration/60757
|
||||
|
||||
[^27]: https://iroidsolutions.com/blog/mastering-cursor-ai-16-golden-tips-for-next-level-productivity
|
||||
|
||||
[^28]: https://dev.to/heymarkkop/my-top-cursor-tips-v043-1kcg
|
||||
|
||||
[^29]: https://www.dotcursorrules.dev/cheatsheet
|
||||
|
||||
[^30]: https://cursor101.com/en/cursor/cheat-sheet
|
||||
|
||||
[^31]: https://mehmetbaykar.com/posts/top-15-cursor-shortcuts-to-speed-up-development/
|
||||
|
||||
[^32]: https://dev.to/romainsimon/4-tips-for-a-10x-productivity-using-cursor-1n3o
|
||||
|
||||
[^33]: https://skywork.ai/blog/vibecoding/cursor-2-0-workflow-tips/
|
||||
|
||||
[^34]: https://forum.cursor.com/t/command-k-and-the-terminal/7265
|
||||
|
||||
[^35]: https://forum.cursor.com/t/shortcut-conflict-for-cmd-k-terminal-clear-and-ai-window/22693
|
||||
|
||||
[^36]: https://www.youtube.com/watch?v=WVeYLlKOWc0
|
||||
|
||||
[^37]: https://cursor.com/docs/agent/modes
|
||||
|
||||
[^38]: https://forum.cursor.com/t/10-pro-tips-for-working-with-cursor-agent/137212
|
||||
|
||||
[^39]: https://ryanocm.substack.com/p/137-10-ways-to-10x-your-cursor-workflow
|
||||
|
||||
[^40]: https://forum.cursor.com/t/add-the-best-practices-section-to-the-documentation/129131
|
||||
|
||||
[^41]: https://www.nocode.mba/articles/debug-vibe-coding-faster
|
||||
|
||||
[^42]: https://www.siddharthbharath.com/coding-with-cursor-beginners-guide/
|
||||
|
||||
[^43]: https://www.letsenvision.com/blog/effective-logging-in-production-with-firebase-crashlytics
|
||||
|
||||
[^44]: https://www.ellenox.com/post/mastering-cursor-ai-advanced-workflows-and-best-practices
|
||||
|
||||
[^45]: https://forum.cursor.com/t/best-practices-setups-for-custom-agents-in-cursor/76725
|
||||
|
||||
[^46]: https://www.reddit.com/r/cursor/comments/1jtc9ej/cursors_internal_prompt_and_context_management_is/
|
||||
|
||||
[^47]: https://forum.cursor.com/t/endless-loops-and-unrelated-code/122518
|
||||
|
||||
[^48]: https://forum.cursor.com/t/auto-injected-summarization-and-loss-of-context/86609
|
||||
|
||||
[^49]: https://github.com/cursor/cursor/issues/3706
|
||||
|
||||
[^50]: https://www.youtube.com/watch?v=TFIkzc74CsI
|
||||
|
||||
[^51]: https://www.codecademy.com/article/how-to-use-cursor-ai-a-complete-guide-with-practical-examples
|
||||
|
||||
[^52]: https://launchdarkly.com/docs/tutorials/cursor-tips-and-tricks
|
||||
|
||||
[^53]: https://www.reddit.com/r/programming/comments/1g20jej/18_observations_from_using_cursor_for_6_months/
|
||||
|
||||
[^54]: https://www.youtube.com/watch?v=TrcyAWGC1k4
|
||||
|
||||
[^55]: https://forum.cursor.com/t/composer-agent-refined-workflow-detailed-instructions-and-example-repo-for-practice/47180/5
|
||||
|
||||
[^56]: https://hackernoon.com/two-hours-with-cursor-changed-how-i-see-ai-coding
|
||||
|
||||
[^57]: https://forum.cursor.com/t/how-are-you-using-ai-inside-cursor-for-real-world-projects/97801
|
||||
|
||||
[^58]: https://www.youtube.com/watch?v=eQD5NncxXgE
|
||||
|
||||
[^59]: https://forum.cursor.com/t/guide-a-simpler-more-autonomous-ai-workflow-for-cursor-new-update/70688
|
||||
|
||||
[^60]: https://forum.cursor.com/t/good-examples-of-cursorrules-file/4346
|
||||
|
||||
[^61]: https://patagonian.com/cursor-features-developers-must-know/
|
||||
|
||||
[^62]: https://forum.cursor.com/t/ai-test-driven-development/23993
|
||||
|
||||
[^63]: https://www.reddit.com/r/cursor/comments/1iq6pc7/all_you_need_is_tdd/
|
||||
|
||||
[^64]: https://forum.cursor.com/t/best-practices-cursorrules/41775
|
||||
|
||||
[^65]: https://www.youtube.com/watch?v=A9BiNPf34Z4
|
||||
|
||||
[^66]: https://engineering.monday.com/coding-with-cursor-heres-why-you-still-need-tdd/
|
||||
|
||||
[^67]: https://github.com/PatrickJS/awesome-cursorrules
|
||||
|
||||
[^68]: https://www.datadoghq.com/blog/datadog-cursor-extension/
|
||||
|
||||
[^69]: https://www.youtube.com/watch?v=oAoigBWLZgE
|
||||
|
||||
[^70]: https://www.reddit.com/r/cursor/comments/1khn8hw/noob_question_about_mcp_specifically_context7/
|
||||
|
||||
[^71]: https://www.reddit.com/r/ChatGPTCoding/comments/1if8lbr/cursor_has_mcp_features_that_dont_work_for_me_any/
|
||||
|
||||
[^72]: https://cursor.com/docs/context/mcp
|
||||
|
||||
[^73]: https://upstash.com/blog/context7-mcp
|
||||
|
||||
[^74]: https://cursor.directory/mcp/sequential-thinking
|
||||
|
||||
[^75]: https://forum.cursor.com/t/how-to-debug-localhost-site-with-mcp/48853
|
||||
|
||||
[^76]: https://www.youtube.com/watch?v=gnx2dxtM-Ys
|
||||
|
||||
[^77]: https://www.mcp-repository.com/use-cases/ai-data-analysis
|
||||
|
||||
[^78]: https://cursor.directory/mcp
|
||||
|
||||
[^79]: https://www.youtube.com/watch?v=tDGJ12sD-UQ
|
||||
|
||||
[^80]: https://github.com/firebase/firebase-functions/issues/1439
|
||||
|
||||
[^81]: https://firebase.google.com/docs/app-hosting/logging
|
||||
|
||||
[^82]: https://dotcursorrules.com/cheat-sheet
|
||||
|
||||
[^83]: https://www.reddit.com/r/webdev/comments/1k8ld2l/whats_easy_way_to_see_errors_and_logs_once_in/
|
||||
|
||||
[^84]: https://www.youtube.com/watch?v=HlYyU2XOXk0
|
||||
|
||||
[^85]: https://stackoverflow.com/questions/51212886/how-to-log-errors-with-firebase-hosting-for-a-deployed-angular-web-app
|
||||
|
||||
[^86]: https://forum.cursor.com/t/list-of-shortcuts/520
|
||||
|
||||
[^87]: https://firebase.google.com/docs/analytics/debugview
|
||||
|
||||
[^88]: https://forum.cursor.com/t/cmd-k-vs-cmd-r-keyboard-shortcuts-default/1172
|
||||
|
||||
[^89]: https://www.youtube.com/watch?v=CeYr7C8UqLE
|
||||
|
||||
[^90]: https://forum.cursor.com/t/can-we-reference-docs-files-in-the-rules/23300
|
||||
|
||||
[^91]: https://forum.cursor.com/t/cmd-l-l-i-and-cmd-k-k-hotkeys-to-switch-between-models-and-chat-modes/2442
|
||||
|
||||
[^92]: https://www.reddit.com/r/cursor/comments/1gqr207/can_i_mention_docs_in_cursorrules_file/
|
||||
|
||||
[^93]: https://cursor.com/docs/configuration/kbd
|
||||
|
||||
[^94]: https://forum.cursor.com/t/how-to-reference-symbols-like-docs-or-web-from-within-a-text-prompt/66850
|
||||
|
||||
[^95]: https://forum.cursor.com/t/tired-of-cursor-not-putting-what-you-want-into-context-solved/75682
|
||||
|
||||
[^96]: https://www.reddit.com/r/vscode/comments/1frnoca/which_keyboard_shortcuts_do_you_use_most_but/
|
||||
|
||||
[^97]: https://forum.cursor.com/t/fixing-basic-features-before-adding-new-ones/141183
|
||||
|
||||
[^98]: https://cursor.com/en-US/docs
|
||||
|
||||
@@ -1,539 +0,0 @@
|
||||
# CIM Review PDF Template
|
||||
## HTML Template for Professional CIM Review Reports
|
||||
|
||||
### 🎯 Overview
|
||||
|
||||
This document contains the HTML template used by the PDF Generation Service to create professional CIM Review reports. The template includes comprehensive styling and structure for generating high-quality PDF documents.
|
||||
|
||||
---
|
||||
|
||||
## 📄 HTML Template
|
||||
|
||||
```html
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<title>CIM Review Report</title>
|
||||
<style>
|
||||
:root {
|
||||
--page-margin: 0.75in;
|
||||
--radius: 10px;
|
||||
--shadow: 0 12px 30px -10px rgba(0,0,0,0.08);
|
||||
--color-bg: #ffffff;
|
||||
--color-muted: #f5f7fa;
|
||||
--color-text: #1f2937;
|
||||
--color-heading: #111827;
|
||||
--color-border: #dfe3ea;
|
||||
--color-primary: #5f6cff;
|
||||
--color-primary-dark: #4a52d1;
|
||||
--color-success-bg: #e6f4ea;
|
||||
--color-success-border: #38a169;
|
||||
--color-highlight-bg: #fff8ed;
|
||||
--color-highlight-border: #f29f3f;
|
||||
--color-summary-bg: #eef7fe;
|
||||
--color-summary-border: #3182ce;
|
||||
--font-stack: -apple-system, system-ui, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
|
||||
}
|
||||
|
||||
@page {
|
||||
margin: var(--page-margin);
|
||||
size: A4;
|
||||
}
|
||||
|
||||
* { box-sizing: border-box; }
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
font-family: var(--font-stack);
|
||||
background: var(--color-bg);
|
||||
color: var(--color-text);
|
||||
line-height: 1.45;
|
||||
font-size: 11pt;
|
||||
}
|
||||
|
||||
.container {
|
||||
max-width: 940px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
.header {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
padding: 24px 20px;
|
||||
background: #f9fbfc;
|
||||
border-radius: var(--radius);
|
||||
border: 1px solid var(--color-border);
|
||||
margin-bottom: 28px;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.header-left {
|
||||
flex: 1 1 300px;
|
||||
}
|
||||
|
||||
.title {
|
||||
margin: 0;
|
||||
font-size: 24pt;
|
||||
font-weight: 700;
|
||||
color: var(--color-heading);
|
||||
position: relative;
|
||||
display: inline-block;
|
||||
padding-bottom: 4px;
|
||||
}
|
||||
|
||||
.title:after {
|
||||
content: '';
|
||||
position: absolute;
|
||||
left: 0;
|
||||
bottom: 0;
|
||||
height: 4px;
|
||||
width: 60px;
|
||||
background: linear-gradient(90deg, var(--color-primary), var(--color-primary-dark));
|
||||
border-radius: 2px;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
margin: 4px 0 0 0;
|
||||
font-size: 10pt;
|
||||
color: #6b7280;
|
||||
}
|
||||
|
||||
.meta {
|
||||
text-align: right;
|
||||
font-size: 9pt;
|
||||
color: #6b7280;
|
||||
min-width: 180px;
|
||||
line-height: 1.3;
|
||||
}
|
||||
|
||||
.section {
|
||||
margin-bottom: 28px;
|
||||
padding: 22px 24px;
|
||||
background: #ffffff;
|
||||
border-radius: var(--radius);
|
||||
border: 1px solid var(--color-border);
|
||||
box-shadow: var(--shadow);
|
||||
page-break-inside: avoid;
|
||||
}
|
||||
|
||||
.section + .section {
|
||||
margin-top: 4px;
|
||||
}
|
||||
|
||||
h2 {
|
||||
margin: 0 0 14px 0;
|
||||
font-size: 18pt;
|
||||
font-weight: 600;
|
||||
color: var(--color-heading);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
h3 {
|
||||
margin: 16px 0 8px 0;
|
||||
font-size: 13pt;
|
||||
font-weight: 600;
|
||||
color: #374151;
|
||||
}
|
||||
|
||||
.field {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 12px;
|
||||
margin-bottom: 14px;
|
||||
}
|
||||
|
||||
.field-label {
|
||||
flex: 0 0 180px;
|
||||
font-size: 9pt;
|
||||
font-weight: 600;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 0.8px;
|
||||
color: #4b5563;
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.field-value {
|
||||
flex: 1 1 220px;
|
||||
font-size: 11pt;
|
||||
color: var(--color-text);
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.financial-table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
margin: 16px 0;
|
||||
font-size: 10pt;
|
||||
}
|
||||
|
||||
.financial-table th,
|
||||
.financial-table td {
|
||||
padding: 10px 12px;
|
||||
text-align: left;
|
||||
vertical-align: top;
|
||||
}
|
||||
|
||||
.financial-table thead th {
|
||||
background: var(--color-primary);
|
||||
color: #fff;
|
||||
font-weight: 600;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 0.5px;
|
||||
font-size: 9pt;
|
||||
border-bottom: 2px solid rgba(255,255,255,0.2);
|
||||
}
|
||||
|
||||
.financial-table tbody tr {
|
||||
border-bottom: 1px solid #eceef1;
|
||||
}
|
||||
|
||||
.financial-table tbody tr:nth-child(odd) td {
|
||||
background: #fbfcfe;
|
||||
}
|
||||
|
||||
.financial-table td {
|
||||
background: #fff;
|
||||
color: var(--color-text);
|
||||
font-size: 10pt;
|
||||
}
|
||||
|
||||
.financial-table tbody tr:hover td {
|
||||
background: #f1f5fa;
|
||||
}
|
||||
|
||||
.summary-box,
|
||||
.highlight-box,
|
||||
.success-box {
|
||||
border-radius: 8px;
|
||||
padding: 16px 18px;
|
||||
margin: 18px 0;
|
||||
position: relative;
|
||||
font-size: 11pt;
|
||||
}
|
||||
|
||||
.summary-box {
|
||||
background: var(--color-summary-bg);
|
||||
border: 1px solid var(--color-summary-border);
|
||||
}
|
||||
|
||||
.highlight-box {
|
||||
background: var(--color-highlight-bg);
|
||||
border: 1px solid var(--color-highlight-border);
|
||||
}
|
||||
|
||||
.success-box {
|
||||
background: var(--color-success-bg);
|
||||
border: 1px solid var(--color-success-border);
|
||||
}
|
||||
|
||||
.footer {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
padding: 18px 20px;
|
||||
font-size: 9pt;
|
||||
color: #6b7280;
|
||||
border-top: 1px solid var(--color-border);
|
||||
margin-top: 30px;
|
||||
background: #f9fbfc;
|
||||
border-radius: var(--radius);
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.footer .left,
|
||||
.footer .right {
|
||||
flex: 1 1 200px;
|
||||
}
|
||||
|
||||
.footer .center {
|
||||
flex: 0 0 auto;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.small {
|
||||
font-size: 8.5pt;
|
||||
}
|
||||
|
||||
.divider {
|
||||
height: 1px;
|
||||
background: var(--color-border);
|
||||
margin: 16px 0;
|
||||
border: none;
|
||||
}
|
||||
|
||||
/* Utility */
|
||||
.inline-block { display: inline-block; }
|
||||
.muted { color: #6b7280; }
|
||||
|
||||
/* Page numbering for PDF (supported in many engines including Puppeteer) */
|
||||
.page-footer {
|
||||
position: absolute;
|
||||
bottom: 0;
|
||||
width: 100%;
|
||||
font-size: 8pt;
|
||||
text-align: center;
|
||||
padding: 8px 0;
|
||||
color: #9ca3af;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<div class="header">
|
||||
<div class="header-left">
|
||||
<h1 class="title">CIM Review Report</h1>
|
||||
<p class="subtitle">Professional Investment Analysis</p>
|
||||
</div>
|
||||
<div class="meta">
|
||||
<div>Generated on ${new Date().toLocaleDateString()}</div>
|
||||
<div style="margin-top:4px;">at ${new Date().toLocaleTimeString()}</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Dynamic Content Sections -->
|
||||
<!-- Example of how your loop would insert sections: -->
|
||||
<!--
|
||||
<div class="section">
|
||||
<h2><span class="section-icon">📊</span>Deal Overview</h2>
|
||||
...fields / tables...
|
||||
</div>
|
||||
-->
|
||||
|
||||
<!-- Footer -->
|
||||
<div class="footer">
|
||||
<div class="left">
|
||||
<strong>BPCP CIM Document Processor</strong> | Professional Investment Analysis | Confidential
|
||||
</div>
|
||||
<div class="center small">
|
||||
Generated on ${new Date().toLocaleDateString()} at ${new Date().toLocaleTimeString()}
|
||||
</div>
|
||||
<div class="right" style="text-align:right;">
|
||||
Page <span class="page-number"></span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Optional script to inject page numbers if using Puppeteer -->
|
||||
<script>
|
||||
// Puppeteer can replace this with its own page numbering; if not, simple fallback:
|
||||
document.querySelectorAll('.page-number').forEach(el => {
|
||||
// placeholder; leave blank or inject via PDF generation tooling
|
||||
el.textContent = '';
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎨 CSS Styling Features
|
||||
|
||||
### **Design System**
|
||||
- **CSS Variables**: Centralized design tokens for consistency
|
||||
- **Modern Color Palette**: Professional grays, blues, and accent colors
|
||||
- **Typography**: System font stack for optimal rendering
|
||||
- **Spacing**: Consistent spacing using design tokens
|
||||
|
||||
### **Typography**
|
||||
- **Font Stack**: -apple-system, system-ui, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif
|
||||
- **Line Height**: 1.45 for optimal readability
|
||||
- **Font Sizes**: 8.5pt to 24pt range for hierarchy
|
||||
- **Color Scheme**: Professional grays and modern blue accent
|
||||
|
||||
### **Layout**
|
||||
- **Page Size**: A4 with 0.75in margins
|
||||
- **Container**: Max-width 940px for optimal reading
|
||||
- **Flexbox Layout**: Modern responsive design
|
||||
- **Section Spacing**: 28px between sections with 4px gaps
|
||||
|
||||
### **Visual Elements**
|
||||
|
||||
#### **Headers**
|
||||
- **Main Title**: 24pt with underline accent in primary color
|
||||
- **Section Headers**: 18pt with icons and flexbox layout
|
||||
- **Subsection Headers**: 13pt for organization
|
||||
|
||||
#### **Content Sections**
|
||||
- **Background**: White with subtle borders and shadows
|
||||
- **Border Radius**: 10px for modern appearance
|
||||
- **Box Shadows**: Sophisticated shadow with 12px blur
|
||||
- **Padding**: 22px horizontal, 24px vertical for comfortable reading
|
||||
- **Page Break**: Avoid page breaks within sections
|
||||
|
||||
#### **Fields**
|
||||
- **Layout**: Flexbox with label-value pairs
|
||||
- **Labels**: 9pt uppercase with letter spacing (180px width)
|
||||
- **Values**: 11pt standard text (flexible width)
|
||||
- **Spacing**: 12px gap between label and value
|
||||
|
||||
#### **Financial Tables**
|
||||
- **Header**: Primary color background with white text
|
||||
- **Rows**: Alternating colors for easy scanning
|
||||
- **Hover Effects**: Subtle highlighting on hover
|
||||
- **Typography**: 10pt for table content, 9pt for headers
|
||||
|
||||
#### **Special Boxes**
|
||||
- **Summary Box**: Light blue background for key information
|
||||
- **Highlight Box**: Light orange background for important notes
|
||||
- **Success Box**: Light green background for positive indicators
|
||||
- **Consistent**: 8px border radius and 16px padding
|
||||
|
||||
---
|
||||
|
||||
## 📋 Section Structure
|
||||
|
||||
### **Report Sections**
|
||||
1. **Deal Overview** 📊
|
||||
2. **Business Description** 🏢
|
||||
3. **Market & Industry Analysis** 📈
|
||||
4. **Financial Summary** 💰
|
||||
5. **Management Team Overview** 👥
|
||||
6. **Preliminary Investment Thesis** 🎯
|
||||
7. **Key Questions & Next Steps** ❓
|
||||
|
||||
### **Data Handling**
|
||||
- **Simple Fields**: Direct text display
|
||||
- **Nested Objects**: Structured field display
|
||||
- **Financial Data**: Tabular format with periods
|
||||
- **Arrays**: List format when applicable
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Template Variables
|
||||
|
||||
### **Dynamic Content**
|
||||
- `${new Date().toLocaleDateString()}` - Current date
|
||||
- `${new Date().toLocaleTimeString()}` - Current time
|
||||
- `${section.icon}` - Section emoji icons
|
||||
- `${section.title}` - Section titles
|
||||
- `${this.formatFieldName(key)}` - Formatted field names
|
||||
- `${value}` - Field values
|
||||
|
||||
### **Financial Table Structure**
|
||||
```html
|
||||
<table class="financial-table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Period</th>
|
||||
<th>Revenue</th>
|
||||
<th>Growth</th>
|
||||
<th>EBITDA</th>
|
||||
<th>Margin</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><strong>FY3</strong></td>
|
||||
<td>${data?.revenue || '-'}</td>
|
||||
<td>${data?.revenueGrowth || '-'}</td>
|
||||
<td>${data?.ebitda || '-'}</td>
|
||||
<td>${data?.ebitdaMargin || '-'}</td>
|
||||
</tr>
|
||||
<!-- Additional periods: FY2, FY1, LTM -->
|
||||
</tbody>
|
||||
</table>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Usage in Code
|
||||
|
||||
### **Template Integration**
|
||||
```typescript
|
||||
// In pdfGenerationService.ts
|
||||
private generateCIMReviewHTML(analysisData: any): string {
|
||||
const sections = [
|
||||
{ title: 'Deal Overview', data: analysisData.dealOverview, icon: '📊' },
|
||||
{ title: 'Business Description', data: analysisData.businessDescription, icon: '🏢' },
|
||||
// ... additional sections
|
||||
];
|
||||
|
||||
// Generate HTML with template
|
||||
let html = `<!DOCTYPE html>...`;
|
||||
|
||||
sections.forEach(section => {
|
||||
if (section.data) {
|
||||
html += `<div class="section"><h2><span class="section-icon">${section.icon}</span>${section.title}</h2>`;
|
||||
// Process section data
|
||||
html += `</div>`;
|
||||
}
|
||||
});
|
||||
|
||||
return html;
|
||||
}
|
||||
```
|
||||
|
||||
### **PDF Generation**
|
||||
```typescript
|
||||
async generateCIMReviewPDF(analysisData: any): Promise<Buffer> {
|
||||
const html = this.generateCIMReviewHTML(analysisData);
|
||||
const page = await this.getPage();
|
||||
|
||||
await page.setContent(html, { waitUntil: 'networkidle0' });
|
||||
const pdfBuffer = await page.pdf({
|
||||
format: 'A4',
|
||||
printBackground: true,
|
||||
margin: { top: '0.75in', right: '0.75in', bottom: '0.75in', left: '0.75in' }
|
||||
});
|
||||
|
||||
this.releasePage(page);
|
||||
return pdfBuffer;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Customization Options
|
||||
|
||||
### **Design System Customization**
|
||||
- **CSS Variables**: Update `:root` variables for consistent theming
|
||||
- **Color Palette**: Modify primary, success, highlight, and summary colors
|
||||
- **Typography**: Change font stack and sizing
|
||||
- **Spacing**: Adjust margins, padding, and gaps using design tokens
|
||||
|
||||
### **Styling Modifications**
|
||||
- **Colors**: Update CSS variables for brand colors
|
||||
- **Fonts**: Change font-family for different styles
|
||||
- **Layout**: Adjust margins, padding, and spacing
|
||||
- **Effects**: Modify shadows, borders, and visual effects
|
||||
|
||||
### **Content Structure**
|
||||
- **Sections**: Add or remove report sections
|
||||
- **Fields**: Customize field display formats
|
||||
- **Tables**: Modify financial table structure
|
||||
- **Icons**: Change section icons and styling
|
||||
|
||||
### **Branding**
|
||||
- **Header**: Update company name and logo
|
||||
- **Footer**: Modify footer content and styling
|
||||
- **Colors**: Implement brand color scheme
|
||||
- **Typography**: Use brand fonts
|
||||
|
||||
---
|
||||
|
||||
## 📊 Performance Considerations
|
||||
|
||||
### **Optimization Features**
|
||||
- **CSS Variables**: Efficient design token system
|
||||
- **Font Loading**: System fonts for fast rendering
|
||||
- **Image Handling**: No external images for reliability
|
||||
- **Print Optimization**: Print-specific CSS rules
|
||||
- **Flexbox Layout**: Modern, efficient layout system
|
||||
|
||||
### **Browser Compatibility**
|
||||
- **Puppeteer**: Optimized for headless browser rendering
|
||||
- **CSS Support**: Modern CSS features for visual appeal
|
||||
- **Fallbacks**: Graceful degradation for older browsers
|
||||
- **Print Support**: Print-friendly styling
|
||||
|
||||
---
|
||||
|
||||
This HTML template provides a professional, visually appealing foundation for CIM Review PDF generation, with comprehensive styling and flexible content structure.
|
||||
186
CLEANUP_PLAN.md
186
CLEANUP_PLAN.md
@@ -1,186 +0,0 @@
|
||||
# Project Cleanup Plan
|
||||
|
||||
## Files Found for Cleanup
|
||||
|
||||
### 🗑️ Category 1: SAFE TO DELETE (Backups & Temp Files)
|
||||
|
||||
**Backup Files:**
|
||||
- `backend/.env.backup` (4.1K, Nov 4)
|
||||
- `backend/.env.backup.20251031_221937` (4.1K, Oct 31)
|
||||
- `backend/diagnostic-report.json` (1.9K, Oct 31)
|
||||
|
||||
**Total Space:** ~10KB
|
||||
|
||||
**Action:** DELETE - These are temporary diagnostic/backup files
|
||||
|
||||
---
|
||||
|
||||
### 📄 Category 2: REDUNDANT DOCUMENTATION (Consider Deleting)
|
||||
|
||||
**Analysis Reports (Already in Git History):**
|
||||
- `CLEANUP_ANALYSIS_REPORT.md` (staged for deletion)
|
||||
- `CLEANUP_COMPLETION_REPORT.md` (staged for deletion)
|
||||
- `DOCUMENTATION_AUDIT_REPORT.md` (staged for deletion)
|
||||
- `DOCUMENTATION_COMPLETION_REPORT.md` (staged for deletion)
|
||||
- `FRONTEND_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
- `LLM_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
- `OPERATIONAL_DOCUMENTATION_SUMMARY.md` (staged for deletion)
|
||||
|
||||
**Action:** ALREADY STAGED FOR DELETION - Git will handle
|
||||
|
||||
**Duplicate/Outdated Guides:**
|
||||
- `BETTER_APPROACHES.md` (untracked)
|
||||
- `DEPLOYMENT_INSTRUCTIONS.md` (untracked) - Duplicate of `DEPLOYMENT_GUIDE.md`?
|
||||
- `IMPLEMENTATION_GUIDE.md` (untracked)
|
||||
- `LLM_ANALYSIS.md` (untracked)
|
||||
|
||||
**Action:** REVIEW THEN DELETE if redundant with other docs
|
||||
|
||||
---
|
||||
|
||||
### 🛠️ Category 3: DIAGNOSTIC SCRIPTS (28 total)
|
||||
|
||||
**Keep These (Core Utilities):**
|
||||
- `check-database-failures.ts` ✅ (used in troubleshooting)
|
||||
- `check-current-processing.ts` ✅ (monitoring)
|
||||
- `test-openrouter-simple.ts` ✅ (testing)
|
||||
- `test-full-llm-pipeline.ts` ✅ (testing)
|
||||
- `setup-database.ts` ✅ (setup)
|
||||
|
||||
**Consider Deleting (One-Time Use):**
|
||||
- `check-current-job.ts` (redundant with check-current-processing)
|
||||
- `check-table-schema.ts` (one-time diagnostic)
|
||||
- `check-third-party-services.ts` (one-time diagnostic)
|
||||
- `comprehensive-diagnostic.ts` (one-time diagnostic)
|
||||
- `create-job-direct.ts` (testing helper)
|
||||
- `create-job-for-stuck-document.ts` (one-time fix)
|
||||
- `create-test-job.ts` (testing helper)
|
||||
- `diagnose-processing-issues.ts` (one-time diagnostic)
|
||||
- `diagnose-upload-issues.ts` (one-time diagnostic)
|
||||
- `fix-table-schema.ts` (one-time fix)
|
||||
- `mark-stuck-as-failed.ts` (one-time fix)
|
||||
- `monitor-document-processing.ts` (redundant)
|
||||
- `monitor-system.ts` (redundant)
|
||||
- `setup-gcs-permissions.ts` (one-time setup)
|
||||
- `setup-processing-jobs-table.ts` (one-time setup)
|
||||
- `test-gcs-integration.ts` (one-time test)
|
||||
- `test-job-creation.ts` (testing helper)
|
||||
- `test-linkage.ts` (one-time test)
|
||||
- `test-llm-processing-offline.ts` (testing)
|
||||
- `test-openrouter-quick.ts` (redundant with simple)
|
||||
- `test-postgres-connection.ts` (one-time test)
|
||||
- `test-production-upload.ts` (one-time test)
|
||||
- `test-staging-environment.ts` (one-time test)
|
||||
|
||||
**Action:** ARCHIVE or DELETE ~18-20 scripts
|
||||
|
||||
---
|
||||
|
||||
### 📁 Category 4: SHELL SCRIPTS & SQL
|
||||
|
||||
**Shell Scripts:**
|
||||
- `backend/scripts/check-document-status.sh` (shell version, have TS version)
|
||||
- `backend/scripts/sync-firebase-config.sh` (one-time use)
|
||||
- `backend/scripts/sync-firebase-config.ts` (one-time use)
|
||||
- `backend/scripts/run-sql-file.js` (utility, keep?)
|
||||
- `backend/scripts/verify-schema.js` (one-time use)
|
||||
|
||||
**SQL Directory:**
|
||||
- `backend/sql/` (contains migration scripts?)
|
||||
|
||||
**Action:** REVIEW - Keep utilities, delete one-time scripts
|
||||
|
||||
---
|
||||
|
||||
### 📝 Category 5: DOCUMENTATION TO KEEP
|
||||
|
||||
**Essential Docs:**
|
||||
- `README.md` ✅
|
||||
- `QUICK_START.md` ✅
|
||||
- `backend/TROUBLESHOOTING_PLAN.md` ✅ (just created)
|
||||
- `DEPLOYMENT_GUIDE.md` ✅
|
||||
- `CONFIGURATION_GUIDE.md` ✅
|
||||
- `DATABASE_SCHEMA_DOCUMENTATION.md` ✅
|
||||
- `BPCP CIM REVIEW TEMPLATE.md` ✅
|
||||
|
||||
**Consider Consolidating:**
|
||||
- Multiple service `.md` files in `backend/src/services/`
|
||||
- Multiple component `.md` files in `frontend/src/`
|
||||
|
||||
---
|
||||
|
||||
## Recommended Action Plan
|
||||
|
||||
### Phase 1: Safe Cleanup (No Risk)
|
||||
```bash
|
||||
# Delete backup files
|
||||
rm backend/.env.backup*
|
||||
rm backend/diagnostic-report.json
|
||||
|
||||
# Clear old logs (keep last 7 days)
|
||||
find backend/logs -name "*.log" -mtime +7 -delete
|
||||
```
|
||||
|
||||
### Phase 2: Remove One-Time Diagnostic Scripts
|
||||
```bash
|
||||
cd backend/src/scripts
|
||||
|
||||
# Delete one-time diagnostics
|
||||
rm check-table-schema.ts
|
||||
rm check-third-party-services.ts
|
||||
rm comprehensive-diagnostic.ts
|
||||
rm create-job-direct.ts
|
||||
rm create-job-for-stuck-document.ts
|
||||
rm create-test-job.ts
|
||||
rm diagnose-processing-issues.ts
|
||||
rm diagnose-upload-issues.ts
|
||||
rm fix-table-schema.ts
|
||||
rm mark-stuck-as-failed.ts
|
||||
rm setup-gcs-permissions.ts
|
||||
rm setup-processing-jobs-table.ts
|
||||
rm test-gcs-integration.ts
|
||||
rm test-job-creation.ts
|
||||
rm test-linkage.ts
|
||||
rm test-openrouter-quick.ts
|
||||
rm test-postgres-connection.ts
|
||||
rm test-production-upload.ts
|
||||
rm test-staging-environment.ts
|
||||
```
|
||||
|
||||
### Phase 3: Remove Redundant Documentation
|
||||
```bash
|
||||
cd /home/jonathan/Coding/cim_summary
|
||||
|
||||
# Delete untracked redundant docs
|
||||
rm BETTER_APPROACHES.md
|
||||
rm LLM_ANALYSIS.md
|
||||
rm IMPLEMENTATION_GUIDE.md
|
||||
|
||||
# If DEPLOYMENT_INSTRUCTIONS.md is duplicate:
|
||||
# rm DEPLOYMENT_INSTRUCTIONS.md
|
||||
```
|
||||
|
||||
### Phase 4: Consolidate Service Documentation
|
||||
Move inline documentation comments instead of separate `.md` files
|
||||
|
||||
---
|
||||
|
||||
## Estimated Space Saved
|
||||
|
||||
- Backup files: ~10KB
|
||||
- Diagnostic scripts: ~50-100KB
|
||||
- Documentation: ~50KB
|
||||
- Old logs: Variable (could be 100s of KB)
|
||||
|
||||
**Total:** ~200-300KB (not huge, but cleaner project)
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Execute Phase 1 immediately** (safe, no risk)
|
||||
**Execute Phase 2 after review** (can always recreate scripts)
|
||||
**Hold Phase 3** until you confirm docs are redundant
|
||||
**Hold Phase 4** for later refactoring
|
||||
|
||||
Would you like me to execute the cleanup?
|
||||
@@ -1,143 +0,0 @@
|
||||
# Cleanup Completed - Summary Report
|
||||
|
||||
**Date:** $(date)
|
||||
|
||||
## ✅ Phase 1: Backup & Temporary Files (COMPLETED)
|
||||
|
||||
**Deleted:**
|
||||
- `backend/.env.backup` (4.1K)
|
||||
- `backend/.env.backup.20251031_221937` (4.1K)
|
||||
- `backend/diagnostic-report.json` (1.9K)
|
||||
|
||||
**Total:** ~10KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 2: One-Time Diagnostic Scripts (COMPLETED)
|
||||
|
||||
**Deleted 19 scripts from `backend/src/scripts/`:**
|
||||
1. check-table-schema.ts
|
||||
2. check-third-party-services.ts
|
||||
3. comprehensive-diagnostic.ts
|
||||
4. create-job-direct.ts
|
||||
5. create-job-for-stuck-document.ts
|
||||
6. create-test-job.ts
|
||||
7. diagnose-processing-issues.ts
|
||||
8. diagnose-upload-issues.ts
|
||||
9. fix-table-schema.ts
|
||||
10. mark-stuck-as-failed.ts
|
||||
11. setup-gcs-permissions.ts
|
||||
12. setup-processing-jobs-table.ts
|
||||
13. test-gcs-integration.ts
|
||||
14. test-job-creation.ts
|
||||
15. test-linkage.ts
|
||||
16. test-openrouter-quick.ts
|
||||
17. test-postgres-connection.ts
|
||||
18. test-production-upload.ts
|
||||
19. test-staging-environment.ts
|
||||
|
||||
**Remaining scripts (9):**
|
||||
- check-current-job.ts
|
||||
- check-current-processing.ts
|
||||
- check-database-failures.ts
|
||||
- monitor-document-processing.ts
|
||||
- monitor-system.ts
|
||||
- setup-database.ts
|
||||
- test-full-llm-pipeline.ts
|
||||
- test-llm-processing-offline.ts
|
||||
- test-openrouter-simple.ts
|
||||
|
||||
**Total:** ~100KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 3: Redundant Documentation & Scripts (COMPLETED)
|
||||
|
||||
**Deleted Documentation:**
|
||||
- BETTER_APPROACHES.md
|
||||
- LLM_ANALYSIS.md
|
||||
- IMPLEMENTATION_GUIDE.md
|
||||
- DOCUMENT_AUDIT_GUIDE.md
|
||||
- DEPLOYMENT_INSTRUCTIONS.md (duplicate)
|
||||
|
||||
**Deleted Backend Docs:**
|
||||
- backend/MIGRATION_GUIDE.md
|
||||
- backend/PERFORMANCE_OPTIMIZATION_OPTIONS.md
|
||||
|
||||
**Deleted Shell Scripts:**
|
||||
- backend/scripts/check-document-status.sh
|
||||
- backend/scripts/sync-firebase-config.sh
|
||||
- backend/scripts/sync-firebase-config.ts
|
||||
- backend/scripts/verify-schema.js
|
||||
- backend/scripts/run-sql-file.js
|
||||
|
||||
**Total:** ~50KB
|
||||
|
||||
---
|
||||
|
||||
## ✅ Phase 4: Old Log Files (COMPLETED)
|
||||
|
||||
**Deleted logs older than 7 days:**
|
||||
- backend/logs/upload.log (0 bytes, Aug 2)
|
||||
- backend/logs/app.log (39K, Aug 14)
|
||||
- backend/logs/exceptions.log (26K, Aug 15)
|
||||
- backend/logs/rejections.log (0 bytes, Aug 15)
|
||||
|
||||
**Total:** ~65KB
|
||||
|
||||
**Logs directory size after cleanup:** 620K
|
||||
|
||||
---
|
||||
|
||||
## 📊 Summary Statistics
|
||||
|
||||
| Category | Files Deleted | Space Saved |
|
||||
|----------|---------------|-------------|
|
||||
| Backups & Temp | 3 | ~10KB |
|
||||
| Diagnostic Scripts | 19 | ~100KB |
|
||||
| Documentation | 7 | ~50KB |
|
||||
| Shell Scripts | 5 | ~10KB |
|
||||
| Old Logs | 4 | ~65KB |
|
||||
| **TOTAL** | **38** | **~235KB** |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 What Remains
|
||||
|
||||
### Essential Scripts (9):
|
||||
- Database checks and monitoring
|
||||
- LLM testing and pipeline tests
|
||||
- Database setup
|
||||
|
||||
### Essential Documentation:
|
||||
- README.md
|
||||
- QUICK_START.md
|
||||
- DEPLOYMENT_GUIDE.md
|
||||
- CONFIGURATION_GUIDE.md
|
||||
- DATABASE_SCHEMA_DOCUMENTATION.md
|
||||
- backend/TROUBLESHOOTING_PLAN.md
|
||||
- BPCP CIM REVIEW TEMPLATE.md
|
||||
|
||||
### Reference Materials (Kept):
|
||||
- `backend/sql/` directory (migration scripts for reference)
|
||||
- Service documentation (.md files in src/services/)
|
||||
- Recent logs (< 7 days old)
|
||||
|
||||
---
|
||||
|
||||
## ✨ Project Status After Cleanup
|
||||
|
||||
**Project is now:**
|
||||
- ✅ Leaner (38 fewer files)
|
||||
- ✅ More maintainable (removed one-time scripts)
|
||||
- ✅ Better organized (removed duplicate docs)
|
||||
- ✅ Kept all essential utilities and documentation
|
||||
|
||||
**Next recommended actions:**
|
||||
1. Commit these changes to git
|
||||
2. Review remaining 9 scripts - consolidate if needed
|
||||
3. Consider archiving `backend/sql/` to a separate repo if not needed
|
||||
|
||||
---
|
||||
|
||||
**Cleanup completed successfully!**
|
||||
@@ -1,345 +0,0 @@
|
||||
# Code Summary Template
|
||||
## Standardized Documentation Format for LLM Agent Understanding
|
||||
|
||||
### 📋 Template Usage
|
||||
Use this template to document individual files, services, or components. This format is optimized for LLM coding agents to quickly understand code structure, purpose, and implementation details.
|
||||
|
||||
---
|
||||
|
||||
## 📄 File Information
|
||||
|
||||
**File Path**: `[relative/path/to/file]`
|
||||
**File Type**: `[TypeScript/JavaScript/JSON/etc.]`
|
||||
**Last Updated**: `[YYYY-MM-DD]`
|
||||
**Version**: `[semantic version]`
|
||||
**Status**: `[Active/Deprecated/In Development]`
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Purpose & Overview
|
||||
|
||||
**Primary Purpose**: `[What this file/service does in one sentence]`
|
||||
|
||||
**Business Context**: `[Why this exists, what problem it solves]`
|
||||
|
||||
**Key Responsibilities**:
|
||||
- `[Responsibility 1]`
|
||||
- `[Responsibility 2]`
|
||||
- `[Responsibility 3]`
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Architecture & Dependencies
|
||||
|
||||
### Dependencies
|
||||
**Internal Dependencies**:
|
||||
- `[service1.ts]` - `[purpose of dependency]`
|
||||
- `[service2.ts]` - `[purpose of dependency]`
|
||||
|
||||
**External Dependencies**:
|
||||
- `[package-name]` - `[version]` - `[purpose]`
|
||||
- `[API service]` - `[purpose]`
|
||||
|
||||
### Integration Points
|
||||
- **Input Sources**: `[Where data comes from]`
|
||||
- **Output Destinations**: `[Where data goes]`
|
||||
- **Event Triggers**: `[What triggers this service]`
|
||||
- **Event Listeners**: `[What this service triggers]`
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Implementation Details
|
||||
|
||||
### Core Functions/Methods
|
||||
|
||||
#### `[functionName]`
|
||||
```typescript
|
||||
/**
|
||||
* @purpose [What this function does]
|
||||
* @context [When/why it's called]
|
||||
* @inputs [Parameter types and descriptions]
|
||||
* @outputs [Return type and format]
|
||||
* @dependencies [What it depends on]
|
||||
* @errors [Possible errors and conditions]
|
||||
* @complexity [Time/space complexity if relevant]
|
||||
*/
|
||||
```
|
||||
|
||||
**Example Usage**:
|
||||
```typescript
|
||||
// Example of how to use this function
|
||||
const result = await functionName(input);
|
||||
```
|
||||
|
||||
### Data Structures
|
||||
|
||||
#### `[TypeName]`
|
||||
```typescript
|
||||
interface TypeName {
|
||||
property1: string; // Description of property1
|
||||
property2: number; // Description of property2
|
||||
property3?: boolean; // Optional description of property3
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration
|
||||
```typescript
|
||||
// Key configuration options
|
||||
const CONFIG = {
|
||||
timeout: 30000, // Request timeout in ms
|
||||
retryAttempts: 3, // Number of retry attempts
|
||||
batchSize: 10, // Batch processing size
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Data Flow
|
||||
|
||||
### Input Processing
|
||||
1. `[Step 1 description]`
|
||||
2. `[Step 2 description]`
|
||||
3. `[Step 3 description]`
|
||||
|
||||
### Output Generation
|
||||
1. `[Step 1 description]`
|
||||
2. `[Step 2 description]`
|
||||
3. `[Step 3 description]`
|
||||
|
||||
### Data Transformations
|
||||
- `[Input Type]` → `[Transformation]` → `[Output Type]`
|
||||
- `[Input Type]` → `[Transformation]` → `[Output Type]`
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Error Handling
|
||||
|
||||
### Error Types
|
||||
```typescript
|
||||
/**
|
||||
* @errorType VALIDATION_ERROR
|
||||
* @description [What causes this error]
|
||||
* @recoverable [true/false]
|
||||
* @retryStrategy [retry approach]
|
||||
* @userMessage [Message shown to user]
|
||||
*/
|
||||
|
||||
/**
|
||||
* @errorType PROCESSING_ERROR
|
||||
* @description [What causes this error]
|
||||
* @recoverable [true/false]
|
||||
* @retryStrategy [retry approach]
|
||||
* @userMessage [Message shown to user]
|
||||
*/
|
||||
```
|
||||
|
||||
### Error Recovery
|
||||
- **Validation Errors**: `[How validation errors are handled]`
|
||||
- **Processing Errors**: `[How processing errors are handled]`
|
||||
- **System Errors**: `[How system errors are handled]`
|
||||
|
||||
### Fallback Strategies
|
||||
- **Primary Strategy**: `[Main approach]`
|
||||
- **Fallback Strategy**: `[Backup approach]`
|
||||
- **Degradation Strategy**: `[Graceful degradation]`
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
### Test Coverage
|
||||
- **Unit Tests**: `[Coverage percentage]` - `[What's tested]`
|
||||
- **Integration Tests**: `[Coverage percentage]` - `[What's tested]`
|
||||
- **Performance Tests**: `[What performance aspects are tested]`
|
||||
|
||||
### Test Data
|
||||
```typescript
|
||||
/**
|
||||
* @testData [test data name]
|
||||
* @description [Description of test data]
|
||||
* @size [Size if relevant]
|
||||
* @expectedOutput [What should be produced]
|
||||
*/
|
||||
```
|
||||
|
||||
### Mock Strategy
|
||||
- **External APIs**: `[How external APIs are mocked]`
|
||||
- **Database**: `[How database is mocked]`
|
||||
- **File System**: `[How file system is mocked]`
|
||||
|
||||
---
|
||||
|
||||
## 📈 Performance Characteristics
|
||||
|
||||
### Performance Metrics
|
||||
- **Average Response Time**: `[time]`
|
||||
- **Memory Usage**: `[memory]`
|
||||
- **CPU Usage**: `[CPU]`
|
||||
- **Throughput**: `[requests per second]`
|
||||
|
||||
### Optimization Strategies
|
||||
- **Caching**: `[Caching approach]`
|
||||
- **Batching**: `[Batching strategy]`
|
||||
- **Parallelization**: `[Parallel processing]`
|
||||
- **Resource Management**: `[Resource optimization]`
|
||||
|
||||
### Scalability Limits
|
||||
- **Concurrent Requests**: `[limit]`
|
||||
- **Data Size**: `[limit]`
|
||||
- **Rate Limits**: `[limits]`
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Debugging & Monitoring
|
||||
|
||||
### Logging
|
||||
```typescript
|
||||
/**
|
||||
* @logging [Logging configuration]
|
||||
* @levels [Log levels used]
|
||||
* @correlation [Correlation ID strategy]
|
||||
* @context [Context information logged]
|
||||
*/
|
||||
```
|
||||
|
||||
### Debug Tools
|
||||
- **Health Checks**: `[Health check endpoints]`
|
||||
- **Metrics**: `[Performance metrics]`
|
||||
- **Tracing**: `[Request tracing]`
|
||||
|
||||
### Common Issues
|
||||
1. **Issue 1**: `[Description]` - `[Solution]`
|
||||
2. **Issue 2**: `[Description]` - `[Solution]`
|
||||
3. **Issue 3**: `[Description]` - `[Solution]`
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Security Considerations
|
||||
|
||||
### Input Validation
|
||||
- **File Types**: `[Allowed file types]`
|
||||
- **File Size**: `[Size limits]`
|
||||
- **Content Validation**: `[Content checks]`
|
||||
|
||||
### Authentication & Authorization
|
||||
- **Authentication**: `[How authentication is handled]`
|
||||
- **Authorization**: `[How authorization is handled]`
|
||||
- **Data Isolation**: `[How data is isolated]`
|
||||
|
||||
### Data Protection
|
||||
- **Encryption**: `[Encryption approach]`
|
||||
- **Sanitization**: `[Data sanitization]`
|
||||
- **Audit Logging**: `[Audit trail]`
|
||||
|
||||
---
|
||||
|
||||
## 📚 Related Documentation
|
||||
|
||||
### Internal References
|
||||
- `[related-file1.ts]` - `[relationship]`
|
||||
- `[related-file2.ts]` - `[relationship]`
|
||||
- `[related-file3.ts]` - `[relationship]`
|
||||
|
||||
### External References
|
||||
- `[API Documentation]` - `[URL]`
|
||||
- `[Library Documentation]` - `[URL]`
|
||||
- `[Architecture Documentation]` - `[URL]`
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Change History
|
||||
|
||||
### Recent Changes
|
||||
- `[YYYY-MM-DD]` - `[Change description]` - `[Author]`
|
||||
- `[YYYY-MM-DD]` - `[Change description]` - `[Author]`
|
||||
- `[YYYY-MM-DD]` - `[Change description]` - `[Author]`
|
||||
|
||||
### Planned Changes
|
||||
- `[Future change 1]` - `[Target date]`
|
||||
- `[Future change 2]` - `[Target date]`
|
||||
|
||||
---
|
||||
|
||||
## 📋 Usage Examples
|
||||
|
||||
### Basic Usage
|
||||
```typescript
|
||||
// Basic example of how to use this service
|
||||
import { ServiceName } from './serviceName';
|
||||
|
||||
const service = new ServiceName();
|
||||
const result = await service.processData(input);
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
```typescript
|
||||
// Advanced example with configuration
|
||||
import { ServiceName } from './serviceName';
|
||||
|
||||
const service = new ServiceName({
|
||||
timeout: 60000,
|
||||
retryAttempts: 5,
|
||||
batchSize: 20
|
||||
});
|
||||
|
||||
const results = await service.processBatch(dataArray);
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
```typescript
|
||||
// Example of error handling
|
||||
try {
|
||||
const result = await service.processData(input);
|
||||
} catch (error) {
|
||||
if (error.type === 'VALIDATION_ERROR') {
|
||||
// Handle validation error
|
||||
} else if (error.type === 'PROCESSING_ERROR') {
|
||||
// Handle processing error
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 LLM Agent Notes
|
||||
|
||||
### Key Understanding Points
|
||||
- `[Important concept 1]`
|
||||
- `[Important concept 2]`
|
||||
- `[Important concept 3]`
|
||||
|
||||
### Common Modifications
|
||||
- `[Common change 1]` - `[How to implement]`
|
||||
- `[Common change 2]` - `[How to implement]`
|
||||
|
||||
### Integration Patterns
|
||||
- `[Integration pattern 1]` - `[When to use]`
|
||||
- `[Integration pattern 2]` - `[When to use]`
|
||||
|
||||
---
|
||||
|
||||
## 📝 Template Usage Instructions
|
||||
|
||||
### For New Files
|
||||
1. Copy this template
|
||||
2. Fill in all sections with relevant information
|
||||
3. Remove sections that don't apply
|
||||
4. Add sections specific to your file type
|
||||
5. Update the file information header
|
||||
|
||||
### For Existing Files
|
||||
1. Use this template to document existing code
|
||||
2. Focus on the most important sections first
|
||||
3. Add examples and usage patterns
|
||||
4. Include error scenarios and solutions
|
||||
5. Document performance characteristics
|
||||
|
||||
### Maintenance
|
||||
- Update this documentation when code changes
|
||||
- Keep examples current and working
|
||||
- Review and update performance metrics regularly
|
||||
- Maintain change history for significant updates
|
||||
|
||||
---
|
||||
|
||||
This template ensures consistent, comprehensive documentation that LLM agents can quickly parse and understand, leading to more accurate code evaluation and modification suggestions.
|
||||
@@ -1,355 +0,0 @@
|
||||
# Document AI + Agentic RAG Integration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide explains how to integrate Google Cloud Document AI with Agentic RAG for enhanced CIM document processing. This approach provides superior text extraction and structured analysis compared to traditional PDF parsing.
|
||||
|
||||
## 🎯 **Benefits of Document AI + Agentic RAG**
|
||||
|
||||
### **Document AI Advantages:**
|
||||
- **Superior text extraction** from complex PDF layouts
|
||||
- **Table structure preservation** with accurate cell relationships
|
||||
- **Entity recognition** for financial data, dates, amounts
|
||||
- **Layout understanding** maintains document structure
|
||||
- **Multi-format support** (PDF, images, scanned documents)
|
||||
|
||||
### **Agentic RAG Advantages:**
|
||||
- **Structured AI workflows** with type safety
|
||||
- **Map-reduce processing** for large documents
|
||||
- **Timeout handling** and error recovery
|
||||
- **Cost optimization** with intelligent chunking
|
||||
- **Consistent output formatting** with Zod schemas
|
||||
|
||||
## 🔧 **Setup Requirements**
|
||||
|
||||
### **1. Google Cloud Configuration**
|
||||
|
||||
```bash
|
||||
# Environment variables to add to your .env file
|
||||
GCLOUD_PROJECT_ID=cim-summarizer
|
||||
DOCUMENT_AI_LOCATION=us
|
||||
DOCUMENT_AI_PROCESSOR_ID=your-processor-id
|
||||
GCS_BUCKET_NAME=cim-summarizer-uploads
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=cim-summarizer-document-ai-output
|
||||
```
|
||||
|
||||
### **2. Google Cloud Services Setup**
|
||||
|
||||
```bash
|
||||
# Enable required APIs
|
||||
gcloud services enable documentai.googleapis.com
|
||||
gcloud services enable storage.googleapis.com
|
||||
|
||||
# Create Document AI processor
|
||||
gcloud ai document processors create \
|
||||
--processor-type=document-ocr \
|
||||
--location=us \
|
||||
--display-name="CIM Document Processor"
|
||||
|
||||
# Create GCS buckets
|
||||
gsutil mb gs://cim-summarizer-uploads
|
||||
gsutil mb gs://cim-summarizer-document-ai-output
|
||||
```
|
||||
|
||||
### **3. Service Account Permissions**
|
||||
|
||||
```bash
|
||||
# Create service account with required roles
|
||||
gcloud iam service-accounts create cim-document-processor \
|
||||
--display-name="CIM Document Processor"
|
||||
|
||||
# Grant necessary permissions
|
||||
gcloud projects add-iam-policy-binding cim-summarizer \
|
||||
--member="serviceAccount:cim-document-processor@cim-summarizer.iam.gserviceaccount.com" \
|
||||
--role="roles/documentai.apiUser"
|
||||
|
||||
gcloud projects add-iam-policy-binding cim-summarizer \
|
||||
--member="serviceAccount:cim-document-processor@cim-summarizer.iam.gserviceaccount.com" \
|
||||
--role="roles/storage.objectAdmin"
|
||||
```
|
||||
|
||||
## 📦 **Dependencies**
|
||||
|
||||
Add these to your `package.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"dependencies": {
|
||||
"@google-cloud/documentai": "^8.0.0",
|
||||
"@google-cloud/storage": "^7.0.0",
|
||||
"@google-cloud/documentai": "^8.0.0",
|
||||
"zod": "^3.25.76"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🔄 **Integration with Existing System**
|
||||
|
||||
### **1. Processing Strategy Selection**
|
||||
|
||||
Your system now supports 5 processing strategies:
|
||||
|
||||
```typescript
|
||||
type ProcessingStrategy =
|
||||
| 'chunking' // Traditional chunking approach
|
||||
| 'rag' // Retrieval-Augmented Generation
|
||||
| 'agentic_rag' // Multi-agent RAG system
|
||||
| 'optimized_agentic_rag' // Optimized multi-agent system
|
||||
| 'document_ai_agentic_rag'; // Document AI + Agentic RAG (NEW)
|
||||
```
|
||||
|
||||
### **2. Environment Configuration**
|
||||
|
||||
Update your environment configuration:
|
||||
|
||||
```typescript
|
||||
// In backend/src/config/env.ts
|
||||
const envSchema = Joi.object({
|
||||
// ... existing config
|
||||
|
||||
// Google Cloud Document AI Configuration
|
||||
GCLOUD_PROJECT_ID: Joi.string().default('cim-summarizer'),
|
||||
DOCUMENT_AI_LOCATION: Joi.string().default('us'),
|
||||
DOCUMENT_AI_PROCESSOR_ID: Joi.string().allow('').optional(),
|
||||
GCS_BUCKET_NAME: Joi.string().default('cim-summarizer-uploads'),
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME: Joi.string().default('cim-summarizer-document-ai-output'),
|
||||
});
|
||||
```
|
||||
|
||||
### **3. Strategy Selection**
|
||||
|
||||
```typescript
|
||||
// Set as default strategy
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
|
||||
// Or select per document
|
||||
const result = await unifiedDocumentProcessor.processDocument(
|
||||
documentId,
|
||||
userId,
|
||||
text,
|
||||
{ strategy: 'document_ai_agentic_rag' }
|
||||
);
|
||||
```
|
||||
|
||||
## 🚀 **Usage Examples**
|
||||
|
||||
### **1. Basic Document Processing**
|
||||
|
||||
```typescript
|
||||
import { processCimDocumentServerAction } from './documentAiProcessor';
|
||||
|
||||
const result = await processCimDocumentServerAction({
|
||||
fileDataUri: 'data:application/pdf;base64,JVBERi0xLjc...',
|
||||
fileName: 'investment-memo.pdf'
|
||||
});
|
||||
|
||||
console.log(result.markdownOutput);
|
||||
```
|
||||
|
||||
### **2. Integration with Existing Controller**
|
||||
|
||||
```typescript
|
||||
// In your document controller
|
||||
export const documentController = {
|
||||
async uploadDocument(req: Request, res: Response): Promise<void> {
|
||||
// ... existing upload logic
|
||||
|
||||
// Use Document AI + Agentic RAG strategy
|
||||
const processingOptions = {
|
||||
strategy: 'document_ai_agentic_rag',
|
||||
enableTableExtraction: true,
|
||||
enableEntityRecognition: true
|
||||
};
|
||||
|
||||
const result = await unifiedDocumentProcessor.processDocument(
|
||||
document.id,
|
||||
userId,
|
||||
extractedText,
|
||||
processingOptions
|
||||
);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
### **3. Strategy Comparison**
|
||||
|
||||
```typescript
|
||||
// Compare all strategies
|
||||
const comparison = await unifiedDocumentProcessor.compareProcessingStrategies(
|
||||
documentId,
|
||||
userId,
|
||||
text,
|
||||
{ includeDocumentAiAgenticRag: true }
|
||||
);
|
||||
|
||||
console.log('Best strategy:', comparison.winner);
|
||||
console.log('Document AI + Agentic RAG result:', comparison.documentAiAgenticRag);
|
||||
```
|
||||
|
||||
## 📊 **Performance Comparison**
|
||||
|
||||
### **Expected Performance Metrics:**
|
||||
|
||||
| Strategy | Processing Time | API Calls | Quality Score | Cost |
|
||||
|----------|----------------|-----------|---------------|------|
|
||||
| Chunking | 3-5 minutes | 9-12 | 7/10 | $2-3 |
|
||||
| RAG | 2-3 minutes | 6-8 | 8/10 | $1.5-2 |
|
||||
| Agentic RAG | 4-6 minutes | 15-20 | 9/10 | $3-4 |
|
||||
| **Document AI + Agentic RAG** | **1-2 minutes** | **1-2** | **9.5/10** | **$1-1.5** |
|
||||
|
||||
### **Key Advantages:**
|
||||
- **50% faster** than traditional chunking
|
||||
- **90% fewer API calls** than agentic RAG
|
||||
- **Superior text extraction** with table preservation
|
||||
- **Lower costs** with better quality
|
||||
|
||||
## 🔍 **Error Handling**
|
||||
|
||||
### **Common Issues and Solutions:**
|
||||
|
||||
```typescript
|
||||
// 1. Document AI Processing Errors
|
||||
try {
|
||||
const result = await processCimDocumentServerAction(input);
|
||||
} catch (error) {
|
||||
if (error.message.includes('Document AI')) {
|
||||
// Fallback to traditional processing
|
||||
return await fallbackToTraditionalProcessing(input);
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Agentic RAG Flow Timeouts
|
||||
const TIMEOUT_DURATION_FLOW = 1800000; // 30 minutes
|
||||
const TIMEOUT_DURATION_ACTION = 2100000; // 35 minutes
|
||||
|
||||
// 3. GCS Cleanup Failures
|
||||
try {
|
||||
await cleanupGCSFiles(gcsFilePath);
|
||||
} catch (cleanupError) {
|
||||
logger.warn('GCS cleanup failed, but processing succeeded', cleanupError);
|
||||
// Continue with success response
|
||||
}
|
||||
```
|
||||
|
||||
## 🧪 **Testing**
|
||||
|
||||
### **1. Unit Tests**
|
||||
|
||||
```typescript
|
||||
// Test Document AI + Agentic RAG processor
|
||||
describe('DocumentAiProcessor', () => {
|
||||
it('should process CIM document successfully', async () => {
|
||||
const processor = new DocumentAiProcessor();
|
||||
const result = await processor.processDocument(
|
||||
'test-doc-id',
|
||||
'test-user-id',
|
||||
Buffer.from('test content'),
|
||||
'test.pdf',
|
||||
'application/pdf'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(true);
|
||||
expect(result.content).toContain('<START_WORKSHEET>');
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
### **2. Integration Tests**
|
||||
|
||||
```typescript
|
||||
// Test full pipeline
|
||||
describe('Document AI + Agentic RAG Integration', () => {
|
||||
it('should process real CIM document', async () => {
|
||||
const fileDataUri = await loadTestPdfAsDataUri();
|
||||
const result = await processCimDocumentServerAction({
|
||||
fileDataUri,
|
||||
fileName: 'test-cim.pdf'
|
||||
});
|
||||
|
||||
expect(result.markdownOutput).toMatch(/Investment Summary/);
|
||||
expect(result.markdownOutput).toMatch(/Financial Metrics/);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
## 🔒 **Security Considerations**
|
||||
|
||||
### **1. File Validation**
|
||||
|
||||
```typescript
|
||||
// Validate file types and sizes
|
||||
const allowedMimeTypes = [
|
||||
'application/pdf',
|
||||
'image/jpeg',
|
||||
'image/png',
|
||||
'image/tiff'
|
||||
];
|
||||
|
||||
const maxFileSize = 50 * 1024 * 1024; // 50MB
|
||||
```
|
||||
|
||||
### **2. GCS Security**
|
||||
|
||||
```typescript
|
||||
// Use signed URLs for temporary access
|
||||
const signedUrl = await bucket.file(fileName).getSignedUrl({
|
||||
action: 'read',
|
||||
expires: Date.now() + 15 * 60 * 1000, // 15 minutes
|
||||
});
|
||||
```
|
||||
|
||||
### **3. Service Account Permissions**
|
||||
|
||||
```bash
|
||||
# Follow principle of least privilege
|
||||
gcloud projects add-iam-policy-binding cim-summarizer \
|
||||
--member="serviceAccount:cim-document-processor@cim-summarizer.iam.gserviceaccount.com" \
|
||||
--role="roles/documentai.apiUser"
|
||||
```
|
||||
|
||||
## 📈 **Monitoring and Analytics**
|
||||
|
||||
### **1. Performance Tracking**
|
||||
|
||||
```typescript
|
||||
// Track processing metrics
|
||||
const metrics = {
|
||||
processingTime: Date.now() - startTime,
|
||||
fileSize: fileBuffer.length,
|
||||
extractedTextLength: combinedExtractedText.length,
|
||||
documentAiEntities: fullDocumentAiOutput.entities?.length || 0,
|
||||
documentAiTables: fullDocumentAiOutput.tables?.length || 0
|
||||
};
|
||||
```
|
||||
|
||||
### **2. Error Monitoring**
|
||||
|
||||
```typescript
|
||||
// Log detailed error information
|
||||
logger.error('Document AI + Agentic RAG processing failed', {
|
||||
documentId,
|
||||
error: error.message,
|
||||
stack: error.stack,
|
||||
documentAiOutput: fullDocumentAiOutput,
|
||||
processingTime: Date.now() - startTime
|
||||
});
|
||||
```
|
||||
|
||||
## 🎯 **Next Steps**
|
||||
|
||||
1. **Set up Google Cloud project** with Document AI and GCS
|
||||
2. **Configure environment variables** with your project details
|
||||
3. **Test with sample CIM documents** to validate extraction quality
|
||||
4. **Compare performance** with existing strategies
|
||||
5. **Gradually migrate** from chunking to Document AI + Agentic RAG
|
||||
6. **Monitor costs and performance** in production
|
||||
|
||||
## 📞 **Support**
|
||||
|
||||
For issues with:
|
||||
- **Google Cloud setup**: Check Google Cloud documentation
|
||||
- **Document AI**: Review processor configuration and permissions
|
||||
- **Agentic RAG integration**: Verify API keys and model configuration
|
||||
- **Performance**: Monitor logs and adjust timeout settings
|
||||
|
||||
This integration provides a significant upgrade to your CIM processing capabilities with better quality, faster processing, and lower costs.
|
||||
@@ -1,506 +0,0 @@
|
||||
# Financial Data Extraction Issue: Root Cause Analysis & Solution
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Problem**: Financial data showing "Not specified in CIM" even when tables exist in the PDF.
|
||||
|
||||
**Root Cause**: Document AI's structured table data is being **completely ignored** in favor of flattened text, causing the parser to fail.
|
||||
|
||||
**Impact**: ~80-90% of financial tables fail to parse correctly.
|
||||
|
||||
---
|
||||
|
||||
## Current Pipeline Analysis
|
||||
|
||||
### Stage 1: Document AI Processing ✅ (Working but underutilized)
|
||||
```typescript
|
||||
// documentAiProcessor.ts:408-482
|
||||
private async processWithDocumentAI() {
|
||||
const [result] = await this.documentAiClient.processDocument(request);
|
||||
const { document } = result;
|
||||
|
||||
// ✅ Extracts structured tables
|
||||
const tables = document.pages?.flatMap(page =>
|
||||
page.tables?.map(table => ({
|
||||
rows: table.headerRows?.length || 0, // ❌ Only counting!
|
||||
columns: table.bodyRows?.[0]?.cells?.length || 0 // ❌ Not using!
|
||||
}))
|
||||
);
|
||||
|
||||
// ❌ PROBLEM: Only returns flat text, throws away table structure
|
||||
return { text: document.text, entities, tables, pages };
|
||||
}
|
||||
```
|
||||
|
||||
**What Document AI Actually Provides:**
|
||||
- `document.pages[].tables[]` - Fully structured tables with:
|
||||
- `headerRows[]` - Column headers with cell text via layout anchors
|
||||
- `bodyRows[]` - Data rows with aligned cell values
|
||||
- `layout` - Text positions in the original document
|
||||
- `cells[]` - Individual cell data with rowSpan/colSpan
|
||||
|
||||
**What We're Using:** Only `document.text` (flattened)
|
||||
|
||||
---
|
||||
|
||||
### Stage 2: Text Extraction ❌ (Losing structure)
|
||||
```typescript
|
||||
// documentAiProcessor.ts:151-207
|
||||
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
|
||||
// Returns: "FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M EBITDA $8.5M..."
|
||||
// Lost: Column alignment, row structure, table boundaries
|
||||
```
|
||||
|
||||
**Original PDF Table:**
|
||||
```
|
||||
FY-3 FY-2 FY-1 LTM
|
||||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||||
Revenue Growth N/A 16.8% 15.9% (4.4)%
|
||||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||||
EBITDA Margin 18.8% 19.3% 19.8% 19.7%
|
||||
```
|
||||
|
||||
**What Parser Receives (flattened):**
|
||||
```
|
||||
FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M Revenue Growth N/A 16.8% 15.9% (4.4)% EBITDA $8.5M $10.2M $12.1M $11.5M EBITDA Margin 18.8% 19.3% 19.8% 19.7%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Stage 3: Deterministic Parser ❌ (Fighting lost structure)
|
||||
```typescript
|
||||
// financialTableParser.ts:181-406
|
||||
export function parseFinancialsFromText(fullText: string): ParsedFinancials {
|
||||
// 1. Find header line with year tokens (FY-3, FY-2, etc.)
|
||||
// ❌ PROBLEM: Years might be on different lines now
|
||||
|
||||
// 2. Look for revenue/EBITDA rows within 20 lines
|
||||
// ❌ PROBLEM: Row detection works, but...
|
||||
|
||||
// 3. Extract numeric tokens and assign to columns
|
||||
// ❌ PROBLEM: Can't determine which number belongs to which column!
|
||||
// Numbers are just in sequence: $45.2M $52.8M $61.2M $58.5M
|
||||
// Are these revenues for FY-3, FY-2, FY-1, LTM? Or something else?
|
||||
|
||||
// Result: Returns empty {} or incorrect mappings
|
||||
}
|
||||
```
|
||||
|
||||
**Failure Points:**
|
||||
1. **Header Detection** (lines 197-278): Requires period tokens in ONE line
|
||||
- Flattened text scatters tokens across multiple lines
|
||||
- Scoring system can't find tables with both revenue AND EBITDA
|
||||
|
||||
2. **Column Alignment** (lines 160-179): Assumes tokens map to buckets by position
|
||||
- No way to know which token belongs to which column
|
||||
- Whitespace-based alignment is lost
|
||||
|
||||
3. **Multi-line Tables**: Financial tables often span multiple lines per row
|
||||
- Parser combines 2-3 lines but still can't reconstruct columns
|
||||
|
||||
---
|
||||
|
||||
### Stage 4: LLM Extraction ⚠️ (Limited context)
|
||||
```typescript
|
||||
// optimizedAgenticRAGProcessor.ts:1552-1641
|
||||
private async extractWithTargetedQuery() {
|
||||
// 1. RAG selects ~7 most relevant chunks
|
||||
// 2. Each chunk truncated to 1500 chars
|
||||
// 3. Total context: ~10,500 chars
|
||||
|
||||
// ❌ PROBLEM: Financial tables might be:
|
||||
// - Split across multiple chunks
|
||||
// - Not in the top 7 most "similar" chunks
|
||||
// - Truncated mid-table
|
||||
// - Still in flattened format anyway
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Unused Assets
|
||||
|
||||
### 1. Document AI Table Structure (BIGGEST MISS)
|
||||
**Location**: Available in Document AI response but never used
|
||||
|
||||
**What It Provides:**
|
||||
```typescript
|
||||
document.pages[0].tables[0] = {
|
||||
layout: { /* table position */ },
|
||||
headerRows: [{
|
||||
cells: [
|
||||
{ layout: { textAnchor: { start: 123, end: 127 } } }, // "FY-3"
|
||||
{ layout: { textAnchor: { start: 135, end: 139 } } }, // "FY-2"
|
||||
// ...
|
||||
]
|
||||
}],
|
||||
bodyRows: [{
|
||||
cells: [
|
||||
{ layout: { textAnchor: { start: 200, end: 207 } } }, // "Revenue"
|
||||
{ layout: { textAnchor: { start: 215, end: 222 } } }, // "$45.2M"
|
||||
{ layout: { textAnchor: { start: 230, end: 237 } } }, // "$52.8M"
|
||||
// ...
|
||||
]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**How to Use:**
|
||||
```typescript
|
||||
function getTableText(layout, documentText) {
|
||||
const start = layout.textAnchor.textSegments[0].startIndex;
|
||||
const end = layout.textAnchor.textSegments[0].endIndex;
|
||||
return documentText.substring(start, end);
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Financial Extractor Utility
|
||||
**Location**: `src/utils/financialExtractor.ts` (lines 1-159)
|
||||
|
||||
**Features:**
|
||||
- Robust column splitting: `/\s{2,}|\t/` (2+ spaces or tabs)
|
||||
- Clean value parsing with K/M/B multipliers
|
||||
- Percentage and negative number handling
|
||||
- Better than current parser but still works on flat text
|
||||
|
||||
**Status**: Never imported or used anywhere in the codebase
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Summary
|
||||
|
||||
| Issue | Impact | Severity |
|
||||
|-------|--------|----------|
|
||||
| Document AI table structure ignored | 100% structure loss | 🔴 CRITICAL |
|
||||
| Only flat text used for parsing | Parser can't align columns | 🔴 CRITICAL |
|
||||
| financialExtractor.ts not used | Missing better parsing logic | 🟡 MEDIUM |
|
||||
| RAG chunks miss complete tables | LLM has incomplete data | 🟡 MEDIUM |
|
||||
| No table-aware chunking | Financial sections fragmented | 🟡 MEDIUM |
|
||||
|
||||
---
|
||||
|
||||
## Baseline Measurements & Instrumentation
|
||||
|
||||
Before changing the pipeline, capture hard numbers so we can prove the fix works and spot remaining gaps. Add the following telemetry to the processing result (also referenced in `IMPLEMENTATION_PLAN.md`):
|
||||
|
||||
```typescript
|
||||
metadata: {
|
||||
tablesFound: structuredTables.length,
|
||||
financialTablesIdentified: structuredTables.filter(isFinancialTable).length,
|
||||
structuredParsingUsed: Boolean(deterministicFinancialsFromTables),
|
||||
textParsingFallback: !deterministicFinancialsFromTables,
|
||||
financialDataPopulated: hasPopulatedFinancialSummary(result)
|
||||
}
|
||||
```
|
||||
|
||||
**Baseline checklist (run on ≥20 recent CIM uploads):**
|
||||
|
||||
1. Count how many documents have `tablesFound > 0` but `financialDataPopulated === false`.
|
||||
2. Record the average/median `tablesFound`, `financialTablesIdentified`, and current financial fill rate.
|
||||
3. Log sample `documentId`s where `tablesFound === 0` (helps scope Phase 3 hybrid work).
|
||||
|
||||
Paste the aggregated numbers back into this doc so Success Metrics are grounded in actual data rather than estimates.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Solution Architecture
|
||||
|
||||
### Phase 1: Use Document AI Table Structure (HIGHEST IMPACT)
|
||||
|
||||
**Implementation:**
|
||||
```typescript
|
||||
// NEW: documentAiProcessor.ts
|
||||
interface StructuredTable {
|
||||
headers: string[];
|
||||
rows: string[][];
|
||||
position: { page: number; confidence: number };
|
||||
}
|
||||
|
||||
private extractStructuredTables(document: any, text: string): StructuredTable[] {
|
||||
const tables: StructuredTable[] = [];
|
||||
|
||||
for (const page of document.pages || []) {
|
||||
for (const table of page.tables || []) {
|
||||
// Extract headers
|
||||
const headers = table.headerRows?.[0]?.cells?.map(cell =>
|
||||
this.getTextFromLayout(cell.layout, text)
|
||||
) || [];
|
||||
|
||||
// Extract data rows
|
||||
const rows = table.bodyRows?.map(row =>
|
||||
row.cells.map(cell => this.getTextFromLayout(cell.layout, text))
|
||||
) || [];
|
||||
|
||||
tables.push({ headers, rows, position: { page: page.pageNumber, confidence: 0.9 } });
|
||||
}
|
||||
}
|
||||
|
||||
return tables;
|
||||
}
|
||||
|
||||
private getTextFromLayout(layout: any, documentText: string): string {
|
||||
const segments = layout.textAnchor?.textSegments || [];
|
||||
if (segments.length === 0) return '';
|
||||
|
||||
const start = parseInt(segments[0].startIndex || '0');
|
||||
const end = parseInt(segments[0].endIndex || documentText.length.toString());
|
||||
|
||||
return documentText.substring(start, end).trim();
|
||||
}
|
||||
```
|
||||
|
||||
**Return Enhanced Output:**
|
||||
```typescript
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<any>;
|
||||
tables: StructuredTable[]; // ✅ Now usable!
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2: Financial Table Classifier
|
||||
|
||||
**Purpose**: Identify which tables are financial data
|
||||
|
||||
```typescript
|
||||
// NEW: services/financialTableClassifier.ts
|
||||
export function isFinancialTable(table: StructuredTable): boolean {
|
||||
const headerText = table.headers.join(' ').toLowerCase();
|
||||
const firstRowText = table.rows[0]?.join(' ').toLowerCase() || '';
|
||||
|
||||
// Check for year/period indicators
|
||||
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd/.test(headerText);
|
||||
|
||||
// Check for financial metrics
|
||||
const hasMetrics = /(revenue|ebitda|sales|profit|margin|cash flow)/i.test(
|
||||
table.rows.slice(0, 5).join(' ')
|
||||
);
|
||||
|
||||
// Check for currency values
|
||||
const hasCurrency = /\$[\d,]+|\d+[km]|\d+\.\d+%/.test(firstRowText);
|
||||
|
||||
return hasPeriods && (hasMetrics || hasCurrency);
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: Enhanced Financial Parser
|
||||
|
||||
**Use structured tables instead of flat text:**
|
||||
|
||||
```typescript
|
||||
// UPDATED: financialTableParser.ts
|
||||
export function parseFinancialsFromStructuredTable(
|
||||
table: StructuredTable
|
||||
): ParsedFinancials {
|
||||
const result: ParsedFinancials = { fy3: {}, fy2: {}, fy1: {}, ltm: {} };
|
||||
|
||||
// 1. Parse headers to identify periods
|
||||
const buckets = yearTokensToBuckets(
|
||||
table.headers.map(h => normalizePeriodToken(h))
|
||||
);
|
||||
|
||||
// 2. For each row, identify the metric
|
||||
for (const row of table.rows) {
|
||||
const metricName = row[0].toLowerCase();
|
||||
const values = row.slice(1); // Skip first column (metric name)
|
||||
|
||||
// 3. Match metric to field
|
||||
for (const [field, matcher] of Object.entries(ROW_MATCHERS)) {
|
||||
if (matcher.test(metricName)) {
|
||||
// 4. Assign values to buckets (GUARANTEED ALIGNMENT!)
|
||||
buckets.forEach((bucket, index) => {
|
||||
if (bucket && values[index]) {
|
||||
result[bucket][field] = values[index];
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
**Key Improvement**: Column alignment is **guaranteed** because:
|
||||
- Headers and values come from the same table structure
|
||||
- Index positions are preserved
|
||||
- No string parsing or whitespace guessing needed
|
||||
|
||||
### Phase 4: Table-Aware Chunking
|
||||
|
||||
**Store financial tables as special chunks:**
|
||||
|
||||
```typescript
|
||||
// UPDATED: optimizedAgenticRAGProcessor.ts
|
||||
private async createIntelligentChunks(
|
||||
text: string,
|
||||
documentId: string,
|
||||
tables: StructuredTable[]
|
||||
): Promise<ProcessingChunk[]> {
|
||||
const chunks: ProcessingChunk[] = [];
|
||||
|
||||
// 1. Create dedicated chunks for financial tables
|
||||
for (const table of tables.filter(isFinancialTable)) {
|
||||
chunks.push({
|
||||
id: `${documentId}-financial-table-${chunks.length}`,
|
||||
content: this.formatTableAsMarkdown(table),
|
||||
chunkIndex: chunks.length,
|
||||
sectionType: 'financial-table',
|
||||
metadata: {
|
||||
isFinancialTable: true,
|
||||
tablePosition: table.position,
|
||||
structuredData: table // ✅ Preserve structure!
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// 2. Continue with normal text chunking
|
||||
// ...
|
||||
}
|
||||
|
||||
private formatTableAsMarkdown(table: StructuredTable): string {
|
||||
const header = `| ${table.headers.join(' | ')} |`;
|
||||
const separator = `| ${table.headers.map(() => '---').join(' | ')} |`;
|
||||
const rows = table.rows.map(row => `| ${row.join(' | ')} |`);
|
||||
|
||||
return [header, separator, ...rows].join('\n');
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 5: Priority Pinning for Financial Chunks
|
||||
|
||||
**Ensure financial tables always included in LLM context:**
|
||||
|
||||
```typescript
|
||||
// UPDATED: optimizedAgenticRAGProcessor.ts
|
||||
private async extractPass1CombinedMetadataFinancial() {
|
||||
// 1. Find all financial table chunks
|
||||
const financialTableChunks = chunks.filter(
|
||||
c => c.metadata?.isFinancialTable === true
|
||||
);
|
||||
|
||||
// 2. PIN them to always be included
|
||||
return await this.extractWithTargetedQuery(
|
||||
documentId,
|
||||
text,
|
||||
chunks,
|
||||
query,
|
||||
targetFields,
|
||||
7,
|
||||
financialTableChunks // ✅ Always included!
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases & Priorities
|
||||
|
||||
### Phase 1: Quick Win (1-2 hours) - RECOMMENDED START
|
||||
**Goal**: Use Document AI tables immediately (matches `IMPLEMENTATION_PLAN.md` Phase 1)
|
||||
|
||||
**Planned changes:**
|
||||
1. Extract structured tables in `documentAiProcessor.ts`.
|
||||
2. Pass tables (and metadata) to `optimizedAgenticRAGProcessor`.
|
||||
3. Emit dedicated financial-table chunks that preserve structure.
|
||||
4. Pin financial chunks so every RAG/LLM pass sees them.
|
||||
|
||||
**Expected Improvement**: 60-70% accuracy gain (verify via new instrumentation).
|
||||
|
||||
### Phase 2: Enhanced Parsing (2-3 hours)
|
||||
**Goal**: Deterministic extraction from structured tables before falling back to text (see `IMPLEMENTATION_PLAN.md` Phase 2).
|
||||
|
||||
**Planned changes:**
|
||||
1. Implement `parseFinancialsFromStructuredTable()` and reuse existing deterministic merge paths.
|
||||
2. Add a classifier that flags which structured tables are financial.
|
||||
3. Update merge logic to favor structured data yet keep the text/LLM fallback.
|
||||
|
||||
**Expected Improvement**: 85-90% accuracy (subject to measured baseline).
|
||||
|
||||
### Phase 3: LLM Optimization (1-2 hours)
|
||||
**Goal**: Better context for LLM when tables are incomplete or absent (aligns with `HYBRID_SOLUTION.md` Phase 2/3).
|
||||
|
||||
**Planned changes:**
|
||||
1. Format tables as markdown and raise chunk limits for financial passes.
|
||||
2. Prioritize and pin financial chunks in `extractPass1CombinedMetadataFinancial`.
|
||||
3. Inject explicit “find the table” instructions into the prompt.
|
||||
|
||||
**Expected Improvement**: 90-95% accuracy when Document AI tables exist; otherwise falls back to the hybrid regex/LLM path.
|
||||
|
||||
### Phase 4: Integration & Testing (2-3 hours)
|
||||
**Goal**: Ensure backward compatibility and document measured improvements
|
||||
|
||||
**Planned changes:**
|
||||
1. Keep the legacy text parser as a fallback whenever `tablesFound === 0`.
|
||||
2. Capture the telemetry outlined earlier and publish before/after numbers.
|
||||
3. Test against a labeled CIM set covering: clean tables, multi-line rows, scanned PDFs (no structured tables), and partial data cases.
|
||||
|
||||
---
|
||||
|
||||
### Handling Documents With No Structured Tables
|
||||
|
||||
Even after Phases 1-2, some CIMs (e.g., scans or image-only tables) will have `tablesFound === 0`. When that happens:
|
||||
|
||||
1. Trigger the enhanced preprocessing + regex route from `HYBRID_SOLUTION.md` (Phase 1).
|
||||
2. Surface an explicit warning in metadata/logs so analysts know the deterministic path was skipped.
|
||||
3. Feed the isolated table text (if any) plus surrounding context into the LLM with the financial prompt upgrades from Phase 3.
|
||||
|
||||
This ensures the hybrid approach only engages when the Document AI path truly lacks structured tables, keeping maintenance manageable while covering the remaining gap.
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Current | Phase 1 | Phase 2 | Phase 3 |
|
||||
|--------|---------|---------|---------|---------|
|
||||
| Financial data extracted | 10-20% | 60-70% | 85-90% | 90-95% |
|
||||
| Tables identified | 0% | 80% | 90% | 95% |
|
||||
| Column alignment accuracy | 10% | 95% | 98% | 99% |
|
||||
| Processing time | 45s | 42s | 38s | 35s |
|
||||
|
||||
---
|
||||
|
||||
## Code Quality Improvements
|
||||
|
||||
### Current Issues:
|
||||
1. ❌ Document AI tables extracted but never used
|
||||
2. ❌ `financialExtractor.ts` exists but never imported
|
||||
3. ❌ Parser assumes flat text has structure
|
||||
4. ❌ No table-specific chunking strategy
|
||||
|
||||
### After Implementation:
|
||||
1. ✅ Full use of Document AI's structured data
|
||||
2. ✅ Multi-tier extraction strategy (structured → fallback → LLM)
|
||||
3. ✅ Table-aware chunking and RAG
|
||||
4. ✅ Guaranteed column alignment
|
||||
5. ✅ Better error handling and logging
|
||||
|
||||
---
|
||||
|
||||
## Alternative Approaches Considered
|
||||
|
||||
### Option 1: Better Regex Parsing (REJECTED)
|
||||
**Reason**: Can't solve the fundamental problem of lost structure
|
||||
|
||||
### Option 2: Use Only LLM (REJECTED)
|
||||
**Reason**: Expensive, slower, less accurate than structured extraction
|
||||
|
||||
### Option 3: Replace Document AI (REJECTED)
|
||||
**Reason**: Document AI works fine, we're just not using it properly
|
||||
|
||||
### Option 4: Manual Table Markup (REJECTED)
|
||||
**Reason**: Not scalable, requires user intervention
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The issue is **NOT** a parsing problem or an LLM problem.
|
||||
|
||||
The issue is an **architecture problem**: We're extracting structured tables from Document AI and then **throwing away the structure**.
|
||||
|
||||
**The fix is simple**: Use the data we're already getting.
|
||||
|
||||
**Recommended action**: Implement Phase 1 (Quick Win) immediately for 60-70% improvement, then evaluate if Phases 2-3 are needed based on results.
|
||||
@@ -1,370 +0,0 @@
|
||||
# Full Documentation Plan
|
||||
## Comprehensive Documentation Strategy for CIM Document Processor
|
||||
|
||||
### 🎯 Project Overview
|
||||
|
||||
This plan outlines a systematic approach to create complete, accurate, and LLM-optimized documentation for the CIM Document Processor project. The documentation will cover all aspects of the system from high-level architecture to detailed implementation guides.
|
||||
|
||||
---
|
||||
|
||||
## 📋 Documentation Inventory & Status
|
||||
|
||||
### ✅ Existing Documentation (Good Quality)
|
||||
- `README.md` - Project overview and quick start
|
||||
- `APP_DESIGN_DOCUMENTATION.md` - System architecture
|
||||
- `AGENTIC_RAG_IMPLEMENTATION_PLAN.md` - AI processing strategy
|
||||
- `PDF_GENERATION_ANALYSIS.md` - PDF optimization details
|
||||
- `DEPLOYMENT_GUIDE.md` - Deployment instructions
|
||||
- `ARCHITECTURE_DIAGRAMS.md` - Visual architecture
|
||||
- `DOCUMENTATION_AUDIT_REPORT.md` - Accuracy audit
|
||||
|
||||
### ⚠️ Existing Documentation (Needs Updates)
|
||||
- `codebase-audit-report.md` - May need updates
|
||||
- `DEPENDENCY_ANALYSIS_REPORT.md` - May need updates
|
||||
- `DOCUMENT_AI_INTEGRATION_SUMMARY.md` - May need updates
|
||||
|
||||
### ❌ Missing Documentation (To Be Created)
|
||||
- Individual service documentation
|
||||
- API endpoint documentation
|
||||
- Database schema documentation
|
||||
- Configuration guide
|
||||
- Testing documentation
|
||||
- Troubleshooting guide
|
||||
- Development workflow guide
|
||||
- Security documentation
|
||||
- Performance optimization guide
|
||||
- Monitoring and alerting guide
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Documentation Architecture
|
||||
|
||||
### Level 1: Project Overview
|
||||
- **README.md** - Entry point and quick start
|
||||
- **PROJECT_OVERVIEW.md** - Detailed project description
|
||||
- **ARCHITECTURE_OVERVIEW.md** - High-level system design
|
||||
|
||||
### Level 2: System Architecture
|
||||
- **APP_DESIGN_DOCUMENTATION.md** - Complete architecture
|
||||
- **ARCHITECTURE_DIAGRAMS.md** - Visual diagrams
|
||||
- **DATA_FLOW_DOCUMENTATION.md** - System data flow
|
||||
- **INTEGRATION_GUIDE.md** - External service integration
|
||||
|
||||
### Level 3: Component Documentation
|
||||
- **SERVICES/** - Individual service documentation
|
||||
- **API/** - API endpoint documentation
|
||||
- **DATABASE/** - Database schema and models
|
||||
- **FRONTEND/** - Frontend component documentation
|
||||
|
||||
### Level 4: Implementation Guides
|
||||
- **CONFIGURATION_GUIDE.md** - Environment setup
|
||||
- **DEPLOYMENT_GUIDE.md** - Deployment procedures
|
||||
- **TESTING_GUIDE.md** - Testing strategies
|
||||
- **DEVELOPMENT_WORKFLOW.md** - Development processes
|
||||
|
||||
### Level 5: Operational Documentation
|
||||
- **MONITORING_GUIDE.md** - Monitoring and alerting
|
||||
- **TROUBLESHOOTING_GUIDE.md** - Common issues and solutions
|
||||
- **SECURITY_GUIDE.md** - Security considerations
|
||||
- **PERFORMANCE_GUIDE.md** - Performance optimization
|
||||
|
||||
---
|
||||
|
||||
## 📊 Documentation Priority Matrix
|
||||
|
||||
### 🔴 High Priority (Critical for LLM Agents)
|
||||
1. **Service Documentation** - All backend services
|
||||
2. **API Documentation** - Complete endpoint documentation
|
||||
3. **Configuration Guide** - Environment and setup
|
||||
4. **Database Schema** - Data models and relationships
|
||||
5. **Error Handling** - Comprehensive error documentation
|
||||
|
||||
### 🟡 Medium Priority (Important for Development)
|
||||
1. **Frontend Documentation** - React components and services
|
||||
2. **Testing Documentation** - Test strategies and examples
|
||||
3. **Development Workflow** - Development processes
|
||||
4. **Performance Guide** - Optimization strategies
|
||||
5. **Security Guide** - Security considerations
|
||||
|
||||
### 🟢 Low Priority (Nice to Have)
|
||||
1. **Monitoring Guide** - Monitoring and alerting
|
||||
2. **Troubleshooting Guide** - Common issues
|
||||
3. **Integration Guide** - External service integration
|
||||
4. **Data Flow Documentation** - Detailed data flow
|
||||
5. **Project Overview** - Detailed project description
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Implementation Plan
|
||||
|
||||
### Phase 1: Core Service Documentation (Week 1)
|
||||
**Goal**: Document all backend services for LLM agent understanding
|
||||
|
||||
#### Day 1-2: Critical Services
|
||||
- [ ] `unifiedDocumentProcessor.ts` - Main orchestrator
|
||||
- [ ] `optimizedAgenticRAGProcessor.ts` - AI processing engine
|
||||
- [ ] `llmService.ts` - LLM interactions
|
||||
- [ ] `documentAiProcessor.ts` - Document AI integration
|
||||
|
||||
#### Day 3-4: File Management Services
|
||||
- [ ] `fileStorageService.ts` - Google Cloud Storage
|
||||
- [ ] `pdfGenerationService.ts` - PDF generation
|
||||
- [ ] `uploadMonitoringService.ts` - Upload tracking
|
||||
- [ ] `uploadProgressService.ts` - Progress tracking
|
||||
|
||||
#### Day 5-7: Data Management Services
|
||||
- [ ] `agenticRAGDatabaseService.ts` - Analytics and sessions
|
||||
- [ ] `vectorDatabaseService.ts` - Vector embeddings
|
||||
- [ ] `sessionService.ts` - Session management
|
||||
- [ ] `jobQueueService.ts` - Background processing
|
||||
|
||||
### Phase 2: API Documentation (Week 2)
|
||||
**Goal**: Complete API endpoint documentation
|
||||
|
||||
#### Day 1-2: Document Routes
|
||||
- [ ] `documents.ts` - Document management endpoints
|
||||
- [ ] `monitoring.ts` - Monitoring endpoints
|
||||
- [ ] `vector.ts` - Vector database endpoints
|
||||
|
||||
#### Day 3-4: Controller Documentation
|
||||
- [ ] `documentController.ts` - Document controller
|
||||
- [ ] `authController.ts` - Authentication controller
|
||||
|
||||
#### Day 5-7: API Integration Guide
|
||||
- [ ] API authentication guide
|
||||
- [ ] Request/response examples
|
||||
- [ ] Error handling documentation
|
||||
- [ ] Rate limiting documentation
|
||||
|
||||
### Phase 3: Database & Models (Week 3)
|
||||
**Goal**: Complete database schema and model documentation
|
||||
|
||||
#### Day 1-2: Core Models
|
||||
- [ ] `DocumentModel.ts` - Document data model
|
||||
- [ ] `UserModel.ts` - User data model
|
||||
- [ ] `ProcessingJobModel.ts` - Job processing model
|
||||
|
||||
#### Day 3-4: AI Models
|
||||
- [ ] `AgenticRAGModels.ts` - AI processing models
|
||||
- [ ] `agenticTypes.ts` - AI type definitions
|
||||
- [ ] `VectorDatabaseModel.ts` - Vector database model
|
||||
|
||||
#### Day 5-7: Database Schema
|
||||
- [ ] Complete database schema documentation
|
||||
- [ ] Migration documentation
|
||||
- [ ] Data relationships and constraints
|
||||
- [ ] Query optimization guide
|
||||
|
||||
### Phase 4: Configuration & Setup (Week 4)
|
||||
**Goal**: Complete configuration and setup documentation
|
||||
|
||||
#### Day 1-2: Environment Configuration
|
||||
- [ ] Environment variables guide
|
||||
- [ ] Configuration validation
|
||||
- [ ] Service account setup
|
||||
- [ ] API key management
|
||||
|
||||
#### Day 3-4: Development Setup
|
||||
- [ ] Local development setup
|
||||
- [ ] Development environment configuration
|
||||
- [ ] Testing environment setup
|
||||
- [ ] Debugging configuration
|
||||
|
||||
#### Day 5-7: Production Setup
|
||||
- [ ] Production environment setup
|
||||
- [ ] Deployment configuration
|
||||
- [ ] Monitoring setup
|
||||
- [ ] Security configuration
|
||||
|
||||
### Phase 5: Frontend Documentation (Week 5)
|
||||
**Goal**: Complete frontend component and service documentation
|
||||
|
||||
#### Day 1-2: Core Components
|
||||
- [ ] `App.tsx` - Main application component
|
||||
- [ ] `DocumentUpload.tsx` - Upload component
|
||||
- [ ] `DocumentList.tsx` - Document listing
|
||||
- [ ] `DocumentViewer.tsx` - Document viewing
|
||||
|
||||
#### Day 3-4: Service Components
|
||||
- [ ] `authService.ts` - Authentication service
|
||||
- [ ] `documentService.ts` - Document service
|
||||
- [ ] Context providers and hooks
|
||||
- [ ] Utility functions
|
||||
|
||||
#### Day 5-7: Frontend Integration
|
||||
- [ ] Component interaction patterns
|
||||
- [ ] State management documentation
|
||||
- [ ] Error handling in frontend
|
||||
- [ ] Performance optimization
|
||||
|
||||
### Phase 6: Testing & Quality Assurance (Week 6)
|
||||
**Goal**: Complete testing documentation and quality assurance
|
||||
|
||||
#### Day 1-2: Testing Strategy
|
||||
- [ ] Unit testing documentation
|
||||
- [ ] Integration testing documentation
|
||||
- [ ] End-to-end testing documentation
|
||||
- [ ] Test data management
|
||||
|
||||
#### Day 3-4: Quality Assurance
|
||||
- [ ] Code quality standards
|
||||
- [ ] Review processes
|
||||
- [ ] Performance testing
|
||||
- [ ] Security testing
|
||||
|
||||
#### Day 5-7: Continuous Integration
|
||||
- [ ] CI/CD pipeline documentation
|
||||
- [ ] Automated testing
|
||||
- [ ] Quality gates
|
||||
- [ ] Release processes
|
||||
|
||||
### Phase 7: Operational Documentation (Week 7)
|
||||
**Goal**: Complete operational and maintenance documentation
|
||||
|
||||
#### Day 1-2: Monitoring & Alerting
|
||||
- [ ] Monitoring setup guide
|
||||
- [ ] Alert configuration
|
||||
- [ ] Performance metrics
|
||||
- [ ] Health checks
|
||||
|
||||
#### Day 3-4: Troubleshooting
|
||||
- [ ] Common issues and solutions
|
||||
- [ ] Debug procedures
|
||||
- [ ] Log analysis
|
||||
- [ ] Error recovery
|
||||
|
||||
#### Day 5-7: Maintenance
|
||||
- [ ] Backup procedures
|
||||
- [ ] Update procedures
|
||||
- [ ] Scaling strategies
|
||||
- [ ] Disaster recovery
|
||||
|
||||
---
|
||||
|
||||
## 📝 Documentation Standards
|
||||
|
||||
### File Naming Convention
|
||||
- Use descriptive, lowercase names with hyphens
|
||||
- Include component type in filename
|
||||
- Example: `unified-document-processor-service.md`
|
||||
|
||||
### Content Structure
|
||||
- Use consistent section headers with emojis
|
||||
- Include file information header
|
||||
- Provide usage examples
|
||||
- Include error handling documentation
|
||||
- Add LLM agent notes
|
||||
|
||||
### Code Examples
|
||||
- Include TypeScript interfaces
|
||||
- Provide realistic usage examples
|
||||
- Show error handling patterns
|
||||
- Include configuration examples
|
||||
|
||||
### Cross-References
|
||||
- Link related documentation
|
||||
- Reference external resources
|
||||
- Include version information
|
||||
- Maintain consistency across documents
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Quality Assurance
|
||||
|
||||
### Documentation Review Process
|
||||
1. **Technical Accuracy** - Verify against actual code
|
||||
2. **Completeness** - Ensure all aspects are covered
|
||||
3. **Clarity** - Ensure clear and understandable
|
||||
4. **Consistency** - Maintain consistent style and format
|
||||
5. **LLM Optimization** - Optimize for AI agent understanding
|
||||
|
||||
### Review Checklist
|
||||
- [ ] All code examples are current and working
|
||||
- [ ] API documentation matches implementation
|
||||
- [ ] Configuration examples are accurate
|
||||
- [ ] Error handling documentation is complete
|
||||
- [ ] Performance metrics are realistic
|
||||
- [ ] Links and references are valid
|
||||
- [ ] LLM agent notes are included
|
||||
- [ ] Cross-references are accurate
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Metrics
|
||||
|
||||
### Documentation Quality Metrics
|
||||
- **Completeness**: 100% of services documented
|
||||
- **Accuracy**: 0% of inaccurate references
|
||||
- **Clarity**: Clear and understandable content
|
||||
- **Consistency**: Consistent style and format
|
||||
|
||||
### LLM Agent Effectiveness Metrics
|
||||
- **Understanding Accuracy**: LLM agents comprehend codebase
|
||||
- **Modification Success**: Successful code modifications
|
||||
- **Error Reduction**: Reduced LLM-generated errors
|
||||
- **Development Speed**: Faster development with LLM assistance
|
||||
|
||||
### User Experience Metrics
|
||||
- **Onboarding Time**: Reduced time for new developers
|
||||
- **Issue Resolution**: Faster issue resolution
|
||||
- **Feature Development**: Faster feature implementation
|
||||
- **Code Review Efficiency**: More efficient code reviews
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Expected Outcomes
|
||||
|
||||
### Immediate Benefits
|
||||
1. **Complete Documentation Coverage** - All components documented
|
||||
2. **Accurate References** - No more inaccurate information
|
||||
3. **LLM Optimization** - Optimized for AI agent understanding
|
||||
4. **Developer Onboarding** - Faster onboarding for new developers
|
||||
|
||||
### Long-term Benefits
|
||||
1. **Maintainability** - Easier to maintain and update
|
||||
2. **Scalability** - Easier to scale development team
|
||||
3. **Quality** - Higher code quality through better understanding
|
||||
4. **Efficiency** - More efficient development processes
|
||||
|
||||
---
|
||||
|
||||
## 📋 Implementation Timeline
|
||||
|
||||
### Week 1: Core Service Documentation
|
||||
- Complete documentation of all backend services
|
||||
- Focus on critical services first
|
||||
- Ensure LLM agent optimization
|
||||
|
||||
### Week 2: API Documentation
|
||||
- Complete API endpoint documentation
|
||||
- Include authentication and error handling
|
||||
- Provide usage examples
|
||||
|
||||
### Week 3: Database & Models
|
||||
- Complete database schema documentation
|
||||
- Document all data models
|
||||
- Include relationships and constraints
|
||||
|
||||
### Week 4: Configuration & Setup
|
||||
- Complete configuration documentation
|
||||
- Include environment setup guides
|
||||
- Document deployment procedures
|
||||
|
||||
### Week 5: Frontend Documentation
|
||||
- Complete frontend component documentation
|
||||
- Document state management
|
||||
- Include performance optimization
|
||||
|
||||
### Week 6: Testing & Quality Assurance
|
||||
- Complete testing documentation
|
||||
- Document quality assurance processes
|
||||
- Include CI/CD documentation
|
||||
|
||||
### Week 7: Operational Documentation
|
||||
- Complete monitoring and alerting documentation
|
||||
- Document troubleshooting procedures
|
||||
- Include maintenance procedures
|
||||
|
||||
---
|
||||
|
||||
This comprehensive documentation plan ensures that the CIM Document Processor project will have complete, accurate, and LLM-optimized documentation that supports efficient development and maintenance.
|
||||
@@ -1,888 +0,0 @@
|
||||
# Financial Data Extraction: Hybrid Solution
|
||||
## Better Regex + Enhanced LLM Approach
|
||||
|
||||
## Philosophy
|
||||
|
||||
Rather than a major architectural refactor, this solution enhances what's already working:
|
||||
1. **Smarter regex** to catch more table patterns
|
||||
2. **Better LLM context** to ensure financial tables are always seen
|
||||
3. **Hybrid validation** where regex and LLM cross-check each other
|
||||
|
||||
---
|
||||
|
||||
## Problem Analysis (Refined)
|
||||
|
||||
### Current Issues:
|
||||
1. **Regex is too strict** - Misses valid table formats
|
||||
2. **LLM gets incomplete context** - Financial tables truncated or missing
|
||||
3. **No cross-validation** - Regex and LLM don't verify each other
|
||||
4. **Table structure lost** - But we can preserve it better with preprocessing
|
||||
|
||||
### Key Insight:
|
||||
The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:
|
||||
- Give it the RIGHT chunks (always include financial sections)
|
||||
- Give it MORE context (increase chunk size for financial data)
|
||||
- Give it BETTER formatting hints (preserve spacing/alignment where possible)
|
||||
|
||||
**When to use this hybrid track:** Rely on the telemetry described in `FINANCIAL_EXTRACTION_ANALYSIS.md` / `IMPLEMENTATION_PLAN.md`. If a document finishes Phase 1/2 processing with `tablesFound === 0` or `financialDataPopulated === false`, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.
|
||||
|
||||
---
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### Three-Tier Extraction Strategy
|
||||
|
||||
```
|
||||
Tier 1: Enhanced Regex Parser (Fast, Deterministic)
|
||||
↓ (if successful)
|
||||
✓ Use regex results
|
||||
↓ (if incomplete/failed)
|
||||
|
||||
Tier 2: LLM with Enhanced Context (Powerful, Flexible)
|
||||
↓ (extract from full financial sections)
|
||||
✓ Fill in gaps from Tier 1
|
||||
↓ (if still missing data)
|
||||
|
||||
Tier 3: LLM Deep Dive (Focused, Exhaustive)
|
||||
↓ (targeted re-scan of entire document)
|
||||
✓ Final gap-filling
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
## Phase 1: Enhanced Regex Parser (2-3 hours)
|
||||
|
||||
### 1.1: Improve Text Preprocessing
|
||||
|
||||
**Goal**: Preserve table structure better before regex parsing
|
||||
|
||||
**File**: Create `backend/src/utils/textPreprocessor.ts`
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Enhanced text preprocessing to preserve table structures
|
||||
* Attempts to maintain column alignment from PDF extraction
|
||||
*/
|
||||
|
||||
export interface PreprocessedText {
|
||||
original: string;
|
||||
enhanced: string;
|
||||
tableRegions: TextRegion[];
|
||||
metadata: {
|
||||
likelyTableCount: number;
|
||||
preservedAlignment: boolean;
|
||||
};
|
||||
}
|
||||
|
||||
export interface TextRegion {
|
||||
start: number;
|
||||
end: number;
|
||||
type: 'table' | 'narrative' | 'header';
|
||||
confidence: number;
|
||||
content: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Identify regions that look like tables based on formatting patterns
|
||||
*/
|
||||
export function identifyTableRegions(text: string): TextRegion[] {
|
||||
const regions: TextRegion[] = [];
|
||||
const lines = text.split('\n');
|
||||
|
||||
let currentRegion: TextRegion | null = null;
|
||||
let regionStart = 0;
|
||||
let linePosition = 0;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i];
|
||||
const nextLine = lines[i + 1] || '';
|
||||
|
||||
const isTableLike = detectTableLine(line, nextLine);
|
||||
|
||||
if (isTableLike.isTable && !currentRegion) {
|
||||
// Start new table region
|
||||
currentRegion = {
|
||||
start: linePosition,
|
||||
end: linePosition + line.length,
|
||||
type: 'table',
|
||||
confidence: isTableLike.confidence,
|
||||
content: line
|
||||
};
|
||||
regionStart = i;
|
||||
} else if (isTableLike.isTable && currentRegion) {
|
||||
// Extend current table region
|
||||
currentRegion.end = linePosition + line.length;
|
||||
currentRegion.content += '\n' + line;
|
||||
currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
|
||||
} else if (!isTableLike.isTable && currentRegion) {
|
||||
// End table region
|
||||
if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
|
||||
regions.push(currentRegion);
|
||||
}
|
||||
currentRegion = null;
|
||||
}
|
||||
|
||||
linePosition += line.length + 1; // +1 for newline
|
||||
}
|
||||
|
||||
// Add final region if exists
|
||||
if (currentRegion && currentRegion.confidence > 0.5) {
|
||||
regions.push(currentRegion);
|
||||
}
|
||||
|
||||
return regions;
|
||||
}
|
||||
|
||||
/**
|
||||
* Detect if a line looks like part of a table
|
||||
*/
|
||||
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
|
||||
let score = 0;
|
||||
|
||||
// Check for multiple aligned numbers
|
||||
const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
|
||||
if (numberMatches && numberMatches.length >= 3) {
|
||||
score += 0.4; // Multiple numbers = likely table row
|
||||
}
|
||||
|
||||
// Check for consistent spacing (indicates columns)
|
||||
const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
|
||||
if (hasConsistentSpacing && numberMatches) {
|
||||
score += 0.3;
|
||||
}
|
||||
|
||||
// Check for year/period patterns
|
||||
if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
|
||||
score += 0.3;
|
||||
}
|
||||
|
||||
// Check for financial keywords
|
||||
if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
|
||||
score += 0.2;
|
||||
}
|
||||
|
||||
// Bonus: Next line also looks like a table
|
||||
if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
|
||||
score += 0.2;
|
||||
}
|
||||
|
||||
return {
|
||||
isTable: score > 0.5,
|
||||
confidence: Math.min(score, 1.0)
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Enhance text by preserving spacing in table regions
|
||||
*/
|
||||
export function preprocessText(text: string): PreprocessedText {
|
||||
const tableRegions = identifyTableRegions(text);
|
||||
|
||||
// For now, return original text with identified regions
|
||||
// In the future, could normalize spacing, align columns, etc.
|
||||
|
||||
return {
|
||||
original: text,
|
||||
enhanced: text, // TODO: Apply enhancement algorithms
|
||||
tableRegions,
|
||||
metadata: {
|
||||
likelyTableCount: tableRegions.length,
|
||||
preservedAlignment: true
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract just the table regions as separate texts
|
||||
*/
|
||||
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
|
||||
return preprocessed.tableRegions
|
||||
.filter(region => region.type === 'table' && region.confidence > 0.6)
|
||||
.map(region => region.content);
|
||||
}
|
||||
```
|
||||
|
||||
### 1.2: Enhance Financial Table Parser
|
||||
|
||||
**File**: `backend/src/services/financialTableParser.ts`
|
||||
|
||||
**Add new patterns to catch more variations:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: More flexible period token regex (add around line 21)
|
||||
const PERIOD_TOKEN_REGEX = /\b(?:
|
||||
(?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc.
|
||||
(?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc.
|
||||
(?:FY[-\s]?[1234])| # FY1, FY 2
|
||||
(?:LTM|TTM)| # LTM, TTM
|
||||
(?:CY\d{2})| # CY21, CY22
|
||||
(?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022
|
||||
)\b/gix;
|
||||
|
||||
// ENHANCED: Better money regex to catch more formats (update line 22)
|
||||
const MONEY_REGEX = /(?:
|
||||
\$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M
|
||||
[\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M
|
||||
\([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative
|
||||
[\d,]+(?:\.\d+)? # Plain numbers
|
||||
)/gx;
|
||||
|
||||
// ENHANCED: Better percentage regex (update line 23)
|
||||
const PERCENT_REGEX = /(?:
|
||||
\(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%)
|
||||
[\d,]+\.?\d*\s*pct| # 12.5 pct
|
||||
NM|N\/A|n\/a # Not meaningful, N/A
|
||||
)/gix;
|
||||
```
|
||||
|
||||
**Add multi-pass header detection:**
|
||||
|
||||
```typescript
|
||||
// ADD after line 278 (after current header detection)
|
||||
|
||||
// ENHANCED: Multi-pass header detection if first pass failed
|
||||
if (bestHeaderIndex === -1) {
|
||||
logger.info('First pass header detection failed, trying relaxed patterns');
|
||||
|
||||
// Second pass: Look for ANY line with 3+ numbers and a year pattern
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i];
|
||||
const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
|
||||
const numberCount = (line.match(/[\d,]+/g) || []).length;
|
||||
|
||||
if (hasYearPattern && numberCount >= 3) {
|
||||
// Look at next 10 lines for financial keywords
|
||||
const lookAhead = lines.slice(i + 1, i + 11).join(' ');
|
||||
const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);
|
||||
|
||||
if (hasFinancialKeywords) {
|
||||
logger.info('Relaxed header detection found candidate', {
|
||||
headerIndex: i,
|
||||
headerLine: line.substring(0, 100)
|
||||
});
|
||||
|
||||
// Try to parse this as header
|
||||
const tokens = tokenizePeriodHeaders(line);
|
||||
if (tokens.length >= 2) {
|
||||
bestHeaderIndex = i;
|
||||
bestBuckets = yearTokensToBuckets(tokens);
|
||||
bestHeaderScore = 50; // Lower confidence than primary detection
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Add fuzzy row matching:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Add after line 354 (in the row matching loop)
|
||||
// If exact match fails, try fuzzy matching
|
||||
|
||||
if (!ROW_MATCHERS[field].test(line)) {
|
||||
// Try fuzzy matching (partial matches, typos)
|
||||
const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
|
||||
if (!fuzzyMatch) continue;
|
||||
}
|
||||
|
||||
// ADD this helper function
|
||||
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
|
||||
const lineLower = line.toLowerCase();
|
||||
|
||||
switch (field) {
|
||||
case 'revenue':
|
||||
return /rev\b|sales|top.?line/.test(lineLower);
|
||||
case 'ebitda':
|
||||
return /ebit|earnings.*operations|operating.*income/.test(lineLower);
|
||||
case 'grossProfit':
|
||||
return /gross.*profit|gp\b/.test(lineLower);
|
||||
case 'grossMargin':
|
||||
return /gross.*margin|gm\b|gross.*%/.test(lineLower);
|
||||
case 'ebitdaMargin':
|
||||
return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
|
||||
case 'revenueGrowth':
|
||||
return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Enhanced LLM Context Delivery (2-3 hours)
|
||||
|
||||
### 2.1: Financial Section Prioritization
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Improve the `prioritizeFinancialChunks` method (around line 1265):**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Much more aggressive financial chunk prioritization
|
||||
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
|
||||
const scoredChunks = chunks.map(chunk => {
|
||||
const content = chunk.content.toLowerCase();
|
||||
let score = 0;
|
||||
|
||||
// TIER 1: Strong financial indicators (high score)
|
||||
const tier1Patterns = [
|
||||
/financial\s+summary/i,
|
||||
/historical\s+financials/i,
|
||||
/financial\s+performance/i,
|
||||
/income\s+statement/i,
|
||||
/financial\s+highlights/i,
|
||||
];
|
||||
tier1Patterns.forEach(pattern => {
|
||||
if (pattern.test(content)) score += 100;
|
||||
});
|
||||
|
||||
// TIER 2: Contains both periods AND metrics (very likely financial table)
|
||||
const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
|
||||
const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
|
||||
const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);
|
||||
|
||||
if (hasPeriods && hasMetrics && hasNumbers) {
|
||||
score += 80; // Very likely financial table
|
||||
} else if (hasPeriods && hasMetrics) {
|
||||
score += 50;
|
||||
} else if (hasPeriods && hasNumbers) {
|
||||
score += 30;
|
||||
}
|
||||
|
||||
// TIER 3: Multiple financial keywords
|
||||
const financialKeywords = [
|
||||
'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
|
||||
'operating income', 'net income', 'cash flow', 'growth'
|
||||
];
|
||||
const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
|
||||
score += keywordMatches * 5;
|
||||
|
||||
// TIER 4: Has year progression (2021, 2022, 2023)
|
||||
const years = content.match(/20[12]\d/g);
|
||||
if (years && years.length >= 3) {
|
||||
score += 25; // Sequential years = likely financial table
|
||||
}
|
||||
|
||||
// TIER 5: Multiple currency values
|
||||
const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
|
||||
if (currencyMatches) {
|
||||
score += Math.min(currencyMatches.length * 3, 30);
|
||||
}
|
||||
|
||||
// TIER 6: Section type boost
|
||||
if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
|
||||
score += 40;
|
||||
}
|
||||
|
||||
return { chunk, score };
|
||||
});
|
||||
|
||||
// Sort by score and return
|
||||
const sorted = scoredChunks.sort((a, b) => b.score - a.score);
|
||||
|
||||
// Log top financial chunks for debugging
|
||||
logger.info('Financial chunk prioritization results', {
|
||||
topScores: sorted.slice(0, 5).map(s => ({
|
||||
chunkIndex: s.chunk.chunkIndex,
|
||||
score: s.score,
|
||||
preview: s.chunk.content.substring(0, 100)
|
||||
}))
|
||||
});
|
||||
|
||||
return sorted.map(s => s.chunk);
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2: Increase Context for Financial Pass
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Update Pass 1 to use more chunks and larger context:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
|
||||
// Change from 7 chunks to 12 chunks, and increase character limit
|
||||
|
||||
const maxChunks = 12; // Was 7 - give LLM more context for financials
|
||||
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively
|
||||
|
||||
// And update line 1595 in extractWithTargetedQuery
|
||||
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;
|
||||
```
|
||||
|
||||
### 2.3: Enhanced Financial Extraction Prompt
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Update the Pass 1 query (around line 1196-1240) to be more explicit:**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Much more detailed extraction instructions
|
||||
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.
|
||||
|
||||
CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:
|
||||
|
||||
I. LOCATE FINANCIAL TABLES
|
||||
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
|
||||
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.
|
||||
|
||||
Financial tables typically appear in these formats:
|
||||
|
||||
FORMAT 1 - Row-based:
|
||||
FY 2021 FY 2022 FY 2023 LTM
|
||||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||||
Revenue Growth N/A 16.8% 15.9% (4.4%)
|
||||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||||
|
||||
FORMAT 2 - Column-based:
|
||||
Metric | Value
|
||||
-------------------|---------
|
||||
FY21 Revenue | $45.2M
|
||||
FY22 Revenue | $52.8M
|
||||
FY23 Revenue | $61.2M
|
||||
|
||||
FORMAT 3 - Inline:
|
||||
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)
|
||||
|
||||
II. EXTRACTION RULES
|
||||
|
||||
1. PERIOD IDENTIFICATION
|
||||
- FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
|
||||
- LTM/TTM = Most recent 12-month period
|
||||
- Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
|
||||
* FY2021 → fy3
|
||||
* FY2022 → fy2
|
||||
* FY2023 → fy1
|
||||
* LTM Sep'23 → ltm
|
||||
|
||||
2. VALUE EXTRACTION
|
||||
- Extract EXACT values as shown: "$45.2M", "16.8%", etc.
|
||||
- Preserve formatting: "$45.2M" not "45.2" or "45200000"
|
||||
- Include negative indicators: "(4.4%)" or "-4.4%"
|
||||
- Use "N/A" or "NM" if explicitly stated (not "Not specified")
|
||||
|
||||
3. METRIC IDENTIFICATION
|
||||
- Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
|
||||
- EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
|
||||
- Margins = Look for "%" after metric name
|
||||
- Growth = "Growth %", "YoY", "Y/Y", "Change %"
|
||||
|
||||
4. DEAL OVERVIEW
|
||||
- Extract: company name, industry, geography, transaction type
|
||||
- Extract: employee count, deal source, reason for sale
|
||||
- Extract: CIM dates and metadata
|
||||
|
||||
III. QUALITY CHECKS
|
||||
|
||||
Before submitting your response:
|
||||
- [ ] Did I find at least 3 distinct fiscal periods?
|
||||
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
|
||||
- [ ] Did I preserve exact number formats from the document?
|
||||
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?
|
||||
|
||||
IV. WHAT TO DO IF TABLE IS UNCLEAR
|
||||
|
||||
If the table is hard to parse:
|
||||
- Include the ENTIRE table section in your analysis
|
||||
- Extract what you can with confidence
|
||||
- Mark unclear values as "Not specified in CIM" only if truly absent
|
||||
- DO NOT guess or interpolate values
|
||||
|
||||
V. ADDITIONAL FINANCIAL DATA
|
||||
|
||||
Also extract:
|
||||
- Quality of earnings notes
|
||||
- EBITDA adjustments and add-backs
|
||||
- Revenue growth drivers
|
||||
- Margin trends and analysis
|
||||
- CapEx requirements
|
||||
- Working capital needs
|
||||
- Free cash flow comments`;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)
|
||||
|
||||
### 3.1: Create Validation Layer
|
||||
|
||||
**File**: Create `backend/src/services/financialDataValidator.ts`
|
||||
|
||||
```typescript
|
||||
import { logger } from '../utils/logger';
|
||||
import type { ParsedFinancials } from './financialTableParser';
|
||||
import type { CIMReview } from './llmSchemas';
|
||||
|
||||
export interface ValidationResult {
|
||||
isValid: boolean;
|
||||
confidence: number;
|
||||
issues: string[];
|
||||
corrections: ParsedFinancials;
|
||||
}
|
||||
|
||||
/**
|
||||
* Cross-validate financial data from multiple sources
|
||||
*/
|
||||
export function validateFinancialData(
|
||||
regexResult: ParsedFinancials,
|
||||
llmResult: Partial<CIMReview>
|
||||
): ValidationResult {
|
||||
const issues: string[] = [];
|
||||
const corrections: ParsedFinancials = { ...regexResult };
|
||||
let confidence = 1.0;
|
||||
|
||||
// Extract LLM financials
|
||||
const llmFinancials = llmResult.financialSummary?.financials;
|
||||
|
||||
if (!llmFinancials) {
|
||||
return {
|
||||
isValid: true,
|
||||
confidence: 0.5,
|
||||
issues: ['No LLM financial data to validate against'],
|
||||
corrections: regexResult
|
||||
};
|
||||
}
|
||||
|
||||
// Validate each period
|
||||
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||
|
||||
for (const period of periods) {
|
||||
const regexPeriod = regexResult[period];
|
||||
const llmPeriod = llmFinancials[period];
|
||||
|
||||
if (!llmPeriod) continue;
|
||||
|
||||
// Compare revenue
|
||||
if (regexPeriod.revenue && llmPeriod.revenue) {
|
||||
const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
|
||||
if (!match.matches) {
|
||||
issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
|
||||
confidence -= 0.1;
|
||||
|
||||
// Trust LLM if regex value looks suspicious
|
||||
if (match.llmMoreCredible) {
|
||||
corrections[period].revenue = llmPeriod.revenue;
|
||||
}
|
||||
}
|
||||
} else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
|
||||
// Regex missed it, LLM found it
|
||||
corrections[period].revenue = llmPeriod.revenue;
|
||||
issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
|
||||
}
|
||||
|
||||
// Compare EBITDA
|
||||
if (regexPeriod.ebitda && llmPeriod.ebitda) {
|
||||
const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
|
||||
if (!match.matches) {
|
||||
issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
|
||||
confidence -= 0.1;
|
||||
|
||||
if (match.llmMoreCredible) {
|
||||
corrections[period].ebitda = llmPeriod.ebitda;
|
||||
}
|
||||
}
|
||||
} else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
|
||||
corrections[period].ebitda = llmPeriod.ebitda;
|
||||
issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
|
||||
}
|
||||
|
||||
// Fill in other fields from LLM if regex didn't get them
|
||||
const fields: Array<keyof typeof regexPeriod> = [
|
||||
'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
|
||||
];
|
||||
|
||||
for (const field of fields) {
|
||||
if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
|
||||
corrections[period][field] = llmPeriod[field];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
logger.info('Financial data validation completed', {
|
||||
confidence,
|
||||
issueCount: issues.length,
|
||||
issues: issues.slice(0, 5)
|
||||
});
|
||||
|
||||
return {
|
||||
isValid: confidence > 0.6,
|
||||
confidence,
|
||||
issues,
|
||||
corrections
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Compare two financial values to see if they match
|
||||
*/
|
||||
function compareFinancialValues(
|
||||
value1: string,
|
||||
value2: string
|
||||
): { matches: boolean; llmMoreCredible: boolean } {
|
||||
const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
|
||||
const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();
|
||||
|
||||
// Exact match
|
||||
if (clean1 === clean2) {
|
||||
return { matches: true, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
// Check if numeric values are close (within 5%)
|
||||
const num1 = parseFinancialValue(value1);
|
||||
const num2 = parseFinancialValue(value2);
|
||||
|
||||
if (num1 && num2) {
|
||||
const percentDiff = Math.abs((num1 - num2) / num1);
|
||||
if (percentDiff < 0.05) {
|
||||
// Values are close enough
|
||||
return { matches: true, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
// Large difference - trust value with more precision
|
||||
const precision1 = (value1.match(/\./g) || []).length;
|
||||
const precision2 = (value2.match(/\./g) || []).length;
|
||||
|
||||
return {
|
||||
matches: false,
|
||||
llmMoreCredible: precision2 > precision1
|
||||
};
|
||||
}
|
||||
|
||||
return { matches: false, llmMoreCredible: false };
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse a financial value string to number
|
||||
*/
|
||||
function parseFinancialValue(value: string): number | null {
|
||||
const clean = value.replace(/[$,\s]/g, '');
|
||||
|
||||
let multiplier = 1;
|
||||
if (/M$/i.test(clean)) {
|
||||
multiplier = 1000000;
|
||||
} else if (/K$/i.test(clean)) {
|
||||
multiplier = 1000;
|
||||
} else if (/B$/i.test(clean)) {
|
||||
multiplier = 1000000000;
|
||||
}
|
||||
|
||||
const numStr = clean.replace(/[MKB]/i, '');
|
||||
const num = parseFloat(numStr);
|
||||
|
||||
return isNaN(num) ? null : num * multiplier;
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2: Integrate Validation into Processing
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
|
||||
**Add after line 1137 (after merging partial results):**
|
||||
|
||||
```typescript
|
||||
// ENHANCED: Cross-validate regex and LLM results
|
||||
if (deterministicFinancials) {
|
||||
logger.info('Validating deterministic financials against LLM results');
|
||||
|
||||
const { validateFinancialData } = await import('./financialDataValidator');
|
||||
const validation = validateFinancialData(deterministicFinancials, mergedData);
|
||||
|
||||
logger.info('Validation results', {
|
||||
documentId,
|
||||
isValid: validation.isValid,
|
||||
confidence: validation.confidence,
|
||||
issueCount: validation.issues.length
|
||||
});
|
||||
|
||||
// Use validated/corrected data
|
||||
if (validation.confidence > 0.7) {
|
||||
deterministicFinancials = validation.corrections;
|
||||
logger.info('Using validated corrections', {
|
||||
documentId,
|
||||
corrections: validation.corrections
|
||||
});
|
||||
}
|
||||
|
||||
// Merge validated data
|
||||
this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
|
||||
} else {
|
||||
logger.info('No deterministic financial data to validate', { documentId });
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Text Preprocessing Integration (1 hour)
|
||||
|
||||
### 4.1: Apply Preprocessing to Document AI Text
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
|
||||
**Add preprocessing before passing to RAG:**
|
||||
|
||||
```typescript
|
||||
// ADD import at top
|
||||
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';
|
||||
|
||||
// UPDATE line 83 (processWithAgenticRAG method)
|
||||
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
|
||||
try {
|
||||
logger.info('Processing extracted text with Agentic RAG', {
|
||||
documentId,
|
||||
textLength: extractedText.length
|
||||
});
|
||||
|
||||
// ENHANCED: Preprocess text to identify table regions
|
||||
const preprocessed = preprocessText(extractedText);
|
||||
|
||||
logger.info('Text preprocessing completed', {
|
||||
documentId,
|
||||
tableRegionsFound: preprocessed.tableRegions.length,
|
||||
likelyTableCount: preprocessed.metadata.likelyTableCount
|
||||
});
|
||||
|
||||
// Extract table texts separately for better parsing
|
||||
const tableSections = extractTableTexts(preprocessed);
|
||||
|
||||
// Import and use the optimized agentic RAG processor
|
||||
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
|
||||
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{
|
||||
preprocessedData: preprocessed, // Pass preprocessing results
|
||||
tableSections: tableSections // Pass isolated table texts
|
||||
}
|
||||
);
|
||||
|
||||
return result;
|
||||
} catch (error) {
|
||||
// ... existing error handling
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Current State (Baseline):
|
||||
```
|
||||
Financial data extraction rate: 10-20%
|
||||
Typical result: "Not specified in CIM" for most fields
|
||||
```
|
||||
|
||||
### After Phase 1 (Enhanced Regex):
|
||||
```
|
||||
Financial data extraction rate: 35-45%
|
||||
Improvement: Better pattern matching catches more tables
|
||||
```
|
||||
|
||||
### After Phase 2 (Enhanced LLM):
|
||||
```
|
||||
Financial data extraction rate: 65-75%
|
||||
Improvement: LLM sees financial tables more reliably
|
||||
```
|
||||
|
||||
### After Phase 3 (Validation):
|
||||
```
|
||||
Financial data extraction rate: 75-85%
|
||||
Improvement: Cross-validation fills gaps and corrects errors
|
||||
```
|
||||
|
||||
### After Phase 4 (Preprocessing):
|
||||
```
|
||||
Financial data extraction rate: 80-90%
|
||||
Improvement: Table structure preservation helps both regex and LLM
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
### Start Here (Highest ROI):
|
||||
1. **Phase 2.1** - Financial Section Prioritization (30 min, +30% accuracy)
|
||||
2. **Phase 2.2** - Increase LLM Context (15 min, +15% accuracy)
|
||||
3. **Phase 2.3** - Enhanced Prompt (30 min, +20% accuracy)
|
||||
|
||||
**Total: 1.5 hours for ~50-60% improvement**
|
||||
|
||||
### Then Do:
|
||||
4. **Phase 1.2** - Enhanced Parser Patterns (1 hour, +10% accuracy)
|
||||
5. **Phase 3.1-3.2** - Validation (1.5 hours, +10% accuracy)
|
||||
|
||||
**Total: 4 hours for ~70-80% improvement**
|
||||
|
||||
### Optional:
|
||||
6. **Phase 1.1, 4.1** - Text Preprocessing (2 hours, +10% accuracy)
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Test 1: Baseline Measurement
|
||||
```bash
|
||||
# Process 10 CIMs and record extraction rate
|
||||
npm run test:pipeline
|
||||
# Record: How many financial fields are populated?
|
||||
```
|
||||
|
||||
### Test 2: After Each Phase
|
||||
```bash
|
||||
# Same 10 CIMs, measure improvement
|
||||
npm run test:pipeline
|
||||
# Compare against baseline
|
||||
```
|
||||
|
||||
### Test 3: Edge Cases
|
||||
- PDFs with rotated pages
|
||||
- PDFs with merged table cells
|
||||
- PDFs with multi-line headers
|
||||
- Narrative-only financials (no tables)
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
Each phase is additive and can be disabled via feature flags:
|
||||
|
||||
```typescript
|
||||
// config/env.ts
|
||||
export const features = {
|
||||
enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
|
||||
enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
|
||||
financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
|
||||
textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
|
||||
};
|
||||
```
|
||||
|
||||
Set `ENHANCED_REGEX=false` to disable any phase.
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Current | Target | Measurement |
|
||||
|--------|---------|--------|-------------|
|
||||
| Financial data extracted | 10-20% | 80-90% | % of fields populated |
|
||||
| Processing time | 45s | <60s | End-to-end time |
|
||||
| False positives | Unknown | <5% | Manual validation |
|
||||
| Column misalignment | ~50% | <10% | Check FY mapping |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
|
||||
2. Test with 5-10 real CIM documents
|
||||
3. Measure improvement
|
||||
4. If >70% accuracy, stop. If not, add Phase 1 and 3.
|
||||
5. Keep Phase 4 as optional enhancement
|
||||
|
||||
The LLM is actually very good at this - we just need to give it the right context!
|
||||
@@ -1,871 +0,0 @@
|
||||
# Financial Data Extraction: Implementation Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Quick Win Implementation (RECOMMENDED START)
|
||||
|
||||
**Timeline**: 1-2 hours
|
||||
**Expected Improvement**: 60-70% accuracy gain
|
||||
**Risk**: Low - additive changes, no breaking modifications
|
||||
|
||||
### Step 1.1: Update DocumentAIOutput Interface
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
|
||||
**Current (lines 15-25):**
|
||||
```typescript
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<{...}>;
|
||||
tables: Array<any>; // ❌ Just counts, no structure
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
export interface StructuredTable {
|
||||
headers: string[];
|
||||
rows: string[][];
|
||||
position: {
|
||||
pageNumber: number;
|
||||
confidence: number;
|
||||
};
|
||||
rawTable?: any; // Keep original for debugging
|
||||
}
|
||||
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<{...}>;
|
||||
tables: StructuredTable[]; // ✅ Full structure
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.2: Add Table Text Extraction Helper
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Add after line 51 (after constructor)
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Extract text from a Document AI layout object using text anchors
|
||||
* Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response
|
||||
*/
|
||||
private getTextFromLayout(layout: any, documentText: string): string {
|
||||
try {
|
||||
const textAnchor = layout?.textAnchor;
|
||||
if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Get the first segment (most common case)
|
||||
const segment = textAnchor.textSegments[0];
|
||||
const startIndex = parseInt(segment.startIndex || '0');
|
||||
const endIndex = parseInt(segment.endIndex || documentText.length.toString());
|
||||
|
||||
// Validate indices
|
||||
if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) {
|
||||
logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length });
|
||||
return '';
|
||||
}
|
||||
|
||||
return documentText.substring(startIndex, endIndex).trim();
|
||||
} catch (error) {
|
||||
logger.error('Failed to extract text from layout', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
layout
|
||||
});
|
||||
return '';
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.3: Add Structured Table Extraction
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Add after getTextFromLayout method
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Extract structured tables from Document AI response
|
||||
* Preserves column alignment and table structure
|
||||
*/
|
||||
private extractStructuredTables(document: any, documentText: string): StructuredTable[] {
|
||||
const tables: StructuredTable[] = [];
|
||||
|
||||
try {
|
||||
const pages = document.pages || [];
|
||||
logger.info('Extracting structured tables from Document AI response', {
|
||||
pageCount: pages.length
|
||||
});
|
||||
|
||||
for (const page of pages) {
|
||||
const pageTables = page.tables || [];
|
||||
const pageNumber = page.pageNumber || 0;
|
||||
|
||||
logger.info('Processing page for tables', {
|
||||
pageNumber,
|
||||
tableCount: pageTables.length
|
||||
});
|
||||
|
||||
for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) {
|
||||
const table = pageTables[tableIndex];
|
||||
|
||||
try {
|
||||
// Extract headers from first header row
|
||||
const headers: string[] = [];
|
||||
if (table.headerRows && table.headerRows.length > 0) {
|
||||
const headerRow = table.headerRows[0];
|
||||
for (const cell of headerRow.cells || []) {
|
||||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||||
headers.push(cellText);
|
||||
}
|
||||
}
|
||||
|
||||
// Extract data rows
|
||||
const rows: string[][] = [];
|
||||
for (const bodyRow of table.bodyRows || []) {
|
||||
const row: string[] = [];
|
||||
for (const cell of bodyRow.cells || []) {
|
||||
const cellText = this.getTextFromLayout(cell.layout, documentText);
|
||||
row.push(cellText);
|
||||
}
|
||||
if (row.length > 0) {
|
||||
rows.push(row);
|
||||
}
|
||||
}
|
||||
|
||||
// Only add tables with content
|
||||
if (headers.length > 0 || rows.length > 0) {
|
||||
tables.push({
|
||||
headers,
|
||||
rows,
|
||||
position: {
|
||||
pageNumber,
|
||||
confidence: table.confidence || 0.9
|
||||
},
|
||||
rawTable: table // Keep for debugging
|
||||
});
|
||||
|
||||
logger.info('Extracted structured table', {
|
||||
pageNumber,
|
||||
tableIndex,
|
||||
headerCount: headers.length,
|
||||
rowCount: rows.length,
|
||||
headers: headers.slice(0, 10) // Log first 10 headers
|
||||
});
|
||||
}
|
||||
} catch (tableError) {
|
||||
logger.error('Failed to extract table', {
|
||||
pageNumber,
|
||||
tableIndex,
|
||||
error: tableError instanceof Error ? tableError.message : String(tableError)
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
logger.info('Structured table extraction completed', {
|
||||
totalTables: tables.length
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Failed to extract structured tables', {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
}
|
||||
|
||||
return tables;
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.4: Update processWithDocumentAI to Use Structured Tables
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Update lines 462-482
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
// Extract tables
|
||||
const tables = document.pages?.flatMap(page =>
|
||||
page.tables?.map(table => ({
|
||||
rows: table.headerRows?.length || 0,
|
||||
columns: table.bodyRows?.[0]?.cells?.length || 0
|
||||
})) || []
|
||||
) || [];
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
// Extract structured tables with full content
|
||||
const tables = this.extractStructuredTables(document, text);
|
||||
```
|
||||
|
||||
### Step 1.5: Pass Tables to Agentic RAG Processor
|
||||
|
||||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||||
**Location**: Update line 337 (processLargeDocument call)
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{}
|
||||
);
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{
|
||||
structuredTables: documentAiOutput.tables || []
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
### Step 1.6: Update Agentic RAG Processor Signature
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update lines 41-48
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
async processLargeDocument(
|
||||
documentId: string,
|
||||
text: string,
|
||||
options: {
|
||||
enableSemanticChunking?: boolean;
|
||||
enableMetadataEnrichment?: boolean;
|
||||
similarityThreshold?: number;
|
||||
} = {}
|
||||
)
|
||||
```
|
||||
|
||||
**Updated:**
|
||||
```typescript
|
||||
async processLargeDocument(
|
||||
documentId: string,
|
||||
text: string,
|
||||
options: {
|
||||
enableSemanticChunking?: boolean;
|
||||
enableMetadataEnrichment?: boolean;
|
||||
similarityThreshold?: number;
|
||||
structuredTables?: StructuredTable[];
|
||||
} = {}
|
||||
)
|
||||
```
|
||||
|
||||
### Step 1.7: Add Import for StructuredTable Type
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Add to imports at top (around line 1-6)
|
||||
|
||||
```typescript
|
||||
import type { StructuredTable } from './documentAiProcessor';
|
||||
```
|
||||
|
||||
### Step 1.8: Create Financial Table Identifier
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Add after line 503 (after calculateCosineSimilarity)
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Identify if a structured table contains financial data
|
||||
* Uses heuristics to detect financial tables vs. other tables
|
||||
*/
|
||||
private isFinancialTable(table: StructuredTable): boolean {
|
||||
const headerText = table.headers.join(' ').toLowerCase();
|
||||
const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' ');
|
||||
|
||||
// Check for year/period indicators in headers
|
||||
const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText);
|
||||
|
||||
// Check for financial metrics in rows
|
||||
const financialMetrics = [
|
||||
'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin',
|
||||
'gross profit', 'operating income', 'net income', 'cash flow',
|
||||
'earnings', 'assets', 'liabilities', 'equity'
|
||||
];
|
||||
const hasFinancialMetrics = financialMetrics.some(metric =>
|
||||
allRowsText.includes(metric)
|
||||
);
|
||||
|
||||
// Check for currency/percentage values
|
||||
const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText);
|
||||
|
||||
// A financial table should have periods AND (metrics OR currency values)
|
||||
const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency);
|
||||
|
||||
if (isFinancial) {
|
||||
logger.info('Identified financial table', {
|
||||
headers: table.headers,
|
||||
rowCount: table.rows.length,
|
||||
pageNumber: table.position.pageNumber
|
||||
});
|
||||
}
|
||||
|
||||
return isFinancial;
|
||||
}
|
||||
|
||||
/**
|
||||
* Format a structured table as markdown for better LLM comprehension
|
||||
* Preserves column alignment and makes tables human-readable
|
||||
*/
|
||||
private formatTableAsMarkdown(table: StructuredTable): string {
|
||||
const lines: string[] = [];
|
||||
|
||||
// Add header row
|
||||
if (table.headers.length > 0) {
|
||||
lines.push(`| ${table.headers.join(' | ')} |`);
|
||||
lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`);
|
||||
}
|
||||
|
||||
// Add data rows
|
||||
for (const row of table.rows) {
|
||||
lines.push(`| ${row.join(' | ')} |`);
|
||||
}
|
||||
|
||||
return lines.join('\n');
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.9: Update Chunk Creation to Include Financial Tables
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update createIntelligentChunks method (lines 115-158)
|
||||
|
||||
**Add after line 118:**
|
||||
```typescript
|
||||
// Extract structured tables from options
|
||||
const structuredTables = (options as any)?.structuredTables || [];
|
||||
```
|
||||
|
||||
**Add after line 119 (inside the method, before semantic chunking):**
|
||||
```typescript
|
||||
// PRIORITY: Create dedicated chunks for financial tables
|
||||
if (structuredTables.length > 0) {
|
||||
logger.info('Processing structured tables for chunking', {
|
||||
documentId,
|
||||
tableCount: structuredTables.length
|
||||
});
|
||||
|
||||
for (let i = 0; i < structuredTables.length; i++) {
|
||||
const table = structuredTables[i];
|
||||
const isFinancial = this.isFinancialTable(table);
|
||||
|
||||
// Format table as markdown for better readability
|
||||
const markdownTable = this.formatTableAsMarkdown(table);
|
||||
|
||||
chunks.push({
|
||||
id: `${documentId}-table-${i}`,
|
||||
content: markdownTable,
|
||||
chunkIndex: chunks.length,
|
||||
startPosition: -1, // Tables don't have text positions
|
||||
endPosition: -1,
|
||||
sectionType: isFinancial ? 'financial-table' : 'table',
|
||||
metadata: {
|
||||
isStructuredTable: true,
|
||||
isFinancialTable: isFinancial,
|
||||
tableIndex: i,
|
||||
pageNumber: table.position.pageNumber,
|
||||
headerCount: table.headers.length,
|
||||
rowCount: table.rows.length,
|
||||
structuredData: table // Preserve original structure
|
||||
}
|
||||
});
|
||||
|
||||
logger.info('Created chunk for structured table', {
|
||||
documentId,
|
||||
tableIndex: i,
|
||||
isFinancial,
|
||||
chunkId: chunks[chunks.length - 1].id,
|
||||
contentPreview: markdownTable.substring(0, 200)
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 1.10: Pin Financial Tables in Extraction
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260)
|
||||
|
||||
**Add before the return statement (around line 1259):**
|
||||
```typescript
|
||||
// Identify and pin financial table chunks to ensure they're always included
|
||||
const financialTableChunks = chunks.filter(
|
||||
chunk => chunk.metadata?.isFinancialTable === true
|
||||
);
|
||||
|
||||
logger.info('Financial table chunks identified for pinning', {
|
||||
documentId,
|
||||
financialTableCount: financialTableChunks.length,
|
||||
chunkIds: financialTableChunks.map(c => c.id)
|
||||
});
|
||||
|
||||
// Combine deterministic financial chunks with structured table chunks
|
||||
const allPinnedChunks = [
|
||||
...pinnedChunks,
|
||||
...financialTableChunks
|
||||
];
|
||||
```
|
||||
|
||||
**Update the return statement to use allPinnedChunks:**
|
||||
```typescript
|
||||
return await this.extractWithTargetedQuery(
|
||||
documentId,
|
||||
text,
|
||||
financialChunks,
|
||||
query,
|
||||
targetFields,
|
||||
7,
|
||||
allPinnedChunks // ✅ Now includes both deterministic and structured tables
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Phase 1
|
||||
|
||||
### Test 1.1: Verify Table Extraction
|
||||
```bash
|
||||
# Monitor logs for table extraction
|
||||
cd backend
|
||||
npm run dev
|
||||
|
||||
# Look for log entries:
|
||||
# - "Extracting structured tables from Document AI response"
|
||||
# - "Extracted structured table"
|
||||
# - "Identified financial table"
|
||||
```
|
||||
|
||||
### Test 1.2: Upload a CIM Document
|
||||
```bash
|
||||
# Upload a test document and check processing
|
||||
curl -X POST http://localhost:8080/api/documents/upload \
|
||||
-F "file=@test-cim.pdf" \
|
||||
-H "Authorization: Bearer YOUR_TOKEN"
|
||||
```
|
||||
|
||||
### Test 1.3: Verify Financial Data Populated
|
||||
Check the database or API response for:
|
||||
- `financialSummary.financials.fy3.revenue` - Should have values
|
||||
- `financialSummary.financials.fy2.ebitda` - Should have values
|
||||
- NOT "Not specified in CIM" for fields that exist in tables
|
||||
|
||||
### Test 1.4: Check Logs for Success Indicators
|
||||
```bash
|
||||
# Should see:
|
||||
✅ "Identified financial table" - confirms tables detected
|
||||
✅ "Created chunk for structured table" - confirms chunking worked
|
||||
✅ "Financial table chunks identified for pinning" - confirms pinning worked
|
||||
✅ "Deterministic financial data merged successfully" - confirms data merged
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Baseline & Post-Change Metrics
|
||||
|
||||
Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback:
|
||||
|
||||
1. Instrument the processing metadata (see `FINANCIAL_EXTRACTION_ANALYSIS.md`) with `tablesFound`, `financialTablesIdentified`, `structuredParsingUsed`, `textParsingFallback`, and `financialDataPopulated`.
|
||||
2. Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample `documentId`s with `tablesFound === 0`).
|
||||
3. Repeat after deploying Phase 1 and Phase 2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates.
|
||||
|
||||
---
|
||||
|
||||
## Expected Results After Phase 1
|
||||
|
||||
### Before Phase 1:
|
||||
```json
|
||||
{
|
||||
"financialSummary": {
|
||||
"financials": {
|
||||
"fy3": {
|
||||
"revenue": "Not specified in CIM",
|
||||
"ebitda": "Not specified in CIM"
|
||||
},
|
||||
"fy2": {
|
||||
"revenue": "Not specified in CIM",
|
||||
"ebitda": "Not specified in CIM"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### After Phase 1:
|
||||
```json
|
||||
{
|
||||
"financialSummary": {
|
||||
"financials": {
|
||||
"fy3": {
|
||||
"revenue": "$45.2M",
|
||||
"revenueGrowth": "N/A",
|
||||
"ebitda": "$8.5M",
|
||||
"ebitdaMargin": "18.8%"
|
||||
},
|
||||
"fy2": {
|
||||
"revenue": "$52.8M",
|
||||
"revenueGrowth": "16.8%",
|
||||
"ebitda": "$10.2M",
|
||||
"ebitdaMargin": "19.3%"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Enhanced Deterministic Parsing (Optional)
|
||||
|
||||
**Timeline**: 2-3 hours
|
||||
**Expected Additional Improvement**: +15-20% accuracy
|
||||
**Trigger**: If Phase 1 results are below 70% accuracy
|
||||
|
||||
### Step 2.1: Create Structured Table Parser
|
||||
|
||||
**File**: Create `backend/src/services/structuredFinancialParser.ts`
|
||||
|
||||
```typescript
|
||||
import { logger } from '../utils/logger';
|
||||
import type { StructuredTable } from './documentAiProcessor';
|
||||
import type { ParsedFinancials, FinancialPeriod } from './financialTableParser';
|
||||
|
||||
/**
|
||||
* Parse financials directly from Document AI structured tables
|
||||
* This is more reliable than parsing from flattened text
|
||||
*/
|
||||
export function parseFinancialsFromStructuredTable(
|
||||
table: StructuredTable
|
||||
): ParsedFinancials {
|
||||
const result: ParsedFinancials = {
|
||||
fy3: {},
|
||||
fy2: {},
|
||||
fy1: {},
|
||||
ltm: {}
|
||||
};
|
||||
|
||||
try {
|
||||
// 1. Identify period columns from headers
|
||||
const periodMapping = mapHeadersToPeriods(table.headers);
|
||||
|
||||
logger.info('Structured table period mapping', {
|
||||
headers: table.headers,
|
||||
periodMapping
|
||||
});
|
||||
|
||||
// 2. Process each row to extract metrics
|
||||
for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) {
|
||||
const row = table.rows[rowIndex];
|
||||
if (row.length === 0) continue;
|
||||
|
||||
const metricName = row[0].toLowerCase();
|
||||
|
||||
// Match against known financial metrics
|
||||
const fieldName = identifyMetricField(metricName);
|
||||
if (!fieldName) continue;
|
||||
|
||||
// 3. Assign values to correct periods
|
||||
periodMapping.forEach((period, columnIndex) => {
|
||||
if (!period) return; // Skip unmapped columns
|
||||
|
||||
const value = row[columnIndex + 1]; // +1 because first column is metric name
|
||||
if (!value || value.trim() === '') return;
|
||||
|
||||
// 4. Validate value type matches field
|
||||
if (isValidValueForField(value, fieldName)) {
|
||||
result[period][fieldName] = value.trim();
|
||||
|
||||
logger.debug('Mapped structured table value', {
|
||||
period,
|
||||
field: fieldName,
|
||||
value: value.trim(),
|
||||
row: rowIndex,
|
||||
column: columnIndex
|
||||
});
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
logger.info('Structured table parsing completed', {
|
||||
fy3: result.fy3,
|
||||
fy2: result.fy2,
|
||||
fy1: result.fy1,
|
||||
ltm: result.ltm
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Failed to parse structured financial table', {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Map header columns to financial periods (fy3, fy2, fy1, ltm)
|
||||
*/
|
||||
function mapHeadersToPeriods(headers: string[]): Array<keyof ParsedFinancials | null> {
|
||||
const periodMapping: Array<keyof ParsedFinancials | null> = [];
|
||||
|
||||
for (const header of headers) {
|
||||
const normalized = header.trim().toUpperCase().replace(/\s+/g, '');
|
||||
let period: keyof ParsedFinancials | null = null;
|
||||
|
||||
// Check for LTM/TTM
|
||||
if (normalized.includes('LTM') || normalized.includes('TTM')) {
|
||||
period = 'ltm';
|
||||
}
|
||||
// Check for year patterns
|
||||
else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) {
|
||||
period = 'fy1'; // Most recent full year
|
||||
}
|
||||
else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) {
|
||||
period = 'fy2'; // Second most recent year
|
||||
}
|
||||
else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) {
|
||||
period = 'fy3'; // Third most recent year
|
||||
}
|
||||
// Generic FY pattern - assign based on position
|
||||
else if (/FY\d{2}/.test(normalized)) {
|
||||
// Will be assigned based on relative position
|
||||
period = null; // Handle in second pass
|
||||
}
|
||||
|
||||
periodMapping.push(period);
|
||||
}
|
||||
|
||||
// Second pass: fill in generic FY columns based on position
|
||||
// Most recent on right, oldest on left (common CIM format)
|
||||
let fyIndex = 1;
|
||||
for (let i = periodMapping.length - 1; i >= 0; i--) {
|
||||
if (periodMapping[i] === null && /FY/i.test(headers[i])) {
|
||||
if (fyIndex === 1) periodMapping[i] = 'fy1';
|
||||
else if (fyIndex === 2) periodMapping[i] = 'fy2';
|
||||
else if (fyIndex === 3) periodMapping[i] = 'fy3';
|
||||
fyIndex++;
|
||||
}
|
||||
}
|
||||
|
||||
return periodMapping;
|
||||
}
|
||||
|
||||
/**
|
||||
* Identify which financial field a metric name corresponds to
|
||||
*/
|
||||
function identifyMetricField(metricName: string): keyof FinancialPeriod | null {
|
||||
const name = metricName.toLowerCase();
|
||||
|
||||
if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) {
|
||||
return 'revenue';
|
||||
}
|
||||
if (/gross\s*profit/.test(name)) {
|
||||
return 'grossProfit';
|
||||
}
|
||||
if (/gross\s*margin/.test(name)) {
|
||||
return 'grossMargin';
|
||||
}
|
||||
if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) {
|
||||
return 'ebitdaMargin';
|
||||
}
|
||||
if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) {
|
||||
return 'ebitda';
|
||||
}
|
||||
if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) {
|
||||
return 'revenueGrowth';
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Validate that a value is appropriate for a given field
|
||||
*/
|
||||
function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean {
|
||||
const trimmed = value.trim();
|
||||
|
||||
// Margin and growth fields should have %
|
||||
if (field.includes('Margin') || field.includes('Growth')) {
|
||||
return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a');
|
||||
}
|
||||
|
||||
// Revenue, profit, EBITDA should have $ or numbers
|
||||
if (['revenue', 'grossProfit', 'ebitda'].includes(field)) {
|
||||
return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed));
|
||||
}
|
||||
|
||||
return /\d/.test(trimmed);
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2.2: Integrate Structured Parser
|
||||
|
||||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
**Location**: Update multi-pass extraction (around line 1063-1088)
|
||||
|
||||
**Add import:**
|
||||
```typescript
|
||||
import { parseFinancialsFromStructuredTable } from './structuredFinancialParser';
|
||||
```
|
||||
|
||||
**Update financial extraction logic (around line 1066-1088):**
|
||||
```typescript
|
||||
// Try structured table parsing first (most reliable)
|
||||
try {
|
||||
const structuredTables = (options as any)?.structuredTables || [];
|
||||
const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t));
|
||||
|
||||
if (financialTables.length > 0) {
|
||||
logger.info('Attempting structured table parsing', {
|
||||
documentId,
|
||||
financialTableCount: financialTables.length
|
||||
});
|
||||
|
||||
// Try each financial table until we get good data
|
||||
for (const table of financialTables) {
|
||||
const parsedFromTable = parseFinancialsFromStructuredTable(table);
|
||||
|
||||
if (this.hasStructuredFinancialData(parsedFromTable)) {
|
||||
deterministicFinancials = parsedFromTable;
|
||||
deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable);
|
||||
|
||||
logger.info('Structured table parsing successful', {
|
||||
documentId,
|
||||
tableIndex: financialTables.indexOf(table),
|
||||
fy3: parsedFromTable.fy3,
|
||||
fy2: parsedFromTable.fy2,
|
||||
fy1: parsedFromTable.fy1,
|
||||
ltm: parsedFromTable.ltm
|
||||
});
|
||||
break; // Found good data, stop trying tables
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (structuredParserError) {
|
||||
logger.warn('Structured table parsing failed, falling back to text parser', {
|
||||
documentId,
|
||||
error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError)
|
||||
});
|
||||
}
|
||||
|
||||
// Fallback to text-based parsing if structured parsing failed
|
||||
if (!deterministicFinancials) {
|
||||
try {
|
||||
const { parseFinancialsFromText } = await import('./financialTableParser');
|
||||
const parsedFinancials = parseFinancialsFromText(text);
|
||||
// ... existing code
|
||||
} catch (parserError) {
|
||||
// ... existing error handling
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If Phase 1 causes issues:
|
||||
|
||||
### Quick Rollback (5 minutes)
|
||||
```bash
|
||||
git checkout HEAD -- backend/src/services/documentAiProcessor.ts
|
||||
git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts
|
||||
npm run build
|
||||
npm start
|
||||
```
|
||||
|
||||
### Feature Flag Approach (Recommended)
|
||||
Add environment variable to control new behavior:
|
||||
|
||||
```typescript
|
||||
// backend/src/config/env.ts
|
||||
export const config = {
|
||||
features: {
|
||||
useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
Then wrap new code:
|
||||
```typescript
|
||||
if (config.features.useStructuredTables) {
|
||||
// Use structured tables
|
||||
} else {
|
||||
// Use old flat text approach
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Phase 1 Success:
|
||||
- ✅ 60%+ of CIM documents have populated financial data (validated via new telemetry)
|
||||
- ✅ No regression in processing time (< 10% increase acceptable)
|
||||
- ✅ No errors in table extraction pipeline
|
||||
- ✅ Structured tables logged in console
|
||||
|
||||
### Phase 2 Success:
|
||||
- ✅ 85%+ of CIM documents have populated financial data or fall back to the hybrid path when `tablesFound === 0`
|
||||
- ✅ Column alignment accuracy > 95%
|
||||
- ✅ Reduction in "Not specified in CIM" responses
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Debugging
|
||||
|
||||
### Key Metrics to Track
|
||||
```typescript
|
||||
// Add to processing result
|
||||
metadata: {
|
||||
tablesFound: number;
|
||||
financialTablesIdentified: number;
|
||||
structuredParsingUsed: boolean;
|
||||
textParsingFallback: boolean;
|
||||
financialDataPopulated: boolean;
|
||||
}
|
||||
```
|
||||
|
||||
### Log Analysis Queries
|
||||
```bash
|
||||
# Find documents with no tables
|
||||
grep "totalTables: 0" backend.log
|
||||
|
||||
# Find failed table extractions
|
||||
grep "Failed to extract table" backend.log
|
||||
|
||||
# Find successful financial extractions
|
||||
grep "Structured table parsing successful" backend.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Implementation
|
||||
|
||||
1. **Run on historical documents**: Reprocess 10-20 existing CIMs to compare before/after
|
||||
2. **A/B test**: Process new documents with both old and new system, compare results
|
||||
3. **Tune thresholds**: Adjust financial table identification heuristics based on results
|
||||
4. **Document findings**: Update this plan with actual results and lessons learned
|
||||
|
||||
---
|
||||
|
||||
## Resources
|
||||
|
||||
- [Document AI Table Extraction Docs](https://cloud.google.com/document-ai/docs/handle-response)
|
||||
- [Financial Parser (current)](backend/src/services/financialTableParser.ts)
|
||||
- [Financial Extractor (unused)](backend/src/utils/financialExtractor.ts)
|
||||
- [Analysis Document](FINANCIAL_EXTRACTION_ANALYSIS.md)
|
||||
@@ -1,634 +0,0 @@
|
||||
# LLM Agent Documentation Guide
|
||||
## Best Practices for Code Documentation Optimized for AI Coding Assistants
|
||||
|
||||
### 🎯 Purpose
|
||||
This guide outlines best practices for documenting code in a way that maximizes LLM coding agent understanding, evaluation accuracy, and development efficiency.
|
||||
|
||||
---
|
||||
|
||||
## 📋 Documentation Structure for LLM Agents
|
||||
|
||||
### 1. **Hierarchical Information Architecture**
|
||||
|
||||
#### Level 1: Project Overview (README.md)
|
||||
- **Purpose**: High-level system understanding
|
||||
- **Content**: What the system does, core technologies, architecture diagram
|
||||
- **LLM Benefits**: Quick context establishment, technology stack identification
|
||||
|
||||
#### Level 2: Architecture Documentation
|
||||
- **Purpose**: System design and component relationships
|
||||
- **Content**: Detailed architecture, data flow, service interactions
|
||||
- **LLM Benefits**: Understanding component dependencies and integration points
|
||||
|
||||
#### Level 3: Service-Level Documentation
|
||||
- **Purpose**: Individual service functionality and APIs
|
||||
- **Content**: Service purpose, methods, interfaces, error handling
|
||||
- **LLM Benefits**: Precise understanding of service capabilities and constraints
|
||||
|
||||
#### Level 4: Code-Level Documentation
|
||||
- **Purpose**: Implementation details and business logic
|
||||
- **Content**: Function documentation, type definitions, algorithm explanations
|
||||
- **LLM Benefits**: Detailed implementation understanding for modifications
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Best Practices for LLM-Optimized Documentation
|
||||
|
||||
### 1. **Clear Information Hierarchy**
|
||||
|
||||
#### Use Consistent Section Headers
|
||||
```markdown
|
||||
## 🎯 Purpose
|
||||
## 🏗️ Architecture
|
||||
## 🔧 Implementation
|
||||
## 📊 Data Flow
|
||||
## 🚨 Error Handling
|
||||
## 🧪 Testing
|
||||
## 📚 References
|
||||
```
|
||||
|
||||
#### Emoji-Based Visual Organization
|
||||
- 🎯 Purpose/Goals
|
||||
- 🏗️ Architecture/Structure
|
||||
- 🔧 Implementation/Code
|
||||
- 📊 Data/Flow
|
||||
- 🚨 Errors/Issues
|
||||
- 🧪 Testing/Validation
|
||||
- 📚 References/Links
|
||||
|
||||
### 2. **Structured Code Comments**
|
||||
|
||||
#### Function Documentation Template
|
||||
```typescript
|
||||
/**
|
||||
* @purpose Brief description of what this function does
|
||||
* @context When/why this function is called
|
||||
* @inputs What parameters it expects and their types
|
||||
* @outputs What it returns and the format
|
||||
* @dependencies What other services/functions it depends on
|
||||
* @errors What errors it can throw and when
|
||||
* @example Usage example with sample data
|
||||
* @complexity Time/space complexity if relevant
|
||||
*/
|
||||
```
|
||||
|
||||
#### Service Documentation Template
|
||||
```typescript
|
||||
/**
|
||||
* @service ServiceName
|
||||
* @purpose High-level purpose of this service
|
||||
* @responsibilities List of main responsibilities
|
||||
* @dependencies External services and internal dependencies
|
||||
* @interfaces Main public methods and their purposes
|
||||
* @configuration Environment variables and settings
|
||||
* @errorHandling How errors are handled and reported
|
||||
* @performance Expected performance characteristics
|
||||
*/
|
||||
```
|
||||
|
||||
### 3. **Context-Rich Descriptions**
|
||||
|
||||
#### Instead of:
|
||||
```typescript
|
||||
// Process document
|
||||
function processDocument(doc) { ... }
|
||||
```
|
||||
|
||||
#### Use:
|
||||
```typescript
|
||||
/**
|
||||
* @purpose Processes CIM documents through the AI analysis pipeline
|
||||
* @context Called when a user uploads a PDF document for analysis
|
||||
* @workflow 1. Extract text via Document AI, 2. Chunk content, 3. Generate embeddings, 4. Run LLM analysis, 5. Create PDF report
|
||||
* @inputs Document object with file metadata and user context
|
||||
* @outputs Structured analysis data and PDF report URL
|
||||
* @dependencies Google Document AI, Claude AI, Supabase, Google Cloud Storage
|
||||
*/
|
||||
function processDocument(doc: DocumentInput): Promise<ProcessingResult> { ... }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Data Flow Documentation
|
||||
|
||||
### 1. **Visual Flow Diagrams**
|
||||
```mermaid
|
||||
graph TD
|
||||
A[User Upload] --> B[Get Signed URL]
|
||||
B --> C[Upload to GCS]
|
||||
C --> D[Confirm Upload]
|
||||
D --> E[Start Processing]
|
||||
E --> F[Document AI Extraction]
|
||||
F --> G[Semantic Chunking]
|
||||
G --> H[Vector Embedding]
|
||||
H --> I[LLM Analysis]
|
||||
I --> J[PDF Generation]
|
||||
J --> K[Store Results]
|
||||
K --> L[Notify User]
|
||||
```
|
||||
|
||||
### 2. **Step-by-Step Process Documentation**
|
||||
```markdown
|
||||
## Document Processing Pipeline
|
||||
|
||||
### Step 1: File Upload
|
||||
- **Trigger**: User selects PDF file
|
||||
- **Action**: Generate signed URL from Google Cloud Storage
|
||||
- **Output**: Secure upload URL with expiration
|
||||
- **Error Handling**: Retry on URL generation failure
|
||||
|
||||
### Step 2: Text Extraction
|
||||
- **Trigger**: File upload confirmation
|
||||
- **Action**: Send PDF to Google Document AI
|
||||
- **Output**: Extracted text with confidence scores
|
||||
- **Error Handling**: Fallback to OCR if extraction fails
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Error Handling Documentation
|
||||
|
||||
### 1. **Error Classification System**
|
||||
```typescript
|
||||
/**
|
||||
* @errorType VALIDATION_ERROR
|
||||
* @description Input validation failures
|
||||
* @recoverable true
|
||||
* @retryStrategy none
|
||||
* @userMessage "Please check your input and try again"
|
||||
*/
|
||||
|
||||
/**
|
||||
* @errorType PROCESSING_ERROR
|
||||
* @description AI processing failures
|
||||
* @recoverable true
|
||||
* @retryStrategy exponential_backoff
|
||||
* @userMessage "Processing failed, please try again"
|
||||
*/
|
||||
|
||||
/**
|
||||
* @errorType SYSTEM_ERROR
|
||||
* @description Infrastructure failures
|
||||
* @recoverable false
|
||||
* @retryStrategy none
|
||||
* @userMessage "System temporarily unavailable"
|
||||
*/
|
||||
```
|
||||
|
||||
### 2. **Error Recovery Documentation**
|
||||
```markdown
|
||||
## Error Recovery Strategies
|
||||
|
||||
### LLM API Failures
|
||||
1. **Retry Logic**: Up to 3 attempts with exponential backoff
|
||||
2. **Model Fallback**: Switch from Claude to GPT-4 if available
|
||||
3. **Graceful Degradation**: Return partial results if possible
|
||||
4. **User Notification**: Clear error messages with retry options
|
||||
|
||||
### Database Connection Failures
|
||||
1. **Connection Pooling**: Automatic retry with connection pool
|
||||
2. **Circuit Breaker**: Prevent cascade failures
|
||||
3. **Read Replicas**: Fallback to read replicas for queries
|
||||
4. **Caching**: Serve cached data during outages
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing Documentation
|
||||
|
||||
### 1. **Test Strategy Documentation**
|
||||
```markdown
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- **Coverage Target**: >90% for business logic
|
||||
- **Focus Areas**: Service methods, utility functions, data transformations
|
||||
- **Mock Strategy**: External dependencies (APIs, databases)
|
||||
- **Assertion Style**: Behavior-driven assertions
|
||||
|
||||
### Integration Tests
|
||||
- **Coverage Target**: All API endpoints
|
||||
- **Focus Areas**: End-to-end workflows, data persistence, external integrations
|
||||
- **Test Data**: Realistic CIM documents with known characteristics
|
||||
- **Environment**: Isolated test database and storage
|
||||
|
||||
### Performance Tests
|
||||
- **Load Testing**: 10+ concurrent document processing
|
||||
- **Memory Testing**: Large document handling (50MB+)
|
||||
- **API Testing**: Rate limit compliance and optimization
|
||||
- **Cost Testing**: API usage optimization and monitoring
|
||||
```
|
||||
|
||||
### 2. **Test Data Documentation**
|
||||
```typescript
|
||||
/**
|
||||
* @testData sample_cim_document.pdf
|
||||
* @description Standard CIM document with typical structure
|
||||
* @size 2.5MB
|
||||
* @pages 15
|
||||
* @sections Financial, Market, Management, Operations
|
||||
* @expectedOutput Complete analysis with all sections populated
|
||||
*/
|
||||
|
||||
/**
|
||||
* @testData large_cim_document.pdf
|
||||
* @description Large CIM document for performance testing
|
||||
* @size 25MB
|
||||
* @pages 150
|
||||
* @sections Comprehensive business analysis
|
||||
* @expectedOutput Analysis within 5-minute time limit
|
||||
*/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 API Documentation
|
||||
|
||||
### 1. **Endpoint Documentation Template**
|
||||
```markdown
|
||||
## POST /documents/upload-url
|
||||
|
||||
### Purpose
|
||||
Generate a signed URL for secure file upload to Google Cloud Storage.
|
||||
|
||||
### Request
|
||||
```json
|
||||
{
|
||||
"fileName": "string",
|
||||
"fileSize": "number",
|
||||
"contentType": "application/pdf"
|
||||
}
|
||||
```
|
||||
|
||||
### Response
|
||||
```json
|
||||
{
|
||||
"uploadUrl": "string",
|
||||
"expiresAt": "ISO8601",
|
||||
"fileId": "UUID"
|
||||
}
|
||||
```
|
||||
|
||||
### Error Responses
|
||||
- `400 Bad Request`: Invalid file type or size
|
||||
- `401 Unauthorized`: Missing or invalid authentication
|
||||
- `500 Internal Server Error`: Storage service unavailable
|
||||
|
||||
### Dependencies
|
||||
- Google Cloud Storage
|
||||
- Firebase Authentication
|
||||
- File validation service
|
||||
|
||||
### Rate Limits
|
||||
- 100 requests per minute per user
|
||||
- 1000 requests per hour per user
|
||||
```
|
||||
|
||||
### 2. **Request/Response Examples**
|
||||
```typescript
|
||||
/**
|
||||
* @example Successful Upload URL Generation
|
||||
* @request {
|
||||
* "fileName": "sample_cim.pdf",
|
||||
* "fileSize": 2500000,
|
||||
* "contentType": "application/pdf"
|
||||
* }
|
||||
* @response {
|
||||
* "uploadUrl": "https://storage.googleapis.com/...",
|
||||
* "expiresAt": "2024-12-20T15:30:00Z",
|
||||
* "fileId": "550e8400-e29b-41d4-a716-446655440000"
|
||||
* }
|
||||
*/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Configuration Documentation
|
||||
|
||||
### 1. **Environment Variables**
|
||||
```markdown
|
||||
## Environment Configuration
|
||||
|
||||
### Required Variables
|
||||
- `GOOGLE_CLOUD_PROJECT_ID`: Google Cloud project identifier
|
||||
- `GOOGLE_CLOUD_STORAGE_BUCKET`: Storage bucket for documents
|
||||
- `ANTHROPIC_API_KEY`: Claude AI API key for document analysis
|
||||
- `DATABASE_URL`: Supabase database connection string
|
||||
|
||||
### Optional Variables
|
||||
- `AGENTIC_RAG_ENABLED`: Enable AI processing (default: true)
|
||||
- `PROCESSING_STRATEGY`: Processing method (default: optimized_agentic_rag)
|
||||
- `LLM_MODEL`: AI model selection (default: claude-3-opus-20240229)
|
||||
- `MAX_FILE_SIZE`: Maximum file size in bytes (default: 52428800)
|
||||
|
||||
### Development Variables
|
||||
- `NODE_ENV`: Environment mode (development/production)
|
||||
- `LOG_LEVEL`: Logging verbosity (debug/info/warn/error)
|
||||
- `ENABLE_METRICS`: Enable performance monitoring (default: true)
|
||||
```
|
||||
|
||||
### 2. **Service Configuration**
|
||||
```typescript
|
||||
/**
|
||||
* @configuration LLM Service Configuration
|
||||
* @purpose Configure AI model behavior and performance
|
||||
* @settings {
|
||||
* "model": "claude-3-opus-20240229",
|
||||
* "maxTokens": 4000,
|
||||
* "temperature": 0.1,
|
||||
* "timeoutMs": 60000,
|
||||
* "retryAttempts": 3,
|
||||
* "retryDelayMs": 1000
|
||||
* }
|
||||
* @constraints {
|
||||
* "maxTokens": "1000-8000",
|
||||
* "temperature": "0.0-1.0",
|
||||
* "timeoutMs": "30000-300000"
|
||||
* }
|
||||
*/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Performance Documentation
|
||||
|
||||
### 1. **Performance Characteristics**
|
||||
```markdown
|
||||
## Performance Benchmarks
|
||||
|
||||
### Document Processing Times
|
||||
- **Small Documents** (<5MB): 30-60 seconds
|
||||
- **Medium Documents** (5-15MB): 1-3 minutes
|
||||
- **Large Documents** (15-50MB): 3-5 minutes
|
||||
|
||||
### Resource Usage
|
||||
- **Memory**: 50-150MB per processing session
|
||||
- **CPU**: Moderate usage during AI processing
|
||||
- **Network**: 10-50 API calls per document
|
||||
- **Storage**: Temporary files cleaned up automatically
|
||||
|
||||
### Scalability Limits
|
||||
- **Concurrent Processing**: 5 documents simultaneously
|
||||
- **Daily Volume**: 1000 documents per day
|
||||
- **File Size Limit**: 50MB per document
|
||||
- **API Rate Limits**: 1000 requests per 15 minutes
|
||||
```
|
||||
|
||||
### 2. **Optimization Strategies**
|
||||
```markdown
|
||||
## Performance Optimizations
|
||||
|
||||
### Memory Management
|
||||
1. **Batch Processing**: Process chunks in batches of 10
|
||||
2. **Garbage Collection**: Automatic cleanup of temporary data
|
||||
3. **Connection Pooling**: Reuse database connections
|
||||
4. **Streaming**: Stream large files instead of loading entirely
|
||||
|
||||
### API Optimization
|
||||
1. **Rate Limiting**: Respect API quotas and limits
|
||||
2. **Caching**: Cache frequently accessed data
|
||||
3. **Model Selection**: Use appropriate models for task complexity
|
||||
4. **Parallel Processing**: Execute independent operations concurrently
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Debugging Documentation
|
||||
|
||||
### 1. **Logging Strategy**
|
||||
```typescript
|
||||
/**
|
||||
* @logging Structured Logging Configuration
|
||||
* @levels {
|
||||
* "debug": "Detailed execution flow",
|
||||
* "info": "Important business events",
|
||||
* "warn": "Potential issues",
|
||||
* "error": "System failures"
|
||||
* }
|
||||
* @correlation Correlation IDs for request tracking
|
||||
* @context User ID, session ID, document ID
|
||||
* @format JSON structured logging
|
||||
*/
|
||||
```
|
||||
|
||||
### 2. **Debug Tools and Commands**
|
||||
```markdown
|
||||
## Debugging Tools
|
||||
|
||||
### Log Analysis
|
||||
```bash
|
||||
# View recent errors
|
||||
grep "ERROR" logs/app.log | tail -20
|
||||
|
||||
# Track specific request
|
||||
grep "correlation_id:abc123" logs/app.log
|
||||
|
||||
# Monitor processing times
|
||||
grep "processing_time" logs/app.log | jq '.processing_time'
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
```bash
|
||||
# Check service health
|
||||
curl http://localhost:5001/health
|
||||
|
||||
# Check database connectivity
|
||||
curl http://localhost:5001/health/database
|
||||
|
||||
# Check external services
|
||||
curl http://localhost:5001/health/external
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Monitoring Documentation
|
||||
|
||||
### 1. **Key Metrics**
|
||||
```markdown
|
||||
## Monitoring Metrics
|
||||
|
||||
### Business Metrics
|
||||
- **Documents Processed**: Total documents processed per day
|
||||
- **Success Rate**: Percentage of successful processing
|
||||
- **Processing Time**: Average time per document
|
||||
- **User Activity**: Active users and session duration
|
||||
|
||||
### Technical Metrics
|
||||
- **API Response Time**: Endpoint response times
|
||||
- **Error Rate**: Percentage of failed requests
|
||||
- **Memory Usage**: Application memory consumption
|
||||
- **Database Performance**: Query times and connection usage
|
||||
|
||||
### Cost Metrics
|
||||
- **API Costs**: LLM API usage costs
|
||||
- **Storage Costs**: Google Cloud Storage usage
|
||||
- **Compute Costs**: Server resource usage
|
||||
- **Bandwidth Costs**: Data transfer costs
|
||||
```
|
||||
|
||||
### 2. **Alert Configuration**
|
||||
```markdown
|
||||
## Alert Rules
|
||||
|
||||
### Critical Alerts
|
||||
- **High Error Rate**: >5% error rate for 5 minutes
|
||||
- **Service Down**: Health check failures
|
||||
- **High Latency**: >30 second response times
|
||||
- **Memory Issues**: >80% memory usage
|
||||
|
||||
### Warning Alerts
|
||||
- **Increased Error Rate**: >2% error rate for 10 minutes
|
||||
- **Performance Degradation**: >15 second response times
|
||||
- **High API Usage**: >80% of rate limits
|
||||
- **Storage Issues**: >90% storage usage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Deployment Documentation
|
||||
|
||||
### 1. **Deployment Process**
|
||||
```markdown
|
||||
## Deployment Process
|
||||
|
||||
### Pre-deployment Checklist
|
||||
- [ ] All tests passing
|
||||
- [ ] Documentation updated
|
||||
- [ ] Environment variables configured
|
||||
- [ ] Database migrations ready
|
||||
- [ ] External services configured
|
||||
|
||||
### Deployment Steps
|
||||
1. **Build**: Create production build
|
||||
2. **Test**: Run integration tests
|
||||
3. **Deploy**: Deploy to staging environment
|
||||
4. **Validate**: Verify functionality
|
||||
5. **Promote**: Deploy to production
|
||||
6. **Monitor**: Watch for issues
|
||||
|
||||
### Rollback Plan
|
||||
1. **Detect Issue**: Monitor error rates and performance
|
||||
2. **Assess Impact**: Determine severity and scope
|
||||
3. **Execute Rollback**: Revert to previous version
|
||||
4. **Verify Recovery**: Confirm system stability
|
||||
5. **Investigate**: Root cause analysis
|
||||
```
|
||||
|
||||
### 2. **Environment Management**
|
||||
```markdown
|
||||
## Environment Configuration
|
||||
|
||||
### Development Environment
|
||||
- **Purpose**: Local development and testing
|
||||
- **Database**: Local Supabase instance
|
||||
- **Storage**: Development GCS bucket
|
||||
- **AI Services**: Test API keys with limits
|
||||
|
||||
### Staging Environment
|
||||
- **Purpose**: Pre-production testing
|
||||
- **Database**: Staging Supabase instance
|
||||
- **Storage**: Staging GCS bucket
|
||||
- **AI Services**: Production API keys with monitoring
|
||||
|
||||
### Production Environment
|
||||
- **Purpose**: Live user service
|
||||
- **Database**: Production Supabase instance
|
||||
- **Storage**: Production GCS bucket
|
||||
- **AI Services**: Production API keys with full monitoring
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Documentation Maintenance
|
||||
|
||||
### 1. **Documentation Review Process**
|
||||
```markdown
|
||||
## Documentation Maintenance
|
||||
|
||||
### Review Schedule
|
||||
- **Weekly**: Update API documentation for new endpoints
|
||||
- **Monthly**: Review and update architecture documentation
|
||||
- **Quarterly**: Comprehensive documentation audit
|
||||
- **Release**: Update all documentation for new features
|
||||
|
||||
### Quality Checklist
|
||||
- [ ] All code examples are current and working
|
||||
- [ ] API documentation matches implementation
|
||||
- [ ] Configuration examples are accurate
|
||||
- [ ] Error handling documentation is complete
|
||||
- [ ] Performance metrics are up-to-date
|
||||
- [ ] Links and references are valid
|
||||
```
|
||||
|
||||
### 2. **Version Control for Documentation**
|
||||
```markdown
|
||||
## Documentation Version Control
|
||||
|
||||
### Branch Strategy
|
||||
- **main**: Current production documentation
|
||||
- **develop**: Latest development documentation
|
||||
- **feature/***: Documentation for new features
|
||||
- **release/***: Documentation for specific releases
|
||||
|
||||
### Change Management
|
||||
1. **Propose Changes**: Create documentation issue
|
||||
2. **Review Changes**: Peer review of documentation updates
|
||||
3. **Test Examples**: Verify all code examples work
|
||||
4. **Update References**: Update all related documentation
|
||||
5. **Merge Changes**: Merge with approval
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 LLM Agent Optimization Tips
|
||||
|
||||
### 1. **Context Provision**
|
||||
- Provide complete context for each code section
|
||||
- Include business rules and constraints
|
||||
- Document assumptions and limitations
|
||||
- Explain why certain approaches were chosen
|
||||
|
||||
### 2. **Example-Rich Documentation**
|
||||
- Include realistic examples for all functions
|
||||
- Provide before/after examples for complex operations
|
||||
- Show error scenarios and recovery
|
||||
- Include performance examples
|
||||
|
||||
### 3. **Structured Information**
|
||||
- Use consistent formatting and organization
|
||||
- Provide clear hierarchies of information
|
||||
- Include cross-references between related sections
|
||||
- Use standardized templates for similar content
|
||||
|
||||
### 4. **Error Scenario Documentation**
|
||||
- Document all possible error conditions
|
||||
- Provide specific error messages and codes
|
||||
- Include recovery procedures for each error type
|
||||
- Show debugging steps for common issues
|
||||
|
||||
---
|
||||
|
||||
## 📋 Documentation Checklist
|
||||
|
||||
### For Each New Feature
|
||||
- [ ] Update README.md with feature overview
|
||||
- [ ] Document API endpoints and examples
|
||||
- [ ] Update architecture diagrams if needed
|
||||
- [ ] Add configuration documentation
|
||||
- [ ] Include error handling scenarios
|
||||
- [ ] Add test examples and strategies
|
||||
- [ ] Update deployment documentation
|
||||
- [ ] Review and update related documentation
|
||||
|
||||
### For Each Code Change
|
||||
- [ ] Update function documentation
|
||||
- [ ] Add inline comments for complex logic
|
||||
- [ ] Update type definitions if changed
|
||||
- [ ] Add examples for new functionality
|
||||
- [ ] Update error handling documentation
|
||||
- [ ] Verify all links and references
|
||||
|
||||
---
|
||||
|
||||
This guide ensures that your documentation is optimized for LLM coding agents, providing them with the context, structure, and examples they need to understand and work with your codebase effectively.
|
||||
@@ -1,225 +0,0 @@
|
||||
# PDF Generation Analysis & Optimization Report
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The current PDF generation implementation has been analyzed for effectiveness, efficiency, and visual quality. While functional, significant improvements have been identified and implemented to enhance performance, visual appeal, and maintainability.
|
||||
|
||||
## Current Implementation Assessment
|
||||
|
||||
### **Effectiveness: 7/10 → 9/10**
|
||||
**Previous Strengths:**
|
||||
- Uses Puppeteer for reliable HTML-to-PDF conversion
|
||||
- Supports multiple input formats (markdown, HTML, URLs)
|
||||
- Comprehensive error handling and validation
|
||||
- Proper browser lifecycle management
|
||||
|
||||
**Previous Weaknesses:**
|
||||
- Basic markdown-to-HTML conversion
|
||||
- Limited customization options
|
||||
- No advanced markdown features support
|
||||
|
||||
**Improvements Implemented:**
|
||||
- ✅ Enhanced markdown parsing with better structure
|
||||
- ✅ Advanced CSS styling with modern design elements
|
||||
- ✅ Professional typography and color schemes
|
||||
- ✅ Improved table formatting and visual hierarchy
|
||||
- ✅ Added icons and visual indicators for better UX
|
||||
|
||||
### **Efficiency: 6/10 → 9/10**
|
||||
**Previous Issues:**
|
||||
- ❌ **Major Performance Issue**: Created new page for each PDF generation
|
||||
- ❌ No caching mechanism
|
||||
- ❌ Heavy resource usage
|
||||
- ❌ No concurrent processing support
|
||||
- ❌ Potential memory leaks
|
||||
|
||||
**Optimizations Implemented:**
|
||||
- ✅ **Page Pooling**: Reuse browser pages instead of creating new ones
|
||||
- ✅ **Caching System**: Cache generated PDFs for repeated requests
|
||||
- ✅ **Resource Management**: Proper cleanup and timeout handling
|
||||
- ✅ **Concurrent Processing**: Support for multiple simultaneous requests
|
||||
- ✅ **Memory Optimization**: Automatic cleanup of expired resources
|
||||
- ✅ **Performance Monitoring**: Added statistics tracking
|
||||
|
||||
### **Visual Quality: 6/10 → 9/10**
|
||||
**Previous Issues:**
|
||||
- ❌ Inconsistent styling between different PDF types
|
||||
- ❌ Basic, outdated design
|
||||
- ❌ Limited visual elements
|
||||
- ❌ Poor typography and spacing
|
||||
|
||||
**Visual Improvements:**
|
||||
- ✅ **Modern Design System**: Professional gradients and color schemes
|
||||
- ✅ **Enhanced Typography**: Better font hierarchy and spacing
|
||||
- ✅ **Visual Elements**: Icons, borders, and styling boxes
|
||||
- ✅ **Consistent Branding**: Unified design across all PDF types
|
||||
- ✅ **Professional Layout**: Better page breaks and section organization
|
||||
- ✅ **Interactive Elements**: Hover effects and visual feedback
|
||||
|
||||
## Technical Improvements
|
||||
|
||||
### 1. **Performance Optimizations**
|
||||
|
||||
#### Page Pooling System
|
||||
```typescript
|
||||
interface PagePool {
|
||||
page: any;
|
||||
inUse: boolean;
|
||||
lastUsed: number;
|
||||
}
|
||||
```
|
||||
- **Pool Size**: Configurable (default: 5 pages)
|
||||
- **Timeout Management**: Automatic cleanup of expired pages
|
||||
- **Concurrent Access**: Queue system for high-demand scenarios
|
||||
|
||||
#### Caching Mechanism
|
||||
```typescript
|
||||
private readonly cache = new Map<string, { buffer: Buffer; timestamp: number }>();
|
||||
private readonly cacheTimeout = 300000; // 5 minutes
|
||||
```
|
||||
- **Content-based Keys**: Hash-based caching for identical content
|
||||
- **Time-based Expiration**: Automatic cache cleanup
|
||||
- **Memory Management**: Size limits to prevent memory issues
|
||||
|
||||
### 2. **Enhanced Styling System**
|
||||
|
||||
#### Modern CSS Framework
|
||||
- **Gradient Backgrounds**: Professional color schemes
|
||||
- **Typography Hierarchy**: Clear visual structure
|
||||
- **Responsive Design**: Better layout across different content types
|
||||
- **Interactive Elements**: Hover effects and visual feedback
|
||||
|
||||
#### Professional Templates
|
||||
- **Header/Footer**: Consistent branding and metadata
|
||||
- **Section Styling**: Clear content organization
|
||||
- **Table Design**: Enhanced financial data presentation
|
||||
- **Visual Indicators**: Icons and color coding
|
||||
|
||||
### 3. **Code Quality Improvements**
|
||||
|
||||
#### Better Error Handling
|
||||
- **Timeout Management**: Configurable timeouts for operations
|
||||
- **Resource Cleanup**: Proper disposal of browser resources
|
||||
- **Logging**: Enhanced error tracking and debugging
|
||||
|
||||
#### Monitoring & Statistics
|
||||
```typescript
|
||||
getStats(): {
|
||||
pagePoolSize: number;
|
||||
cacheSize: number;
|
||||
activePages: number;
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
### **Before Optimization:**
|
||||
- **Memory Usage**: ~150MB per PDF generation
|
||||
- **Generation Time**: 3-5 seconds per PDF
|
||||
- **Concurrent Requests**: Limited to 1-2 simultaneous
|
||||
- **Resource Cleanup**: Manual, error-prone
|
||||
|
||||
### **After Optimization:**
|
||||
- **Memory Usage**: ~50MB per PDF generation (67% reduction)
|
||||
- **Generation Time**: 1-2 seconds per PDF (60% improvement)
|
||||
- **Concurrent Requests**: Support for 5+ simultaneous
|
||||
- **Resource Cleanup**: Automatic, reliable
|
||||
|
||||
## Recommendations for Further Improvement
|
||||
|
||||
### 1. **Alternative PDF Libraries** (Future Consideration)
|
||||
|
||||
#### Option A: jsPDF
|
||||
```typescript
|
||||
// Pros: Lightweight, no browser dependency
|
||||
// Cons: Limited CSS support, manual layout
|
||||
import jsPDF from 'jspdf';
|
||||
```
|
||||
|
||||
#### Option B: PDFKit
|
||||
```typescript
|
||||
// Pros: Full control, streaming support
|
||||
// Cons: Complex API, manual styling
|
||||
import PDFDocument from 'pdfkit';
|
||||
```
|
||||
|
||||
#### Option C: Puppeteer + Optimization (Current Choice)
|
||||
```typescript
|
||||
// Pros: Full CSS support, reliable rendering
|
||||
// Cons: Higher resource usage
|
||||
// Status: ✅ Optimized and recommended
|
||||
```
|
||||
|
||||
### 2. **Advanced Features**
|
||||
|
||||
#### Template System
|
||||
```typescript
|
||||
interface PDFTemplate {
|
||||
name: string;
|
||||
styles: string;
|
||||
layout: string;
|
||||
variables: string[];
|
||||
}
|
||||
```
|
||||
|
||||
#### Dynamic Content
|
||||
- **Charts and Graphs**: Integration with Chart.js or D3.js
|
||||
- **Interactive Elements**: Forms and dynamic content
|
||||
- **Multi-language Support**: Internationalization
|
||||
|
||||
### 3. **Production Optimizations**
|
||||
|
||||
#### CDN Integration
|
||||
- **Static Assets**: Host CSS and fonts on CDN
|
||||
- **Caching Headers**: Optimize browser caching
|
||||
- **Compression**: Gzip/Brotli compression
|
||||
|
||||
#### Monitoring & Analytics
|
||||
```typescript
|
||||
interface PDFMetrics {
|
||||
generationTime: number;
|
||||
fileSize: number;
|
||||
cacheHitRate: number;
|
||||
errorRate: number;
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### ✅ **Completed Optimizations**
|
||||
1. Page pooling system
|
||||
2. Caching mechanism
|
||||
3. Enhanced styling
|
||||
4. Performance monitoring
|
||||
5. Resource management
|
||||
6. Error handling improvements
|
||||
|
||||
### 🔄 **In Progress**
|
||||
1. Template system development
|
||||
2. Advanced markdown features
|
||||
3. Chart integration
|
||||
|
||||
### 📋 **Planned Features**
|
||||
1. Multi-language support
|
||||
2. Advanced analytics
|
||||
3. Custom branding options
|
||||
4. Batch processing optimization
|
||||
|
||||
## Conclusion
|
||||
|
||||
The PDF generation system has been significantly improved across all three key areas:
|
||||
|
||||
1. **Effectiveness**: Enhanced functionality and feature set
|
||||
2. **Efficiency**: Major performance improvements and resource optimization
|
||||
3. **Visual Quality**: Professional, modern design system
|
||||
|
||||
The current implementation using Puppeteer with the implemented optimizations provides the best balance of features, performance, and maintainability. The system is now production-ready and can handle high-volume PDF generation with excellent performance characteristics.
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Deploy Optimizations**: Implement the improved service in production
|
||||
2. **Monitor Performance**: Track the new metrics and performance improvements
|
||||
3. **Gather Feedback**: Collect user feedback on the new visual design
|
||||
4. **Iterate**: Continue improving based on usage patterns and requirements
|
||||
|
||||
The optimized PDF generation service represents a significant upgrade that will improve user experience, reduce server load, and provide professional-quality output for all generated documents.
|
||||
@@ -1,79 +0,0 @@
|
||||
# Quick Fix Implementation Summary
|
||||
|
||||
## Problem
|
||||
List fields (keyAttractions, potentialRisks, valueCreationLevers, criticalQuestions, missingInformation) were not consistently generating 5-8 numbered items, causing test failures.
|
||||
|
||||
## Solution Implemented (Phase 1: Quick Fix)
|
||||
|
||||
### Files Modified
|
||||
|
||||
1. **backend/src/services/llmService.ts**
|
||||
- Added `generateText()` method for simple text completion tasks
|
||||
- Line 105-121: New public method wrapping callLLM for quick repairs
|
||||
|
||||
2. **backend/src/services/optimizedAgenticRAGProcessor.ts**
|
||||
- Line 1299-1320: Added list field validation call before returning results
|
||||
- Line 2136-2307: Added 3 new methods:
|
||||
- `validateAndRepairListFields()` - Validates all list fields have 5-8 items
|
||||
- `repairListField()` - Uses LLM to fix lists with wrong item count
|
||||
- `getNestedField()` / `setNestedField()` - Utility methods for nested object access
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **After multi-pass extraction completes**, the code now validates each list field
|
||||
2. **If a list has < 5 or > 8 items**, it automatically repairs it:
|
||||
- For lists < 5 items: Asks LLM to expand to 6 items
|
||||
- For lists > 8 items: Asks LLM to consolidate to 7 items
|
||||
3. **Uses document context** to ensure new items are relevant
|
||||
4. **Lower temperature** (0.3) for more consistent output
|
||||
5. **Tracks repair API calls** separately
|
||||
|
||||
### Test Status
|
||||
- ✅ Build successful
|
||||
- 🔄 Running pipeline test to validate fix
|
||||
- Expected: All tests should pass with list validation
|
||||
|
||||
## Next Steps (Phase 2: Proper Fix - This Week)
|
||||
|
||||
### Implement Tool Use API (Proper Solution)
|
||||
|
||||
Create `/backend/src/services/llmStructuredExtraction.ts`:
|
||||
- Use Anthropic's tool use API with JSON schema
|
||||
- Define strict schemas with minItems/maxItems constraints
|
||||
- Claude will internally retry until schema compliance
|
||||
- More reliable than post-processing repair
|
||||
|
||||
**Benefits:**
|
||||
- 100% schema compliance (Claude retries internally)
|
||||
- No post-processing repair needed
|
||||
- Lower overall API costs (fewer retry attempts)
|
||||
- Better architectural pattern
|
||||
|
||||
**Timeline:**
|
||||
- Phase 1 (Quick Fix): ✅ Complete (2 hours)
|
||||
- Phase 2 (Tool Use): 📅 Implement this week (6 hours)
|
||||
- Total investment: 8 hours
|
||||
|
||||
## Additional Improvements for Later
|
||||
|
||||
### 1. Semantic Chunking (Week 2)
|
||||
- Replace fixed 4000-char chunks with semantic chunking
|
||||
- Respect document structure (don't break tables/sections)
|
||||
- Use 800-char chunks with 200-char overlap
|
||||
- **Expected improvement**: 12-30% better retrieval accuracy
|
||||
|
||||
### 2. Hybrid Retrieval (Week 3)
|
||||
- Add BM25/keyword search alongside vector similarity
|
||||
- Implement cross-encoder reranking
|
||||
- Consider HyDE (Hypothetical Document Embeddings)
|
||||
- **Expected improvement**: 15-25% better retrieval accuracy
|
||||
|
||||
### 3. Fix RAG Search Issue
|
||||
- Current logs show `avgSimilarity: 0`
|
||||
- Implement HyDE or improve query embedding strategy
|
||||
- **Problem**: Query embeddings don't match document embeddings well
|
||||
|
||||
## References
|
||||
- Claude Tool Use: https://docs.claude.com/en/docs/agents-and-tools/tool-use
|
||||
- RAG Chunking: https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies
|
||||
- Structured Output: https://dev.to/heuperman/how-to-get-consistent-structured-output-from-claude-20o5
|
||||
@@ -1,320 +0,0 @@
|
||||
# Financial Extraction Improvement Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines a comprehensive plan to address all pending todos related to financial extraction improvements. The plan is organized by priority and includes detailed implementation steps, success criteria, and estimated effort.
|
||||
|
||||
## Current Status
|
||||
|
||||
### ✅ Completed
|
||||
- Test financial extraction with Stax Holding Company CIM - All values correct
|
||||
- Implement deterministic parser fallback - Integrated into simpleDocumentProcessor
|
||||
- Implement few-shot examples - Added comprehensive examples for PRIMARY table identification
|
||||
- Fix primary table identification - Financial extraction now correctly identifies PRIMARY table
|
||||
|
||||
### 📊 Current Performance
|
||||
- **Accuracy**: 100% for Stax CIM test case (FY-3: $64M, FY-2: $71M, FY-1: $71M, LTM: $76M)
|
||||
- **Processing Time**: ~178 seconds (3 minutes) for full document
|
||||
- **API Calls**: 2 (1 financial extraction + 1 main extraction)
|
||||
- **Completeness**: 96.9%
|
||||
|
||||
---
|
||||
|
||||
## Priority 1: Research & Analysis (Weeks 1-2)
|
||||
|
||||
### Todo 1: Review Older Commits for Historical Patterns
|
||||
|
||||
**Objective**: Understand how financial extraction worked in previous versions to identify what was effective.
|
||||
|
||||
**Tasks**:
|
||||
1. Review commit history (2-3 hours)
|
||||
- Check commit 185c780 (Claude 3.7 implementation)
|
||||
- Check commit 5b3b1bf (Document AI fixes)
|
||||
- Check commit 0ec3d14 (multi-pass extraction)
|
||||
- Document prompt structures, validation logic, and error handling
|
||||
|
||||
2. Compare prompt simplicity (2 hours)
|
||||
- Extract prompts from older commits
|
||||
- Compare verbosity, structure, and clarity
|
||||
- Identify what made older prompts effective
|
||||
- Document key differences
|
||||
|
||||
3. Analyze deterministic parser usage (2 hours)
|
||||
- Review how financialTableParser.ts was used historically
|
||||
- Check integration patterns with LLM extraction
|
||||
- Identify successful validation strategies
|
||||
|
||||
4. Create comparison document (1 hour)
|
||||
- Document findings in docs/financial-extraction-evolution.md
|
||||
- Include before/after comparisons
|
||||
- Highlight lessons learned
|
||||
|
||||
**Deliverables**:
|
||||
- Analysis document comparing old vs new approaches
|
||||
- List of effective patterns to reintroduce
|
||||
- Recommendations for prompt simplification
|
||||
|
||||
**Success Criteria**:
|
||||
- Complete analysis of 3+ historical commits
|
||||
- Documented comparison of prompt structures
|
||||
- Clear recommendations for improvements
|
||||
|
||||
---
|
||||
|
||||
### Todo 2: Review Best Practices for Financial Data Extraction
|
||||
|
||||
**Objective**: Research industry best practices and academic approaches to improve extraction accuracy and reliability.
|
||||
|
||||
**Tasks**:
|
||||
1. Academic research (4-6 hours)
|
||||
- Search for papers on LLM-based tabular data extraction
|
||||
- Review financial document parsing techniques
|
||||
- Study few-shot learning for table extraction
|
||||
|
||||
2. Industry case studies (3-4 hours)
|
||||
- Research how companies extract financial data
|
||||
- Review open-source projects (Tabula, Camelot)
|
||||
- Study financial data extraction libraries
|
||||
|
||||
3. Prompt engineering research (2-3 hours)
|
||||
- Study chain-of-thought prompting for tables
|
||||
- Review few-shot example selection strategies
|
||||
- Research validation techniques for structured outputs
|
||||
|
||||
4. Hybrid approach research (2-3 hours)
|
||||
- Review deterministic + LLM hybrid systems
|
||||
- Study error handling patterns
|
||||
- Research confidence scoring methods
|
||||
|
||||
5. Create best practices document (2 hours)
|
||||
- Document findings in docs/financial-extraction-best-practices.md
|
||||
- Include citations and references
|
||||
- Create implementation recommendations
|
||||
|
||||
**Deliverables**:
|
||||
- Best practices document with citations
|
||||
- List of recommended techniques
|
||||
- Implementation roadmap
|
||||
|
||||
**Success Criteria**:
|
||||
- Reviewed 10+ academic papers or industry case studies
|
||||
- Documented 5+ applicable techniques
|
||||
- Clear recommendations for implementation
|
||||
|
||||
---
|
||||
|
||||
## Priority 2: Performance Optimization (Weeks 3-4)
|
||||
|
||||
### Todo 3: Reduce Processing Time Without Sacrificing Accuracy
|
||||
|
||||
**Objective**: Reduce processing time from ~178 seconds to <120 seconds while maintaining 100% accuracy.
|
||||
|
||||
**Strategies**:
|
||||
|
||||
#### Strategy 3.1: Model Selection Optimization
|
||||
- Use Claude Haiku 3.5 for initial extraction (faster, cheaper)
|
||||
- Use Claude Sonnet 3.7 for validation/correction (more accurate)
|
||||
- Expected impact: 30-40% time reduction
|
||||
|
||||
#### Strategy 3.2: Parallel Processing
|
||||
- Extract independent sections in parallel
|
||||
- Financial, business description, market analysis, etc.
|
||||
- Expected impact: 40-50% time reduction
|
||||
|
||||
#### Strategy 3.3: Prompt Optimization
|
||||
- Remove redundant instructions
|
||||
- Use more concise examples
|
||||
- Expected impact: 10-15% time reduction
|
||||
|
||||
#### Strategy 3.4: Caching Common Patterns
|
||||
- Cache deterministic parser results
|
||||
- Cache common prompt templates
|
||||
- Expected impact: 5-10% time reduction
|
||||
|
||||
**Deliverables**:
|
||||
- Optimized processing pipeline
|
||||
- Performance benchmarks
|
||||
- Documentation of time savings
|
||||
|
||||
**Success Criteria**:
|
||||
- Processing time reduced to <120 seconds
|
||||
- Accuracy maintained at 95%+
|
||||
- API calls optimized
|
||||
|
||||
---
|
||||
|
||||
## Priority 3: Testing & Validation (Weeks 5-6)
|
||||
|
||||
### Todo 4: Add Unit Tests for Financial Extraction Validation Logic
|
||||
|
||||
**Test Categories**:
|
||||
|
||||
1. Invalid Value Rejection
|
||||
- Test rejection of values < $10M for revenue
|
||||
- Test rejection of negative EBITDA when should be positive
|
||||
- Test rejection of unrealistic growth rates
|
||||
|
||||
2. Cross-Period Validation
|
||||
- Test revenue growth consistency
|
||||
- Test EBITDA margin trends
|
||||
- Test period-to-period validation
|
||||
|
||||
3. Numeric Extraction
|
||||
- Test extraction of values in millions
|
||||
- Test extraction of values in thousands (with conversion)
|
||||
- Test percentage extraction
|
||||
|
||||
4. Period Identification
|
||||
- Test years format (2021-2024)
|
||||
- Test FY-X format (FY-3, FY-2, FY-1, LTM)
|
||||
- Test mixed format with projections
|
||||
|
||||
**Deliverables**:
|
||||
- Comprehensive test suite with 50+ test cases
|
||||
- Test coverage >80% for financial validation logic
|
||||
- CI/CD integration
|
||||
|
||||
**Success Criteria**:
|
||||
- All test cases passing
|
||||
- Test coverage >80%
|
||||
- Tests catch regressions before deployment
|
||||
|
||||
---
|
||||
|
||||
## Priority 4: Monitoring & Observability (Weeks 7-8)
|
||||
|
||||
### Todo 5: Monitor Production Financial Extraction Accuracy
|
||||
|
||||
**Monitoring Components**:
|
||||
|
||||
1. Extraction Success Rate Tracking
|
||||
- Track extraction success/failure rates
|
||||
- Log extraction attempts and outcomes
|
||||
- Set up alerts for issues
|
||||
|
||||
2. Error Pattern Analysis
|
||||
- Categorize errors by type
|
||||
- Track error trends over time
|
||||
- Identify common error patterns
|
||||
|
||||
3. User Feedback Collection
|
||||
- Add UI for users to flag incorrect extractions
|
||||
- Store feedback in database
|
||||
- Use feedback to improve prompts
|
||||
|
||||
**Deliverables**:
|
||||
- Monitoring dashboard
|
||||
- Alert system
|
||||
- Error analysis reports
|
||||
- User feedback system
|
||||
|
||||
**Success Criteria**:
|
||||
- Real-time monitoring of extraction accuracy
|
||||
- Alerts trigger for issues
|
||||
- User feedback collected and analyzed
|
||||
|
||||
---
|
||||
|
||||
## Priority 5: Code Quality & Documentation (Weeks 9-11)
|
||||
|
||||
### Todo 6: Optimize Prompt Size for Financial Extraction
|
||||
|
||||
**Current State**: ~28,000 tokens
|
||||
|
||||
**Optimization Strategies**:
|
||||
1. Remove redundancy (target: 30% reduction)
|
||||
2. Use more concise examples (target: 40-50% reduction)
|
||||
3. Focus on critical rules only
|
||||
|
||||
**Success Criteria**:
|
||||
- Prompt size reduced by 20-30%
|
||||
- Accuracy maintained at 95%+
|
||||
- Processing time improved
|
||||
|
||||
---
|
||||
|
||||
### Todo 7: Add Financial Data Visualization
|
||||
|
||||
**Implementation**:
|
||||
1. Backend API for validation and corrections
|
||||
2. Frontend component for preview and editing
|
||||
3. Confidence score display
|
||||
4. Trend visualization
|
||||
|
||||
**Success Criteria**:
|
||||
- Users can preview financial data
|
||||
- Users can correct incorrect values
|
||||
- Corrections are stored and used for improvement
|
||||
|
||||
---
|
||||
|
||||
### Todo 8: Document Extraction Strategies
|
||||
|
||||
**Documentation Structure**:
|
||||
1. Table Format Catalog (years, FY-X, mixed formats)
|
||||
2. Extraction Patterns (primary table, period mapping)
|
||||
3. Best Practices Guide (prompt engineering, validation)
|
||||
|
||||
**Deliverables**:
|
||||
- Comprehensive documentation in docs/financial-extraction-guide.md
|
||||
- Format catalog with examples
|
||||
- Pattern library
|
||||
- Best practices guide
|
||||
|
||||
---
|
||||
|
||||
## Priority 6: Advanced Features (Weeks 12-14)
|
||||
|
||||
### Todo 9: Compare RAG vs Simple Extraction for Financial Accuracy
|
||||
|
||||
**Comparison Study**:
|
||||
1. Test both approaches on 10+ CIM documents
|
||||
2. Analyze results and identify best approach
|
||||
3. Design and implement hybrid if beneficial
|
||||
|
||||
**Success Criteria**:
|
||||
- Clear understanding of which approach is better
|
||||
- Hybrid approach implemented if beneficial
|
||||
- Accuracy improved or maintained
|
||||
|
||||
---
|
||||
|
||||
### Todo 10: Add Confidence Scores to Financial Extraction
|
||||
|
||||
**Implementation**:
|
||||
1. Design scoring algorithm (parser agreement, value consistency)
|
||||
2. Implement confidence calculation
|
||||
3. Flag low-confidence extractions for review
|
||||
4. Add review interface
|
||||
|
||||
**Success Criteria**:
|
||||
- Confidence scores calculated for all extractions
|
||||
- Low-confidence extractions flagged
|
||||
- Review process implemented
|
||||
|
||||
---
|
||||
|
||||
## Implementation Timeline
|
||||
|
||||
- **Weeks 1-2**: Research & Analysis
|
||||
- **Weeks 3-4**: Performance Optimization
|
||||
- **Weeks 5-6**: Testing & Validation
|
||||
- **Weeks 7-8**: Monitoring
|
||||
- **Weeks 9-11**: Code Quality & Documentation
|
||||
- **Weeks 12-14**: Advanced Features
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- **Accuracy**: Maintain 95%+ accuracy
|
||||
- **Performance**: <120 seconds processing time
|
||||
- **Reliability**: 99%+ extraction success rate
|
||||
- **Test Coverage**: >80% for financial validation
|
||||
- **User Satisfaction**: <5% manual correction rate
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Review and approve this plan
|
||||
2. Prioritize todos based on business needs
|
||||
3. Assign resources
|
||||
4. Begin Week 1 tasks
|
||||
|
||||
Reference in New Issue
Block a user