Files
sales-data-analysis/.cursor/rules/data_loading.md
Jonathan Pressnell cf0b596449 Initial commit: sales analysis template
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-06 09:16:34 -05:00

70 lines
2.0 KiB
Markdown

# Data Loading Rules
## CRITICAL: Always Use data_loader.py
**NEVER load data directly with `pd.read_csv()`. Always use:**
```python
from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path())
```
## Why This Matters
The `data_loader.py` implements intelligent fallback logic to ensure 100% date coverage:
1. **Primary:** Parse primary date column (from config.DATE_COLUMN)
2. **Fallback 1:** Use fallback date columns if primary is missing (from config.DATE_FALLBACK_COLUMNS)
3. **Fallback 2:** Use Year column if both missing
4. **Result:** Maximum date coverage possible
## What data_loader.py Provides
- **Date Column:** Properly parsed datetime with fallback logic
- **Year:** Extracted year (100% coverage via fallback)
- **YearMonth:** Period format for monthly aggregations
- **Revenue Column:** Converted to numeric (from config.REVENUE_COLUMN)
## Column Configuration
Before using, configure column names in `config.py`:
- `REVENUE_COLUMN`: Your revenue/amount column name
- `DATE_COLUMN`: Primary date column name
- `DATE_FALLBACK_COLUMNS`: List of fallback date columns
- `CUSTOMER_COLUMN`: Customer/account column name
- Other columns as needed
## Common Mistakes
**WRONG:**
```python
df = pd.read_csv('sales_data.csv')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date']) # May drop significant data!
```
**CORRECT:**
```python
from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path()) # Uses fallback logic
```
## Data File Location
The data file path is configured in `config.py`:
- `DATA_FILE`: Filename (e.g., 'sales_data.csv')
- `DATA_DIR`: Optional subdirectory (defaults to current directory)
- Use `get_data_path()` to get the full path
## Validation
After loading, validate data structure:
```python
from data_loader import validate_data_structure
is_valid, msg = validate_data_structure(df)
if not is_valid:
print(f"ERROR: {msg}")
```