70 lines
2.0 KiB
Markdown
70 lines
2.0 KiB
Markdown
# Data Loading Rules
|
|
|
|
## CRITICAL: Always Use data_loader.py
|
|
|
|
**NEVER load data directly with `pd.read_csv()`. Always use:**
|
|
|
|
```python
|
|
from data_loader import load_sales_data
|
|
from config import get_data_path
|
|
df = load_sales_data(get_data_path())
|
|
```
|
|
|
|
## Why This Matters
|
|
|
|
The `data_loader.py` implements intelligent fallback logic to ensure 100% date coverage:
|
|
|
|
1. **Primary:** Parse primary date column (from config.DATE_COLUMN)
|
|
2. **Fallback 1:** Use fallback date columns if primary is missing (from config.DATE_FALLBACK_COLUMNS)
|
|
3. **Fallback 2:** Use Year column if both missing
|
|
4. **Result:** Maximum date coverage possible
|
|
|
|
## What data_loader.py Provides
|
|
|
|
- **Date Column:** Properly parsed datetime with fallback logic
|
|
- **Year:** Extracted year (100% coverage via fallback)
|
|
- **YearMonth:** Period format for monthly aggregations
|
|
- **Revenue Column:** Converted to numeric (from config.REVENUE_COLUMN)
|
|
|
|
## Column Configuration
|
|
|
|
Before using, configure column names in `config.py`:
|
|
- `REVENUE_COLUMN`: Your revenue/amount column name
|
|
- `DATE_COLUMN`: Primary date column name
|
|
- `DATE_FALLBACK_COLUMNS`: List of fallback date columns
|
|
- `CUSTOMER_COLUMN`: Customer/account column name
|
|
- Other columns as needed
|
|
|
|
## Common Mistakes
|
|
|
|
❌ **WRONG:**
|
|
```python
|
|
df = pd.read_csv('sales_data.csv')
|
|
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
|
|
df = df.dropna(subset=['Date']) # May drop significant data!
|
|
```
|
|
|
|
✅ **CORRECT:**
|
|
```python
|
|
from data_loader import load_sales_data
|
|
from config import get_data_path
|
|
df = load_sales_data(get_data_path()) # Uses fallback logic
|
|
```
|
|
|
|
## Data File Location
|
|
|
|
The data file path is configured in `config.py`:
|
|
- `DATA_FILE`: Filename (e.g., 'sales_data.csv')
|
|
- `DATA_DIR`: Optional subdirectory (defaults to current directory)
|
|
- Use `get_data_path()` to get the full path
|
|
|
|
## Validation
|
|
|
|
After loading, validate data structure:
|
|
```python
|
|
from data_loader import validate_data_structure
|
|
is_valid, msg = validate_data_structure(df)
|
|
if not is_valid:
|
|
print(f"ERROR: {msg}")
|
|
```
|