2.0 KiB
2.0 KiB
Data Loading Rules
CRITICAL: Always Use data_loader.py
NEVER load data directly with pd.read_csv(). Always use:
from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path())
Why This Matters
The data_loader.py implements intelligent fallback logic to ensure 100% date coverage:
- Primary: Parse primary date column (from config.DATE_COLUMN)
- Fallback 1: Use fallback date columns if primary is missing (from config.DATE_FALLBACK_COLUMNS)
- Fallback 2: Use Year column if both missing
- Result: Maximum date coverage possible
What data_loader.py Provides
- Date Column: Properly parsed datetime with fallback logic
- Year: Extracted year (100% coverage via fallback)
- YearMonth: Period format for monthly aggregations
- Revenue Column: Converted to numeric (from config.REVENUE_COLUMN)
Column Configuration
Before using, configure column names in config.py:
REVENUE_COLUMN: Your revenue/amount column nameDATE_COLUMN: Primary date column nameDATE_FALLBACK_COLUMNS: List of fallback date columnsCUSTOMER_COLUMN: Customer/account column name- Other columns as needed
Common Mistakes
❌ WRONG:
df = pd.read_csv('sales_data.csv')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date']) # May drop significant data!
✅ CORRECT:
from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path()) # Uses fallback logic
Data File Location
The data file path is configured in config.py:
DATA_FILE: Filename (e.g., 'sales_data.csv')DATA_DIR: Optional subdirectory (defaults to current directory)- Use
get_data_path()to get the full path
Validation
After loading, validate data structure:
from data_loader import validate_data_structure
is_valid, msg = validate_data_structure(df)
if not is_valid:
print(f"ERROR: {msg}")