# Data Loading Rules ## CRITICAL: Always Use data_loader.py **NEVER load data directly with `pd.read_csv()`. Always use:** ```python from data_loader import load_sales_data from config import get_data_path df = load_sales_data(get_data_path()) ``` ## Why This Matters The `data_loader.py` implements intelligent fallback logic to ensure 100% date coverage: 1. **Primary:** Parse primary date column (from config.DATE_COLUMN) 2. **Fallback 1:** Use fallback date columns if primary is missing (from config.DATE_FALLBACK_COLUMNS) 3. **Fallback 2:** Use Year column if both missing 4. **Result:** Maximum date coverage possible ## What data_loader.py Provides - **Date Column:** Properly parsed datetime with fallback logic - **Year:** Extracted year (100% coverage via fallback) - **YearMonth:** Period format for monthly aggregations - **Revenue Column:** Converted to numeric (from config.REVENUE_COLUMN) ## Column Configuration Before using, configure column names in `config.py`: - `REVENUE_COLUMN`: Your revenue/amount column name - `DATE_COLUMN`: Primary date column name - `DATE_FALLBACK_COLUMNS`: List of fallback date columns - `CUSTOMER_COLUMN`: Customer/account column name - Other columns as needed ## Common Mistakes ❌ **WRONG:** ```python df = pd.read_csv('sales_data.csv') df['Date'] = pd.to_datetime(df['Date'], errors='coerce') df = df.dropna(subset=['Date']) # May drop significant data! ``` ✅ **CORRECT:** ```python from data_loader import load_sales_data from config import get_data_path df = load_sales_data(get_data_path()) # Uses fallback logic ``` ## Data File Location The data file path is configured in `config.py`: - `DATA_FILE`: Filename (e.g., 'sales_data.csv') - `DATA_DIR`: Optional subdirectory (defaults to current directory) - Use `get_data_path()` to get the full path ## Validation After loading, validate data structure: ```python from data_loader import validate_data_structure is_valid, msg = validate_data_structure(df) if not is_valid: print(f"ERROR: {msg}") ```