Files
sales-data-analysis/.cursor/rules/data_loading.md
Jonathan Pressnell cf0b596449 Initial commit: sales analysis template
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-06 09:16:34 -05:00

2.0 KiB

Data Loading Rules

CRITICAL: Always Use data_loader.py

NEVER load data directly with pd.read_csv(). Always use:

from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path())

Why This Matters

The data_loader.py implements intelligent fallback logic to ensure 100% date coverage:

  1. Primary: Parse primary date column (from config.DATE_COLUMN)
  2. Fallback 1: Use fallback date columns if primary is missing (from config.DATE_FALLBACK_COLUMNS)
  3. Fallback 2: Use Year column if both missing
  4. Result: Maximum date coverage possible

What data_loader.py Provides

  • Date Column: Properly parsed datetime with fallback logic
  • Year: Extracted year (100% coverage via fallback)
  • YearMonth: Period format for monthly aggregations
  • Revenue Column: Converted to numeric (from config.REVENUE_COLUMN)

Column Configuration

Before using, configure column names in config.py:

  • REVENUE_COLUMN: Your revenue/amount column name
  • DATE_COLUMN: Primary date column name
  • DATE_FALLBACK_COLUMNS: List of fallback date columns
  • CUSTOMER_COLUMN: Customer/account column name
  • Other columns as needed

Common Mistakes

WRONG:

df = pd.read_csv('sales_data.csv')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date'])  # May drop significant data!

CORRECT:

from data_loader import load_sales_data
from config import get_data_path
df = load_sales_data(get_data_path())  # Uses fallback logic

Data File Location

The data file path is configured in config.py:

  • DATA_FILE: Filename (e.g., 'sales_data.csv')
  • DATA_DIR: Optional subdirectory (defaults to current directory)
  • Use get_data_path() to get the full path

Validation

After loading, validate data structure:

from data_loader import validate_data_structure
is_valid, msg = validate_data_structure(df)
if not is_valid:
    print(f"ERROR: {msg}")